Spam Mail Detection through Data Mining – A Comparative Performance Analysis

Full Text (PDF, 253KB), PP.31-39

Views: 0 Downloads: 0

Author(s)

Megha Rathi 1,* Vikas Pareek 2

1. Department of computer Science Engineering of Jaypee Institute of Information Technology, Noida, India

2. Department of Computer Science of Banasthali University, Banasthali, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2013.12.05

Received: 26 Aug. 2013 / Revised: 6 Oct. 2013 / Accepted: 2 Nov. 2013 / Published: 8 Dec. 2013

Index Terms

Classifier, Feature Selection, E-mails, Spam Mails.

Abstract

As web is expanding day by day and people generally rely on web for communication so e-mails are the fastest way to send information from one place to another. Now a day’s all the transactions all the communication whether general or of business taking place through e-mails. E-mail is an effective tool for communication as it saves a lot of time and cost. But e-mails are also affected by attacks which include Spam Mails. Spam is the use of electronic messaging systems to send bulk data. Spam is flooding the Internet with many copies of the same message, in an attempt to force the message on people who would not otherwise choose to receive it. In this study, we analyze various data mining approach to spam dataset in order to find out the best classifier for email classification. In this paper we analyze the performance of various classifiers with feature selection algorithm and without feature selection algorithm. Initially we experiment with the entire dataset without selecting the features and apply classifiers one by one and check the results. Then we apply Best-First feature selection algorithm in order to select the desired features and then apply various classifiers for classification. In this study it has been found that results are improved in terms of accuracy when we embed feature selection process in the experiment. Finally we found Random Tree as best classifier for spam mail classification with accuracy = 99.72%. Still none of the algorithm achieves 100% accuracy in classifying spam emails but Random Tree is very nearby to that.

Cite This Paper

Megha Rathi, Vikas Pareek, "Spam Mail Detection through Data Mining – A Comparative Performance Analysis", International Journal of Modern Education and Computer Science (IJMECS), vol.5, no.12, pp.31-39, 2013.DOI: 10.5815/ijmecs.2013.12.05

Reference

[1]Nie N, Simpser A, Stepanikova I, and Zheng L.Ten years after the birth of Internet, how do Americans use the internet in their daily lives[R]. Technical report, Stanford University, 2004.
[2]Almeida T, Yamakami A, Almeida J. Evaluation of approaches for dimensionality reduction applied with NaïveBayes anti-spam filters [C]. In the Proceedings of the 8th IEEE International conference on machine learning and applications, Miami, FL, USA,2009, 517-522.
[3]Vapnik V N. Statistical learning theory [M]. John W iley &Sons, NewYork, N Y, 1998.
[4]Ian H, Witten and Eibe Frank.Data Mining: Practical machine learning tools and techniques”, 2nd Edition. San Fransisco: Morgan Kaufmann; 2005.
[5]Caruana R.A. and Freitag D. How useful is Relevance? Technical Report [A]. AAAI Symposium on Relevance, New Orleans, 1994.
[6]Blum A.L. and Langley P. Selection of Relevant Features and Examples in Machine Learning [C]. In International Symposium on Artificial Intelligence on Relevance, 1997, 245-271.
[7]Doak J. An Evaluation of Feature Selection Methods and their Application to computer Security [R]. Technical Report CSE-92-18, Davis, Ca: University of California, Department of computer Science, 1992.
[8]Liu H and Motoda H, and Dash M. A Monotonic Measure for Optimal Feature Selection [C]. In Proc. Of the European Conf. on Machine Learning, Springer Verlag, 1998, 101-106.
[9]Ducheneaut N and Bellotti V. E-mail as habitat: an exploration of embedded personal information management [A]. Interactions ACM, 2001, 8: 30-38.
[10]Carreras X, and Marquez L. Boosting trees for anti spam filtering [C]. In International conference on Recent Advances in Natural Language Processing. , 2001 160-167.
[11]Sahami M, Dumasi S, Heckerman D, and Horvitz E. A Bayesian approach to filtering junk e-mail: In Learning for text categorization [A]. Papers from the 1998 Workshop, Madison, Wisconsin, 1998.
[12]Mohammad N.T.A Fuzzy clustering approach to filter spam E-mail [A].Proceedings of World Congress on Engineering, vol. 3, WCE-2011.
[13]Ahmed K. An overview of content-based spam filtering techniques [A]. Informatica, 2007, 31(3): 269-277.
[14]Biro I, Szabo J, Benczur A, and Siklosi D. Linked Latent Dirichlet Allocation in Web Spam Filtering [A].In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIR Web), Madrid, Spain, 2009.
[15]Perkins A. The classification of search engine spam. http://www.ebrand management.com/white papers/spam classification, 2001.
[16]Paulo C, Clotilde L, Pedro S. Symniotic data mining for personalized spam filtering [C]. In the Proceedings of the International Conference on Web Intelligence and Intelligent Agent Technology, 2009, 149-156.
[17]Rasim M A, Ramiz M A, and Saadat A N. Classification of Textual E-mail spam using Data Mining Techniques [J]. In the Journal of Applied Computational Intelligence and Soft Computing, 2011.
[18]Erosheva E A and Fienberg S E. Bayesian mixed membership models for soft clustering and classification [J]. Proceedings of National Academy of Sciences, 2004, 97(22):11885-11892.
[19]Crawford E, Kay J, McCreath E. Automatic induction of rules for e-mail classification [C]. In 6th Australian Document Computing symposium, Coffs Harbour, Australia, 2001, 13-20.
[20]Spam Assassin. The Apache Spam Assassin Project. http://spamassassin.apache.org/.2006.
[21]Stern H. Fast Spam Assassin Score Learning tool http://search.cpan.org/src/PARKER/MailSpamAssassin 3.0.3/masses/README.perceptron,2004.
[22]Kufandirimbwa O, Gotora R. Spam detection using Artificial Neural Networks [J]. In Online Journal of Physical and Environmental Science Research, 2012, 1:22-29.
[23]UCI–Machine Learning Repository – Spambase Dataset.http://archive.ics.uci.edu/ml/datasets/Spambase.