Comparative Analysis of Classification Algorithms for Email Spam Detection

Full Text (PDF, 609KB), PP.60-67

Views: 0 Downloads: 0

Author(s)

Shafii Muhammad Abdulhamid 1,* Maryam Shuaib 1 Oluwafemi Osho 1 Idris Ismaila 1 John K. Alhassan 1

1. Department of Cyber Security, Federal University of Technology, Minna, Nigeria

* Corresponding author.

DOI: https://doi.org/10.5815/ijcnis.2018.01.07

Received: 24 Jun. 2017 / Revised: 28 Jul. 2017 / Accepted: 10 Aug. 2017 / Published: 8 Jan. 2018

Index Terms

Email spam, classification algorithms, Bayesian Logistic Regression, Hidden Naive Bayes, Rotation Forest

Abstract

The increase in the use of email in every day transactions for a lot of businesses or general communication due to its cost effectiveness and efficiency has made emails vulnerable to attacks including spamming. Spam emails also called junk emails are unsolicited messages that are almost identical and sent to multiple recipients randomly. In this study, a performance analysis is done on some classification algorithms including: Bayesian Logistic Regression, Hidden Na?ve Bayes, Radial Basis Function (RBF) Network, Voted Perceptron, Lazy Bayesian Rule, Logit Boost, Rotation Forest, NNge, Logistic Model Tree, REP Tree, Na?ve Bayes, Multilayer Perceptron, Random Tree and J48. The performance of the algorithms were measured in terms of Accuracy, Precision, Recall, F-Measure, Root Mean Squared Error, Receiver Operator Characteristics Area and Root Relative Squared Error using WEKA data mining tool. To have a balanced view on the classification algorithms’ performance, no feature selection or performance boosting method was employed. The research showed that a number of classification algorithms exist that if properly explored through feature selection means will yield more accurate results for email classification. Rotation Forest is found to be the classifier that gives the best accuracy of 94.2%. Though none of the algorithms did not achieve 100% accuracy in sorting spam emails, Rotation Forest has shown a near degree to achieving most accurate result.

Cite This Paper

Shafi’i Muhammad Abdulhamid, Maryam Shuaib, Oluwafemi Osho, Idris Ismaila, John K. Alhassan, "Comparative Analysis of Classification Algorithms for Email Spam Detection", International Journal of Computer Network and Information Security(IJCNIS), Vol.10, No.1, pp.60-67, 2018. DOI:10.5815/ijcnis.2018.01.07

Reference

[1]R. . Kumar, G. Pookuzhali, and P. Sudhakar, “Comparative Study on Email Spam Classifier using Data Mining Techniques,” 2012, vol. I.
[2]J. M. Carmona-cejudo, G. Castillo, M. Baena-garcía, and R. Morales-bueno, “Knowledge-Based Systems A comparative study on feature selection and adaptive strategies for email foldering using the ABC-DynF framework,” vol. 46, pp. 81–94, 2013.
[3]R. Group, “Email Statistics Report , 2016-2020,” vol. 44, no. 0, pp. 0–3, 2016.
[4]A. Sharaff, N. . Nagwani, and A. Dhadse, “Comparative Study of Classification Algorithms for Spam Email Detection,” Springer, no. January, 2016.
[5]R. M. Alguliev, R. M. Aliguliyev, and S. A. Nazirova, “Classification of Textual E-Mail Spam Using Data Mining Techniques,” Appl. Comput. Intell. Soft Comput., vol. 2011, pp. 1–8, 2011.
[6]A. F. Yasin, “Spam Reduction by using E-mail History and Authentication (SREHA),” Int. J. Inf. Technol. Comput. Sci., vol. Vol.8, no. No.7, p. pp.17-22, 2016.
[7]M. Iqbal, M. A. Malik, A. Mushtaq, and K. Faisal, “Study on the Effectiveness of Spam Detection Technologies,” Int. J. Inf. Technol. Comput. Sci., vol. Vol.8, no. 1, pp. 11–21, 2016.
[8]I. H. Witten and F. Eibe, Data mining?: practical machine learning tools and techniques, 2nd ed. Morgan Kaufmann Publishers, 2005.
[9]O. Maimon and L. Rokach, The data mining and knowledge discovery handbook, 2nd ed. Springer, 2010.
[10]S. M. Abdulhamid et al., “A Review on Mobile SMS Spam Filtering Techniques,” IEEE Access, 2017.
[11] Adebayo, O. S., D. O. Ugiomoh, and M. D. AbdulMalik, “The Design and Development of Real-Time E-Voting System in Nigeria with Emphasis on Security and Result Veracity.,” Int. J. Comput. Netw. Inf. Secur., vol. 5, no. 5, p. 9, 2013.
[12]M. Rathi and V. Pareek, “Spam Mail Detection through Data Mining – A Comparative Performance Analysis,” Int. J. Mod. Educ. Comput. Sci., vol. 5, no. December, pp. 31–39, 2013.
[13]P. . Panigrahi, “A comparative study of supervised machine learning techniques for spam E-mail filtering,” in Proceedings - 4th International Conference on Computational Intelligence and Communication Networks, CICN 2012, 2012, pp. 506–512.
[14]W. . Awad and S. . Elseuofi, “Machine Learning Methods for Spam E- mail Classification,” vol. 3, no. 1, pp. 173–184, 2011.
[15]D. Renuka, T. Hamsapriya, M. . Chakkaravarthi, and P. . Surya, “Spam Classification based on Supervised Learning using Machine Learning Techniques,” in Process Automation, Control and Computing (PACC), 2011, pp. 1–7.
[16]B. Yu and Z. Xu, “A comparative study for content-based dynamic spam classification using four machine learning algorithms,” Knowledge-Based Syst., vol. 21, no. 14, pp. 355–362, 2008.
[17]S. Youn and D. Mcleod, “A Comparative Study for Email Classification,” Adv. Innov. Syst. Comput. Sci. Softw. Eng., pp. 387–391, 2007.
[18]M. Zavvar, M. Rezaei, and S. Garavand, “Email Spam Detection Using Combination of Particle Swarm Optimization and Artificial Neural Network and Support Vector Machine,” Int. J. Mod. Educ. Comput. Sci., vol. 7, no. July, pp. 68–74, 2016.
[19]P. Parveen and P. G. Halse, “Spam Mail Detection using Classification,” vol. 5, no. 6, pp. 347–349, 2016.
[20]R. Sharma and G. Kaur, “E-Mail Spam Detection Using SVM and RBF,” no. April, pp. 57–63, 2016.
[21]R. Malarvizhi and K. Saraswathi, “Content-Based Spam Filtering and Detection Algorithms - An Efficient Analysis & Comparison,” Int. J. Eng. Trends Technol., vol. 4, no. 9, pp. 4237–4242, 2013.
[22]P. Ozarkar and M. Patwardhan, “Efficient Spam Classification by Appropriate Feature Selection,” vol. 13, no. 5, 2013.
[23]A. K. Sharma and S. Sahni, “A Comparative Study of Classification Algorithms for Spam Email Data Analysis,” Int. J. Comput. Sci. Eng., vol. 3, no. 5, pp. 1890–1895, 2011.