K-MLP Based Classifier for Discernment of Gratuitous Mails using N-Gram Filtration

Full Text (PDF, 874KB), PP.45-58

Views: 0 Downloads: 0

Author(s)

Harjot Kaur 1,* Er. Prince Verma 1

1. CT Group of Institution/CSE, Jalandhar, 144041, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijcnis.2017.07.06

Received: 11 Jan. 2017 / Revised: 2 Apr. 2017 / Accepted: 11 May 2017 / Published: 8 Jul. 2017

Index Terms

E-Mail, Spam Filters, N-Gram feature selection, K-Means clustering algorithm, Multi-Layer Perceptron Neural Network (MLP-NN) algorithm, Support Vector Machine (SVM) algorithm

Abstract

Electronic spam is a highly concerning phenomenon over the internet affecting various organisations like Google, Yahoo etc. Email spam causes several serious problems like high utilisation of memory space, financial loss, degradation of computation speed and power, and several threats to authenticated account holders. Email spam allows the spammers to deceit as a legitimate account holder of the organisations to fraud money and other useful information from the victims. It is necessary to control the spreading of spam and to develop an effective and efficient mechanism for defence. In this research, we proposed an efficient method for characterising spam emails using both supervised and unsupervised approaches by boosting the algorithm’s performance. This study refined a supervised approach, MLP using a fast and efficient unsupervised approach, K-Means for the detection of spam emails by selecting best features using N-Gram technique. The proposed system shows high accuracy with a low error rate in contrast to the existing technique. The system also shows a reduction in vague information when MLP was combined with K-Means algorithm for selecting initial clusters. N-Gram produces 100 best features from the group of data. Finally, the results are demonstrated and the output of the proposed technique is examined in contrast to the existing technique.

Cite This Paper

Harjot Kaur, Er. Prince Verma, "K-MLP Based Classifier for Discernment of Gratuitous Mails using N-Gram Filtration", International Journal of Computer Network and Information Security(IJCNIS), Vol.9, No.7, pp.45-58, 2017. DOI:10.5815/ijcnis.2017.07.06

Reference

[1]B. Yu and Z. Xu, “A comparative study for content-based dynamic spam classification using four machine learning algorithms”, Knowledge Based System-Elsevier, vol. 21, pp. 355–362, 2008.
[2]T. A. Almeida and A. Yamakami, “Content-Based Spam Filtering”, in International Joint Conference on Neural Networks (IJCNN) - IEEE, pp. 1-7, 2010.
[3]L. Firte, C. Lemnaru, and R. Potolea, “Spam Detection Filter using KNN Algorithm and Resampling”, in 6th International Conference on Intelligent Computer Communication and Processing -IEEE, pp.27-33, 2010.
[4]D. K. Renuka, T. Hamsapriya, M. R. Chakkaravarthi and P. L. Surya, “Spam Classification Based on Supervised Learning Using Machine Learning Techniques”, in 2011 International Conference on Process Automation, Control and Computing - IEEE, pp. 1–7, 2011.
[5]R. Shams and R. E. Mercer, “Classifying spam emails using text and readability features”, in International Conference on Data Mining (ICDM) - IEEE, pp. 657–666, 2013.
[6]M. Rathi and V. Pareek, “Spam Email Detection through Data Mining - A Comparative Performance Analysis”, International Journal of Modern Education and Computer Science (IJMECS), vol. 12, pp. 31-39, 2013.
[7]A. Harisinghaney, A. Dixit, S. Gupta, and Anuja Arora, “Text and image based spam email classification using KNN, Na?ve Bayes and Reverse DBSCAN Algorithm”, in International Conference on Reliability, Optimization and Information Technology (ICROIT)-IEEE, pp.153-155, 2014.
[8]S. P. Teli and S. K. Biradar, “Effective Email Classification for Spam and Non- spam”, International Journal of Advanced Research in Computer and Software Engineering, vol. 4, 2014.
[9]Alsmadi and I. Alhami, “Clustering and classification of email contents”, Journal of King Saud University - Computer and Information Science -Elsevier, vol. 27, no. 1, pp. 46–57, 2015.
[10]A. S. Aski and N. K. Sourati, “Proposed efficient algorithm to filter spam using machine learning techniques”, Pacific Science Review- A Natural Science Engineering- Elsevier., vol. 18, no. 2, pp. 145–149, 2016.
[11]M.Prilepok and P. Berek, “Spam Detection Using Data Compression And Signatures And Signatures,” Cybernetics and Systems: An International Journal, vol. 44, pp. 533–549, 2014.
[12]G. Kaur, R. K. Gurm, “A Survey on Classification Techniques in Internet Environment”, International Journal of Advance Research in Computer and Communication Engineering (IJARCCE), vol. 5, no. 3, pp. 589–593, March 2016.
[13]Rekha and S. Negi, “A Review on Different Spam Detection Approaches”, International Journal of Engineering Trends and Technology (IJETT), vol.11, no.6, 2014
[14]Z. Elberrichi and B. Aljohar, “N-grams in Texts Categorization,” Scientific Journal of King Faisal University (Basic and Applied Sciences), vol. 8, no. 2, pp. 25–39, 2007.
[15]D. Jurafsky and J. H. Martin, “N-Gram,” Speech and Language Processing, 2014.
[16]J. Clark, I. Koprinska and J.Poon, “A Neural Network-Based Approach to automated email classification”, in WIC International Conference on Web Intelligence – IEEE, 2003.
[17]S. Karamizadeh, S. M. Abdullah, M. Halimi, J. Shayan, and M. J. Rajabi, “Advantage and drawback of support vector machine functionality,” in 1st International Conference on Computer Communication and Control Technology - IEEE, pp. 63–65, 2014.
[18]C. Zhang and Z. Zhang, “A survey of recent advances in face detection,” Technical Report -Microsoft Research, 2010.
[19]M. Iqbal, M. M. Abid, M. Ahmad, and F. Khurshid,"Study on the Effectiveness of Spam Detection Technologies", International Journal of Information Technology and Computer Science (IJITCS), Vol.8, No.1, pp.11-21, 2016.
[20]R. Xu. and D. Wunsch, “Survey of Clustering Algorithms,” IEEE Transaction on Neural Networks, vol. 16, no. 3, pp. 645–678, 2005.
[21]M. S. Chen, J. Han, and P. S. Yu, “Data Mining: An overview from a database perspective,” IEEE Transaction on Knowledge and Data Engineering, vol. 8, no. 6, pp. 866–883, 1996.
[22]J. Dermoudy, B. Kang, D. Bhattacharyya, et. al. “Process of Extracting Uncover Patterns from Data: A Review,” International Journal of Database Theory and Application (IJDTA), Vol. 2, No. 2, June 2009.
[23]D. Guan, W. Yuan, Y. Lee, A. Gavrilov, and S. Lee, “Combining Multi-Layer Perceptron and K-means for Data Clustering with Background Knowledge,” Springer, pp. 1220–1226, 2007.
[24]P. Verma and D. Kumar, “Association Rule Mining Algorithm’s Variant Analysis,” International Journal of Computer Application (IJCA), vol. 78, no. 14, pp. 26–34, 2013.
[25]R. Xu. and D. Wunsch, “Survey of Clustering Algorithms,” IEEE Transaction on. Neural Networks, vol. 16, no. 3, pp. 645–678, 2005.
[26]M. S. Chen, J. Han, and P. S. Yu, “Data Mining: An overview from a database perspective,” IEEE Transaction on Knowledge and Data Engineering, vol. 8, no. 6, pp. 866–883, 1996.
[27]A. Silberschatz, M. Stonebraker and J.D. Ullman, “Database Research: Achievements and Opportunities into the 21st Century,” Report NSF workshop Future of Database Systems Research, May 1995.
[28]G. Piatetsky Shapiro and W.J. Frawley, “Knowledge Discovery in Databases”, AAAI/MIT Press, 1991.
[29]A. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice-Hall, 1988.
[30]C. Bishop, Neural Networks for Pattern Recognition. New York: Oxford Univ. Press, 1995.
[31]A. Baraldi and E. Alpaydin, “Constructive Feedforward ART clustering networks—Part I and II,” IEEE Transaction Neural Network., vol. 13, no. 3, pp. 645–677, May 2002.
[32]V. Cherkassky and F. Mulier, Learning from Data: Concepts, Theory and Methods. New York: Wiley, 1998.
[33]R. Duda, P. Hart, and D. Stork, Pattern Classification, 2ND ED. New York: Wiley, 2001.
[34]A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264–323, 1999.
[35]M. S. B. PhridviRaj and C. V. GuruRao, “Data Mining – The Past, Present and Future – A Typical Survey on Data Streams,” Procedia Technology, vol. 12, pp. 255–263, 2014.
[36]U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From Data Mining to Knowledge Discovery in Databases,” AI Magazines, vol. 17, no. 3, p. 37, 1996.
[37]B. Everitt, S. Landau, and M. Leese, Cluster Analysis. London: Arnold, 2001.
[38]A. Jain, A. Rajavat, and R. Bhartiya, “Design, analysis and implementation of modified K-mean algorithm for the large dataset to increase scalability and efficiency,” In - 4th International Conference on Computer Intelligence and Communication Networks (CICN), pp. 627–631, 2012.
[39]P. Chauhan and M. Shukla, “A Review on Outlier Detection Techniques on Data Stream by Using Different Approaches of K-Means Algorithm,” In - International Conference on Advances in Computer Engineering and Applications (ICACEA), pp. 580–585, 2015.
[40]S. Firdaus and A. Uddin, “A Survey on Clustering Algorithms and Complexity Analysis,” International Journal of Computer Science Issues (IJCSI), vol. 12, no. 2, pp. 62–85, 2015.
[41]D. Sisodia, “Clustering Techniques: A Brief Survey of Different Clustering Algorithms,” International Journal on latest trends and Engineering Technology(IJLTET), vol. 1, no. 3, pp. 82–87, 2012.
[42]K. N. Ahmed and T. A. Razak, “An Overview of Various Improvements 8of DBSCAN Algorithm in Clustering Spatial Databases,” International Journal of Advance Research in Computer and Communication Engineering (IJARCCE), vol. 5, no. 2, pp. 360–363, 2016.
[43]A. Joshi, “A Review: Comparative Study of Various Clustering Techniques in Data Mining,” International Journal of Advance Research in Computer Science and Software Engineering (IJARCSSE), vol. 3, no. 3, pp. 55–57, 2013.
[44]A. Naik, “Density Based Clustering Algorithm,” 06-Dec-2010.[Online].Available:https://sites.google.com/site/dataclusteringalgorithms/density-based-clustering-algorithm. [Accessed: 15-Jan-2017].
[45]M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software,” SIGKDD Exploration Newsletter., vol. 11, no. 1, pp. 1-10, 2009.
[46]R. Ng and J. Han,” Efficient and Effective Clustering Method for Spatial Data Mining,” In - 20th VLDB Conference, pp. 144-155, Santiago, Chile,1994.
[47]Cios, K. J., W. Pedrycz, et al., Data Mining Methods for Knowledge Discovery, vol. 458, Springer Science & Business Media, 2012.
[48]S. Dixit, and N. Gwal, "An Implementation of Data Pre-Processing for Small Dataset," International Journal of Computer Application (IJCA), vol. 10, no. 6, pp. 28-3, Oct. 2014.
[49]S. Singhal and M. Jena, “A Study on WEKA Tool for Data Pre-processing, Classification and Clustering,” International Journal of Innovative Technology and Exploration Engineering, vol. 2, no. 6, pp. 250–253, May 2013.
[50]O. Y. Alshamesti, and I. M. Romi, “Optimal Clustering Algorithms for Data Mining” Int. Journal of Info. Eng. and Electron. Bus. (IJIEEB), vol. 5, no. 2, pp. 22-27, Aug 2013. “DOI: 10.5815/ijieeb.2013.02.04 “
[51]N. Lekhi, M. Mahajan “Outlier Reduction using Hybrid Approach in Data Mining,” International Journal of Modern Education and Computer Science (IJMECS), vol. 7, no. 5, pp. 43–49, May 2015.
[52]C. L. P. Chen and C.Y. Zhang, "Data- Intensive Applications, Challenges, Techniques and Technologies: A survey on Big Data." ELSEVIER- Information Science, pp. 314-347, Aug. 2014.
[53]E. Rahm, and H. H. Do, "Data Cleaning: Problems and current approaches," IEEE- Data Engineering Bulletin, vol. 23, no. 4, pp. 3-13, Dec 2000.
[54]H Kaur, P. Verma, “Survey on E-Mail Spam Detection Using Supervised Approach with Feature Selection,” International Journal of Engineering Sciences and Research Technology (IJESRT), vol. 6, no. 4, pp. 120-128, April 2017.