Developing an Efficient Text Pre-Processing Method with Sparse Generative Naive Bayes for Text Mining

Full Text (PDF, 523KB), PP.11-19

Views: 0 Downloads: 0

Author(s)

Mrutyunjaya Panda 1,*

1. Department of Computer Science and Applications, Utkal University, Vani Vihar, Bhubaneswar-4, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2018.09.02

Received: 3 Jun. 2018 / Revised: 26 Jul. 2018 / Accepted: 14 Aug. 2018 / Published: 8 Sep. 2018

Index Terms

Information Retrieval, Stemming, Tokenization, stop-word, Sparse Generative classifier, Naive Bayes, Accuracy, W-T-L

Abstract

With the explosive growth of internet, there are a big amount of data being collected in terms of text document, that attracts many researchers in text mining. Traditional data mining methods are found to be trapped while dealing with the scale of text data. Such large scale data can be handled by using parallel computing frameworks such as: Hadoop and MapRedue etc. However, they are also not away from challenges.On the other hand, Naive Bayes (NB) and its variant Multinomial Naive Bayes (MNB) plays an important role in text mining for their simplicity and robustness but if anything or everything from number of words, documents and labels go beyond the linear scaling, then MNB is intractable and will soon be out of memory while dealing in a single computer. Looking into the high dimensional sparse nature of the documents in text datasets, a scalable sparse generative Naive Bayes (SGNB) classifier is also proposed to develop a good text classification model. Unlike parallelization, SGNB reduces the time complexity non-linearly and hence expected to provide best results. In this paper, an efficient Lovins stemmer in combination with snowball based stopword calculation and word tokenizer is proposed for text pre-processing. The extensive experiments conducted on publicly available very well known text datasets opines the effectiveness of the proposed approach in terms of accuracy, F-score and time in comparison to many baseline methods available in the recent literature.

Cite This Paper

Mrutyunjaya Panda, " Developing an Efficient Text Pre-Processing Method with Sparse Generative Naive Bayes for Text Mining", International Journal of Modern Education and Computer Science(IJMECS), Vol.10, No.9, pp. 11-19, 2018. DOI:10.5815/ijmecs.2018.09.02

Reference

[1]M. Rahimirad, M. Mosleh and A. M. Rahmani. Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA, Journal of Advances in Computer Engineering and Technology, Vol. 1(2), pp. 1-8, 2015. http://jacet.srbiau.ac.ir/article_6706_6cc25826769e12ffcc494d3348913fe9.pdf
[2]Z. Elberrichi, A. Rahmoun, and Mohamed A. Bentaalah, Using WordNet for Text Categorization,The International Arab Journal of Information Technology, Vol. 5(1), pp.16-24, 2008. http://ccis2k.org/iajit/PDF/vol.5,no.1/3-37.pdf
[3]J. Chen, H. Huang, S. Tian, and Y. Qu. Feature selection for text classification with naive Bayes. Expert Systems with Applications, Vol. 36(3), pp. 5432-5435, 2009.
[4]R. Krovetz. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191-202, 1993. New York, NY: ACM Press.
[5]J. Huang, J. Lu and C.X. Ling. Comparing naive Bayes, decision trees, and SVM with AUC and accuracy, Third IEEE International Conference on Data Mining, 2003. ICDM 2003, DOI-10.1109/ICDM.2003.1250975
[6]C.D. Manning, P. Raghavan and H. Schu¨tze. Introduction to Information Retrieval, 1st ed. Cambridge University Press, 2008.ISBN-13: 978-0521865715
[7]M. Fernandez-Gavilanes, T. Alvarez-Lopez, J. Juncal-Martõnez, E. Costa-Montenegro, F. J. Gonzalez-Castano. Un-supervised method for Sentiment Analysis in online texts, Expert Systems With Applications, Elsevier, 2016. doi: 10.1016/j.eswa.2016.03.031
[8]W. Medhat, A. Hassan and H. Korashy. Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, Vol.5, pp.1093–1113,2014. http://www.sciencedirect.com/science/article/pii/S2090447914000550.
[9]V. A. Kharde and S.S. Sonawane, Sentiment Analysis of Twitter Data: A Survey of Techniques, International Journal of Computer Applications, Vol. 139(11), pp. 5-15, April 2016. DOI: 10.5120/ijca2016908625
[10]B. Agarwal, I. Xie, O. Vovsha, R. P. Rambow. Sentiment Analysis of Twitter Data, In Proceedings of the ACL 2011Workshop on Languages in Social Media, 2011 , pp. 30-38.
[11]Po-Wei Liang, Bi-Ru Dai. Opinion Mining on Social Media Data, IEEE 14th International Conference on Mobile Data Management,Milan, Italy, June 3 - 6, 2013, pp 91-96, ISBN: 978-1-494673-6068-5, http://doi.ieeecomputersociety.org/10.1109/MDM.2013
[12]D. Davidov and A. Rappoport. Enhanced Sentiment Learning Using Twitter Hashtags and Smileys. Coling 2010: Poster Volume, 241-249, Beijing, August 2010
[13]R. Xia, C. Zong, and S. Li. Ensemble of feature sets and classification algorithms for sentiment classification,Information Sciences: an International Journal, Vol. 181(6), pp. 1138–1152, 2011.
[14]X. Li, C. X. Ling, and H. Wang. The convergence behavior of naive Bayes on large sparse datasets. ACM Trans. Knowl. Discov. Data Vol.1(1), Article 10,(July 2016), pp.1-24. DOI: http://dx.doi.org/10.1145/2948068
[15]A. S. Altheneyan, Mohamed El Bachir Menai. Naive Bayes classifiers for authorship attribution of Arabic texts, Journal of King Saud University – Computer and Information Sciences, Vol. 26, pp. 473–484, 2014.
[16]J. K. Raulji , J. R. Saini. Stop-Word Removal Algorithm and its Implementation for Sanskrit Language, International Journal of Computer Applications, Vol. 150(2), pp. 15-17, 2016.
[17]Z. Elberrichi, A. Rahmoun, and M. A. Bentaalah. Using WordNet for Text Categorization, The International Arab Journal of Information Technology, Vol. 5(1), pp. 1-9, 2008.
[18]Mohammed A. Otair. Comparative analysis of Arabic Stemming Algorithms, International Journal of Managing Information Technology (IJMIT), Vol. 5(2), pp. 1-12, 2013.
[19]S. Lai, L. Xu, K. Liu and J. Zhao. Recurrent Convolutional Neural Networks for Text Classification, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2267-2273, 2015.
[20]J. D. Prusa, Taghi M. Khoshgoftaar, D. J. Dittman. Impact of Feature Selection Techniques for Tweet Sentiment Classification, Proceedings of the Twenty-Eighth International Florida Artificial Intelligence Research Society Conference, pp. 299-304, 2015, AAAI.
[21]N. Sheydaei, Mohamad Saraee and Azar Shahgholian. A novel feature selection method for text classification using association rules and clustering, Journal of Information Science, Vol. 41(1), pp. 3–15, 2015.
[22]Ferhat O¨ Zgu¨ R C Atak, Genetic Algorithm based Feature Selection in High Dimensional Text Dataset Classification, WSEAS Transactions On Information Science And Applications, Vol. 12, pp. 290-296, 2015.
[23]R. Krovetz. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191-202, 1993.
[24]A. Arampatzis, van der Weide, Th.P., Koster, C.H.A., and Van Bommel, P.. Linguistically-motivated Information Retrieval. Encyclopedia of Library and Information Science, published by Marcel Dekker, Inc. - New York – Basel, 2000.
[25]M. Porter. An algorithm for suffix stripping. Program, Vol. 14(3), pp. 130-137, 1980.
[26]J. B. Lovins. Development of a Stemming Algorithm, Mechanical Translation and Computational Linguistics, Vol. 11, 1968.
[27]P. Turney. Learning to Extract Keyphrases from Text, ERB-1057, National Research Council Canada, pp. 1-45, 1999.
[28]A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1995. http://www.cs.cmu.edu/~mccallum/bow
[29]A. Puurula. Scalable text mining with sparse generative models, 2016. https://arxiv.org/pdf/1602.02332.pdf
[30]S. Jiang, Harry Zhang, Charles X. Ling, Stan Matwin: Discriminative Parameter Learning for Bayesian Networks. In: ICML 2008.
[31]UCI Machine Learning Repositiory, Reuter 50 50 Dataset. https://archive.ics.uci.edu/ml/datasets/Reuter_50_50. (Visited on 04/14/2014).
[32]Newsgroup Data Set. 2006. http://people.csail.mit.edu/20Newsgroup/
[33]N.Abdulla, N. A. Mahyoub, M. Shehab and M. Al-Ayyoub Arabic Sentiment Analysis: Corpus-based and Lexicon-based, IEEE conference on Applied Electrical Engineering and Computing Technologies (AEECT 2013), December 3-12, 2013, Amman, Jordan.
[34]C. Potts. On the negativity of negation. In Nan Li and David Lutz, eds., Proceedings of Semantics and Linguistic Theory, Vol. 20, pp. 636-659, 2011.
[35]S. Teufel and M. Moens. Summarizing scientific articles: experiments with relevance and rhetorical status. Computational Linguistics, Vol. 28(4), pp. 409-445, 2002.
[36]I. H Witten, E Frank, MA Hall and CJ Pal. Data Mining: Practical machine learning tools and techniques, 2016. Morgan Kauffman.
[37]K. Murugesan and J. Zhang. Hybrid bisect K-means clustering algorithm. In: IEEE International Conference on Business Computing and Global Informatization (BCGIN), pp. 216–219, 2011. IEEE
[38]S. C. Tan, K. M. Ting and S. M. Teng. A general stochastic clustering method for automatic cluster discovery. Pattern Recogn. Vol. 44 (10–11), pp. 2786–2799, 2011.
[39]M. Sorostinean, K. Sana, M. Mohammad and A. Targhi. Sentiment analysis on Movie reviews, 2017.
[40]Y. Zhang and B. C. Wallace. A sensitivity analysis of ( and practitioner guide to) convolutional neural network for sentiment classification, 2016. ARXIV. arXiv:1510.03820
[41]M. Nabil, M. Aly and A. F. Atiya. ASTD: Arabic Sentiments Tweets Datasets, In: Proc. Of of 2015 conference on empirical methods in Natural Language processing, pp. 2515-2519, 2015.
[42]S. Nirkhi, R. V. Dharaskar and V.M.Thakare. Authorship identification uisng generalized features and analysis of computational methods, Transaction on Machine learning and artificial Intelligence, Vol. 3(2), pp. 41-45, 2015.
[43]A. Danesh, B. Moshiri and O. Fatemi. Improve text classification accuracy based on classification fuson methods, in: Proceedings of FUSION, pp. 1-6, 2007. IEEE.