Arabic Text Categorization Using Logistic Regression

Full Text (PDF, 543KB), PP.71-78

Views: 0 Downloads: 0

Author(s)

Mayy M. Al-Tahrawi 1,*

1. Computer Science Department, Faculty of Information Technology, Al-Ahliyya Amman University, Amman, Jordan

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2015.06.08

Received: 11 Jul. 2014 / Revised: 9 Dec. 2014 / Accepted: 22 Feb. 2015 / Published: 8 May 2015

Index Terms

Logistic Regression, Arabic Text Categorization, Arabic Document Classification

Abstract

Several Text Categorization (TC) techniques and algorithms have been investigated in the limited research literature of Arabic TC. In this research, Logistic Regression (LR) is investigated in Arabic TC. To the best of our knowledge, LR was never used for Arabic TC before. Experiments are conducted on Aljazeera Arabic News (Alj-News) dataset. Arabic text-preprocessing takes place on this dataset to handle the special nature of Arabic text. Experimental results of this research prove that the LR classifier is a competitive Arabic TC algorithm to the state of the art ones in this field; it has recorded a precision of 96.5% on one category and above 90% for 3 categories out of the five categories of Alj-News dataset. Regarding the overall performance, LR has recorded a macroaverage precision of 87%, recall of 86.33% and F-measure of 86.5%.

Cite This Paper

Mayy M. Al-Tahrawi, "Arabic Text Categorization Using Logistic Regression", International Journal of Intelligent Systems and Applications(IJISA), vol.7, no.6, pp.71-78, 2015. DOI:10.5815/ijisa.2015.06.08

Reference

[1]http://www.InternetWorldStats.com (Accessed November, 2014).

[2]Yahyaoui M. “Toward an Arabic web page classifier”. Master project. AUI. 2001.

[3]El-Kourdi M, Bensaid A and Rachidi T. “Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm”. In the 20th Int. Conf. on Computational Linguistics, Geneva, August, 27, 2004.

[4]Duwairi R. “Arabic Text Categorization”. International Arab Journal of Information Technology, 2007; 4(2): 125 – 131. doi: 10.1002/asi.20360.

[5]Mesleh A. A. “Chi Square Feature Extraction Based Svms Arabic Language Text Categorization System”. Journal of Computer Science, 2007; 3(6): 430 - 435.

[6]Al-Harbi S, Almuhareb A, Al-Thubaity A, Khorsheed M and Al-Rajeh A. “Automatic Arabic Text Classification”. In: JADT’08. France; 2008. pp 77 - 83.

[7]El-Halees A. M. “A Comparative Study on Arabic Text Classification”. Egyptian Computer Science Journal, 2008; 30(2).

[8]Al-Saleem S. “Associative Classification to Categorize Arabic Data Sets”. International Journal of ACM Jordan, 2010; 1(3): 118 - 127.

[9]Chantar H. K. and Corne D.W. “Feature subset selection for Arabic document categorization using BPSO-KNN”, IEEE , 2011. pp. 546 - 551. doi: 10.1109/NaBIC.2011.6089647.

[10]Al-Shalabi R, Kanaan G and Gharaibeh H. “Arabic text categorization using KNN algorithm”. In the Proc. of Int. multi conf. on computer science and information technology CSIT06; 2006.

[11]Harrag F, El-Qawasmeh E and Pichappan P. “Improving Arabic text categorization using decision trees”. In IEEE, NDT '09, 2009. pp 110 – 115. doi:10.1109/NDT.2009.5272214.

[12]Duwairi R. “A Distance-based Classifier for Arabic Text Categorization”. In the Proc. of the Int. Conf. on Data Mining DMIN’05, Las Vegas, USA; June, 2005. pp 20-23.

[13]Ghwanmeh S. “Applying Clustering of Hierarchical K-means-like Algorithm on Arabic Language”. The Int. Journal of Information Technology 2007; 3(3): 168-172.

[14]Kanaan G, Al-Shalabi R and Ghwanmeh S. “A comparison of text-classification techniques applied to Arabic text”. Journal of the American Society for Information Science and Technology, 2009; 60(9): 1836 – 1844. doi:10.1002/asi.v60:9.

[15]Khreisat L. “Arabic Text Classification Using N-Gram Frequency Statistics: A Comparative Study”. In the Proceedings of the 2006 International Conference on Data Mining (DMIN 2006), Las Vegas, Nevada, USA; June 26-29, 2006. pp 78 - 82.

[16]AL-Tahrawi M. M. and Al-Khatib S. N. “Arabic Text Classification Using Polynomial Networks”. Journal of King Saud University - Computer and Information Sciences. In Press.

[17]Cooper W. S., Gey F. C. and Dabney D. P. “Probabilistic retrieval based on staged logistic regression”. In: SGIR 92, pp. 198–210, 1992. 

[18]Fuhr N and Pfeifer U. “Combining model-oriented and description-oriented approaches for probabilistic indexing”. In: SIGIR 91, pp. 46–56. 1991.

[19]Gey F. C. “Inferring probability of relevance using the method of logistic regression”. In: SIGIR 94, pp.222–231, 1994.

[20]Ittner D. J., Lewis D. D. and Ahn D. D. “Text categorization of low quality images”. In: Symposium on Document Analysis and Information Retrieval, pp. 301–315. 1995.

[21]Lewis D. D. and Gale W. A. “A sequential algorithm for training text classifiers”. In: SIGIR 94, pp. 3–12. 1994.

[22]Sch¨utze H., Hull D. A. and Pedersen J. O. “A comparison of classifiers and document representations for the routing problem”. In: SIGIR 95, pp. 229–237. 1995.

[23]Alexander GENKIN DIMACS and David D. LEWIS. “Large-Scale Bayesian Logistic Regression for Text Categorization”. American Statistical Association and the American Society for Quality Technometrics, August 2007, 49(3). DOI 10.1198/004017007000000245.

[24]Andrew Gelman and Jennifer Hill. “Data Analysis Using Regression and Multilevel/Hierarchical Models”. Cambridge University Press. 2007.

[25]Amrita Paul. “Effect of imbalanced data on document classification algorithms”. Master Thesis. Auckland University of Technology. 2014.

[26]Yiming Yang and Xin Liu. “A re-examination of text categorization methods”. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (1999), pp. 42-49, doi:10.1145/312624.312647

[27]P. Komarek and A. Moore. “Fast robust logistic regression for large sparse datasets with binary outputs”. In Proceedings of the International Workshop on Artificial Intelligence and Statistics, New York,NY, 2003.

[28]Sujeevan Aseervathama,, Anestis Antoniadisb, Eric Gaussiera, Michel Burletc and Yves Denneulind. “A Sparse Version of the Ridge Logistic Regression for Large-Scale Text Categorization”. Pattern Recognition Letters (01 October 2010). doi:10.1016/j.patrec.2010.09.023

[29]Andrew Y. Ng, Michael I. Jordan. “On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Neural Information Processing Systems, 2002.

[30]Hanuman Thota , Raghava Naidu Miriyala , Siva Prasad Akula, .Mrithyunjaya Rao , Chandra Sekhar Vellanki ,Allam Appa Rao and Srinubabu Gedela. “Performance Comparative in Classification Algorithms Using Real Datasets”. JCSB, Vol.2, February, 2009.

[31]Paul Komarek and , Andrew Moore . “Fast Logistic Regression for Data Mining, Text Classification and Link Detection”. Proceedings of NIPS2003 .

[32]Arild Brandrud Næss. “Bayesian Text Categorization”. MASTER'S THESIS, Norwegian University of Science and Technology, 2007

[33]Tong Zhang and Frank J. Oles.“Text Categorization Based on Regularized Linear Classification Methods”. Information Retrieval, Vol. 4, pp. 5–31, 2001.

[34]Alexander Genkin, David D. Lewis AND David Madigan. “Sparse Logistic Regression for Text Categorization” . Working Group on Monitoring Message Streams Project Report, April 2005. 2005

[35]Georgiana Ifrim, Gökhan Bakir and Gerhard Weikum. “Fast Logistic Regression for Text Categorization with Variable-Length N-grams”. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 354-362, 2008. doi:10.1145/1401890.1401936

[36]Weiwei Cheng and Eyke Hullermeier. “Combining Instance-Based Learning and Logistic Regression for Multilabel Classification”. Journal Machine Learning, Volume 76 Issue 2-3, pp. 211 – 225, 2009. doi:10.1007/s10994-009-5127-5.

[37]http://filebox.vt.edu/users/dsaid/Alj-News.tar.gz. (January, 2014)

[38]Vapnik V. Statistical learning theory. New York: Wiley; 1998.

[39]Zhang J, Jin R, Yang Y and Hauptmann A. “Modified logistic regression: an approximation to SVM and its applications in large-scale text categorization”. In: Proc Twentieth Int Conf Machine Learning (ICML 2003), Washington, DC USA, August, 2003; pp. 21–24.

[40]Hoi SCH, Jin R, Lyu M.R. “Large-scale text categorization by batch mode learning”. In: Proc 15th Int World Wide Web conference (WWW2006), Edinburgh, England, UK, May, 2006.

[41]Komarek P and Moore A. “Making logistic regression a core data mining tool: a practical investigation of accuracy, speed, and simplicity”. Technical Report TR-05—27, Robotics Institute, Carnegie Mellon University, May 2005.

[42]R. O. Duda and P. E. Hart. “Pattern Classiffication and Scene Analysis”. Wiley-Interscience, New York, 1973.

[43]D. D. Lewis. “Evaluating and optimizing autonomous text classification systems”. In SIGIR '95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 246-254, 1995.

[44]P. J. Green. “Iteratively Reweighted Least Squares for Maximum Likelihood Estimation, and some Robustand Resistant Alternatives”. Journal of the Royal Statistical Society. Series B (Methodological), 46(2), pp. 149-192, 1984.

[45]Said D, Wanas N, Darwish N and Hegazy N. “A Study of Arabic Text preprocessing methods for Text Categorization”. In the 2nd Int. conf. on Arabic Language Resources and Tools, April, 22-23, Cairo, Egypt, 2009; pp 230-236.

[46]Mohamed S, Ata W and Darwish N. “A new technique for automatic text categorization for Arabic documents”. In Proc. of the 5th IBIMA International Conference on Internet and Information Technology in Modern Organizations, Cairo, Egypt; 2005. pp 13–15.

[47]Khoja S and Garside R. “Stemming Arabic text”. Computing Department, Lancaster University, Lancaster; 1999.http://www.comp.lancs.ac.uk/computing/users/khoja/stemmer.ps (Accessed January 2014)

[48]http://zeus.cs.pacificu.edu/shereen/ArabicStemmerCode.zip. (January, 2014)

[49]Sawalha M and Atwell E. “Comparative evaluation of Arabic language morphological analyzers and stemmers”. In: the Proc. of COLING’2008 22nd Int. Conf. on Computational Linguistics, (poster volume); 2008. pp 107-110.

[50]Larkey L and Connell ME. “Arabic information retrieval at UMass in TREC-10”. In: Proceedings of TREC, Gaithersburg: NIST; 2001. doi:10.1.1.14.9079.

[51]Joachims T. “Text categorization with support vector machines: learning with many relevant features”. Proc 10th Euro Conf Machine Learning (ECML) Springer-Verlag London, UK 1998;1398:137–142.

[52]Brank J, Grobelnik M, Milic-Frayling N and Mladenic D. “Interaction of feature selection methods and linear classification models”, Workshop on Text Learning held at ICML-2002; 2002.

[53]Rogati M and Yang Y. “High- performing term selection for text classification”. CIKM’McLean, Virginia, USA, November, 2002, pp. 4–9.

[54]Bekkerman R. “Distributional clustering of words for text categorization”. Master's thesis, CS Department, Technion-Israel Inst. of Technology; 2003.

[55]Khorsheed M and Al-Thubaity A. “Comparative evaluation of text classification techniques using a large diverse Arabic dataset”. Lang Resources & Evaluation, Springer, 2013; 47(2):513-538. doi: 10.1007/s10579-013-9221-8.

[56]Belkebir R and Guessoum A. “A Hybrid BSO-Chi2-SVM Approach to Arabic Text Categorization”. In IEEE Computer Systems and Applications (AICCSA), 2013 ACS International Conference, Ifrane , 27-30 May 2013; pp 1-7. doi: 10.1109/AICCSA.2013.6616437 .

[57]Sharef B, Omar N and Sharef Z. “An Automated Arabic Text Categorization Based on the Frequency Ratio Accumulation”. The International Arab Journal of Information Technology, 2014; 11( 2): 213-221.

[58]Thabtah F, Eljinini M, Zamzeer M and Hadi W. “Naïve Bayesian based on Chi Square to Categorize Arabic Data”. In proceedings of The 11th International Business Information Management Association Conference (IBIMA) Conference on Innovation and Knowledge Management in Twin Track Economies, Cairo, Egypt ; 2009. pp 930 - 935. doi:10.1.1.411.3605.

[59]Fodil L, Sayoud H and Ouamour S. “Theme Classification of Arabic Text: A Statistical Approach”. In Terminology and Knowledge Engineering 2014, Berlin : Germany; 2014. pp 77-86.

[60]Sawaf H, Zaplo J and Ney H. “Statistical classification methods for Arabic news articles”. Arabic Natural Language Processing Workshop, ACL’2001, Toulouse, France. 2001; pp 127–132.

[61]AL-Tahrawi M. M. and Abu Zitar R. “Polynomial networks versus other techniques in text categorization”. Int J Patt Recog Artif Intell (IJPRAI) 2008; 22(2):295–322. doi: 10.1142/S0218001408006247.

[62]AL-Tahrawi M. M. “The Significance of Low Frequent Terms In Text Classification”. International Journal of Intelligent Systems, 2014; 29(5): 389 – 406. doi: 10.1002/int.21643.

[63]AL-Tahrawi M. M. “Class-Based Aggressive Feature Selection For Polynomial Networks Text Classifiers – An Empirical Study”. UPB Scientific Bulletin, Series C, In Press.

[64]Al-Tahrawi M. M. “The role of rare terms in enhancing the performance of polynomial networks based text categorization”. J Intell Learn Syst Appl 2013;5:84–89. doi: 10.4236/jilsa.2013.52009.

[65]Eldos M. “Arabic Text Data Mining: A Root Extractor for Dimensionality Reduction”. ACTA Press, A scientific and Technical Publishing Company; 2002.

[66]Eldin S. “Development of a computer-based Arabic Lexicon”. In the Int. Symposium on Computers & Arabic Language, ISCAL, Riyadh, KSA; 2007.

[67]Zheng Z, Wu X and Srihari R. “Feature selection for text categorization on imbalanced data”. SIGKDD Explorations, ACM, New York, NY, USA, 2004; 6(1):80–89. . doi:10.1.1.103.5069.

[68]Ababneh J, Almomani O, Hadi W, Kamel N, El-Omari T and Al-Ibrahim A. “Vector Space Models to Classify Arabic Text”. International Journal of Computer Trends and Technology (IJCTT), 2014; 7(4): 219-223.

[69]Lewis D. D. and Ringuette M. “A comparison of two learning algorithms for text categorization”. In: Proc Third Ann Symp Document Analysis and Information Retrieval (SDAIR’94), Las Vegas, USA, 1994; pp. 81–93. doi:10.1.1.49.860.

[70]El-Halees A. M. “Arabic Text Classification Using Maximum Entropy”. The Islamic University Journal, 2007; 15(1): 157 - 167. doi: 10.1.1.124.361.

[71]Al-Saleem S. “Automated Arabic Text Categorization Using SVM and NB”. International Arab Journal of e-Technology, 2011; 2( 2): 124-128.

[72]Debole F and Sebastiani F. “An analysis of the relative hardness of Reuters-21578 subsets”. JASIS; 2005. 56(6): 584–596.

[73] Van Rijsbergen C. J. “Information retrieval”. 2nd edn. London: Butterworths; 1979.

[74]Awad, W. A. “Machine Learning Algorithms in Web Page Classification”. International Journal of Computer Science & Information Technology (IJCSIT), 2012, 4(5), 93-101. doi: 10.5121/ijcsit.2012.4508.