Combining Different Approaches to Improve Arabic Text Documents Classification

Full Text (PDF, 632KB), PP.39-52

Views: 0 Downloads: 0

Author(s)

Ibrahim S. I. Abuhaiba 1,* Hassan M. Dawoud 1

1. Computer Engineering Department, Islamic University, P. O. Box 108, Gaza, Palestine

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2017.04.05

Received: 5 Jun. 2016 / Revised: 19 Sep. 2016 / Accepted: 5 Dec. 2016 / Published: 8 Apr. 2017

Index Terms

Text classification, combining classifiers, fixed combining rules, stacking, boosting, bagging

Abstract

The objective of this research is to improve Arabic text documents classification by combining different classification algorithms. To achieve this objective we build four models using different combination methods.
The first combined model is built using fixed combination rules, where five rules are used; and for each rule we used different number of classifiers. The best classification accuracy, 95.3%, is achieved using majority voting rule with seven classifiers, and the time required to build the model is 836 seconds.
The second combination approach is stacking, which consists of two stages of classification. The first stage is performed by base classifiers, and the second by a meta classifier. In our experiments, we used different numbers of base classifiers and two different meta classifiers: Naïve Bayes and linear regression. Stacking achieved a very high classification accuracy, 99.2% and 99.4%, using Naïve Bayes and linear regression as meta classifiers, respectively. Stacking needed a long time to build the models, which is 1963 seconds using naïve Bayes and 3718 seconds using linear regression, since it consists of two stages of learning.
The third model uses AdaBoost to boost a C4.5 classifier with different number of iterations. Boosting improves the classification accuracy of the C4.5 classifier; 95.3%, using 5 iterations, and needs 1175 seconds to build the model, while the accuracy is 99.5% using 10 iterations and requires 1966 seconds to build the model.
The fourth model uses bagging with decision tree. The accuracy is 93.7% achieved in 296 seconds when using 5 iterations, and 99.4% when using 10 iteration requiring 471 seconds. We used three datasets to test the combined models: BBC Arabic, CNN Arabic, and OSAC datasets. The experiments are performed using Weka and RapidMiner data mining tools. We used a platform of Intel Core i3 of 2.2 GHz CPU with 4GB RAM.
The results of all models showed that combining classifiers can effectively improve the accuracy of Arabic text documents classification.

Cite This Paper

Ibrahim S. I. Abuhaiba, Hassan M. Dawoud,"Combining Different Approaches to Improve Arabic Text Documents Classification", International Journal of Intelligent Systems and Applications(IJISA), Vol.9, No.4, pp.39-52, 2017. DOI:10.5815/ijisa.2017.04.05

Reference

[1]T. David and D. Robert, "Experiments with Classifier Combining Rules," in Proceedings of the First International Workshop on Multiple Classifier Systems, Cagliari, Italy, 2000.
[2]T. Dietterich, "Ensemble Methods in Machine Learning," in Proceedings of the First International Workshop on Multiple Classifier Systems, London, UK, 2000.
[3]G. Fumera and F. Roli, "A Theoretical and Experimental Analysis of Linear Combiners for Multiple Classifier Systems," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 6, pp. 942–956, 2005.
[4]M. Ponti, "Combining Classifiers: From the Creation of Ensembles to the Decision Fusion," in 4th SIBGRAPI Conference on Graphics, Patterns and Images Tutorials, Sao Carlos, Brazil, 2011.
[5]R. Lior, Pattern Classification Using Ensemble Methods, New Jersey: World Scientific Publishing Co. Pte. Ltd., 2010.
[6]J. Kittler, M. Hatef, R. Duin, and J. Matas, "On Combining Classifiers," IEEE Trans Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226–239, 1998.
[7]D. Wolpert, "Stacked Generalization," Neural Networks, vol. 5, no. 2, pp. 241-259, 1992.
[8]J. Quinlan, "Bagging, Boosting, and C4.5," in In Proceedings of the Thirteenth National Conference on Artificial Intelligence, 1996.
[9]B. Leo, "Bagging Predictors," Machine Learning, vol. 24, no. 2, pp. 123-140, 1996.
[10]S. Al-Harbi, A. Almuhareb, A. Al-Thubaity, M. Khorsheed, and A. Al-Rajeh, "Automatic Arabic Text Classification," in Proceedings of The 9th International Conference on the Statistical Analysis of Textual Data, Lyon-, France, 2008.
[11]R. Duwairi, "Arabic Text Categorization," The International Arab Journal of Information Technology, vol. 4, no. 2, pp. 125-132, 2007.
[12]M. Saad and W. Ashour, "Arabic Text Classification Using Decision Trees," in Computer science and information technologies, Moscow, Russia, 2010.
[13]M. Saad, "The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification," Master Thesis, The Islamic University, Gaza, 2010.
[14]A. El-Halees, "A Comparative Study on Arabic Text Classification," Egyptian Computer Science Journal, vol. 20, no. 2, 2008.
[15]G. Kanaan, R. Al-Shalabi, S. Ghwanmeh, and H. Al-Ma'adeed, "A Comparison of Text Classification Techniques Applied to Arabic Text," Journal of the American Society for Information Science and Technology, vol. 60, no. 9, pp. 1836–1844, 2009.
[16]A. Mesleh, "Chi Square Feature Extraction Based SVMs Arabic Language Text Categorization System," Journal of Computer Science, vol. 3, no. 6, pp. 430-435, 2007.
[17]F. Harrag and E. El-Qawasmeh, "Neural Network for Arabic Text Classification," in the second International Conference of Applications of Digital Information and Web Technologies, London, 2009.
[18]M. El-Kourdi, A. Bensaid, and T. Rachidi, "Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm," in the 20th International Conference on Computational Linguistics, Geneva, 2004.
[19]A. El-Halees, "Arabic Text Classification Using Maximum Entropy," The Islamic University Journal, vol. 15, no. 1, pp. 157-167, 2007.
[20]H. Sawaf, J. Zaplo, and H. Ney, "Statistical Classification Methods for Arabic News Articles," in In the Workshop on Arabic Natural Language Processing, Toulouse, France, 2001.
[21]R. Al-Shalabi, G. Kanaan, and M. Gharaibeh, "Arabic Text Categorization using kNN Algorithm," in Proceedings of the 4th International Multi-conference on Computer Science and Information Technology, Amman, Jordan, 2006.
[22]I. Hmeidi, B. Hawashin, and E. El-Qawasmeh, "Performance of KNN and SVM Classifiers on Full Word Arabic Articles," Advanced Engineering Informatics, vol. 22, no. 1, pp. 106–111, 2008.
[23]M. Abbas, K. Smaili, and D. Berkani, "Comparing TR-Classifier and kNN by using Reduced Sizes of Vocabularies," in The 3rd International Conference on Arabic Language Processing, Rabat, Morroco, 2009.
[24]M. Bawaneh, M. Alkoffash, and A. Al-Rabea, "Arabic Text Classification using K-NN and Naive Bayes," Journal of Computer Science, vol. 4, no. 7, pp. 600-605, 2008.
[25]A. El-Halees, "Arabic Opinion Mining Using Combined Classification Approach," in Proceedings of the International Arab Conference on Information Technology, Azrqa, Jordan, 2011.
[26]A. Danesh, B. Moshiri, and O. Fatemi, "Improved Text Classification Accuracy Based on Classifier Fusion Methods," in Proceedings of The 10th International Conference on Information Fusion, Quebec, Canada, 2007.
[27]A. Fujino, H. Isozaki, and J. Suzuki, "Multi-label Text Categorization with Model Combination based on F1-score Maximization," in Proceedings of the 3rd International Joint Conference on Natural Language Processing, Kyoto, Japan, 2008.
[28]Y. Bi, D. Bell, H. Wang, G. Guo, and J. Juan, "Combining Multiple Classifiers Using Dempster’s Rule of Combination for Text Categorization," Applied Artificial Intelligence, vol. 21, no. 3, pp. 211-239, 2007.
[29]M. Saad, "Arabic Computational Linguistics," 26 07 2010. [Online]. Available: http://sourceforge.net/projects/ar-text-mining/. [Accessed 23 04 2013].
[30]A. Fahad, A. Ibrahim, and F. Salah, "Processing Large Arabic Text Corpora: Preliminary Analysis and Results," in Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, 2009.
[31]M. Attia, "Arabic Tokenization System," in Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, Prague, Czech Republic, 2007.
[32]T. Gharib, M. Habib, and Z. Fayed, "Arabic Text Classification Using Support Vector Machines," International Journal of Computers and Their Applications, vol. 16, no. 4, pp. 192-199, 2009.
[33]M. R. Al-Maimani, A. Naamany, and A. Z. A. Bakar, "Arabic Information Retrieval: Techniques, Tools and Challenges," in GCC Conference and Exhibition, 2011.
[34]Hayder K. Al Ameed, Shaikha O. Al Ketbi, Amna A. Al Kaabi, Khadija S. Al Shebli, Naila F. Al Shamsi, Noura H. Al Nuaimi, and Shaikha S. Al Muhairi, "Arabic Light Stemmer: A new Enhanced Approach," in The Second International Conference on Innovations in Information Technology (IIT’05), Dubai, 2005.
[35]C. Aitao, "Building an Arabic Stemmer for Information Retrieval," in Proceedings of the Eleventh Text Retrieval Conference, Berkeley, 2003.
[36]S. Khoja and R. Garside, "Stemming Arabic Text," in Lancaster, UK, Computing Department, Lancaster University, 1999.
[37]M. Ababneh, R. Al-Shalabi, G. Kanaan, and A. Al-Nobani, "Building an Effective Rule-Based Light Stemmer for Arabic Language to Improve Search Effectiveness," The International Arab Journal of Information Technology, vol. 9, no. 4, pp. 368-372, 2012.
[38]N. Abdusalam, S. Tahaghoghi, and S. Falk, "Stemming Arabic Conjunctions and Prepositions," in Proceedings of the 12th international conference on String Processing and Information Retrieval, Heidelberg, 2005.
[39]L. Leah, B. Lisa and C. Margaret, "Light Stemming for Arabic Information Retrieval," Arabic Computational Morphology Text, Speech and Language Technology, vol. 38, pp. 221-243, 2007.
[40]Q. Zhengwei, G. Cathal, D. Aiden, and S. Alan, "Term Weighting Approaches for Mining Significant Locations from Personal Location Logs," in CIT 2010 - 10th IEEE International Conference on Computer and Information Technology, Bradford, UK, 2010.
[41]L. Man, T. Chew-Lim, L. Hwee-Boon, and S. Sam-Yuan, "A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines," in WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web, Chiba, Japan, 2005.
[42]"Weka 3: Data Mining Software in Java," Machine Learning Group at the University of Waikato, [Online]. Available: http://www.cs.waikato.ac.nz/ml/weka/.
[43]"RapidMiner," [Online]. Available: http://rapid-i.com/.
[44]M. Azara, T. Fatayer, and A. El-Halees, "Arabic Text Classification using Learning Vector Quantization," in 8th International Conference on Informatics and Systems (INFOS), Giza, Egypt, 2012.
[45]F. Harrag, E. El-Qawasmeh and P. Pichappan, "Improving Arabic Text Categorization using Decision Trees," in First International Conference on Networked Digital Technologies, Ostrava, 2009.