Text Classification based on Discriminative-Semantic Features and Variance of Fuzzy Similarity

Full Text (PDF, 415KB), PP.26-39

Views: 0 Downloads: 0

Author(s)

Pouyan Parsafard 1 Hadi Veisi 2 Niloofar Aflaki 3 Siamak Mirzaei 4,*

1. Kish International Campus, University ofTehran, Kish, Iran

2. Faculty of New Sciencesand Technologies (FNST), University of Tehran, Tehran, Iran

3. Geoinformatics Collaboratory and School of Natural and Computational Sciences, Massey University, Auckland, New Zealand

4. College of Science and Engineering, Flinders University, South Australia

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2022.02.03

Received: 24 Sep. 2021 / Revised: 7 Nov. 2021 / Accepted: 2 Dec. 2021 / Published: 8 Apr. 2022

Index Terms

Persian topic identification, Discriminative features, Semantic similarities, Fuzzy similarities, Natural language processing

Abstract

Due to the rapid growth of the Internet, large amounts of unlabelled textual data are producing daily. Clearly, finding the subject of a text document is a primary source of information in the text processing applications. In this paper, a text classification method is presented and evaluated for Persian and English. The proposed technique utilizes variance of fuzzy similarity besides discriminative and semantic feature selection methods. Discriminative features are those that distinguish categories with higher power and the concept of semantic feature takes into the calculations the similarity between features and documents by using only available documents. In the proposed method, incorporating fuzzy weighting as a measure of similarity is presented. The fuzzy weights are derived from the concept of fuzzy similarity which is defined as the variance of membership values of a document to all categories in the way that with some membership value at the same time, the sum of these membership values should be equal to 1. The proposed document classification method is evaluated on three datasets (one Persian and two English datasets) and two classification methods, support vector machine (SVM) and artificial neural network (ANN), are used. Comparing the results with other text classification methods, demonstrate the consistent superiority of the proposed technique in all cases. The weighted average F-measure of our method are %82 and %97.8 in the classification of Persian and English documents, respectively.

Cite This Paper

Pouyan Parsafard, Hadi Veisi, Niloofar Aflaki, Siamak Mirzaei, "Text Classification based on Discriminative-Semantic Features and Variance of Fuzzy Similarity", International Journal of Intelligent Systems and Applications(IJISA), Vol.14, No.2, pp.26-39, 2022. DOI: 10.5815/ijisa.2022.02.03

Reference

[1] Aggarwal, C.C., Zhai, C.X., 2012. A Survey of Text Classification Algorithms. Mining Text Data, Springer.
[2] AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., Oroumchian, F., 2009. Hamshahri: A Standard Persian Text Collection. Knowledge-Based Systems, 22, 382-387.
[3] Basu, T., Murthy, C.A., 2012. Effective Text Classification by a Supervised Feature Selection Approach. Data Mining Workshops (ICDMW), IEEE 12th International Conference, 918-925.
[4] Bijankhan, M., 2008. 100 Millions Word Farsi Corpus. Technical Report, Research Center for Intelligent Signal Processing.
[5] Bijankhan, M., Sheykhzadegan, J., Bahrani, M., & Ghayoomi, M. 2011. Lessons from building a Persian written corpus: Peykare. Language Resources and Evaluation, 45, 143-164.
[6] Carpineto, C., Romano, G., 2012. A Survey of Automatic Query Expansion in Information Retrieval. ACM Computing Surveys (CSUR). 44 (1), 1-50.
[7] Chen, J., Pan, H., Ao, Q., 2012. Study a Text Classification Method Based on Neural Network Model. Proceedings of the MSEC International Conference on Multimedia, Software Engineering and Computing, Springer Berlin Heidelberg, 128, 471-475.
[8] Dobbins, S., Topliss, M., Weinstein, S., Andersen, P., Cellio, M., Hayes, P., Knecht, L., Nirenburg, I., 1987. Reuters-21578 Text Categorization Collection. (Available at http://kdd.ics.uci.edu/databases/reuters21578).
[9] Elahimanesh, M.H., Minaei-Bidgoli, B., Malekinezhad, H., 2012. Improving K-Nearest Neighbor Efficacy for FarsiText Classification. The International Conference on Language Resources and Evaluation (LREC), 1618-1621.
[10] Farhoodi, M., Yari, A., 2010. Applying Machine Learning Algorithms for Automatic Persian Text Classification. 6th International Conference on Advanced Information Management and Service (IMS), 318-323.
[11] Gharavi E., Veisi H., 2014, A Hidden Markov Model for Persian Text Classification. 3rd National Computational Linguistics Conference, Tehran-Iran.
[12] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I., 2009. The Weka Data Mining Software. ACM SIGKDD Explorations Newsletter 11(1), 10-18.
[13] Hao, P.Y., Chiang, J.H., Tu, Y.K., 2007. Hierarchically SVM Classification Based on Support Vector Clustering Method and Its Application to Document Categorization. An International Journal Expert Systems with Applications. 33(3), 627-635.
[14] Harish, B.S, Guru, D.S, Manjunath, S., 2010. Representation and Classification of Text Documents: A Brief Review. IJCA Special Issue on Recent Trends in Image Processing and Pattern Recognition.
[15] Jafari, A., Hosseinejad, M., Amiri, A., 2011. Improvement in Automatic Classification of Persian Documents by Means of Naïve Bayes and Representative Vector. 1st International Conference on Computer and Knowledge Engineering (ICCKE), IEEE, 226-229.
[16] Jones, K.S., 2004. A Statistical Interpretation of Term Specificity and Its Application in Retrieval. J. Document. 60 (5), 493-502.
[17] Ko, Y.J., Park, J., Seo, J., 2004. Improving Text Categorization Using the Importance of Sentences. An International Journal of Information Processing and Management. 40, 65-79.
[18] Lan, M., Tan, C.L., Su. J., 2009. Supervised and Traditional Term Weighting Methods for Automatic Text Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence. 31(4), 721-735.
[19] Maghsoodi, N. and Homayounpoor, M., 2011. Using Thesaurus to Improve Multiclass Text Classification. Part II, LNCS 6609, 244-253.
[20] Mitchell, T., 1997. The 20 Newsgroups Dataset. (Available at http://kdd.ics.uci.edu/databases/20newsgroups).
[21] Miyamoto, S., 2001. Fuzzy Multisets and Fuzzy Clustering of Documents. In Proceedings of the IEEE International Conference on Fuzzy Systems, FUZZ-IEEE, 3, 1539-1542.
[22] Pilevar, M.T., Feili, H., Soltani, M., 2009. Classification of Persian Textual Documents Using Learning Vector Quantization. In IEEE Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering, 1-6.
[23] Qian, T., Xiong, H., Wang, Y., Chen, E., 2007. On the Strength of Hyperclique Patterns for Text Categorization. An International Journal Information Sciences. 177, 4040-4058.
[24] Ridwan Rismanto, Arie Rachmad Syulistyo, Bebby Pramudya Citra Agusta, "Research Supervisor Recommendation System Based on Topic Conformity", International Journal of Modern Education and Computer Science, Vol.12, No.1, pp. 26-34, 2020.
[25] Sanjay B. Ankali, Latha Parthiban, " A Methodology for Reliable Code Plagiarism Detection Using Complete and Language Agnostic Code Clone Classification", International Journal of Modern Education and Computer Science, Vol.13, No.3, pp. 34-56, 2021.
[26] Saracoglu, R., Tutuncu, K., Allahverdi, N., 2007. A fuzzy clustering approach for finding similar documents using a novel similarity measure. Expert Systems with Applications, 33(3), 600-605.
[27] Saraee, M., Bagheri, A., 2013. Feature Selection Methods in Persian Sentiment Analysis. Natural Language Processing and Information Systems, Springer-Verlag Berlin Heidelberg, 7934, 303-308.
[28] Vijay Verma, Rajesh Kumar Aggarwal, "Accuracy Assessment of Similarity Measures in Collaborative Recommendations Using CF4J Framework", International Journal of Modern Education and Computer Science, Vol.11, No.5, pp. 41-53, 2019.
[29] Widyantoro, D.H., Yen, J., 2000. A Fuzzy Similarity Approach in Text Classification Task. IEEE, Fuzzy Systems, The Ninth IEEE International Conference on, FUZZ-IEEE, 2, 653-658.
[30] Yari, A., Abbasi, A., MomenBellah, S., 2010. Presenting a fuzzy relation to classify the Persian Web documents. IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS), 2, 220- 223.
[31] Zong, W., Wu, F., Chu, L., Sculli, D., 2015. A Discriminative and Semantic Feature Selection Method for Text Categorization. International Journal of Production Economics, 215-222.