Influence of GUJarati STEmmeR in Supervised Learning of Web Page Categorization

Full Text (PDF, 380KB), PP.23-34

Views: 0 Downloads: 0

Author(s)

Chandrakant D. Patel 1,* Jayesh M. Patel 2

1. Hemchandracharya North Gujarat University, Patan, Gujarat, India

2. Acharya Motibhai Patel Institute of Computer Studies, Ganpat University, Kherva, Gujarat, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2021.03.03

Received: 21 Nov. 2020 / Revised: 2 Jan. 2021 / Accepted: 19 Feb. 2021 / Published: 8 Jun. 2021

Index Terms

Stemming, Gujarati Language, Supervised algorithms, Machine Learning, Accuracy

Abstract

With the large quantity of information offered on-line, it's equally essential to retrieve correct information for a user query. A large amount of data is available in digital form in multiple languages. The various approaches want to increase the effectiveness of on-line information retrieval but the standard approach tries to retrieve information for a user query is to go looking at the documents within the corpus as a word by word for the given query. This approach is incredibly time intensive and it's going to miss several connected documents that are equally important. So, to avoid these issues, stemming has been extensively utilized in numerous Information Retrieval Systems (IRS) to extend the retrieval accuracy of all languages. These papers go through the problem of stemming with Web Page Categorization on Gujarati language which basically derived the stem words using GUJSTER algorithms [1]. The GUJSTER algorithm is based on morphological rules which is used to derived root or stem word from inflected words of the same class. In particular, we consider the influence of extracted a stem or root word, to check the integrity of the web page classification using supervised machine learning algorithms. This research work is intended to focus on the analysis of Web Page Categorization (WPC) of Gujarati language and concentrate on a research problem to do verify the influence of a stemming algorithm in a WPC application for the Gujarati language with improved accuracy between from 63% to 98% through Machine Learning supervised models with standard ratio 80% as training and 20% as testing.

Cite This Paper

Chandrakant D. Patel, Jayesh M. Patel, "Influence of GUJarati STEmmeR in Supervised Learning of Web Page Categorization", International Journal of Intelligent Systems and Applications(IJISA), Vol.13, No.3, pp.23-34, 2021. DOI:10.5815/ijisa.2021.03.03

Reference

[1]C. D. Patel and J. M. Patel, “GUJSTER: a Rule based stemmer using Dictionary Approach,” IEEE - International Conference on Inventive Communication and Computational Technologies (ICICCT 2017) GUJSTER:, no. IEEE, pp. 496–499, 2017.
[2]D. Harman, “How effective is suffixing?,” Journal of the American Society for Information Science, vol. 42, no. 1, pp. 7–15, 1991.
[3]P. Majumder, M. Mitra, S. Parui, and G. Kole, “YASS: Yet Another Suffix Stripper,” ACM Transaction of Information Systems, vol. 25. pp. 18–37, 2007.
[4]W. B. Frakes and R. Baeza-yates, Information Retrieval : Data Structures & Algorithms. 2004.
[5]W. B. Croft and J. Xu, “Corpus-specific stemming using word form co-occurence,” pp. 147–159, 1995.
[6]J. Anjali Ganesh, “A Comparative Study of Stemming Algorithms,” IJCTA, vol. 2, no. 2004, pp. 1930–1938, 2011.
[7]Neha Garg, R.K. Gupta,"Exploration of Various Clustering Algorithms for Text Mining", International Journal of Education and Management Engineering, Vol.8, No.4, pp.10-18, 2018.
[8]C. D. Patel and J. M. Patel, “Improving a Lightweight Stemmer for Gujarati Language,” International Journal of Information Sciences and Techniques, vol. 6, no. 1, pp. 135–142, 2016.
[9]C. D. Patel and J. M. Patel, “A Review of Indian and Non-Indian Stemming : A focus on Gujarati Stemming Algorithms,” International Journal of Advanced Research, vol. 3, no. 12, pp. 1701–1706, 2015.
[10]V. Gupta and G. S. Lehal, “A survey of common stemming techniques and existing stemmers for indian languages,” Journal of Emerging Technologies in Web Intelligence, vol. 5, no. 2, pp. 157–161, 2013.
[11]J. B. Lovins, “Development of a stemming algorithm,” Mechanical Translation and Computational Linguistics, vol. 11, no. June, pp. 22–31, 1968.
[12]J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets, Edition-. Cambridge: Cambridge University Press, 2014.
[13]J. Han, M. Kamber, and J. Pei, Data Mining Concept and Techniques, 3rd Editio. Morgan Kaufmann Publishers is an imprint of Elsevier, 2012.
[14]E. Riloff, “Little words can make a big difference for text classification,” SIGIR Forum (ACM Special Interest Group on Information Retrieval), pp. 130–136, 1995.
[15]M. Spitters, “Comparing feature sets for learning text categorization,” Proceedings of the Sixth Conference on Content-Based Multimedia Access (RIAO 2002), pp. 1124–1135, 2000.
[16]T. Gaustad and G. Bouma, “Accurate Stemming of Dutch for Text Classification,” Computational Linguistics in the Netherlands 2001, pp. 1–14, 2016.
[17]A. M. Cohen, J. Yang, and W. R. Hersh, “Retrieval of Biomedical Documents,” Medical Informatics, pp. 1–9, 2004.
[18]M. Panda, “Developing an Efficient Text Pre-Processing Method with Sparse Generative Naive Bayes for Text Mining,” International Journal of Modern Education and Computer Science, vol. 10, no. 9, pp. 11–19, 2018.
[19]M. A. Hafer and S. F. Weiss, “Word Segmentation by Letter Successor Varieties,” Information Storage and Retrieval, vol. 10, pp. 371–385, 1974.
[20]M. Porter, “Snowball: A language for stemming algorithms,” http://snowball.tartarus.org/texts/introduction.html, pp. 1–15, 2001.
[21]M. F. Porter, “An algorithm for suffix stripping,” Program, vol. 14, no. 3, pp. 130–137, 1980.
[22]H. Donna, “How effective is suffixing?,” Journal of the American Society for Information Science, vol. 42, no. 1, pp. 7–15, 1991.
[23]J. L. Dawson, “Suffix Removal and Word conflaction,” LLC Buletin, vol. 2, no. 3, pp. 33–46, 1974.
[24]C. D. Paice, “An Evaluation Method for Stemming Algorithms,” In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 42–50, 1994.
[25]R. Krovetz, “Viewing Morphology as an Inference Process,” 16th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 191–202, 1993.
[26]J. Allan and G. Kumaran, “Stemming in the language modeling framework,” CIIR Technical Report, vol. IR-289, no. June, p. 455, 2003.
[27]N. L. Bhamidipati and S. K. Pal, “Stemming via distribution-based word segregation for classification and retrieval,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 37, no. 2, pp. 350–360, 2007.
[28]V. Gurusamy and S. K. K. Nandhini, “Performance Analysis : Stemming Algorithm for the English Language,” IJSRD - International Journal for Scientific Research & Development, vol. 5, no. 05, pp. 1933–1938, 2017.
[29]A. Ramanathan and D. D. Rao, “A Lightweight Stemmer for Hindi,” Proceedings of the EACL 2003 Workshop on Computational Linguistics for South Asian Languages, pp. 43–48, 2003.
[30]L. S. Larkey, M. E. Connell, and N. Abduljaleel, “Hindi CLIR in thirty days,” ACM Transactions on Asian Language Information Processing, vol. 2, no. 2, pp. 130–142, 2003.
[31]A. K. Pandey and T. J. Siddiqui, “An unsupervised hindi stemmer with heuristic improvements,” Proceedings of SIGIR 2008 Workshop on Analytics for Noisy Unstructured Text Data, AND’08, pp. 99–105, 2008.
[32]N. Aswani and R. Gaizauskas, “Developing Morphological Analysers for South Asian Languages: Experimenting with the Hindi and Gujarati Languages.,” pp. 811–815, 2010.
[33]U. Mishra and C. Prakash, “MAULIK: An Effective Stemmer for Hindi Language,” International Journal on Computer Science and Engineering, vol. 4, no. 5, pp. 711–717, 2012.
[34]A. Jain and S. Das, “Hindi stemmer @ fire-2013,” ACM International Conference Proceeding Series, pp. 4–6, 2013.
[35]V. Gupta, “Hindi Rule Based Stemmer for Nouns,” International Journal of Advanced Research in Computer Science and Software Engineering, vol. 4, no. 1, pp. 1–4, 2014.
[36]R. Joon and A. Singhal, “Analysis of MWES in Hindi Text Using NLTK,” International Journal on Natural Language Computing, vol. 6, no. 1, pp. 13–22, 2017.
[37]P. Patel, K. Popat, and P. Bhattacharyya, “Hybrid Stemmer for Gujarati,” Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing (WSSANLP),the 23rd International Conference on Computational Linguistics (COLING), Beijing, no. August, pp. 51–55, 2010.
[38]K. Suba, D. Jiandani, and P. Bhattacharyya, “Hybrid Inflectional Stemmer and Rule-based Derivational Stemmer for Gujarati,” Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP), pp. 1–8, 2011.
[39]J. Ameta, N. Joshi, and I. Mathur, “A Lightweight Stemmer for Gujarati,” In Proceedings of 46th Annual National Convention of Computer Society of India., 2011.
[40]J. R. Sheth and B. C. Patel, “Stemming Techniques and Naïve Approach for Gujarati Stemmer,” International Conference in Recent Trends in Information Technology and Computer Science (ICRTITCS - 2012) Proceedings published in International Journal of Computer Applications, pp. 9–11, 2012.
[41]J. Sheth and B. Patel, “Dhiya: A stemmer for morphological level analysis of Gujarati language,” Proceedings of the 2014 International Conference on Issues and Challenges in Intelligent Computing Techniques, ICICT 2014, pp. 151–154, 2014.
[42]B. Dalwadi and N. Desai, “An Affix Removal Stemmer for Gujarati Text,” IEEE, pp. 2296–2299, 2016.
[43]M. Patel, “A Suffix Stripper for Gujarati Noun,” Discovery, vol. 47, no. 218, pp. 95–101, 2015.
[44]Sharvari S. Govilkar, J. W. Bakal, Sagar R. Kulkarni,"Extraction of Root Words using Morphological Analyzer for Devanagari Script", International Journal of Information Technology and Computer Science, Vol.8, No.1, pp.33-39, 2016.
[45]X. Qi and B. D. Davison, “Web page classification: Features and algorithms,” ACM Computing Surveys, vol. 41, no. 2, p. 31, 2009.
[46]P. Bhati, Hand book of Gujarati Grammer. 1889.
[47]G. of Gujarat, ભાષા વિવેક, Second. Gandhinagar: ભાષા નિયામકની કચેરી, 2010.
[48]Mavajibhai, વ્યાકરણ પરિચય. 2010.
[49]G. Grefenstette and P. Tapanainen, “What is a Word, What is sentence? Problems of Tokenization,” Rank Xerox Research Centre, vol. 3, p. 9, 1994.
[50]L. H. Patil and M. Atique, “A novel approach for feature selection method TF-IDF in document clustering,” 2013 3rd IEEE International Advance Computing Conference (IACC), pp. 858–862, 2013.
[51]M. A. Hafer and S. F. Weiss, “Word segmentation by letter successor varieties,” Information Storage and Retrieval, vol. 10, no. 11–12, pp. 371–385, 1974.
[52]L. Hao and L. Hao, “Automatic identification of stop words in chinese text classification,” Proceedings - International Conference on Computer Science and Software Engineering, CSSE 2008, vol. 1, pp. 718–722, 2008.
[53]T. K. H. T. K. Ho, “Fast identification of stop words for font learning and keyword\nspotting,” Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR ’99 (Cat. No.PR00318), 1999.
[54]Pareek, Jyoti and J. H. Pareek Jyoti, “Evaluation of some Information Retrieval models for Gujarati Ad hoc Monolingual Tasks,” VNSGU Journal of Science & Technology, vol. 3, no. 2, pp. 176–181, 2012.