Extraction of Root Words using Morphological Analyzer for Devanagari Script

Full Text (PDF, 354KB), PP.33-39

Views: 0 Downloads: 0

Author(s)

Sharvari S. Govilkar 1,* J. W. Bakal 2 Sagar R. Kulkarni 2

1. Department of Information Technology, TSEC, Mumbai, India

2. Department of Computer Engineering, PIIT, New Panvel, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2016.01.04

Received: 23 Apr. 2015 / Revised: 1 Aug. 2015 / Accepted: 15 Sep. 2015 / Published: 8 Jan. 2016

Index Terms

Morphological analyzer, text mining, tokenization, stop words in Devanagari, suffixes in Devanagari, stemming, removing inflections using rules

Abstract

In India, more than 300 million people use Devanagari script for documentation. In Devanagari script, Marathi and Hindi are mainly used as primary language of Maharashtra state and national language of India respectively. As compared with English script, Devanagari script is reach of morphemes. Thus the lemmatization of Devanagari script is quite complex than that of English script. There is lack of resources for Devanagari script such as WordNet, ontology representation, parsing the keywords and their part of speech. Thus the overall task of information retrieval becomes complex and time consuming. Devanagari script document always carries suffixes which may cause problem in accurate information retrieval. We propose a method of extracting root words from Devanagari script document which can be used for information retrieval, text summarization, text categorization, ontology building etc. An attempt is made to design the Morphological Analyzer for Devanagari script. We have designed CORPUS containing more than 3000 possible stop words and suffixes for Marathi language. Morphological Analyzer can acts as a preliminary stage for developing any information retrieval application in Devanagari script. We have conducted the experiments on randomly selected Marathi documents and we found the accuracy of designed morphological analyzer is up to 96%.

Cite This Paper

Sharvari S. Govilkar, J. W. Bakal, Sagar R. Kulkarni, "Extraction of Root Words using Morphological Analyzer for Devanagari Script", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.1, pp.33-39, 2016. DOI:10.5815/ijitcs.2016.01.04

Reference

[1]Pushpak Bhattacharya, Manish Shrivastava, Nitin Agrawal, Bibhuti Mohapatra, Smriti Singh, IIT Bombay “Morphology Based Natural Language Processing tools for Indian Languages” 2012.

[2]Ashish Almeida, Pushpak Bhattacharyya IIT Bombay “Using Morphology to Improve Marathi Monolingual Information Retrieval” IEEE 2012.

[3]Upendra Mishra, Chandra Prakash, “MAULIK: An Effective Stemmer for Hindi Language” International Journal on Computer Science and Engineering (IJCSE), ISSN: 0975-3397 Vol. 4 No. 05 May 2012.

[4]Mudassar M. Majgaonker, Tanveer J Siddiqui, Discovering suffixes: A Case Study for Marathi Language, International Journal on Computer Science and Engineering Vol. 02, No. 08, 2010, 2716-272.

[5]Deepak Kumar, Manjeet Singh, and Seema Shukla “FST Based Morphological Analyzer for Hindi Language”, JSS Academy of Technical Education Noida, Uttar Pradesh, India, 2010.

[6]Dr. Riyad Al-Shalabi, Dr. Ghassan Kanaan, Dr. Ahmad Hasnah “Stop word removal algorithm for Arabic language”, IEEE 7803-8482-2/2004.

[7]Leah S. Larkey, Margaret E. Connell, Nasreen Abduljaleel, “Hindi CLIR in Thirty Days”, University of Massachusetts, Amherst. ACM Transactions on Asian Language Information Processing, 2003, 2(2), pp. 130-142.

[8]http://www.unicode.org/charts/PDF/U0900.pdf for UTF-8 Unicode’s used in Devanagari.

[9]http://www.unicode.org/Public/6.1.0/charts/CodeCharts.pdf contains more than 200 scripts Unicode’s and their ranges used throughout the world. 

[10]http://www.cfilt.iitb.ac.in/indowordnet/index.jsp Center for Indian language technology (CFILT), by IIT Bombay.

[11]http://ltrc.iiit.ac.in/analyzer/marathi/all_out by IIIT, Hyderabad.