A Semi-Automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

Full Text (PDF, 1040KB), PP.1-13

Views: 0 Downloads: 0

Author(s)

Noureddine Doumi 1,* Ahmed Lehireche 2 Denis Maurel 3 Ahmed Abdelali 4

1. Computer Science Dept., University of Saïda, Algeria

2. Computer Science Dept., University of SBA, Algeria

3. Université François Rabelais Tours, LI computer laboratory, France

4. Qatar Computing Research Institute, Qatar

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2016.02.01

Received: 20 Apr. 2015 / Revised: 11 Aug. 2015 / Accepted: 7 Oct. 2015 / Published: 8 Feb. 2016

Index Terms

Arabic NLP, Arabic linguistic resources, Arabic verbs, Finite state transducers, Unitex

Abstract

This work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTs are used to produce all possible inflected verb forms with their full morphological features. Among the algorithm's strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license.

Cite This Paper

Noureddine Doumi, Ahmed Lehireche, Denis Maurel, Ahmed Abdelali, "A Semi-Automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.2, pp.1-13, 2016. DOI:10.5815/ijitcs.2016.02.01

Reference

[1]N. Doumi, A. Lehireche, D. Maurel and M. Ali Cherif, “La conception d'un jeu de ressources libres pour le TAL arabe sous Unitex,” in TRADETAL2013, Colloque international en Traductologie et TAL, Oran - Algeria, 2013.

[2]S. Khoja, "APT: Arabic Part-of-Speech Tagger." in Proceedings  of  the  Student Workshop  at the 2nd Meeting of the NAACL, (NAACL’01). 2001. Carnegie Mellon University, Pennsylvania. pp. 20-25.

[3]M. Attia, P. Pecina, A. Toral, L. Tounsi, J. Van Genabith, "A lexical database for modern standard Arabic interoperable with a finite state morphological transducer." in Procedding of Second International Workshop, SFCM Systems and Frameworks for Computational Morphology. 2011. Zurich, Switzerland,: Springer. pp. 98-118.

[4]N. Habash, Introduction to Arabic natural language processing: Synthesis lectures on human language technologies, Morgan & Claypool, 2010.

[5]D. Maurel, and F. Guenthner, Automata and dictionaries, Texts in computing, ed. I. Mackie. Vol. 6. London: King's college, 2005.

[6]L. Clément, B. Lang, and B. Sagot, "Morphology based automatic acquisition of large-coverage lexica." in LREC04 4th International Conference on Language Resources and Evaluation. 2004. Lisbon, Portugal. pp. 1841-1844.

[7]I. M. H. Saleh, and N. Habash, “Automatic extraction of lemma-based bilingual dictionaries for morphologically rich languages,” in Third Workshop on Computational Approaches to Arabic Script-based Languages at the MT Summit XII, Ottawa, Canada, 2009.

[8]N. Habash, and O. Rambow, "Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop." in Proceedings of the 43rd Annual Meeting of ACL. 2005. Ann Arbor, Michigan. pp. 573-580.

[9]F. J. Och, and H. Ney, “A systematic comparison of various statistical alignment models,” Computational Linguistics, vol. 29, no. 1, pp. c-51, 2003.

[10]P. Koehn, "Pharaoh: A beam search decoder for phrase-based statistical machine translation models," Machine Translation: From Real Users to Research, Proceedings, Lecture Notes in Computer Science R. E. Frederking and K. B. Taylor, eds., pp. 115-124, 2004.

[11]W. W. Cohen, "Learning trees and rules with set-valued features." in 13th National Conference on Artificial Intelligence (AAAI 96) / 8th Conference on Innovative Applications of Artificial Intelligence (IAAI 96). 1996. Portland. pp. 709-716.

[12]M. Dreyer, and J. Eisner, "Discovering morphological paradigms from plain text using a Dirichlet process mixture model." in Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011. Association for Computational Linguistics. pp. 616-627.

[13]G. Durrett, and J. DeNero, "Supervised Learning of Complete Morphological Paradigms." in The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT). 2013. Atlanta. pp. 1185-1195.

[14]T. Buckwalter, "Buckwalter Arabic Morphological Analyzer Version 2.0," catalog number LDC2004L02, LDC, 2004.

[15]A. A. Neme, "A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers." in International Workshop on Lexical Resources. 2011. Slovenia. pp. 78-85.

[16]D. Graff, M. Maamouri, B. Bouziri et al., "Standard arabic morphological analyzer (SAMA) version 3.1," Linguistic Data Consortium LDC2009E73, 2009.

[17]R. Abbes, J. Dichy, and M. Hassoun, "The architecture of a standard Arabic lexical database: some figures, ratios and categories from the DIINAR.1 source program." in Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages. 2004. Geneva, Switzerland: Association for Computational Linguistics. pp. 15-22.

[18]O. Smrz, "ElixirFM - Implementation of Functional Arabic Morphology." in ACL2007, Computational Approaches to Semitic Languages: Common Issues and Resources. 2007. Prague, Czech Republic. pp. 1-8.

[19]M. Al-Bawab, Arabic derivation and inflection algorithms, ALESCO : Arab League Educational, Scientific and Cultural Organization, Tunisia, 2007.

[20]M. Silberztein, "NooJ: an oriented object approach." in INTEX pour la Linguistique et le Traitement Automatique des Langues, Actes des 4èmes et 5èmes journées INTEX, Bordeaux, may 2001 and Marseille, may 2002. 2004. Besan?on: Presses universitaires de Franche-Comté.

[21]S. Mesfar, “Analyse morpho-syntaxique automatique et reconnaissance des entités nommées en arabe standard,” PhD thesis, Université de Franche-Comté, 2008.

[22]E. F. Moore, “Gedanken-experiments on sequential machines,” Automata studies, vol. 34, pp. 129-153, 1956.

[23]M. Mohri, “On some applications of finite-state automata theory to natural language processing,” Natural Language Engineering, vol. 2, no. 1, pp. 61-80, 1996.

[24]K. R. Beesley, "Arabic morphology using only finite-state operations." in Proceedings of the Workshop on Computational Approaches to Semitic languages. 1998. Association for Computational Linguistics. pp. 50-57.

[25]M. Mohri, "Compact representations by finite-state transducers." in Proceedings of the 32nd annual meeting on Association for Computational Linguistics. 1994. Association for Computational Linguistics. pp. 204-209.

[26]J. Daciuk, “Incremental construction of finite-state automata and transducers, and their use in the natural language processing,” PhD thesis, Technical University of Gdańsk, 1998.

[27]J. Daciuk, B. W. Watson, and R. E. Watson, "Incremental construction of minimal acyclic finite state automata and transducers." in Proceedings of the International Workshop on Finite State Methods in Natural Language Processing. 1998. Association for Computational Linguistics. pp. 48-56.

[28]J. Daciuk, S. Mihov, B.W. Watson, R.E. Watson, “Incremental construction of minimal acyclic finite-state automata,” Computational Linguistics, vol. 26, no. 1, pp. 3-16, Mar, 2000.

[29]M. Mohri, F. C. N. Pereira, and M. D. Riley, "Systems and methods for determinization and minimization a finite state transducer for speech recognition," Google Patents, 2001.

[30]J. Daciuk, "Comparison of construction algorithms for minimal, acyclic, deterministic, finite-state automata from sets of strings," Implementation and Application of Automata, pp. 255-261: Springer, 2003.

[31]B. W. Watson, and J. Daciuk, “An efficient incremental DFA minimization algorithm,” Natural Language Engineering, vol. 9, no. 1, pp. 49-64, 2003.

[32]R. C. Carrasco, J. Daciuk, and M. L. Forcada, "An implementation of deterministic tree automata minimization," Implementation and Application of Automata, pp. 122-129: Springer, 2007.

[33]R. C. Carrasco, J. Daciuk, and M. L. Forcada, “Incremental construction of minimal tree automata,” Algorithmica, vol. 55, no. 1, pp. 95-110, 2009.

[34]L. Tounsi, B. Bouchou, and D. Maurel, "A compression method for natural language automata." in Proceeding of the 2009 conference on Finite-State Methods and Natural Language Processing: Post-proceedings of the 7th International Workshop FSMNLP. 2009. pp. 146-157.

[35]B. W. Watson, “A taxonomy of algorithms for constructing minimal acyclic deterministic finite automata,” South African Computer Journal, no. 27, pp. 12-17, August, 2001.

[36]S. Mihov, “Direct construction of minimal acyclic finite states automata,” Annuaire de l'Universite de Sofia St. Kl. Ohridski, Faculté de mathématiques et Informatique, vol. 92, no. 2, 1999.

[37]K. R. Beesley, and L. Karttunen, "Finite-state non-concatenative morphotactics." in Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. 2000. Association for Computational Linguistics. pp. 191-198.

[38]M. Al-Bawab, M. Merayati, Y. Mir Alam, M.H. Al-Tayene, Statistics on Arabic verbs in the computational lexicon, Lebanon: Librairie Du Liban Publishers, 1996.

[39]D. E. Kouloughli, Grammaire de l'arabe d'aujourd'hui: Pocket, 1994.

[40]B. Courtois, and M. Silberztein, “Dictionnaires électroniques du fran?ais,” Langue fran?aise, vol. 87, no. 1, pp. 3-4, 1990.

[41]B. Courtois, Buts et méthodes de l’élaboration des dictionnaires électroniques du LADL, Université Paris 7 Denis Diderot: Centre Interlangue d'études en Lexicologie, 1994-1995.

[42]S. Paumier, Unitex manual for version 3.1, http://www-igm.univ-mlv.fr/~unitex/Unitex Manual3.1.pdf: IGM, Université de Marne-la-Vallée, Paris, 2014.

[43]W. Zaghouani, “Critical Survey of the Freely Available Arabic Corpora,” in Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools - LREC2014, Reykjavik, Iceland, 2014, pp. 1-9.