Noureddine Doumi

Work place: Computer Science Dept., University of Saïda, Algeria

E-mail: noureddine.doumi@univ-saida.dz

Website:

Research Interests: Computational Learning Theory

Biography

Noureddine Doumi is currently assistant professor at computer science department in Tahar Moulay University of Saida; he received his Magister degree in computer science from University of Sidi-Bel-Abbes in 2005. He is member of EEDIS lab in UDL-SBA and an active member as a developer in Unitex/GramLab project in University of Paris-Est Marne-La-Vallée. His research interest includes Arabic NLP, Linguistic Resources, Finite-State Machines and Machine Learning.

Author Articles
A Semi-Automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

By Noureddine Doumi Ahmed Lehireche Denis Maurel Ahmed Abdelali

DOI: https://doi.org/10.5815/ijitcs.2016.02.01, Pub. Date: 8 Feb. 2016

This work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTs are used to produce all possible inflected verb forms with their full morphological features. Among the algorithm's strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license.

[...] Read more.
Other Articles