Denis Maurel

Work place: Université François Rabelais Tours, LI computer laboratory, France

E-mail: denis.maurel@univ-tours.fr

Website:

Research Interests: Data Structures and Algorithms

Biography

Denis Maurel was born in 1956. He received his Ph.D. in computer science from the University Paris 7 (France) in 1989. He has been qualified both in computer science and linguistics. Since 2000, he is Professor at the Université François Rabelais at Tours (France). He is Head of the BdTln (Data base and NLP) team of the LI (Computer Science Laboratory) of this university, member of the steering committee of CIAA conferences. His fields of interest are NLP (with focus in Named entities, Morphology, Resources) and Finite state machines.

Author Articles
A Semi-Automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

By Noureddine Doumi Ahmed Lehireche Denis Maurel Ahmed Abdelali

DOI: https://doi.org/10.5815/ijitcs.2016.02.01, Pub. Date: 8 Feb. 2016

This work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTs are used to produce all possible inflected verb forms with their full morphological features. Among the algorithm's strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license.

[...] Read more.
Other Articles