ADPBC: Arabic Dependency Parsing Based Corpora for Information Extraction

Full Text (PDF, 400KB), PP.54-61

Views: 0 Downloads: 0

Author(s)

Sally Mohamed 1,2,* Mahmoud Hussien 3 Hamdy M. Mousa 3

1. Computer Science Department, Faculty of Computer and Information, Menoufia University, Egypt

2. Higher Institute of Engineering and Technology, Tanta, Egypt

3. Computer Science Department, Faculty of Computer and Information, Menoufia University

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2021.01.04

Received: 4 Jun. 2020 / Revised: 13 Aug. 2020 / Accepted: 6 Sep. 2020 / Published: 8 Feb. 2021

Index Terms

Dependency parsing, Arabic corpora, information extraction

Abstract

There is a massive amount of different information and data in the World Wide Web, and the number of Arabic users and contents is widely increasing. Information extraction is an essential issue to access and sort the data on the web. In this regard, information extraction becomes a challenge, especially for languages, which have a complex morphology like Arabic. Consequently, the trend today is to build a new corpus that makes the information extraction easier and more precise. This paper presents Arabic linguistically analyzed corpus, including dependency relation. The collected data includes five fields; they are a sport, religious, weather, news and biomedical. The output is CoNLL universal lattice file format (CoNLL-UL). The corpus contains an index for the sentences and their linguistic meta-data to enable quick mining and search across the corpus. This corpus has seventeenth morphological annotations and eight features based on the identification of the textual structures help to recognize and understand the grammatical characteristics of the text and perform the dependency relation. The parsing and dependency process conducted by the universal dependency model and corrected manually. The results illustrated the enhancement in the dependency relation corpus. The designed Arabic corpus helps to quickly get linguistic annotations for a text and make the information Extraction techniques easy and clear to learn. The gotten results illustrated the average enhancement in the dependency relation corpus.

Cite This Paper

Sally Mohamed, Mahmoud Hussien, Hamdy M. Mousa, "ADPBC: Arabic Dependency Parsing Based Corpora for Information Extraction", International Journal of Information Technology and Computer Science(IJITCS), Vol.13, No.1, pp.54-61, 2021. DOI:10.5815/ijitcs.2021.01.04

Reference

[1]Noureddine Doumi, Ahmed Lehireche, Denis Maurel, Ahmed Abdelali,"A Semi-Automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs", International Journal of Information Technology and Computer Science(IJITCS), Vol.8, No.2, pp.1-13, 2016. DOI: 10.5815/ijitcs.2016.02.01
[2]M. El-haj and R. Koulali, “KALIMAT a Multipurpose Arabic Corpus,” Second Work. Arab. Corpus Linguist., pp. 22–25, 2013.
[3]S. Ali, H. Mousa, and M. Hussien, “A Review of Open Information Extraction Techniques,” IJCI. Int. J. Comput. Inf., vol. 6, no. 1, pp. 20–28, 2019.
[4]H. Mahmoud, S. S. Kareem, and T. El-Shishtawy, “A Semantic Retrieval System for Extracting Relationships from Biological Corpus,” Int. J. Comput. Sci. Inf. Technol., vol. 10, no. 1, pp. 43–53, 2018.
[5]M. Straka, J. Hajiˇ, and J. Strakov, “UDPipe : Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization , Morphological Analysis , POS Tagging and Parsing.”
[6]“https://catalog.ldc.upenn.edu/LDC2010T08.” p. https://catalog.ldc.upenn.edu/LDC2010T08.
[7]Afnan Atiah Alsolamy, Muazzam Ahmed Siddiqui, Imtiaz Hussain Khan, " A Corpus Based Approach to Build Arabic Sentiment Lexicon", International Journal of Information Engineering and Electronic Business(IJIEEB), Vol.11, No.6, pp. 16-23, 2019. DOI: 10.5815/ijieeb.2019.06.03
[8]M. Straka, J. Hajič, and J. Straková, “UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing,” Proc. 10th Int. Conf. Lang. Resour. Eval. Lr. 2016, pp. 4290–4297, 2016.
[9]D. Taji, N. Habash, and D. Zeman, “Universal Dependencies for Arabic,” pp. 166–176, 2017.
[10]S. Al Maadeed, W. Ayouby, A. Hassaïne, and J. M. Aljaam, “QUWI: An Arabic and English handwriting dataset for offline writer identification,” Proc. - Int. Work. Front. Handwrit. Recognition, IWFHR, no. September, pp. 746–751, 2012.
[11]Moner N. M. Arafa, Reda Elbarougy, A. A. Ewees, G. M. Behery," A Dataset for Speech Recognition to Support Arabic Phoneme Pronunciation", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.10, No.4, pp. 31-38, 2018.DOI: 10.5815/ijigsp.2018.04.04
[12]W. Zaghouani, “Critical Survey of the Freely Available Arabic Corpora,” Proc. Work. Free. Arab. Corpora Corpora Process. Tools Work. Program., pp. 1–8, 2017.
[13]M. Saad, D. Langlois, and K. Smäili, “Building and modelling multilingual subjective corpora,” Proc. 9th Int. Conf. Lang. Resour. Eval. Lr. 2014, no. May, pp. 3086–3091, 2014.
[14]Donia Gamal, Marco Alfonse, El-Sayed M.El-Horbaty, Abdel-Badeeh M.Salem, "Twitter Benchmark Dataset for Arabic Sentiment Analysis", International Journal of Modern Education and Computer Science(IJMECS), Vol.11, No.1, pp. 33-38, 2019.DOI: 10.5815/ijmecs.2019.01.04
[15]W. Zaghouani, N. Habash, O. Obeid, B. Mohit, H. Bouamor, and K. Oflazer, “Building an Arabic machine translation post-edited corpus: Guidelines and annotation,” Proc. 10th Int. Conf. Lang. Resour. Eval. Lr. 2016, pp. 1869–1876, 2016.
[16]T. Arts, Y. Belinkov, N. Habash, A. Kilgarriff, and V. Suchomel, “ArTenTen: Arabic Corpus and Word Sketches,” J. King Saud Univ. - Comput. Inf. Sci., vol. 26, no. 4, pp. 357–371, 2014.
[17]A. O. Al-Thubaity, “A 700M+ Arabic corpus: KACST Arabic corpus design and construction,” Lang. Resour. Eval., vol. 49, no. 3, pp. 721–751, 2015.
[18]N. Omar and Q. Al-Tashi, “Arabic Nested Noun Compound Extraction Based on Linguistic Features and Statistical Measures,” GEMA Online® J. Lang. Stud., vol. 18, no. 2, pp. 93–107, 2018.
[19]Y. Marton, N. Habash, and O. Rambow, “Dependency parsing of modern standard arabic with lexical and inflectional features,” Comput. Linguist., vol. 39, no. 1, pp. 161–194, 2013.
[20]Nizar Y. Habash, Introduction to Arabic Natural Language Processing. 2010.
[21]C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky, “The Stanford CoreNLP Natural Language Processing Toolkit,” Proc. 52nd Annu. Meet. Assoc. Comput. Linguist. Syst. Demonstr., pp. 55–60, 2014.
[22]N. Bhutani, Y. Suhara, W.-C. Tan, A. Halevy, and H. V. Jagadish, “Open Information Extraction from Question-Answer Pairs,” pp. 2294–2305, 2019.
[23]A. Abdelali, K. Darwish, N. Durrani, and H. Mubarak, “Farasa: A Fast and Furious Segmenter for Arabic,” vol.”, Proceedings of the 2016, Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2016) (Demonstrations), San Diego, California, J pp. 11–16, June 12-17, 2016.
[24]M. Straka, “UDPipe 2 . 0 Prototype at CoNLL 2018 UD Shared Task,” Proc. CoNLL 2018 Shar. Task Multiling. Parsing from Raw Text to Univers. Depend., pp. 197–207, 2018.
[25]F. Albogamy and A. Ramsay, “Universal dependencies for Arabic tweets,” Int. Conf. Recent Adv. Nat. Lang. Process. RANLP, vol. 2017-Septe, pp. 46–51, 2017.
[26]O. Lyashevkaya and I. Panteleeva, “Automatic Dependency Parsing of a Learner English Corpus Realec,” High. Sch. Econ. Res. Pap. No. WP BRP., 2018.
[27]A. Panchenko, E. Ruppert, S. Faralli, S. P. Ponzetto, and C. Biemann, “Building a web-scale dependency-parsed corpus from common crawl,” Lr. 2018 - 11th Int. Conf. Lang. Resour. Eval., pp. 1816–1823, 2019.
[28]M. Serrano Morales Antoni Badia Cardús, “Treball de fi de màster What is Modern Standard Arabic NLP? Definition and Tools (or How to understand Arabic even if you do not know a word),” 2015.
[29]H. K. El-Najjar and R. S. Baraka, “Improving Dependency Parsing of Verbal Arabic Sentences Using Semantic Features,” Int. J. Speech Technol., pp. 86–91, 2018.
[30]D. Zeman et al., “CoNLL 2018 Shared Task : Multilingual Parsing from Raw Text to Universal Dependencies,” pp. 1–21, 2018.
[31]S. Buchholz and E. Marsi, “CoNLL-X shared task on multilingual dependency parsing,” Proc. Tenth Conf. Comput. Nat. Lang. Learn., no. June, pp. 149–164, 2006.
[32]A. More et al., “Conll-UL: Universal morphological lattices for universal dependency parsing,” Lr. 2018 - 11th Int. Conf. Lang. Resour. Eval., pp. 3847–3853, 2019.