Adil Toumouh; Dominic Widdows; Ahmed Lehireche

Exploring Semantic Relatedness in Arabic Corpora using Paradigmatic and Syntagmatic Models

Full Text (PDF, 887KB), PP.37-47

Views: 0 Downloads: 0

Author(s)

Adil Toumouh ^1,* Dominic Widdows ² Ahmed Lehireche ¹

1. Computer Science department, Djillali Liabes University, Sidi Bel Abbes, 22000, Algeria

2. Microsoft Bing, Bellevue WA, 98004, USA

* Corresponding author.

DOI: https://doi.org/10.5815/ijieeb.2016.01.05

Received: 14 Sep. 2015 / Revised: 6 Oct. 2015 / Accepted: 26 Nov. 2015 / Published: 8 Jan. 2016

Index Terms

Relatedness, syntagmatic model, paradigmatic model, HAL model, term document model, word order information, permutation, Arabic corpus

Abstract

In this paper we explore two paradigms: firstly, paradigmatic representation via the native HAL model including a model enriched by adding word order information using the permutation technique of Sahlgren and al [21], and secondly the syntagmatic representation via a words-by-documents model constructed using the Random Indexing method. We demonstrate that these kinds of word space models which were initially dedicated to extract similarity can also been efficient for extracting relatedness from Arabic corpora. For a given word the proposed models search the related words to it. A result is qualified as a failure when the number of related words given by a model is less than or equal to 4, otherwise it is considered as a success. To decide if a word is related to other one, we get help from an expert of the economic domain and use a glossary1 of the domain. First we begin by a comparison between a native HAL model and term- document model. The simple HAL model records a better result with a success rate of 72.92%. In a second stage, we want to boost the HAL model results by adding word order information via the permutation technique of sahlgren and al [21]. The success rate of the enriched HAL model attempt 79.2 %.

Cite This Paper

Adil Toumouh, Dominic Widdows, Ahmed Lehireche, "Exploring Semantic Relatedness in Arabic Corpora using Paradigmatic and Syntagmatic Models", International Journal of Information Engineering and Electronic Business(IJIEEB), Vol.8, No.1, pp.37-47, 2016. DOI:10.5815/ijieeb.2016.01.05

Reference

[1]M. Abbas and K. Smaili, “Comparison of Topic Identification Methods for Arabic Language,” International conference RANLP05: Recent Advances in Natural Language Processing, 21-23 september 2005, Borovets, Bulgary.
[2]K. Ben Sidi Ahmed and A. Toumouh, “Effective Ontology Learning: Concepts' Hierarchy Building using Plain Text Wikipedia” ICWIT 2012: 170-178.
[3]K. Ben Sidi Ahmed, A. Toumouh and D. Widdows, “Lightweight domain ontology learning from texts: graph theory–based approach using Wikipedia,” International Journal of Metadata, Semantics and Ontologies, Volume 9 Issue 2, April 2014, Pages 83-90.
[4]P. Blouw, and C. Eliasmith, “A Neurally Plausible Encoding of Word Order Information into a Semantic Vector Space,” 35th Annual Conference of the Cognitive Science Society, 2013.
[5]A. Budanitsky and G. Hirst, “Evaluating WordNet-based Measures of Semantic Distance,” Computational Linguistics, 32(1), 2006.
[6]W. B. Johnson, and d. J. Lindenstrauss, “Extensions of Lipshitz Mapping into Hilbert Space,” In Conference in modern analysis and probability, volumn 26 of Contemporary Mathematics, pages 189-206. Amer. Math. Soc., (1984).
[7]M. N. Jones, and D. J. K. Mewhort, “Representing word meaning and order information in a composite holographic lexicon,” Psychological Review, 114, 1-37 (2007).
[8]S. Jonnalagadda, T. Cohen, S. Tze-Inn Wu, and G. Gonzalez, “Enhancing clinical concept extraction with distributional semantics,” Journal of Biomedical Informatics 45(1):129-140 (2012).
[9]P. Kanerva, “Sparse Distributed Memory”, MIT Press, (1988).
[10]P. Kanerva, J. Kristofersson, and A. Holst, “Random Indexing of text samples for Latent Semantic Analysis,” in Proceedings of the 22nd annual conference of the cognitive science society, New Jersey: Erlbaum, (2000).
[11]S. Kaski, “Dimensionality reduction by random mapping: Fast similarity computation for clustering,” In Proceedings of the IJCNN’98, International Joint Conference on Neural Networks. IEEE Service Center (1999).
[12]T. Landauer, and M. Littman, “Fully automatic cross-language document retrieval using latent semantic indexing,” In Proceedings of the Sixth Annual Conference of the UW Centre for the New Oxford English Dictionary and Text Research, (1990) pages 31–38, Waterloo, Ontario, October.
[13]K. Lund, C. Burgess, “Producing high-dimensional semantic spaces from lexical co-occurrence,” Behavior Research Methods, Instrumentation, and Computers, (1996) 28, 203- 208.
[14]C. H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, “Latent semantic indexing: A probabilistic analysis” In Proceedings of the 17th ACM Symposium on the Principles of Database Systems. ACM Press (1998).
[15]T. A. Plate, “Holographic reduced representations,” IEEE Transactions on Neural Networks, 6, 623– 641, (1995).
[16]G. L. Recchia, M. N. Jones, M. Sahlgren, and P. Kanerva, “Encoding sequential information in vector space models of semantics: Comparing holographic reduced representation and random permutation,” In S. Ohlsson and R. Catrambone (Eds.), Proceedings of the 32nd Cognitive Science Society, 865-870, (2010).
[17]M. Sahlgren, and R. Coster, “Using bag-of-concepts to improve the performance of support vector machines in text categorization,” In Proceedings of the 20th International Conference on Computational Linguistics, COL ING'04 (pp. 487{493) (2004).
[18]M. Sahlgren, “An Introduction to Random Indexing,” In Proceedings of the Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE, Copenhagen, Denmark,(2005).
[19]M. Sahlgren, “The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces,” Ph.D. Dissertation, Department of Linguistics, Stockholm University (2006).
[20]M. Sahlgren, “The Distributional Hypothesis. From context to meaning,” Distributional models of the lexicon in linguistics and cognitive science (Special issue of the Italian Journal of Linguistics), Rivista di Linguistica, volume 20, numero 1, (2008).
[21]M. Sahlgren, A. Holst, and P. Kanerva, “Permutations as a Means to Encode Order in Word Space,” Proceedings of the 30th Annual Meeting of the Cognitive Science Society (CogSci'08), July 23-26, Washington D.C., USA, (2008).
[22]G. Salton, “The Smart Retrieval System – Experiments in Automatic Document Processing,” Prentice-Hall, Englewood Cliffs, NJ, 1971.
[23]G. Salton and M. McGill, “Introduction to modern information retrieval,” McGraw-Hill, New York, NY, 1983.
[24]H. Schutze, “Automatic word sense discrimination,” Computational Linguistics (1998) 24(1):97–124.
[25]A. Toumouh, A. Lehireche, D. Widdows, and M. Malki, “Adapting WordNet to the Medical Domain using Lexicosyntactic Patterns in the Ohsumed Corpus,” In Proceeding of The 4th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA-06), Dubai/Sharjah, UAE,pp. 1029-1036, (2006).
[26]A. Toumouh, D. Widdows, and A. Lehireche, “Using Word Space Models for Enriching Multilingual Lexical Resources and Detecting the Relation Between Morphological and Semantic Composition,” International Conference on Web and Information Tecnologies (ICWIT '08), pp. 195-201, (2008).
[27]A. Toumouh, D. Widdows, and A. Lehireche, “Parallel corpora and WordSpace models: using a third language as an Interlingua to enrich multilingual resources,” International Journal of Information and Communication Technology, Vol. 3, No. 4, pp.299-313, (2011).
[28]T. Zesch, “Study of Semantic Relatedness of Words Using Collaboratively Constructed Semantic Resources,” TU Darmstadt, Ph.D. Thesis, (2010).
[29]D. Widdows, “Geometry and Meaning,” CSLI Publications (2004).
[30]D. Widdows, A. Toumouh, B. Dorow, and A. Lehireche, “Ongoing Developments in Automatically Adapting Lexical Resources to the Biomedical Domain,” Fifth International Conference on Language Resources and Evaluation, LREC, Genoa, Italy, (2006).
[31]D. Widdows, and K. Ferraro, “Semantic vectors: a scalable open source package and online technology management application,” The 6th Edition of the Language Resources and Evaluation, (LREC2008), Marrakech, Morocco (2008).
[32]R. Baeza-Yates, and B. Ribiero-Neto, “Modern Information Retrieval” Addison Wesley / ACM Press (1999).

International Journal of Information Engineering and Electronic Business (IJIEEB)