A Partial String Matching Approach for Named Entity Recognition in Unstructured Bengali Data

Full Text (PDF, 1046KB), PP.36-45

Views: 0 Downloads: 0

Author(s)

Nabil Ibtehaz 1,* Abdus Satter 2

1. Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka 1000, Bangladesh

2. Institute of Information Technology, University of Dhaka, Dhaka 1000, Bangladesh

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2018.01.04

Received: 30 Oct. 2017 / Revised: 12 Nov. 2017 / Accepted: 29 Nov. 2017 / Published: 8 Jan. 2018

Index Terms

Named Entity Recognition, Dynamic Programing, Trie, String Matching, Edit Distance

Abstract

In today's data driven, automated and digitized world, a significant stage of information extraction is to look for special keywords, more formally known as 'Named Entity'. This has been an active research topic for more than two decades and significant progresses have been made. Today we have models powered by deep learning that, although not perfect, have near human level accuracy on certain occasions. Unfortunately these algorithms require a lot of annotated training data, which we hardly have for Bengali language. This paper proposes a partial string matching approach to identify a named entity from an unstructured text corpus in Bengali. The algorithm is a partial string matching technique, based on Breadth First Search (BFS) search on a Trie data structure, augmented with dynamic programming. This technique is capable of not only identifying named-entities present on a text, but also estimating the actual named-entities from erroneous data. To evaluate the proposed technique, we conducted experiments in a closed domain where we employed this approach on a text corpus with some predefined named entities. The texts experimented on was both structured and unstructured, and our algorithm managed to succeed in both the cases.

Cite This Paper

Nabil Ibtehaz, Abdus Satter, "A Partial String Matching Approach for Named Entity Recognition in Unstructured Bengali Data", International Journal of Modern Education and Computer Science(IJMECS), Vol.10, No.1, pp. 36-45, 2018.DOI: 10.5815/ijmecs.2018.01.04

Reference

[1]D. Nadeau, and S. Sekine, “A survey of named entity recognition and classification,” Lingvisticae Investigationes, vol. 30, no. 1, 2007, pp. 3–26.
[2]T. K. Sang, F. Erik, and F. De Meulder, “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” in Proceedings of the Seventh Conference on Natural Language Learning, Association for Computational Linguistics, 2003.
[3]T. K. Sang, F. Erik, and F. De Meulder, “Introduction to the CoNLL-2002 shared task: language-independent named entity recognition,” in Proceedings of the 6th Conference on Natural Language Learning, Aug 2002, pp. 1–4.
[4]R. Grishman, and B. Sundheim, “Message understanding conference-6: A brief history,” in Proceedings of the 16th International Conference on Computational Linguistics, 1996.
[5]B. B. Chaudhuri, and S. Bhattacharya, “An Experiment on Automatic Detection of Named Entities in Bangla,” IJCNLP, pp.75–82, 2008.
[6]L. F. Rau, “Extracting company names from text,” in Proceedings of the 7th IEEE Conference on Artificial Intelligence Applications, IEEE, Feb 1991, pp. 29–32.
[7]S. Sekine, and H. Isahara, “IREX: IR & IE Evaluation Project in Japanese,” in LREC, 2000, pp. 1977–1980.
[8]T. K. Sang, F. Erik, and F. De Meulder. “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition,” in Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, 2003.
[9]G. R. Doddington, A. Mitchell, M. A. Przybocki, L. A. Ramshaw, S. Strassel, and R. M. Weischedel, “The Automatic Content Extraction (ACE) Program-Tasks, Data, and Evaluation,” In LREC, vol. 2, pp. 837–840, 2004.
[10]D. Santos, N. Seco, N. Cardoso, and R. Vilela, “Harem: An advanced ner evaluation contest for portuguese.” quot; In Nicoletta Calzolari; Khalid Choukri; Aldo Gangemi; Bente Maegaard; Joseph Mariani; Jan Odjik; Daniel Tapias (ed) in Proceedings of the 5th International Conference on Language Resources and Evaluation, May 2006.
[11]D. Maynard, V. Tablan, C. Ursu, H. Cunningham, and Y. Wilks, “Named entity recognition from diverse text types,” in Recent Advances in Natural Language Processing 2001 Conference, 2001, pp. 257–274.
[12]E. Minkov, R. C. Wang, and W. W. Cohen, “Extracting personal names from email: Applying named entity recognition to informal text,” in Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Oct 2005, pp. 443–450.
[13]T. Poibeau, and L. Kosseim, “Proper name extraction from non-journalistic texts," Language and computers, vol. 37, no.1, pp. 144–157, 2001.
[14]M. Asahara, and Y. Matsumoto, “Japanese named entity extraction with redundant morphological analysis,” in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Association for Computational Linguistics, May 2003, pp. 8–15.
[15]A. McCallum, and W. Li, “Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons,” in Proceedings of the seventh conference on Natural language learning, Association for Computational Linguistics, May 2003, vol. 4, pp. 188–191.
[16]G. Zhou, and J. Su, “Named entity recognition using an HMM-based chunk tagger,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Jul 2002, pp. 473–480.
[17]R. C. Bunescu, and M. Pasca, “Using Encyclopedic Knowledge for Named entity Disambiguation,” Eacl, vol. 6, pp. 9–16, 2006.
[18]Y. Shinyama, and S. Sekine, “Named entity discovery using comparable news articles,” in Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, Aug 2004.
[19]P. Selvaperumal, and A. Suruliandi, “Semi-Supervised Personal Name Disambiguation Technique for the Web,” International Journal of Modern Education and Computer Science(IJMECS), vol. 8, no. 3, pp. 28–36, Mar 2016.
[20]C. N. Santos, and V. Guimaraes, “Boosting named entity recognition with neural character embeddings,” arXiv preprint arXiv:1505.05008 (2015).
[21]J. P. Chiu, and E. Nichols, “Named entity recognition with bidirectional LSTM-CNNs,” arXiv preprint arXiv:1511.08308 (2015).
[22]G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” arXiv preprint arXiv:1603.01360 (2016).
[23]Z. Yang, R. Salakhutdinov, and W. Cohen, “Multi-task cross-lingual sequence tagging from scratch,” arXiv preprint arXiv:1603.06270 (2016).
[24]X. Ma, and E. Hovy, “End-to-end sequence labeling via bi-directional lstm-cnns-crf,” arXiv preprint arXiv:1603.01354 (2016).
[25]M. Al-Yahya, M. Al-Shaman, N. Al-Otaiby, W. Al-Sultan, A. Al-Zahrani, M. Al-Dalbahie, “Ontology-Based Semantic Annotation of Arabic Language Text,” IJMECS, vol. 7, no. 7, pp. 53–59, 2015.
[26]S. Kale, and S. Govilkar, “Survey of Named Entity Recognition Techniques for Various Indian Regional Languages,” International Journal of Computer Applications, vol. 164, no. 4, 2017.
[27]M. S. Islam, and J. K. Das, “Design Analysis Rules to Identify Proper Noun from Bengali Sentence for Universal Networking language”, IJMECS, vol. 6, no. 8, pp. 1–9, 2014.
[28] Risvik KM “Search system and method for retrieval of data, and the use thereof in a search engine.” ,United States Patent 6377945 B1, April 23 2002
[29] Shang H, Merrettal T “Tries for approximate string matching.” , IEEE Trans Knowl Data Eng 8(4):540–547
[30] Oommen, B.J. & Badr, G. Pattern Anal Applic (2007) 10:1. https://doi.org/10.1007/s10044-006-0032-z