Analog Document Search Using CRNN and Keyphrase Extraction

Full Text (PDF, 430KB), PP.16-24

Views: 0 Downloads: 0

Author(s)

Lokeshwar S 1,* Vadiraja Rao M. K 1 Sujay Kumar P. S 1 Vishveshwara Guthal Gowda 1 Hemavathi P. 1

1. Bangalore Institute of Technology (VTU), Bengaluru, Karnataka, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijigsp.2021.02.02

Received: 30 Jun. 2020 / Revised: 23 Jul. 2020 / Accepted: 2 Sep. 2020 / Published: 8 Apr. 2021

Index Terms

Analog document search, CRNN, Keyphrase Extraction, Position Rank

Abstract

There seems to be a peculiar trend in the way information is now used, moving to digital media not just for the newspapers but for books as well. With advances in Optical Character Recognition (OCR), Style Transfer Mapping (STM), and efficient key phrasing, we are now able to digitalize the document to a form that can be read across multiple platforms and searched efficiently. It provides users with the ease of searching for relevant documents without the tedious process of manual searching.
We propose a system that uses the CRNN model to detect English characters in the document with high accuracy. We then pair it with a hybrid keyphrasing technique, which uses Positional Rank as its Graph based rank and re-rank the key phrases using the C-Value method. This process allows us to automatically digitize the printed document and summarise it to provide high-quality keyphrases, which can be used to efficiently search and retrieve relevant printed documents.

Cite This Paper

Lokeshwar S, Vadiraja Rao M. K, Sujay Kumar P. S, Vishveshwara Guthal Gowda, Hemavathi P., " Analog Document Search Using CRNN and Keyphrase Extraction", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.13, No.2, pp. 16-24, 2021. DOI: 10.5815/ijigsp.2021.02.02

Reference

[1]Yeom, H., Ko, Y., & Seo, J. (2019). Unsupervised-learning-based Keyphrase Extraction from a Single Document by the Effective Combination of the Graph-based Model and the Modified C-value Method. Computer Speech & Language

[2]Florescu, C., Caragea, C, 2017. Positionrank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of the Fifty-fifth Annual Meeting of the Association for Computational Linguitics, pp. 1105-1115

[3]Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms:. the C-value/NC-value method. International Journal on Digital Libraries, 3(2), 115–130. doi:10.1007/s007999900023

[4]M. Oquab, L. Bottou, I. Laptev, and J. Sivic, "Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks," in Proceedings of the 2014 Computer Vision and Pattern Recognition (CVPR),2014, pp. 1717–1724 

[5]Li, B., Peng, L., & Ji, J. (2014). Historical Chinese Character Recognition Method Based on Style Transfer Mapping. 2014 11th IAPR International Workshop on Document Analysis Systems. doi:10.1109/das.2014.33

[6]Tang, Y., Peng, L., Xu, Q., Wang, Y., & Furuhata, A. (2016). CNN Based Transfer Learning for Historical Chinese Character Recognition. 2016 12th IAPR Workshop on Document Analysis Systems (DAS). doi:10.1109/das.2016.52

[7]Hoon Chung, Sung Joo Lee, & Jeon Gue Park. (2016). Deep neural network using trainable activation functions. 2016 International Joint Conference on Neural Networks (IJCNN). Doi:10.1109/ijcnn.2016.7727219

[8]Afroge, S., Ahmed, B., & Mahmud, F. (2016). Optical character recognition using back propagation neural network. 2016 2nd International Conference on Electrical, Computer & Telecommunication Engineering (ICECTE)

[9]Sabu, A. M., & Das, A. S. (2018). A Survey on various Optical Character Recognition Techniques. 2018 Conference on Emerging Devices and Smart Systems (ICEDSS). doi:10.1109/icedss.2018.8544323

[10]El-Beltagy, S. R., & Rafea, A. (2009). KP-Miner: A keyphrase extraction system for English and Arabic documents. Information Systems, 34(1), 132–144.

[11]Wang, R., Liu, W., McDonald, C., 2014. Corpus-independent generic keyphrase extraction using word embedding vectors. In: Proceedings of the Software Engineering Research Conference, vol. 39

[12]Wei, T. C., Sheikh, U. U., & Rahman, A. A.-H. A. (2018). Improved optical character recognition with deep neural network. 2018 IEEE 14th International Colloquium on Signal Processing & Its Applications (CSPA).

[13]Bennani-Smires, K., et al., 2018. EmbedRank: Unsupervised keyphrase extraction using sentence embeddings, [online] Available: https://arxiv.org/abs/1801.04470

[14]Manning, Christopher D., Surdeanu, Mihai, Bauer, John, Finkel, Jenny, Bethard, Steven J., and McClosky, David. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

[15]Anette Hulth. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 conference on Empirical methods in natural language processing (EMNLP ’03). Association for Computational Linguistics, USA, 216–223. DOI:https://doi.org/10.3115/1119355.1119383