A Full-text Website Search Engine Powered by Lucene and The Depth First Search Algorithm

Full Text (PDF, 1666KB), PP.1-12

Views: 0 Downloads: 0

Author(s)

Modinat. A. Mabayoje 1,* O. S. Oni 1 Olawale S. Adebayo 2

1. Department of Computer Science, University of Ilorin, P.M.B 1515, Ilorin, Nigeria

2. Cyber Security Science Department, Federal University of Technology PMB 65, Minna, Nigeria

* Corresponding author.

DOI: https://doi.org/10.5815/ijcnis.2013.03.01

Received: 25 Jun. 2012 / Revised: 11 Oct. 2012 / Accepted: 20 Dec. 2012 / Published: 8 Mar. 2013

Index Terms

Full Text search engine, Relational Database, Information Retrieval, Lucene, Depth first search algorithm

Abstract

With the amount of available text data on the web growing rapidly, the need for users to search such information is dramatically increasing. Full text search engines and relational databases each have unique strengths as development tools but also have overlapping capabilities. Both can provide for storage and update of data and both support search of the data. Full text systems are better for quickly searching high volumes of unstructured text for the presence of any word or combination of words. They provide rich text search capabilities and sophisticated relevancy ranking tools for ordering results based on how well they match a potentially fuzzy search request. Relational databases, on the other hand, excel at storing and manipulating structured data -- records of fields of specific types (text, integer, currency, etc.). They can do so with little or no redundancy. They support flexible search of multiple record types for specific values of fields, as well strong tools for quickly and securely updating individual records. The web being a collection of largely unstructured document which is ever growing in size, the appeal of using RDBMS for searching this collection of documents has become very costly.
This paper describes the architecture, design and implementation of a prototype website search engine powered by Lucene to search through any website. This approach involves the development of a small scale web crawler to gather information from the desired website. The gathered information are then converted to a Lucene document and stored in the index. The time taken to search the index is very short when compared with how long it takes for a relational database to process a query.

Cite This Paper

Modinat. A. Mabayoje, O. S. Oni, Olawale S. Adebayo, "A Full-text Website Search Engine Powered by Lucene and The Depth First Search Algorithm", International Journal of Computer Network and Information Security(IJCNIS), vol.5, no.3, pp.1-12, 2013. DOI:10.5815/ijcnis.2013.03.01

Reference

[1]Wikipedia, the Encyclopaedia: The vector Space Model [Online], July, 2012. Available: http://en.wikipedia.org/wiki/Information_retrieval.
[2]H. S. Al-Obaidy, Building Ontology Web Retrieval System Using Data Mining, Unpublished PhD thesis, Dept. of Computer Science, Ahlia University, Bahrain, 2009.
[3]D. M. Christopher, R. Prabhakar and S. Hinrich, An Introduction to Information Retrieval, (online edition). Cambridge University Press, 2009.
[4]H. DJOERD, Information Retrieval Models (Author's Version). Twente: University of Twente, 2005.
[5]Jarkata Lucene Javadoc: Lucene 3.6.0 Documentation [Online], May, 2010. Available: http://lucene.apache.org/core/3_6_0/api/all/index.html.
[6]G. Salton & M. McGill, Introduction to Modern Information Retrieval, London: McGraw-Hill, 1983.
[7]Lecture Note, The Vector Space Model [Online], May, 2012.Available: http://www.csee.umbc.edu/~ian/irF02/lectures/07Models- VSM.pdf.
[8]S. E. Robertson, C. J. Van Rijsbergen and M. F. Porter. Probabilistic models of indexing and searching. In R. Oddy et al. (Ed.), Information Retrieval Research, (pp. 35-56), Butterworths, 1981.
[9]D. Joydip and B. Pushpak, Seminar Report on Ranking in Information Retrieval. Mumbai: Indian Institute of Technology, Bombay, 2010.
[10]K. Jennifer, Adding Search Functionality to Your Web Site [Online], April, 2010. Available: http://webdesign.about.com/od/administration/a/ aa091399.htm.
[11]P. WILSON, Information Storage and Retrieval, vol. 9(8), 457-471, 1973.
[12]P. B. SHOLA (2003), Data Structure with implementation in C and Pascal. Feyisetan Press, Ibadan, 2003, pp. 119-120.
[13]F. Burkowski, Retrieval activities in a database consisting of heterogeneous collections of structured texts, in the 15th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'92), 1992, pp. 125.
[14]Wikipedia, the Encyclopedia: Java Programming Language [Online]. July, 2012. Available: http://en.wikipedia.org/wiki/Java_programming.
[15]Oracle Corporation, Java Server Pages: A developer's perspective [Online], July, 2012. Available: http://java.sun.com/developer/technicalArticles/ Programming/jsp/ Java Server Pages A developer's perspective.htm.
[16]Smith. Introducing Lucene.Net. [Online], May, 2012.Available: http://www.codeproject.com/Articles/29755/Introducing-Lucene-Net
[17]Source Fourge, HTML Link Parser Documentation [Online], April, 2012. Available: http://htmlparser.sourceforge.net/HTML Parser.htm
[18]M. ZHU. Recall Precision and Average Precision, 2004.