Improved Architecture of Focused Crawler on the basis of Content and Link Analysis

Full Text (PDF, 969KB), PP.33-40

Views: 0 Downloads: 0

Author(s)

Bhupinderjit Singh 1,* Deepak Kumar Gupta 1 Raj Mohan Singh 1

1. Department of Computer Science & Engineering, Dr. B R Ambedkar National Institute of Technology, Jalandhar, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2017.11.04

Received: 5 Aug. 2017 / Revised: 11 Sep. 2017 / Accepted: 16 Oct. 2017 / Published: 8 Nov. 2017

Index Terms

Focused Crawler, Topic Weight Table, Search Engine, Page Score, Link Score, URL Queue Optimization

Abstract

World Wide Web is a vast, dynamic and continuously growing collection of web documents. Due to its huge size, it is very difficult for the users to search for the relevant information about a particular topic of interest. In this paper, an improved architecture of focused crawler is proposed, which is a hybrid of various techniques used earlier. The main goal of a focused crawler is to fetch the web documents which are related to a pre-defined set of topics/domains and to ignore the irrelevant web pages. To check the relevancy of a web page, Page Score is computed on the basis of content similarity of the web page with reference to the topic keywords. URLs Priority Queue is implemented by calculating the Link Score of extracted URLs based on URLs attributes. URLs queue is also optimized by removing the duplicate contents. Topic Keywords Weight Table is expanded by extracting more keywords from the relevant pages database and recalculating the keywords weight. The experimental result shows that our proposed crawler has better efficiency than the earlier crawlers.

Cite This Paper

Bhupinderjit Singh, Deepak Kumar Gupta, Raj Mohan Singh, "Improved Architecture of Focused Crawler on the basis of Content and Link Analysis", International Journal of Modern Education and Computer Science(IJMECS), Vol.9, No.11, pp. 33-40, 2017. DOI:10.5815/ijmecs.2017.11.04

Reference

[1]“Internet Live Stats - Internet Usage & Social Media Statistics.” [Online]. Available: http://www.internetlivestats.com/. [Accessed: 16-May-2017].
[2]M. Shokouhi, P. Chubak, and Z. Raeesy, “Enhancing Focused Crawling with Genetic Algorithms,” in International Conference on Information Technology: Coding and Computing (ITCC’05) - Volume II, 2005, p. 503–508 Vol. 2. DOI: 10.1109/ITCC.2005.145
[3]M. P. S. Bhatia and D. Gupta, “Discussion on Web Crawlers of Search Engine,” no. April, pp. 227–230, 2008.
[4]M. Levene and A. Poulovassilis, Web Dynamics. 2004.
[5]S. Chakrabarti, M. Berg, and B. Dom, “Focused crawling: A New Approach to Topic- Specific Web Resource Discovery,” Comput. Networks, vol. 31, pp. 1623–1640, 1999. DOI: 10.1016/S1389-1286(99)00052-3
[6]A. Pal, D. S. Tomar, and S. C. Shrivastava, “Effective Focused Crawling Based on Content and Link Structure Analysis,” Int. J. Comput. Sci. Inf. Secur. IJCSIS, vol. 2, no. 1, p. 5, 2009.
[7]Meenu, P. Singla and R. Batra, “Design of a Focused Crawler Based on Dynamic Computation of Topic Specific Weight Table,” vol. 2, no. 4, pp. 617–623, 2014.
[8]M. S. Safran, A. Althagafi, and D. Che, “Improving Relevance Prediction for Focused Web Crawlers,” 2012 IEEE/ACIS 11th International Conference on Computer and Information Science, 2012. DOI: 10.1109/ICIS.2012.61
[9]D. Hati and A. Kumar, “An Approach for Identifying URLs Based on Division Score and Link Score in Focused Crawler,” Int. J. Comput. Appl., vol. 2, no. 3, pp. 48–53, 2010. DOI: 10.5120/643-899
[10]J. Choudhary and D. Roy, “Priority based Semantic Web Crawler,” Int. J. Comput. Appl., vol. 81, no. 15, pp. 10–13, 2013. DOI: 10.5120/14197-2372
[11]P. Gupta, A. Sharma, J. P. Gupta, and K. Bhatia, “A Novel Framework for Context Based Distributed Focused Crawler (CBDFC),” Int. J.CCT, vol. 1, no. 1, pp. 14–26, 2009.
[12]M. Jamali, H. Sayyadi, B. B. Hariri and H. Abolhassani, “Method for Focused Crawling Using Combination of Link Struc ture and Content Similarity,” Proc. 2006 IEEE/WIC/ACM Int. Conf. Web Intell., pp. 753–756, 2006. DOI: 10.1109/WI.2006.19
[13]X. C. and X. Zhang, “HAWK: A Focused Crawler with Content and Link Analysis,” Int. J. Comput. Technol., vol. 2, no. 3, p. 2012. DOI 10.1109/ICEBE.2008.46
[14]S. Kumar and N. Chauhan, “A Context Model For Focused Web Search,” Int. J. Comput. Technol., vol. 2, no. 3, 2012.
[15]“CsePedia - Encyclopedia of Computer and Internet.” [Online]. Available: http://www.csepedia.com.
[16]M. Yuvarani, “LSCrawler : A Framework for an Enhanced Focused Web Crawler based on Link Semantics,” 2006. DOI: 10.1109/WI.2006.112
[17] “PHP tip: How to Convert a Relative URL to an Absolute URL | Nadeau Software.” [Online]. Available: http://nadeausoftware.com/articles/2008/05/php_tip_how_convert_relative_url_absolute_url. [Accessed: 11-May-2017].