The Impact of Feature Selection on Web Spam Detection

Full Text (PDF, 451KB), PP.61-67

Views: 0 Downloads: 0

Author(s)

Jaber Karimpour 1,* Ali A. Noroozi 1 Adeleh Abadi 1

1. Dept. of Computer Science, University of Tabriz, Tabriz, Iran

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2012.09.08

Received: 3 Sep. 2011 / Revised: 4 Jan. 2012 / Accepted: 16 Mar. 2012 / Published: 8 Aug. 2012

Index Terms

Web Spam Detection, Feature Selection, Imperialistic Competitive Algorithm, Genetic Algorithm

Abstract

Search engine is one of the most important tools for managing the massive amount of distributed web content. Web spamming tries to deceive search engines to rank some pages higher than they deserve. Many methods have been proposed to combat web spamming and to detect spam pages. One basic one is using classification, i.e., learning a classification model for classifying web pages to spam or non-spam. This work tries to select the best feature set for classification of web spam using imperialist competitive algorithm and genetic algorithm. Imperialist competitive algorithm is a novel optimization algorithm that is inspired by socio-political process of imperialism in the real world. Experiments are carried out on WEBSPAM-UK2007 data set, which show feature selection improves classification accuracy, and imperialist competitive algorithm outperforms GA.

Cite This Paper

Jaber Karimpour, Ali A. Noroozi, Adeleh Abadi, "The Impact of Feature Selection on Web Spam Detection", International Journal of Intelligent Systems and Applications(IJISA), vol.4, no.9, pp.61-67, 2012. DOI:10.5815/ijisa.2012.09.08

Reference

[1]Caverlee J, Liu L, Webb S. A Parameterized Approach to Spam-Resilient Link Analysis of the Web. IEEE Transactions on Parallel and Distributed Systems (TPDS), 2009, 20:1422-1438.

[2]Gyongyi Z,Garcia-Molina H. Web spam taxonomy. In: First internationalworkshop on adversarial information retrieval on the web (AIRWeb’05), Japan, 2005.

[3]Liu B. Web Data Mining, Exploring Hyperlinks, Contents, and Usage Data. Springer, 2006.

[4]Ntoulas A, Najork M, Manasse M, et al. Detecting Spam Web Pages through Content Analysis. In Proc. of the 15th Intl. World Wide Web Conference (WWW’06), 2006. 83–92

[5]Wang W, Zeng G, Tang D. Using evidence based content trust model for spam detection. Expert Systems with Applications, 2010. 37(8):5599-5606

[6]Becchetti L, Castillo C, Donato D, et al. Link-based characterization and detection of Web Spam. In Proc. Of 2nd Int. Workshop on Adversarial 

Information Retrieval on the Web (AIRWeb’06), Seattle, WA, 2006. 1–8

[7]Castillo C, Donato D, Gionis A, et al. Know your neighbors: Web spam detection using the web topology. In Proc. Of 30th Annu. Int. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR’07), New York, 2007. 423–430

[8]Liu Y, Cen R, Zhang M, et al. Identifying web spam with user behavior analysis. In Proc. Of 4th Int. Workshop on Adversarial Information Retrieval on the Web (AIRWeb’08), China, 2008. 9-16

[9]Erdelyi M, Garzo A, Benczur A A. Web spam classification: a few features worth more. In Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality2011, India, 2011. 27-34.

[10]Han J, Kaber M, Pei J. Data Mining, Concepts and Techniques. 3rd edn, Morgan Kaufman, 2011.

[11]Vafaie H, De Jong K. Genetic algorithms as a tool for feature selection in machine learning. In Proceedings of Fourth International Conference on Tools with Artificial Intelligence (TAI '92), 1992. 200-203.

[12]Atashpaz-Gargari E, Lucas C. Imperialist competitive algorithm: An algorithm for optimization inspired by imperialistic competition. IEEE Congress on Evolutionary Computation (CEC 2007), 2007. 4661-4667

[13]Castillo C, Donato D, Becchetti L, et al. A reference collection for webspam. SIGIR Forum, 2006, 40(2): 11–24

[14]Yahoo Research. Web Spam Collections. [cited 2011 May], Available from: http://barcelona.research.yahoo.net/webspam/datasets/, 2007

[15]Mousavi Rad S J, Mollazade K, Akhlagian Tab F. Application of Imperialist Competitive Algorithm for Feature Selection: A Case Study on Bulk Rice Classification. International Journal of Computer Applications, 2012. 40(16):41-48

[16]Yang J, Honavar V. Feature subset selection using a genetic algorithm. Intelligent Systems and their Applications, IEEE, 1998. 13(2):44-49.

[17]Eiben A E, Smith J E. Introduction to Evolutionary Computing, Springer, 2010.

[18]Araujo L, Martinez-Romo J. Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models. IEEE Transactions on Information Forensics and Security, 2010. 5(3):581-590.