A link and Content Hybrid Approach for Arabic Web Spam Detection

Full Text (PDF, 456KB), PP.30-43

Views: 0 Downloads: 0

Author(s)

Heider A. Wahsheh 1,* Mohammed N. Al-Kabi 1 Izzat M. Alsmadi 1

1. Dept. of Computer Information Systems, IT & CS Faculty, Yarmouk University, Irbid, Jordan

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2013.01.03

Received: 6 Apr. 2012 / Revised: 4 Aug. 2012 / Accepted: 3 Oct. 2012 / Published: 8 Dec. 2012

Index Terms

Arabic Web Spam, Content-Based Detection, Link-Based Detection, Content/Link Arabic Web Spam

Abstract

Some Web sites developers act as spammers and try to mislead the search engines by using illegal Search Engine Optimizations (SEO) tips to increase the rank of their Web documents, to be more visible at the top 10 SERP. This is since gaining more visitors for marketing and commercial goals. This study is a continuation of a series of Arabic Web spam studies conducted by the authors, where this study is dedicated to build the first Arabic content/link Web spam detection system. This Novel system is capable to extract the set of content and link features of Web pages, in order to build the largest Arabic Web spam dataset. The constructed dataset contains three groups with the following three percentages of spam contents: 2%, 30%, and 40%. These three groups with varying percentages of spam contents were collected through the embedded crawler in the proposed system. The automated classification of spam Web pages used based on the features in the benchmark dataset. The proposed system used the rules of Decision Tree; which is considered as the best classifier to detect Arabic content/link Web spam. The proposed system helps to clean the SERP from all URLs referring to Arabic spam Web pages. It produces accuracy of 90.1099% for Arabic content-based, 93.1034% for Arabic link-based, and 89.011% in detecting both Arabic content and link Web spam, based on the collected dataset and conducted analysis.

Cite This Paper

Heider A. Wahsheh, Mohammed N. Al-Kabi, Izzat M. Alsmadi, "A link and Content Hybrid Approach for Arabic Web Spam Detection", International Journal of Intelligent Systems and Applications(IJISA), vol.5, no.1, pp.30-43, 2013.DOI:10.5815/ijisa.2013.01.03

Reference

[1]W. Alrawabdeh. 2009. Internet and the Arab World: Understanding the Key Issues and Overcoming the Barriers. The International Arab Journal of Information Technology. v6, n1, 2009, pp. 27-33.

[2]Internet World Stats, 2012. Arabic Speaking Internet Users Statistics. Retrieved February, 24, 2012 from the World Wide Web: http://www.Internetworldstats.com/stats19.htm 

[3]A. Tarabaouni. MENA Online Advertising Industry. Retrieved October, 28, 2011 from the World Wide Web: http://www.slideshare.net/aitmit/mena-online-advertising-industry

[4]Internet World Stats, 2012. Internet world users by languages top 10 languages. Retrieved February, 24, 2012 from the World Wide Web: http://www.internetworldstats.com/stats7.htm

[5]R. Almeida, B. Mozafar, J. Ch. On the Evolution of Wikipedia. In Proceedings of the International Conference on Weblogs and Social Media. Boulder, Colorado, USA, (2007), pp. 1-8.

[6]M. Selvan, A. C. Sekar, A.P. Dharshini. Survey on Web Page Ranking Algorithms. International Journal of Computer Applications. v41, n19, 2012, pp.1-7.

[7]M. Bendersk, W. Crof, Y. Dia. Quality-Biased Ranking of Web Documents. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM 11), Hong Kong, China, (2011), pp.1-10.

[8]A. Batzios, C. Dimou, A. Symeonidis, P. Mitkas. BioCrawler: An intelligent crawler for the semantic Web. Expert Systems with Applications. v35, 2008, pp. 524–530.

[9]Z. Gyongyi, H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan, (2005), pp. 1-9.

[10]W. Dou, K. Lim, C. Su, N. Zhou, N. Cui. Brand Positioning Strategy Using Search Engine Marketing. MIS Quarterly. v34 n2, 2010, pp. 261-279.

[11]G. Boone, J. Secci, L. Gallant. Emerging Trends in Online Advertising. doxa comunicacion. v5 n5, 2009, pp. 241-253.

[12]H. A. Wahsheh, M. N. Al-Kabi. Detecting Arabic Web spam. The 5th International Conference on Information Technology (ICIT 2011), Amman-Jordan. (2011), pp. 1-8.

[13]R. Jaramh,T. Saleh, S. Khattab, I. Farag. Detecting Arabic spam Web pages using Content Analysis. International Journal of Reviews in Computing. v6, 2011, pp.1-8.

[14]M. Al-Kabi, H. Wahsheh, A. AlEroud et al. Combating Arabic Web spam Using Content Analysis. In Proceedings of the 2011 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT2011), Amman Jordan. (2011), pp. 401-404.

[15]H. Wahsheh, I. Alsmadi, M. Al-Kabi. Analyzing the Popular Words to Evaluate spam in Arabic Web Pages. IJJ: The Research Bulletin of JORDAN ACM – ISWSA. v2 n2, 2012a, pp. 22-26.

[16]M. Al-Kabi, H. Wahsheh, I Alsmadi, et al. Content Based Analysis to Detect Arabic Web spam. Journal of Information Science. v38 n3, 2012, pp. 284-296.

[17]H. A. Wahsheh, I. Abu Dosh, M. Al-Kabi, et al. Using Machine Learning Algorithms to Detect Content-based Arabic Web spam. Journal of Information Assurance and Security. v7 n1 ,2012b, pp.14-24.

[18]H. A. Wahsheh, M. N. Al-Kabi, I. M. Alsmadi.. Spam Detection Methods for Arabic Web Pages. First Taibah University International Conference on Computing and Information Technology (ICCIT 2012), Al-Madinah Al-Munawwarah, Saudi Arabia. v2, (2012c) pp 486-490.

[19]H. Wahsheh, M. Al-Kabi, I. Alsmadi. Evaluating Arabic spam Classifiers Using Link Analysis. In Proceeding of the 3rd International Conference on Information and Communication Systems (ICICS'12), ACM, Irbid, Jordan. (2012d) pp.1-5.

[20]A. Ntoulas, M. Najork, M. Manasse, et al. Detecting spam Web Pages through Content Analysis. In Proceedings of the World Wide Web Conference, Edinburgh, Scotland. (2006), pp. 83–92.

[21]W. Wang, G. Zeng. Content Trust Model for Detecting Web spam. In IFIP International Federation for Information Processing, (Etalle, S. and Marsh, S. Eds) Trust Management, 2007, pp. 139-152.

[22]A. Benczur, D. Siklosi, J. Szabo, et al. Web spam: a Survey with Vision for the Archivist. International Web Archiving Workshop (IWAW’08), Aaarhus, Denmark. (2008), pp. 1-9.

[23]J. Gadge, S. Sane, H. Kekre. Layered Approach to Improve Web Information Retrieval. Proceedings on 2nd National Conference on Information and Communication Technology NCICT. v7, (2011) pp. 28-32.

[24]T. Liu, J. Xu, T. Qin, et al. LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval. SIGIR 2007 Workshop on Learning to Rank for Information Retrieval (LR4IR 2007), Amsterdam, Netherlands. (2007) pp. 1-10.

[25]C. Liang, L. Ru, X. Zhu. R-spamRank: A spam detection algorithm based on link analysis. Journal of Computational Information Systems. v3, 2007, pp. 1705-1712.

[26]Y. Chung, M. Toyoda, M. Kitsuregawa. Identifying spam link generators for monitoring emerging web spam. In Proceedings of the 4th workshop on Information credibility (WICOW '10), Raleigh, North Carolina, USA. (2010), pp. 51-58.

[27]A. West, A. Agrawal, P. Bakeret et al. Autonomous link spam detection in purely collaborative environments. In Proceedings of the 7th International Symposium on Wikis and Open Collaboration (WikiSym '11), ACM, Mountain View, California, USA. (2011), pp. 91-100.

[28]W. Zhang, D. Zhu, Y. Zhang, et al. Harmonic functions based semi-supervised learning for Web spam detection. In Proceedings of ACM Symposium on Applied Computing, Taichung, Taiwan. (2011), pp. 74-75.

[29]Y. Niu, Y. Wang, H. Chen, et al. A Quantitative Study of Forum spamming Using Context-based Analysis. In Proceedings of the Network & Distributed System Security (NDSS) Symposium, San Diego, California, USA. (2006), pp. 1-14.

[30]L. Becchetti, C. Castillo, D. Donato, et al. Web spam Detection: Link-based and Content-based Techniques. In The European Integrated Project Dynamically Evolving, Large Scale Information Systems (DELIS): proceedings of the final workshop, Barcelona, Spain. v222, (2008), pp. 99-113.

[31]D. Saraswathi, A. Vijaya Kathiravan, S. Anita. A Novel Approach for Combating spamdexing in Web using UCINET and SVM Light Tool. International Journal of Innovative Technology and Creative Engineering. v1 n3, 2011, pp. 47- 52.

[32]E. Kumar S. Kohli. Improving Link spam Detection using spamizer, In Proceedings of the World Congress on Engineering and Computer Science 2011 (WCECS 2011), San Francisco, USA. v1, (2011) pp 19-21.

[33]M. Attia. Arabic Language Research and Translation. Retrieved March, 23, 2012 from the World Wide Web: http://attiaspace.com

[34]Y. Du, Y. Shi, X. Zhao. Using spam farm to boost PageRank. In the Proceedings of the 3rd international workshop on Adversarial information retrieval on the web AIRWeb '07, ACM. (2007), pp 29-36.

[35]J. Martinez-Romo, l. Araujo. Web spam Identification Through Language Model Analysis. Fifth International Workshop on Adversarial Information Retrieval on the Web AIRWeb ’09, Madrid, Spain. (2009) pp. 21-28.

[36]Y. Wang. A multinomial logistic regression modeling approach for anomaly intrusion detection. Computers & Security. v24, (2005) pp. 662-674.

[37]L. Yang. Distance Metric Learning: A Comprehensive Survey. Department of Computer Science and Engineering Michigan State University. 2006, pp. 1-51.

[38]D. Xhemali, C. Hinde,R. Stone. Naïve Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages. International Journal of Computer Science. v4, 2009, pp. 6-23.

[39]H. Witten, E. Frank. Data Mining: Practica Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems, second edition, Morgan Kaufmann (MK). 2005, pp. 1-558.