Detection of Plagiarism in Arabic Documents

Full Text (PDF, 559KB), PP.80-89

Views: 0 Downloads: 0

Author(s)

Mohamed El Bachir Menai 1,*

1. Department of Computer Science, College of Computer and Information Sciences, King Saud University, P.O. Box 51178, Riyadh 11543, Saudi Arabia

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2012.10.10

Received: 9 Jan. 2012 / Revised: 3 Apr. 2012 / Accepted: 21 Jun. 2012 / Published: 8 Sep. 2012

Index Terms

Plagiarism Detection, Similarity Detection, Arabic, Fingerprinting, Heuristic Algorithm

Abstract

Many language-sensitive tools for detecting plagiarism in natural language documents have been developed, particularly for English. Language-independent tools exist as well, but are considered restrictive as they usually do not take into account specific language features. Detecting plagiarism in Arabic documents is particularly a challenging task because of the complex linguistic structure of Arabic. In this paper, we present a plagiarism detection tool for comparison of Arabic documents to identify potential similarities. The tool is based on a new comparison algorithm that uses heuristics to compare suspect documents at different hierarchical levels to avoid unnecessary comparisons. We evaluate its performance in terms of precision and recall on a large data set of Arabic documents, and show its capability in identifying direct and sophisticated copying, such as sentence reordering and synonym substitution. We also demonstrate its advantages over other plagiarism detection tools, including Turnitin, the well-known language-independent tool.

Cite This Paper

Mohamed El Bachir Menai, "Detection of Plagiarism in Arabic Documents", International Journal of Information Technology and Computer Science(IJITCS), vol.4, no.10, pp.80-89, 2012. DOI:10.5815/ijitcs.2012.10.10

Reference

[1]Lukashenko R., Graudina V., Grundespenkis J. Computer-based plagiarism detection methods and tools: an overview [C]. In: Proceedings of the International Conference on Computer Systems and Technologies, Bulgaria, 2007, 14-15.

[2]Maurer H., Kappe F., Zaka B. Plagiarism – A survey [J]. Journal of Universal Computer Science, 2006, 12(8): 1050-1084.

[3]Gruner G., Naven S. Tool support for plagiarism detection in text documents [C]. In: Proceedings of the ACM symposium on Applied Computing, Santa Fe, New Mexico, 2005, 13-17.

[4]Menai M.B., Al-Hassoun N.S. Similarity detection in Java programming assignments [C]. In: Proceedings of the 5th International Conference on Computer Science & Education, Hefei, China, 2010, 356-361.

[5]Mozgovoy M., Kakkonen T., Sutinen E. Using natural language parsers in plagiarism detection [C]. In: Proceedings of the SLaTE Workshop on Speech and Language Technology in Education, Farmington, Pennsylvania, USA, 2007.

[6]Hoad C., Zobel J. Methods for identifying versioned and plagiarized documents [J]. Journal of the American Society for Information Science and Technology, 2003, 54(3): 203-215.

[7]Schleimer S., Wilkerson D., Aiken A. Winnowing: local algorithms for document fingerprinting [C]. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 2003, 9-12.

[8]Dumais S.T. Latent Semantic Analysis [J]. Annual Review of Information Science and Technology, 2005: 38-188, doi:10.1002/aris. 1440380105.

[9]Shivakumar N., Garcia-Molina H. SCAM: a copy detection mechanism for digital documents [C]. In: Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries, Austin, Texas, USA, June 1995.

[10]http://www.turnitin.com, visited: 10 Feb. 2012.

[11]http://www.canexus.com/eve/, visited: 15 Jan. 2012.

[12]http://plagiarism.phys.virginia.edu/Wsoftware.html, visited: 15 Jan. 2012.

[13]Si A., Leong H., Lau R. CHECK: a document plagiarism detection system [C]. In: Proceedings of ACM Symposium for Applied Computing, Feb. 1997, 70-77.

[14]Eissen S., Stein B., Kulig M. Plagiarism detection without reference collection [C]. In: Proceedings of the 30th Annual Conference of the German Classification Society, Berlin: Freie university, 8–10 Mar. 2006, 359-366.

[15]http://www.plagiarism.com/self.detect.htm, visited: 15 Jan. 2012.

[16]Lancaster T., Culwin F. Classifications of plagiarism detection engines [J]. ITALICS, 2005, 4(2).

[17]Alzahrani S.M., Salim N. Statement-based fuzzy-set IR versus fingerprints matching for plagiarism detection in Arabic documents [C]. In: Proceedings of the 5th Postgraduate Annual Research Seminar (PARS 09), Johor Bahru, Malaysia, 2009.

[18]Farghaly A., Shaalan K. Arabic natural language processing: challenges and solutions [J]. ACM Transactions on Asian Language Information Processing, 2009, 8 (14): 1-22.

[19]Khoja S. Stemming Arabic Text [R]. 1999. http://zeus.cs.pacificu.edu/shereen/research.htm

[20]Black W., Elkateb S., Rodriguez H., Alkhalifa M., Vossen P., Pease A., Fellbaum C. Introducing the Arabic WordNet project [C]. In: Proceedings of the 3rd International WordNet Conference, Masaryk University, Brno, 2006, 295-300.

[21]Pataki M. Plagiarism detection and document chunking methods [C]. In: Proceedings of the 12th International WWW Conference, Budapest, Hungaria, May 20-24, 2003.

[22]Kernighan B.W., Ritchie D.M. The C Programming Language [B]. 2nd ed., Englewood Cliffs, NJ: Prentice Hall, 1988.

[23]Levenshtein V.I. Binary codes with correction for deletions and insertions of the symbol 1 [J]. Probl. Peredachi Inf., 1965, 1(1), 12–25.

[24]Karp R.M., Rabin M.O. Efficient randomized pattern-matching algorithms [J]. IBM Journal of Research and Development, 1987, 31(2): 249-260.

[25]http://www.alwaraq.net, visited: 2 Feb. 2012.

[26]Menai M.B., Bagais M. APlag: a plagiarism checker for Arabic texts [C]. In: Proceedings of the 6th International Conference on Computer Science & Education (ICCSE 2011), Singapore, Aug. 3-5, 2011, 1379-1383.

[27]Sawalha M., Atwell E. Comparative evaluation of Arabic language morphological analysers and stemmers [C]. In: Proceedings of 22nd International Conference on Computational Linguistics (COLING 2008), Manchester, UK, Aug. 2008, 107-110.

[28]Al-Serhan H., Al Shalabi R., Kannan G. New approach for extracting Arabic roots [C]. In: Proceedings of the International Arab Conference on Information Technology (ACIT’2003), Potland, Oregon, USA, 2003, 42-59.

[29]Buckwalter T. Issues in Arabic orthography and morphology analysis [C]. In: Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (Semitic'04), Geneva, Switzerland, 2004, 31-34.