A Methodology for Reliable Code Plagiarism Detection Using Complete and Language Agnostic Code Clone Classification

Full Text (PDF, 902KB), PP.34-56

Views: 0 Downloads: 0

Author(s)

Sanjay B. Ankali 1,* Latha Parthiban 2

1. VTU-RRC, & Faculty, KLE College of Engineering and Technology, Chikodi-India-591201

2. Department of Computer Science, Pondicherry University, Community College, Lawspet-India-605008

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2021.03.04

Received: 17 Jan. 2021 / Revised: 26 Feb. 2021 / Accepted: 14 Mar. 2021 / Published: 8 Jun. 2021

Index Terms

Clone types, functional tree, TF-IDF, cosine similarity, Code plagiarism

Abstract

Code clone detection plays a vital role in both industry and academia. Last three decades have seen more than 250 clone detection techniques with lack of single framework that can detect and classify all 4 basic types of code clones with high precision. This serious lack of clone classification impacts largely on the universities and online learning platforms that fail to validate the projects or coding assignments submitted online. In this paper, we propose a complete and language agnostic technique to detect and classify all 4 clone types of C, C++, and Java programs. The method first generates the parse tree then extracts the functional tree to eliminate the need for the preprocessing stage employed by previous clone detection techniques. The generated parse tree contains all the necessary information for detecting code clones. We employ TF-IDF cosine similarity for the proper classification of clone types. The proposed technique achieves incredible precision rate of 100% in detecting the first two types of clones and 98% precision in detecting type-3 and type-4 clones for small codes of C, C++, and Java containing an average line count of 5. The proposed technique outperforms the existing tree-based clone detection tools by providing the average precision of 98.07% on the C, C++, and Java programs crawled from Github with an average line count of 15 which signifies that cosine similarity measure on ANTLR functional tree accurately detects all 4 types of small clones and act as proper validation tools for identifying the learning level in the submitted programming assignment.

Cite This Paper

Sanjay B. Ankali, Latha Parthiban, " A Methodology for Reliable Code Plagiarism Detection Using Complete and Language Agnostic Code Clone Classification", International Journal of Modern Education and Computer Science(IJMECS), Vol.13, No.3, pp. 34-56, 2021.DOI: 10.5815/ijmecs.2021.03.04

Reference

[1] Wang, W. L. (2020). Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree. 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 261-271). IEEE.
[2] Krinke, J. (2001). Identifying similar code with program dependence graphs. Proceedings of the 8th Working Conference on Reverse Engineering (WCRE’01), (pp. 301–309). Stuttgart, Germany.
[3] Baxter, I. D. (1998). Clone detection using abstract syntax trees. Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272) (pp. 368--377). IEEE.
[4] Ducasse, S. R. (1999). A language independent approach for detecting duplicated code. International Conference on Software Maintenance-1999 (ICSM'99) (pp. 109-118). IEEE.
[5] Godfrey, C. K. (2006). clones considered harmful. Reverse Engineering (WCRE’06) (pp. 19-28). Benevento, Italy: IEEE.
[6] Chanchal Kumar Roy, J. R. (2007). A survey on software clone detection research. Queen’s School of Computing TR, 64-68.
[7] Chanchal K. Roy, J. R. (2009). Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming, 470-495.
[8] Dhavleesh Rattan, R. B. (2013). Software clone detection: A systematic review. Information and Software Technology, 1165-1199.
[9] Ain, Q. U. (2019). A systematic review on code clone detection. . IEEE access, 86121-86144.
[10] Chivers, K. a. (n.d.). https://www.researchgate.net/publication/337953514. Retrieved April 2020, from ResearchGate.
[11] Ottenstein, K. J. (1976.). An algorithmic approach to the detection and prevention of plagiarism. ACM SIGCSE Bulletin, 30–41.
[12] Halstead., M. H. (1973). An experimental determination of the “purity” of a trivial algorithm. ACM SIGMETRICS Performance Evaluation Review, 10–15.
[13] Grier, S. (1981). A tool that detects plagiarism in Pascal programs. Proceedings of the Twelfth SIGCSE Technical Symposium on Computer Science Education. New York, NY, USA: Association for Computing Machinery.
[14] J. L. Donaldson, M. P. (1981). A plagiarism detection system. ACM SIGCSE Bulletin, 21-25.
[15] J. A. W. Faidhi and S. K. Robinson. 1987. An empirical approach for detecting program similarity and plagiarism within a university programming environment. Comput. Educ. 11, 1 (Jan. 1987), 11–19.
[16] Whale., G. (1990). Software metrics and plagiarism detection. Journal of Systems and Software, 131–138.
[17] Johnson. (1994). Substring matching for clone detection and change tracking. International Conference on Software Maintenance (ICSM) (pp. 120-126), Victoria, BC, Canada: IEEE.
[18] L. Barbour, H. Y. (2010). A technique for just-in time clone detection. Proceedings of the 18th IEEE International Conference on Program Comprehension (ICPC’10) (pp. 76–79). Washington DC, USA: IEEE.
[19] S. Ducasse, O. N. (2006). On the effectiveness of clone detection by string matching, Journal on Software Maintenance and Evolution: Research and Practice, 37-58.
[20] Cordy, C. K. (2008). NICAD: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization. 16th IEEE International Conference on Program Comprehension, (pp. 172-181). Amsterdam: IEEE.
[21] Y. Higo, S. K. (2011). Code clone detection on specialized PDG’s with heuristics, Proceedings of the 15th European Conference on Software Maintenance and Reengineering (CSMR’11), (pp. 75-84). Oldenburg, Germany.
[22] Kim, S. S. (2017). VUDDY: a scalable approach for vulnerable code clone discovery. In Security and Privacy (SP), 2017 IEEE Symposium (pp. 595-614). San Jose, CA, USA: IEEE.
[23] Shihab, D. E. (2013). Cccd: Concolic code clone detection. 2013 20th Working Conference on Reverse Engineering (WCRE) (pp. 489-490). Koblenz, Germany: IEEE.
[24] Z. Liu, Q. W. (2017). A vulnerable code clone detection system based on vulnerability fingerprint. 2017 IEEE 3rd Information Technology and Mechatronics Engineering Conference (ITOEC) (pp. 548-553). Chongqing, China: IEEE.
[25] Baker, B. S. (2007). Finding clones with dup: Analysis of an experiment. IEEE Transactions on Software Engineering, 33(9), 608-621.
[26] Kamiya, T. K. (2002). CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transaction on Software Engineering, 54–67.
[27] Koschke., N. G. (2009). Incremental clone detection. . In Proceedings of the 13th European Conference on Software Maintenance and Reengineering (pp. 219–228). IEEE.
[28] Li, Z. a. (2004). CP-Miner: A Tool for Finding Copy-Paste and Related Bugs in Operating System Code. Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation - Volume 6 (p. 20). San Francisco, CA: USENIX Association.
[29] Ragkhitwetsagul, C. K. (2019). Siamese: scalable and incremental code clone search via multiple code representations. Empir Software Eng., 2236–2284.
[30] Nishi, M. A. (2018). Scalable code clone detection and search based on adaptive prefix filtering. Journal of Systems and Software 137 (2018), 130-142.
[31] Wang, P. J. (2018). CCAligner: a token based large-gap clone detector. Proceedings of the 40th International Conference on Software Engineering (pp. 1066-1077). Gothenburg Sweden: ACM.
[32] Sajnani, H. V. (2016). SourcererCC: scaling code clone detection to big-code. Software Engineering (ICSE), 2016 IEEE/ACM 38th International Conference (pp. 1157-1168). Austin Texas: IEEE/ACM.
[33] Y. Semura, N. Y. (2017). Ccfindersw:Clone detection tool with exible multilingual tokenization. 24th Asia-Pacific Software Engineering Conference (pp. 654-659). Nanjing, Jiangsu, China: A PSEC.
[34] R. Koschke, R. F. (2006). Clone detection using abstract syntax suffix trees. Proceedings of the 13th Working Conference on Reverse Engineering (WCRE’06), (pp. 253–262.). Benevento, Italy.
[35] L. Jiang, G. M. (2007). DECKARD: Scalable and accurate tree based detection of code clones. Proceedings of 29th International Conference on Software Engineering (ICSE’07), (pp. 96-105). Minneapolis, MN, USA.
[36] I. D. Baxter, A. Y. (1998). Clone detection using abstract syntax trees. Proceedings of the 14th International Conference on Software Maintenance (ICSM ’98), (pp. Bethesda, Maryland, USA, 1998, pp. 368–). Bethesda, Maryland, USA.
[37] L. Barbour, H. Y. (2010). A technique for just-in time clone detection. Proceedings of the 18th IEEE International Conference on Program Comprehension (ICPC’10), (pp. 76–79.). Washington DC, USA.
[38] W.S. Evans, C. F. (2009). Clone detection via structural abstraction, Software Quality Journal, 309–330.
[39] A. Corazza, S. D. (2010). A tree kernel based approach for clone detection. Proceedings of the 26th IEEE International Conference on Software Maintenance (ICSM’10) (pp. 1-5). Timisoara, Romania: IEEE.
[40] D. Gitchell, N. T. (1999). Sim: a utility for detecting similarity in computer programs, ACM SIGCSE Bulletin 31 (1), 266–270.
[41] T.T. Nguyen, H. N.-K. (2009). ClemanX:Incremental clone detection tool for evolving software. Proceedings of 31st International Conference on Software Engineering (ICSE’09), (pp. 437–438). Vancouver,Canada.
[42] B. Biegel, S. D. (2010). Highly configurable and extensible code clone detection. Proceedings of the 17th Working Conference on Reverse Engineering (WCRE’10), (pp. 237–241). Beverly, MA, USA.
[43] V. Wahler, D. S. (2004). Clone detection in source code by frequent itemset techniques, Proceedings of the 4th IEEE International Workshop Source Code Analysis and Manipulation (SCAM’04), (pp. 128–135.). Chicago, IL, USA: IEEE.
[44] Yang, Y. Z. (2018). Structural Function Based Code Clone Detection Using a New Hybrid Technique. IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC) (pp. 286-291). Tokyo, Japan: IEEE.
[45] Lavoie, E. M. (2019). Computing structural types of clone syntactic blocks. 16th Working Conference on Reverse Engineering (pp. 274-278). Lille: IEEE.
[46] J. Zeng, K. B. (2019). Fast code clone detection based on weighted recursive autoencoders. IEEE Access, 7, 125062-125078.
[47] Ming Wu, P. W. (2020). LVMapper:A Large-Variance Clone Detector Using Sequencing Alignment Approach. IEEE access.
[48] R. Komondoor, S. H. (2001). Using slicing to identify duplication in source code. Proceedings of the 8th International Symposium on Static Analysis (SAS’01), (pp. 40–56). Paris, France.
[49] Y. Higo, K. S. (2009). Problematic code clones identification using multiple detection results. Proceedings of the 16th Asia Pacific Software Engineering Conference (APSEC’09), (pp. 365–372.). Penang, Malaysia.
[50] Krinke, J. (2001). Identifying similar code with program dependence graphs. Proceedings of the 8th Working Conference on Reverse Engineering (WCRE’01), (pp. 301–309). Stuttgart, Germany.
[51] S. Choi, H. P. (2009). A static API birthmark for windows binary executables. The Journal of Systems and software, 862–873.
[52] Elizabeth Burd, J. B. (2002). "Evaluating Clone Detection Tools for Use during Preventative Maintenance,” 2nd IEEE International Workshop on Source Code Analysis and Manipulation (SCAM) (pp. 36-43). Montreal, Canada: IEEE.
[53] G. Antoniol, U. V. (2002). Analyzing cloning evolution in the Linux kernel, Information and Software Technology, 755-765.
[54] M. Balazinska, E. M. (1999). Measuring clone based reengineering opportunities. Proceedings of the 6th International Software Metrics Symposium (METRICS’99), (pp. 292–303). Boca Raton,Florida, USA.
[55] Ragkhitwetsagul, C. J. (2018). A picture is worth a thousand words: Code clone detection based on image similarity. Software Clones (IWSC), 2018 IEEE 12th International Workshop (pp. 44-50). Campobasso, Italy: IEEE.
[56] Prechelt, M. a. (2000). JPlag: Finding plagiarisms among a set of programs. . University of Karlsruhe, Department of Informatics.
[57] Birov, T. C. (2015). Duplicate code detection algorithm. 16th International Conference on Computer Systems and Technologies, CompSysTech '15 (pp. 104-111). New York, NY: ACM.
[58] M. Iwamoto, S. O. (2013.). A token-based illicit copy detection method using complexity for a program exercise. Eighth International Conference on Broadband and Wireless Computing, Communication and Applications. (pp. 575-580.). NW Washington, DCUnited States.: IEEE.
[59] B. Muddu, A. A. (2013). Cpdp: A robust technique for plagiarism detection in source code. 7th International Workshop on Software Clones (IWSC) (pp. 39-45). San Francisco, CA, USA: IEEE.
[60] W. Tang, D. C. (2018). BCFinder: A Lightweight and Platform-Independent Tool to Find Third-Party Components in Binaries. 2018 25th Asia-Pacific Software Engineering Conference (APSEC) (pp. 288-297). Nara, Japan: IEEE.
[61] K. Ito, T. I. (2017). Web-service for finding cloned files usingb-bit minwise hashing. 2017 IEEE 11th International Workshop on Software Clones (pp. 1-2). Klagenfurt, Austria: IEEE.
[62] A. Cosma, A. S. (2012). A novel approach based on formal methods for clone detection. 2012 6th International Workshop on Software Clones (IWSC) (pp. 8-14). Zurich: IEEE.
[63] Parr, T. (2014). ANTLR. Retrieved April 10, 2020, from https://www.antlr.org/: https://www.antlr.org/
[64] Parr, T. (2014). https://github.com/antlr/grammars-v4. Retrieved April 10, 2020, from https://github.com/: https://github.com/antlr/grammars-v4.
[65] Naumann, F. (2013). SImilarity Measures.
[66] Ragkhitwetsagul, C. a. (2017). Using compilation/decompilation to enhance clone detection. 11th International Workshop on Software Clone (IWSC'17) (pp. 8-14). Klagenfurt, Austria: IEEE.
[67] Thome, J. (n.d.). https://github.com/julianthome/inmemantlr. Retrieved April 10, 2020, from https://github.com: https://github.com/julianthome/inmemantlr
[68] Christopher D Manning, P. R. (2008). Introduction to information retrieval. Cambridge.: volume 1. Cambridge university press Cambridge.
[69] Hao Zhong, L. Z. (2009). Inferring resource specifications from natural language api documentation. In Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society.
[70] Pati, J. B. (2017). A Comparison Among ARIMA, BP-NN, and MOGA-NN for Software Clone Evolution Prediction. IEEE Access 5, 11841-11851.
[71] Y. Yang, Z. R. (2018). Structural function based code clone detection using a new hybrid technique. IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC) (pp. 286-291). Tokyo,Japan: IEEE.
[72] J. Zeng, K. B. (2019). Fast code clone detection based on weighted recursive autoencoders. IEEE Access, 7, 125062-125078.
[73] D. Li, M. P. (2014). One pass preprocessing for token-based source code clone detection. IEEE 6th International Conference on Awareness Science and Technology (iCAST) (pp. 1-6). Paris, France: IEEE.
[74] Thompson, H. L. (2011). Incremental clone detection and elimination for erlang programs. Fundamental Approaches to Software Engineering,Springer , 356-370.
[75] Kamiya, T. (2013). Agec: An execution-semantic clone detection tool. 21st International Conference on Program Comprehension (ICPC) (pp. 227-229). San Francisco, CA, USA: IEEE.
[76] Mohamed El Bachir Menai,"Detection of Plagiarism in Arabic Documents", International Journal of Information Technology and Computer Science, vol.4, no.10, pp.80-89, 2012.
[77] Karzan Wakil, Muhammad Ghafoor, Mehyeddin Abdulrahman, Shvan Tariq, "Plagiarism Detection System for the Kurdish Language", International Journal of Information Technology and Computer Science, Vol.9, No.12, pp.64-71, 2017.