Author Attribution of Arabic Texts Using Extended Probabilistic Context Free Grammar Language Model

Full Text (PDF, 584KB), PP.27-39

Views: 0 Downloads: 0

Author(s)

Ibrahim S. I. Abuhaiba 1,* Mohammad F. Eltibi 1

1. Computer Engineering Department, Islamic University, P. O. Box 108, Gaza, Palestine

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2016.06.04

Received: 11 Sep. 2015 / Revised: 1 Dec. 2015 / Accepted: 25 Jan. 2016 / Published: 8 Jun. 2016

Index Terms

Author attribution, author identification, language model, PCFG language model, Chi-square score, genetic algorithm

Abstract

Author attribution is the problem of assigning an author to an unknown text. We propose a new approach to solve such a problem using an extended version of the probabilistic context free grammar language model, supplied by more informative lexical and syntactic features. In addition to the probabilities of the production rules in the generated model, we add probabilities to terminals, non-terminals, and punctuation marks. Also, the new model is augmented with a scoring function which assigns a score for each production rule. Since the new model contains different features, optimum weights, found using a genetic algorithm, are added to the model to govern how each feature participates in the classification. The advantage of using many features is to successfully capture the different writing styles of authors. Also, using a scoring function identifies the most discriminative rules. Using optimum weights supports capturing different authors’ styles, which increases the classifier’s performance. The new model is tested over nine authors, 20 Arabic documents per author, where the training and testing are done using the leave-one-out method. The initial error rate of the system is 20.6%. Using the optimum weights for features reduces the error rate to 12.8%.

Cite This Paper

Ibrahim S. I. Abuhaiba, Mohammad F. Eltibi, "Author Attribution of Arabic Texts Using Extended Probabilistic Context Free Grammar Language Model", International Journal of Intelligent Systems and Applications (IJISA), Vol.8, No.6, pp.27-39, 2016. DOI:10.5815/ijisa.2016.06.04

Reference

[1]C. Chaski, “Who’s At The Keyboard? Authorship Attribution in Digital Evidence Investigations”, International Journal of Digital Evidence, vol. 4, no. 1, 2005.
[2]N. Glance, M. Hurst, K. Nigam, M. Siegler, R. Stockton, and T. Tomokiyo, “Deriving marketing intelligence from online discussion”, in ACM SIGKDD international conference on knowledge discovery in data mining, Chicago, USA, August 2005, pp. 419–428.
[3]J. Oberlander, and S. Nowson, “Whose thumb is it anyway? Classifying author personality from weblog text”, in COLING/ACL 2006 Main Conference Poster Sessions, Sydney, Australia, 2006, pp. 627–634.
[4]G. Frantzeskou, E. Stamatatos, S. Gritzalis, and S. Katsikas, “Effective identification of source code authors using byte-level information”, in International Conference on Software Engineering, New York, USA, 2006, pp. 893-896.
[5]S. Burrows, A. Uitdenbogerd, and A. Turpin, “Application of Information Retrieval Techniques for Source Code Authorship Attribution”, in International Conference on Database Systems for Advanced Applications, Berlin, 2009, pp. 699 – 713.
[6]N. Indurkhya, and F. Damerau, “Syntactic Parsing,” in Handbook of Natural Language Processing, 2nd ed., USA, 2010.
[7]S. Argamon, C. Whitelaw, P. Chase, S. Hota, N. Garg, and S. Levitan, “Stylistic text classification using functional lexical features”, Journal of the American Society for Information Science and Technology, vol. 58, pp. 802-822, 2007.
[8]F. Türkoğlu, B. Diri, and M. Amasyal, “Author Attribution of Turkish Texts by Feature Mining”, in Intelligent computing international conference on Advanced intelligent computing theories and applications, Heidelberg, Berlin, 2007, pp. 1086-1093.
[9]M. Koppel, and J. Schler, “Exploiting stylistic idiosyncrasies for authorship attribution”, In IJCAI Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico , 2003, pp. 69-72.
[10]F. Peng, D. Schuurmans, V. Keselj, and S. Wang, "Language Independent Authorship Attribution using Character Level Language Models", in Tenth conference on European chapter of the Association for Computational Linguistics, USA, 2003, pp. 267-274.
[11]R. Baayen, H. Halteren, and F. Tweedie, “Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution”, Literary and Linguistic Computing, vol. 11, pp. 121–131, 1996.
[12]P. McCarthy, G. Lewis, D. Dufty, and D. McNamara, “Analyzing writing styles with coh-metrix”, in Florida Artificial Intelligence Research Society International Conference, 2006, pp. 764-769.
[13]N. Tsimboukakis, and G. Tambouratzis, "Neural Networks for Author Attribution", in Fuzzy Systems Conference, London, 2007.
[14]K. Luyckx, and W. Daelemans, “Shallow Text Analysis and Machine Learning for Authorship Attribution”, in Computational Linguistics, Netherlands, 2005, pp. 149-160.
[15]J. Diederich, J. Kindermann, E. Leopold, and G. Paass, "Authorship Attribution with Support Vector Machines", Applied Intelligence Journal, vol. 19, no. 1-2, 2003.
[16]K. Luyckx, “Authorship Attribution of E-mail as a Multi-Class Task”, in CLEF 2011 Labs and Workshop, Netherlands, 2011.
[17]V. Keselj, F. Peng, N. Cercone, and C. Thomay, “N-Gram-Based Author Profiles for Authorship Attribution”, in Pacific Association for Computational Linguistics, Canada, August 2003, pp. 255-264.
[18]S. Ouamour, and H. Sayoud, “Authorship Attribution of Ancient Texts Written by Ten Arabic Travelers Using a SMO-SVM Classifier”, in International Conference on Communications and Information Technology, Hammamet, June 2012, pp. 44-47.
[19]K. Shaker, and D. Corne, “Authorship Attribution in Arabic using a Hybrid of Evolutionary Search and Linear Discriminant Analysis”, in Computational Intelligence, Colchester, September 2010, pp. 1-6.
[20]A. Abbasi, and H. Chen, “Applying Authorship Analysis to Extremist-Group Web Forum Messages”, Intelligent Systems IEEE, vol. 20, no. 5, September, 2005.
[21]J. Li, R. Zheng, and H. Chen, “From Fingerprint to Writeprint”, Communications Of The Acm, vol. 49, no. 4, April, 2006.
[22]T. Mitchell, “Genetic Algorithms”, in Machine Learning, 1st ed., USA, 1997.
[23]K. Luyckx, and W. Daelemans, “Authorship Attribution and Verification with Many Authors and Limited Data”, in International Conference on Computational Linguistics, Manchester, August 2008, pp. 1086–1093.
[24]M. Koppel, J. Schler, and S. Argamon, “Authorship attribution in the wild”, Language Resources and Evaluation, vol. 45, no. 1, March, 2011.
[25]C. Manning, P. Raghavan, and H. Schütze, “Scoring, term weighting and the vector space model,” in An Introduction to Information Retrieval, 1st ed., England, 2009.
[26]S. Raghavan, A. Kovashka, and R. Mooney, “Authorship Attribution Using Probabilistic Context-Free Grammars”, in ACL Conference Short Papers, Sweden, 2010, pp. 38-42.
[27]Y. Yang, J. Pedersen, “A comparative study on feature selection in text categorization”, in Machine Learning-International Workshop, USA, 1997, pp. 412-420.
[28]C. Chaski, "Empirical Evaluations of Language-Based Author Identification Techniques", International Journal of Speech Language and the Law, vol. 8, no. 1, 2001.
[29]Parsing [online], Available: http://en.wikipedia.org/wiki/Parsing
[30]R. Duda, P. Hart, and D. Strok, “Maximum likelihood and Bayesian estimation”, Pattern Classification, 2nd ed., Wiley Publication, 2001.
[31]Felesteen newspaper [online], Available: http://www.felesteen.ps.
[32]The Stanford Parser. (2012, November). [Online]. Available: http://nlp.stanford.edu/software/lex-parser.shtml.
[33]GNUGP License. (2007, June 29) [Online]. Available: http://www.gnu.org/licenses/gpl.
[34]S. Green, and C. Manning, “Better Arabic parsing: baselines, evaluations, and analysis” in International Conference on Computational Linguistics, USA, 2010.
[35]S. Theodridis, and K. Koutroumbas, “Template Matching”, in Pattern Recognition, 4th ed., USA, 2009.
[36]Penn Treebank Project. (1999, February 2). [Online]. Available: www.cis.upenn.edu/~treebank/
[37]T. Buckwalter, “Buckwalter Arabic Morphological Analyzer Version 1.0”, Linguistic Data Consortium, catalog number LDC2002L49, ISBN 1-58563-257-0, 2002.
[38]Agency France Press. (2012, November). [Online]. Available: http://www.afp.com
[39]Java Genetic Algorithms Package. (2012, November). [Online]. Available: http://jgap.sourceforge.net