Disinformation, Fakes and Propaganda Identifying Methods in Online Messages Based on NLP and Machine Learning Methods

PDF (1446KB), PP.57-85

Views: 0 Downloads: 0

Author(s)

Victoria Vysotska 1 Krzysztof Przystupa 2 Lyubomyr Chyrun 3 Serhii Vladov 4 Yuriy Ushenko 5 Dmytro Uhryn 5 Zhengbing Hu 6

1. Department of Information Systems and Networks, Lviv Polytechnic National University, Lviv, 79013, Ukraine

2. Department of Automation, Lublin University of Technology, Poland

3. Applied Mathematics Department, Ivan Franko National University of Lviv, Lviv, 79000, Ukraine

4. Scientific Work Organization and Gender Issues Department, Kremenchuk Flight College of Kharkiv National University of Internal Affairs, Kremenchuk, 39605, Ukraine

5. Department of Computer Science of the Yuriy Fedkovych Chernivtsi National University, Chernivtsi, 58012, Ukraine

6. School of Computer Science, Hubei University of Technology, Wuhan, China

* Corresponding author.

DOI: https://doi.org/10.5815/ijcnis.2024.05.06

Received: 10 Mar. 2023 / Revised: 11 Jun. 2023 / Accepted: 23 Aug. 2023 / Published: 8 Oct. 2024

Index Terms

Information Security, Cybersecurity, Logistic Regression, NLP, Propaganda, Disinformation, Fake News, Message, Text, Linguistic Analysis, Artificial Intelligence, Cyber Warfare, Machine Learning, Information Technology

Abstract

A new method of propaganda analysis is proposed to identify signs and change the dynamics of the behaviour of coordinated groups based on machine learning at the processing disinformation stages. In the course of the work, two models were implemented to recognise propaganda in textual data - at the message level and the phrase level. Within the framework of solving the problem of analysis and recognition of text data, in particular, fake news on the Internet, an important component of NLP technology (natural language processing) is the classification of words in text data. In this context, classification is the assignment or assignment of textual data to one or more predefined categories or classes. For this purpose, the task of binary text classification was solved. Both models are built based on logistic regression, and in the process of data preparation and feature extraction, such methods as vectorisation using TF-IDF vectorisation (Term Frequency – Inverse Document Frequency), the BOW model (Bag-of-Words), POS marking (Part-Of-Speech), word embedding using the Word2Vec two-layer neural network, as well as manual feature extraction methods aimed at identifying specific methods of political propaganda in texts are used. The analogues of the project under development are analysed the subject area (the propaganda used in the media and the basis of its production methods) is studied. The software implementation is carried out in Python, using the seaborn, matplotlib, genism, spacy, NLTK (Natural Language Toolkit), NumPy, pandas, scikit-learn libraries. The model's score for propaganda recognition at the phrase level was obtained: 0.74, and at the message level: 0.99. The implementation of the results will significantly reduce the time required to make the most appropriate decision on the implementation of counter-disinformation measures concerning the identified coordinated groups of disinformation generation, fake news and propaganda. Different classification algorithms for detecting fake news and non-fakes or fakes identification accuracy from Internet resources ana social mass media are used as the decision tree (for non-fakes identification accuracy 0.98 and fakes identification accuracy 0.9903), the k-nearest neighbours (0.83/0.999), the random forest (0.991/0.933), the multilayer perceptron (0.9979/0.9945), the logistic regression (0.9965/0.9988), and the Bayes classifier (0.998/0.913). The logistic regression (0.9965) the multilayer perceptron (0.9979) and the Bayesian classifier (0.998) are more optimal for non-fakes news identification. The logistic regression (0.9988), the multilayer perceptron (0.9945), and k-nearest neighbours (0.999) are more optimal for identifying fake news identification.

Cite This Paper

Victoria Vysotska, Krzysztof Przystupa, Lyubomyr Chyrun, Serhii Vladov, Yuriy Ushenko, Dmytro Uhryn, Zhengbing Hu, "Disinformation, Fakes and Propaganda Identifying Methods in Online Messages Based on NLP and Machine Learning Methods", International Journal of Computer Network and Information Security(IJCNIS), Vol.16, No.5, pp.57-85, 2024. DOI:10.5815/ijcnis.2024.05.06

Reference

[1]M. Hartmann, Y. Golovchenko, and I. Augenstein, “Mapping (dis-)information flow about the MH17 plane crash,”/ arXiv. https://arxiv.org/abs/1910.01363.
[2]A. Mykytiuk, V. Vysotska, O. Markiv, L. Chyrun, and Y. Pelekh, “Technology of Fake News Recognition Based on Machine Learning Methods,” CEUR Workshop Proceedings 3387 (2023): 311-330.
[3]N. Khairova, A. Galassi, F. Lo Scudo, B. Ivasiuk, and I. Redozub, “Unsupervised approach for misinformation detection in Russia-Ukraine war news,” CEUR Workshop Proceedings 3722 (2024): 21-36.
[4]V. Vysotska, “Information technology for recognizing propaganda, fakes and disinformation in textual content based on NLP and machine learning method,” Radio Electronics, Computer Science, Control 2 (2024):126-141. https://doi.org/10.15588/1607-3274-2024-2-13.
[5]V. A. Oliinyk, V. Vysotska, Y. Burov, K. Mykich, and V. B. Fernandes, “Propaganda Detection in Text Data Based on NLP and Machine Learning,” CEUR Workshop Proceedings 2631 (2020): 132-144.
[6]V. Vysotska, L. Chyrun, S. Chyrun, and I. Holets, “Information technology for identifying disinformation sources and inauthentic chat users' behaviours based on machine learning,” CEUR Workshop Proceedings 3723 (2024): 427-465.
[7]B. Brennen, "Making sense of lies, deceptive propaganda, and fake news." Journal of Media Ethics 32.3 (2017): 179-181. https://doi.org/10.1080/23736992.2017.1331023
[8]A. T. Stephen, "The role of digital and social media marketing in consumer behavior." Current opinión in Psychology 10 (2016): 17-21. https://doi.org/10.1016/j.copsyc.2015.10.016
[9]M. Aldwairi, and A. Alwahedi, "Detecting fake news in social media networks." Procedia Computer Science 141 (2018): 215-222. https://doi.org/10.1016/j.procs.2018.10.171
[10]V. Prokopenko, “Legal nature of pages in social networks. Legal newspaper-online version.” https://yur-gazeta.com/publications/practice/zahist-intelektualnoyi-vlasnosti-avtorske-pravo/pravova-priroda-storinok-u-socialnih-merezhah.html
[11]C. Bjola “Propaganda in the digital age,” Global Affairs 3(3), (2017): 189-191. https://doi.org/10.1080/23340460.2017.1427694.
[12]R. A. Dar, and Dr. R. Hashmy, “A Survey on COVID-19 related Fake News Detection using Machine Learning Models,” CEUR Workshop Proceedings, Vol-3426, 2023, pp. 36-46.
[13]Propaganda Definitions. https://propaganda.qcri.org/annotations/definitions.html.
[14]A. Shupta, O. Barmak, A. Wierzbicki, and T. Skrypnyk, “An Adaptive Approach to Detecting Fake News Based on Generalized Text Features,” CEUR Workshop Proceedings 3387 (2023): 300-310.
[15]J. Garcia-Marín, and A. Calatrava, “The Use of Supervised Learning Algorithms in Political Communication and Media Studies: Locating Frames in the Press,” Pamplona 31(3) (2018): 175-188. https://doi.org/10.15581/003.31.3.175-188.
[16]texty.org.ua. How Texty detects and makes sense of manipulative news. https://medium.com/@texty.org.ua/how-texty-detects-and-makes-sense-of-manipulative-news-1f43d33936eb.
[17]Nginx. https://fgz.texty.org/.
[18]I. Afanasieva, N. Golian, V. Golian, A. Khovrat, and K. Onyshchenko, “Application of Neural Networks to Identify of Fake News,” CEUR Workshop Proceedings 3396 (2023): 346-358.
[19]A. Wierzbicki, A. Shupta, and O. Barmak, “Synthesis of model features for fake news detection using large language models,” CEUR Workshop Proceedings 3722 (2024): 50-65.
[20]A. Jain, et al., "A smart system for fake news detection using machine learning." 2019 International conference on issues and challenges in intelligent computing techniques (ICICT). Vol. 1. IEEE, 2019. https://doi.org/10.1109/ICICT46931.2019.8977659
[21]S. Yang, et al., "Unsupervised fake news detection on social media: A generative approach." Proceedings of the AAAI conference on artificial intelligence. Vol. 33. No. 01. 2019. https://doi.org/10.1609/aaai.v33i01.33015644
[22]K. Shu, et al. "Fake news detection on social media: A data mining perspective." ACM SIGKDD explorations newsletter 19.1 (2017): 22-36. https://doi.org/10.1145/3137597.3137600
[23]N. O'Brien, Machine learning for detection of fake news. https://dspace.mit.edu/handle/1721.1/119727
[24]S. Aphiwongsophon, and P. Chongstitvatana, "Detecting fake news with machine learning method." 2018 15th international conference on electrical engineering/electronics, computer, telecommunications and information technology (ECTI-CON). IEEE, 2018. https://doi.org/10.1109/ECTICon.2018.8620051
[25]A. A. Tanvir, et al. (2019). "Detecting fake news using machine learning and deep learning algorithms." 2019 7th international conference on smart computing & communications (ICSCC). IEEE, 2019. https://doi.org/10.1109/ICSCC.2019.8843612
[26]J. Reis, et al. "Supervised learning for fake news detection." IEEE Intelligent Systems 34.2 (2019): 76-81. https://doi.org/10.1109/MIS.2019.2899143
[27]I. Sukailo, and N. Korshun, "The influence of NLU and generative AI on the development of cyber defense systems." Cyber security: education, science, technology 2.18 (2022): 187-196. 
[28]M. Martseniuk, V. Kozachok, O. Bohdanov, and Z. Brzhevska, “Analysis of Methods for Detecting Misinformation in Social Networks Using Machine Learning.” Electronic Professional Scientific Journal «Cybersecurity: Education, Science, Technique» 2(22), (2023): 148–155. https://doi.org/10.28925/2663-4023.2023.22.148155
[29]What Is Natural Language Processing (NLP)? https://www.oracle.com/eg/artificial-intelligence/what-is-naturallanguage-processing/.
[30]Introduction to Information Retrieval BM25, BM25F, and User Behavior Chris Manning and Pandu Nayak. https://web.stanford.edu/class/cs276/handouts/lecture12-bm25etc.pdf
[31]Understanding TF-IDF and BM-25. https://kmwllc.com/index.php/2020/03/20/understanding-tf-idfand-bm-25/.
[32]T. T. T. Pham, T. D. Pham, and V. C. Ta, “Evaluation of GAN-based Models for Phishing URL Classifiers,” International Journal of Computer Network and Information Security 15(2) (2023):1-14. https://doi.org/10.5815/ijcnis.2023.02.01.
[33]A. M. Meligy, H. M. Ibrahim, and M. F. Torky, “Identity Verification Mechanism for Detecting Fake Profiles in Online Social Networks,” International Journal of Computer Network and Information Security 9(1) (2017): 31-39. https://doi.org/10.5815/ijcnis.2017.01.04.
[34]Q. Gao, “A Preliminary Study of Fake Fingerprints,” International Journal of Computer Network and Information Security 6(12) (2014): 1-8. https://doi.org/10.5815/ijcnis.2014.12.01.
[35]A. Soufyane, B. A. Abdelhakim, and M. B. Ahmed "An intelligent chatbot using NLP and TF-IDF algorithm for text understanding applied to the medical field." Emerging Trends in ICT for Sustainable Development: The Proceedings of NICE2020 International Conference. Cham: Springer International Publishing, 2021. https://doi.org/10.1007/978-3-030-53440-0_1
[36]Y. Lv, and C. Zhai. "A log-logistic model-based interpretation of TF normalization of BM25." Advances in Information Retrieval: 34th European Conference on IR Research, ECIR 2012, Barcelona, Spain, April 1-5, 2012. Proceedings 34. Springer Berlin Heidelberg, 2012. https://doi.org/10.1007/978-3-642-28997-2_21
[37]L., MishchenkŠ¾, I., KlymenkŠ¾, and V. Tkachenko, “The Fake News Recognition Method Based on Naïve Bayes with Improved TF-IDF Algorithm.” In: Kazymyr, V., et al. Mathematical Modeling and Simulation of Systems. MODS 2023. Lecture Notes in Networks and Systems 1091 (2024). Springer, Cham. https://doi.org/10.1007/978-3-031-67348-1_12
[38]L. Mishchenko, and I. Klymenko, “Recognizing fake news based on natural language processing using the BM25 algorithm with fine-tuned parameters.” Eastern-European Journal of Enterprise Technologies  6(2(126)) (2023) 33–40. https://doi.org/10.15587/1729-4061.2023.293513
[39]D. V. Lande, et al. "Spoken language identification based on the transcript analysis." Digital Scholarship in the Humanities 38.2 (2023): 586-595. https://doi.org/10.1093/llc/fqac052
[40]A. Soufyane, B. A. Abdelhakim, and M. B. Ahmed. "An intelligent chatbot using NLP and TF-IDF algorithm for text understanding applied to the medical field." Emerging Trends in ICT for Sustainable Development: The Proceedings of NICE2020 International Conference. Cham: Springer International Publishing, 2021. https://doi.org/10.1007/978-3-030-53440-0_1
[41]F. Jiang, et al., "Naive bayes text categorization algorithm based on tf-idf attribute weighting." Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence. 2018. https://doi.org/10.1145/3297156.3297256
[42]S. Wang, L. Jiang, and C. Li. "Adapting naive Bayes tree for text classification." Knowledge and Information Systems 44 (2015): 77-89. https://doi.org/10.1007/s10115-014-0746-y
[43]S. Mazumder, and T. Barui. "Discovering topics from the titles of the Indian LIS theses." Library Philosophy and Practice (e-journal) (2021): 1-23.
[44]H. Fan, and Y. Qin. "Research on text classification based on improved tf-idf algorithm." 2018 International Conference on Network, Communication, Computer Engineering (NCCE 2018). Atlantis Press, 2018. https://doi.org/10.2991/ncce-18.2018.79
[45]C., Liu, et al. "Research of text classification based on improved TFIDF algorithm." 2018 IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE). IEEE, 2018. https://doi.org/10.1109/IRCE.2018.8492945
[46]W. Dai. "Improvement and implementation of feature weighting algorithm TF-IDF in Text classification." 2018 International Conference on Network, Communication, Computer Engineering (NCCE 2018). Atlantis Press, 2018. https://doi.org/10.2991/ncce-18.2018.94
[47]I. K. Izzah, and A. S. Girsang, "Modified TF-Assocterm weighting method for text classification on news dataset from twitter." IAENG International Journal of Computer Science 48.1 (2021): 142-151. https://www.iaeng.org/IJCS/issues_v48/issue_1/IJCS_48_1_15.pdf
[48]A. S. Alammary, "Arabic questions classification using modified TF-IDF." IEEE Access 9 (2021): 95109-95122. https://doi.org/10.1109/ACCESS.2021.3094115
[49]T. Dogan, and A. K. Uysal. "On term frequency factor in supervised term weighting schemes for text classification." Arabian Journal for Science and Engineering 44 (2019): 9545-9560. https://doi.org/10.1007/s13369-019-03920-9
[50]K. Sharifani, et al. "Operating Machine Learning across Natural Language Processing Techniques for Improvement of Fabricated News Model." International Journal of Science and Information System Research 12.9 (2022): 20-44. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4251017
[51]J.-Y., Yoo, and D. Yang. "Classification scheme of unstructured text document using TF-IDF and naive bayes classifier." Advanced Science and Technology Letters 111.50 (2015): 263-266. https://web.archive.org/web/20180604105217id_/http://onlinepresent.org/proceedings/vol111_2015/50.pdf
[52]H. Hairani, et al. "The abstract of thesis classifier by using naive Bayes method." 2021 International Conference on Software Engineering & Computer Systems and 4th International Conference on Computational Science and Information Management (ICSECS-ICOCSIM). IEEE, 2021. https://doi.org/10.1109/ICSECS52883.2021.00063
[53]S. Xu, "Bayesian Naive Bayes classifiers to text classification." Journal of Information Science 44.1 (2018): 48-59. https://doi.org/10.1177/016555151667794
[54]D. Lande, and A. Feher. "OSINT Time Series Forecasting Methods Analysis." Theoretical and Applied Cybersecurity 5.1 (2023). https://doi.org/10.20535/tacs.2664-29132023.1.287750
[55]Word Representation in Natural Language Processing Part II. https://towardsdatascience.com/wordrepresentation-in-natural-language-processing-part-ii-1aee2094e08a
[56]V. Gangadharan, et al. "Paraphrase detection using deep neural network based word embedding techniques." 2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184). IEEE, 2020.  https://doi.org/10.1109/ICOEI48184.2020.9142877
[57]R. Egger, "Text Representations and Word Embeddings: Vectorizing Textual Data." Applied Data Science in Tourism: Interdisciplinary Approaches, Methodologies, and Applications. Cham: Springer International Publishing, 2022. 335-361. https://doi.org/10.1007/978-3-030-88389-8_16
[58]How to use tf idf vectorizer. https://www.projectpro.io/recipes/use-tf-df-vectorizer.
[59]H. Zhao, R. Wang, and K. Chen. "Syntax in end-to-end natural language processing." Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts. 2021. https://doi.org/10.18653/v1/2021.emnlp-tutorials.6
[60]E. M. Bender, “Linguistic fundamentals for natural language processing: 100 essentials from morphology and syntax.” Springer Nature, 2022. https://doi.org/10.1007/978-3-031-02150-3
[61]S. A., Salloum, R. Khan, and K. Shaalan. "A survey of semantic analysis approaches." Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2020). Springer International Publishing, 2020. https://doi.org/10.1007/978-3-030-44289-7_6
[62]D. H. Maulud, et al. "State of art for semantic analysis of natural language processing." Qubahan academic journal 1.2 (2021): 21-28. https://doi.org/10.48161/qaj.v1n2a44
[63]A. K. Singh, and M. Shashi. "Vectorization of text documents for identifying unifiable news articles." International Journal of Advanced Computer Science and Applications 10.7 (2019). https://doi.org/10.14569/IJACSA.2019.0100742
[64]TF-IDF from scratch in python on a real-world dataset. https://towardsdatascience.com/tf-idf-for-documentranking-from-scratch-in-python-on-real-world-dataset-796d339a4089. 
[65]V. Vysotska, R. Holoshchuk, S. Goloshchuk, O. Voloshynskyi, M. Shevchenko, V. Panasyuk, “Predicting the Effects of News on the Financial Market Based on Machine Learning Technology.” 2023 IEEE 5th International Conference on Advanced Information and Communication Technologies (AICT) (pp. 152-157). IEEE. (2023, November). https://doi.org/10.1109/AICT61584.2023.10452706
[66]S. Robertson, H. Zaragoza, and M. Taylor. "Simple BM25 extension to multiple weighted fields." Proceedings of the thirteenth ACM international conference on Information and knowledge management. (2004): 42-49. https://doi.org/10.1145/1031171.1031181
[67]V. Vysotska, S. Mazepa, L. Chyrun, O. Brodyak, I. Shakleina, and V. Schuchmann, “NLP tool for extracting relevant information from criminal reports or fakes/propaganda content.” 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT) (pp. 93-98). IEEE. (2022, November). https://doi.org/10.1109/CSIT56902.2022.10000563
[68]N. Vo, and K. Lee. "Where are the facts? searching for factchecked information to alleviate the spread of fake news." arXiv preprint arXiv:2010.03159 (2020). https://arxiv.org/abs/2010.03159
[69]V. Vysotska, S. Voloshyn, O. Markiv, O. Brodyak, N. Sokulska, and V. Panasyuk, “Tone Analysis of Regional Articles in English-Language Newspapers Based on Recurrent Neural Network Bi-LSTM.” 2023 IEEE 5th International Conference on Advanced Information and Communication Technologies (AICT) (pp. 1-6). IEEE. (2023, November). https://doi.org/10.1109/AICT61584.2023.10452700
[70]Propaganda detection. https://www.kaggle.com/datasets/vladimirsydor/propaganda-detection-our-data.
[71]Fake News Detection, https://www.kaggle.com/code/ilaydadu/fake-news-detection-with-nlp-and-lstm.
[72]Fake News Detection, https://www.kaggle.com/code/superrajdoor/fake-news-detection-with-lstm-and-nlp-prorew1/input%20//