Improving Classification by Using MASI Algorithm for Resampling Imbalanced Dataset

Full Text (PDF, 300KB), PP.33-41

Views: 0 Downloads: 0

Author(s)

Thuy Nguyen Thi Thu 1,* Lich Nghiem Thi 1 Nguyen Thu Thuy 1 Toan Nghiem Thi 2 Nguyen Chi Trung 3

1. ThuongMai University, Hanoi, Vietnam

2. LyNhanTong High School, Bacninh, Vietnam

3. Hanoi National of Education University, Hanoi, Vietnam

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2019.10.04

Received: 20 Apr. 2019 / Revised: 23 May 2019 / Accepted: 7 Jun. 2019 / Published: 8 Oct. 2019

Index Terms

Classification, Transaction Fraudulent Detection, Imbalanced Dataset, Resampling

Abstract

At present, financial fraud detection is interested by many machine learning researchers. This is because of existing a big ratio between normal transactions and abnormal ones in data set. Therefore, a good result of prediction rate does not mean that there is a good detection result. This is explained that the experimental result might be effected by the imbalance in the dataset. Resampling a dataset before putting to classification process can be seen as the required task for researching in financial fraud detection area. An algorithm, so-called as MASI, is proposed in this paper in order to improve the classification results. This algorithm breaks the imbalance in the data set by re-labelling the major class samples (normal transactions) to the minor class ones basing the nearest neighbor’s samples. This algorithm has been validated with UCI machine learning repository data domain. Then, the algorithm is also used with data domain, which is taken from a Vietnamese financial company. The results show the better in sensitivity, specificity, and G-mean values compared to other publication control methods (Random Over-sampling, Random Under-sampling, SMOTE and Borderline SMOTE). The MASI also remains the training dataset whereas other methods do not. Moreover, the classifiers using MASI resampling training dataset have detected better number of abnormal transactions compared to the one using no resampling algorithm (normal training data).

Cite This Paper

Thuy Nguyen Thi Thu, Lich Nghiem Thi, Nguyen Thu Thuy, Toan Nghiem Thi, Nguyen Chi Trung, "Improving Classification by Using MASI Algorithm for Resampling Imbalanced Dataset", International Journal of Intelligent Systems and Applications(IJISA), Vol.11, No.10, pp.33-41, 2019. DOI:10.5815/ijisa.2019.10.04

Reference

[1]E.A. Gustavo, P.A Batista, C. Ronaldo, M.C. Monard, "A Study of the Behavior of Several Methods for Balancing machine Learning Training Data," ACM SIGKDD Explorations Newsletter, Special issue on learning from imbalanced datasets, vol. 6, No. 1, 2004, pp. 20-29.
[2]C. V. Nitesh, B. W. Kevin, O. H. Lawrence, K. W. Philip, “SMOTE: Synthetic Minority Over-sampling Technique”, Artificial Intelligence Research, vol. 16, 2002, pp. 321–357.
[3]Y.J. Chen, W.C. Liou, Y.M. Chen, J. H. Wu, “Fraud detection for financial statements of business groups”, International Journal of Accounting Information Systems, 2019, Vol 32, pp. 1-23
[4]G. Baader, H. Krcmar, “Reducing false positives in fraud detection: Combining the red flag
approach with process mining”, International Journal of Accounting Information Systems, Vol. 31, 2018, pp. 1-16. DOI: 10.1016/j.accinf.2018.03.004
[5]K. Randhawa, C. K. Loo, M. SEERA, C. P. Lim, A. K. Nandi, “Credit Card Fraud Detection Using AdaBoost and Majority Voting”, IEEE Access, Vol 6, 2018, pp. 14277 – 14284.
[6]S. Yanmin, A. KC. Wong, S.K. Mohamed, “Classification of imbalanced data: A review”, International Journal of Pattern Recognition and Artificial Intelligence, vol. 23, No. 4, 2009, pp. 687-719
[7]I. Sadgali, N. Sael, F. Benabbou, “Performance of machine learning techniques in the detection of
financial frauds”, Second International Conference on Intelligent Computing in Data Sciences (ICDS 2018), 148, 2019, pp. 45–54. DOI: 10.1016/j.procs.2019.01.007
[8]A.Aida, S. S. M. Siti Mariyam, and R.L. Anca, “Classification with class imbalance problem: A Review”. Int. J. Advance Soft Compu. Appl, Vol. 7, No. 3, November 2015
[9]C. L. Castro, A. P. Braga, “Supervised learning with imbalanced data sets: an overview”, Sba Control & Automation, vol.22, No.5, 2011.
[10]N.T. Lich, N.T.T. Thuy, and , N.T. Toan, “MASI: Moving to Adaptive Samples in Imbalanced Credit Card Dataset for Classification”, Proceeding of IEEE International Conference on Innovative research and development, ICIRD, 2018
[11]A. Sharma, S. Kulshrestha, S.B. Daniel, “Machine Learning Approaches for Cancer Detection”, I.J. Engineering and Manufacturing, 2018, 2, pp. 45-55.
[12]S. Bellamkonda & N.P.Gopalan, “A Facial Expression Recognition Model using Support Vector Machines”, I.J. Mathematical Sciences and Computing, 2018, No 4, pp: 56-65.
[13]C. Aydin, “Classification of the Fire Station Requirement with Using Machine Learning Algorithms”, International Journal of Information Technology and Computer Science, 2019, No 1, pp: 24-30.
[14]D. A. A. G. Singh, E. J. Leavline, S. Muthukrishnan and R. Yuvaraj, “Machine Learning based Business Forecasting”, I.J. Information Engineering and Electronic Business, 2018, 6, 40-51.
[15]Z. Masoumeh, S. Pourya, “Application of Credit Card Fraud Detection: Based on Bagging Ensemble Classifier”, International Conference on Computer, Communication and Convergence (ICCC 2015), vol. 48, 2015, pp. 679-685
[16]J. Berger, “Statistical Decision Theory and Bayesian Analysis”, Springer, 1985
[17]I. Tomek, "Two Modifications of CNN", Transactions on Systems Man and Communications, vol. 6, no. 11, 1976, pp. 769-772.
[18]M.R. Chernick, “Resampling methods”, Wiley Periodicals, 2012. Available at: https://onlinelibrary.wiley.com/doi/pdf/10.1002/widm.1054.
[19]S. Jerzy, S. Wilk, "Rough Sets for Handling Imbalanced Data: Combining Filtering and Rule-based Classifiers", Fundamenta Informatica - Special issue on concurrency specification and programming (CSP 2005), vol. 72, No. 1-3, 2006, pp. 379-391.
[20]K. Miroslav, and M. Stan, "Addressing the Curse of Imbalanced Training Sets: One-Sided Selection", Proceedings of the Fourteenth International Conference on Machine Learning, 1997.
[21]L. Jorma, "Improving Identification of Difficult Small Classes by Balancing Class Distribution,", Artificial Intelligence in Medicine, Springer-Verlag Berlin Heidelberg, 2001, pp. 63-66.
[22]B. Chumphol, S. Krung, L. Chidchanok, "Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling Technique for Handling the Class Imbalanced Problem,", Advances in Knowledge Discovery and Data Mining, vol. 5476, Springer-Verlag Berlin Heidelberg, 2009, pp. 475-482.
[23]H. He, and E. A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, 2009, pp. 1263 – 1284
[24]N. Japkowicz, and S. Stephen, “The class imbalance problem: a systematic study”, Intell. Data Anal Journal. Vol. 6, No. 5, 2002, pp. 429–450.
[25]H. Haibo, B. Yang, A. Edwardo, L. Garcia, Shutao, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning," IEEE International Joint Conference On Neural Networks, 2008, pp. 1322-1328
[26]E. Ramentol, Y. Caballero, R. Bello, F. Herrera, “SMOTE-RSB: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory", Knowledge and Information Systems, vol. 33, No. 2, 2012, pp. 245-265.
[27]C. H. Yu, “Resampling methods: Concepts, Applications, and Justification”, Practical Assessment, Research &Evaluation, Vol. 8, No. 19, Available at: http://PAREonline.net/getvn.asp?v=8&n=19.
[28]S. R. Uma, M. N. Suresh, “A Hybrid Approach for Class Imbalance Problem in Customer Churn Prediction: A Novel Extension to Under-sampling”, I.J. Intelligent Systems and Applications, 2018, N0 5, pp. 71-81.
[29]P. Cao, J. Yang, W. Li, D. Zhao, and O. Zaiane, “Ensemble-based hybrid probabilistic sampling for imbalanced data learning in lung nodule CAD,” Computerized Medical Imaging and Graphics, vol. 38, no. 3, pp. 137-150, 2014.
[30]D.X. Tho, T. D. Hung, O. Hirose, and K. Satou, “SPY: A Novel Resampling Method for Improving Classification Performance in Imbalanced Data”, Knowledge and Systems Engineering (KSE), 2015 Seventh International Conference on, 2015, pp. 280-285.
[31]P. H. Malhotra, P. Sharma, “Intrusion Detection using Machine Learning and Feature Selection”, I. J. Computer Network and Information Security, 2019, 4, pp. 43-52.
[32]F. Usama, P.S. Gregory, S. Padhraic, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine, vol. 17, 1996, pp. 37-54.
[33]UCI machine learning data domain. Available at: https://archive.ics.uci.edu/ml/index.php.
[34]FICO-UCSD, 2009. Available at: https://ebiquity.umbc.edu/blogger/2009/05/24/ucsd-data-mining-contest/
[35]M. Ester, K. Hans-Peter, S. Jörg, and X. Xiaowei “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise” , AAAI Press, 1996, pp. 226–31.