Genetic-based Summarization for Local Outlier Detection in Data Stream

Full Text (PDF, 328KB), PP.58-68

Views: 0 Downloads: 0

Author(s)

Mohamed Sakr 1,* Walid Atwa 1 Arabi Keshk 1

1. Computer Science Dept. Faculty of Computers and Information, Menoufia University, Egypt

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2021.01.05

Received: 3 Feb. 2020 / Revised: 8 Mar. 2020 / Accepted: 16 Mar. 2020 / Published: 8 Feb. 2021

Index Terms

Outlier detection, data streams, local outlier factor, genetics

Abstract

Outlier detection is one of the important tasks in data mining. Detecting outliers over streaming data has become an important task in many applications, such as network analysis, fraud detections, and environment monitoring. One of the well-known outlier detection algorithms called Local Outlier Factor (LOF). However, the original LOF has many drawbacks that can’t be used with data streams: 1- it needs a lot of processing power (CPU) and large memory to detect the outliers. 2- it deals with static data which mean that in any change in data the LOF recalculates the outliers from the beginning on the whole data. These drawbacks make big challenges for existing outlier detection algorithms in terms of their accuracies when they are implemented in the streaming environment. In this paper, we propose a new algorithm called GSILOF that focuses on detecting outliers from data streams using genetics. GSILOF solve the problem of large memory needed as it has fixed memory bound. GSILOF has two phases. First, the summarization phase that tries to summarize the past data arrived. Second, the detection phase detects the outliers from the new arriving data. The summarization phase uses a genetic algorithm to try to find the subset of points that can represent the whole original set. our experiments have been done over real datasets. Our experiments confirming the effectiveness of the proposed approach and the high quality of approximate solutions in a set of real-world streaming data.

Cite This Paper

Mohamed Sakr, Walid Atwa, Arabi Keshk, "Genetic-based Summarization for Local Outlier Detection in Data Stream", International Journal of Intelligent Systems and Applications(IJISA), Vol.13, No.1, pp.58-68, 2021. DOI:10.5815/ijisa.2021.01.05

Reference

[1]S. Sadik and L. Gruenwald, "Research issues in outlier detection for data streams," ACM SIGKDD Explorations Newsletter, vol. 15, pp. 33-40, 2014.
[2]M. Salehi, C. Leckie, J. C. Bezdek, T. Vaithianathan and X. Zhang, "Fast memory efficient local outlier detection in data streams," IEEE Transactions on Knowledge and Data Engineering, vol. 28, pp. 3246-3260, 2016.
[3]Y. Yan, L. Cao, C. Kulhman and E. Rundensteiner, "Distributed local outlier detection in big data," in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017.
[4]M. Sakr, W. Atwa, and A. Keshk. "Research Article Distributed Anomaly Detection Over Big Data." Research Journal of Applied Sciences, Engineering and Technology 16, no. 2 (2019): 77-87.
[5]A. K. Jain, "Data clustering: 50 years beyond K-means," Pattern recognition letters, vol. 31, pp. 651-666, 2010.
[6]D. Pokrajac, A. Lazarevic and L. J. Latecki, "Incremental local outlier detection for data streams," in Computational intelligence and data mining, 2007. CIDM 2007. IEEE symposium on, 2007.
[7]V. Chandola, A. Banerjee and V. Kumar, "Anomaly detection: A survey," ACM computing surveys (CSUR), vol. 41, p. 15, 2009.
[8]M. Sakr, W. Atwa, and A. Keshk. "Sub-Grid Partitioning Algorithm for Distributed Outlier Detection on Big Data." In 2018 13th International Conference on Computer Engineering and Systems (ICCES), pp. 252-257. IEEE, 2018.
[9]M. Gupta, J. Gao, C. C. Aggarwal and J. Han, "Outlier detection for temporal data: A survey," IEEE Transactions on Knowledge and Data Engineering, vol. 26, pp. 2250-2267, 2014.
[10]W. Atwa, and L. Kan. "Clustering evolving data stream with affinity propagation algorithm." In International Conference on Database and Expert Systems Applications, pp. 446-453. Springer, Cham, 2014.
[11]E. M. Knox and R. T. Ng, "Algorithms for mining distancebased outliers in large datasets," in Proceedings of the international conference on very large data bases, 1998.
[12]F. Angiulli and F. Fassetti, "Detecting distance-based outliers in streams of data," in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 2007.
[13]D. Yang, E. A. Rundensteiner and M. O. Ward, "Neighbor-based pattern detection for windows over streaming data," in Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, 2009.
[14]M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas and Y. Manolopoulos, "Continuous monitoring of distance-based outliers over data streams," in Data Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011.
[15]M. M. Breunig, H.-P. Kriegel, R. T. Ng and J. Sander, "LOF: identifying density-based local outliers," in ACM sigmod record, 2000.
[16]K. Yamanishi, J.-I. Takeuchi, G. Williams and P. Milne, "On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms," Data Mining and Knowledge Discovery, vol. 8, pp. 275-300, 2004.
[17]K. Yamanishi and J.-i. Takeuchi, "A unifying framework for detecting outliers and change points from non-stationary time series data," in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002.
[18]W. Atwa, and L. Kan. "Constraint-based clustering algorithm for multi-density data and arbitrary shapes." In Industrial Conference on Data Mining, pp. 78-92. Springer, Cham, 2017.
[19]F. Cao, M. Estert, W. Qian and A. Zhou, "Density-based clustering over an evolving data stream with noise," in Proceedings of the 2006 SIAM international conference on data mining, 2006.
[20]W. Atwa, and L. Kan. "Semi-supervised Clustering Method for Multi-density Data." In International Conference on Database Systems for Advanced Applications, pp. 313-319. Springer, Cham, 2015.
[21]Elahi, M.; Li, K.; Nisar, W.; Lv, X.; Wang, H. Efficient Clustering-Based Outlier Detection Algorithm for Dynamic Data Stream. In Proceedings of the 5th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Shandong, China, 18–20 October 2008; Volume 5, pp. 298–304.
[22]W. Atwa, and L. Kan. "Affinity propagation-based clustering for data streams." Applied Mathematics & Information Sciences 9, no. 4, 2015.
[23]I. Assent, P. Kranen, C. Baldauf and T. Seidl, "Anyout: Anytime outlier detection on streaming data," in International Conference on Database Systems for Advanced Applications, 2012.
[24]M. Salehi, C. A. Leckie, M. Moshtaghi and T. Vaithianathan, "A relevance weighted ensemble model for anomaly detection in switching data streams," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2014.
[25]D. E. Goldberg and J. H. Holland, "Genetic algorithms and machine learning," Machine learning, vol. 3, pp. 95-99, 1988.
[26]G. S. Na, D. Kim and H. Yu, "DILOF: Effective and Memory Efficient Local Outlier Detection in Data Streams," in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.
[27]B. Póczos, L. Xiong and J. Schneider, "Nonparametric divergence estimation with applications to machine learning on distributions," arXiv preprint arXiv:1202.3758, 2012.
[28]D. Dheeru and E. Karra Taniskidou, UCI Machine Learning Repository, 2017.
[29]C. C. Aggarwal and S. Sathe, "Theoretical foundations and algorithms for outlier ensembles," ACM SIGKDD Explorations Newsletter, vol. 17, pp. 24-47, 2015.
[30]J. Tang, Z. Chen, A. W.-C. Fu and D. W. Cheung, "Enhancing effectiveness of outlier detections for low density patterns," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2002.
[31]S. Ramaswamy, R. Rastogi and K. Shim, "Efficient algorithms for mining outliers from large data sets," in ACM Sigmod Record, 2000.
[32]E. Lozano and E. Acufia, "Parallel algorithms for distance-based and density-based outliers," in Data Mining, Fifth IEEE International Conference on, 2005.
[33]W. Jin, A. K. H. Tung, J. Han and W. Wang, "Ranking outliers using symmetric neighborhood relationship," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2006.
[34]D. M. Hawkins, Identification of outliers, vol. 11, Springer, 1980.
[35]M. Bai, X. Wang, J. Xin and G. Wang, "An efficient algorithm for distributed density-based outlier detection on big data," Neurocomputing, vol. 181, pp. 19-28, 2016.
[36]C. C. Aggarwal and P. S. Yu, "Outlier detection with uncertain data," in Proceedings of the 2008 SIAM International Conference on Data Mining, 2008.
[37]C. C. Aggarwal and P. S. Yu, "Outlier detection for high dimensional data," in ACM Sigmod Record, 2001.