Parallel DBSCAN Clustering Algorithm Using Hadoop Map-reduce Framework for Spatial Data

Full Text (PDF, 555KB), PP.1-12

Views: 0 Downloads: 0

Author(s)

Maithri. C. 1,* Chandramouli H. 2

1. Department of Computer Science and Engineering, Kalpataru Institute of Technology, Tiptur, India

2. Department of Computer Science and Engineering, East Point College of Engineering and Technology, Bangalore, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2022.06.01

Received: 10 Jul. 2022 / Revised: 14 Sep. 2022 / Accepted: 14 Oct. 2022 / Published: 8 Dec. 2022

Index Terms

Artificial Intelligence, Data mining, DBSCAN, Hadoop, Parallel Clustering

Abstract

Data clustering is the first step for future applications of big data analysis. It is a driving model for Artificial Intelligence and Machine Learning architectures. Processing large volumes of data in faster mode is a big challenge in these applications. which requires fast and efficient algorithms for handling big data. Parallel clustering algorithms are one promising design, which increases the speed of handling such big data. In this paper, a parallel algorithm for clustering a spatial dataset called the P-DBSCAN algorithm is implemented using Hadoop map-reduce framework. This research paper signifies the improvement for data clustering in data analytic applications. The new P-DBSCAN algorithm is executed over generated dataset. The result of this parallel algorithm is compared with existing DBSCAN algorithm to show improvement of runtime performance. This work offers an increase in the performance of execution time. In addition, the outcome of P-DBSCAN shows how to resolve the scalability problem of a large data set.

Cite This Paper

Maithri. C., Chandramouli H., "Parallel DBSCAN Clustering Algorithm Using Hadoop Map-reduce Framework for Spatial Data", International Journal of Information Technology and Computer Science(IJITCS), Vol.14, No.6, pp.1-12, 2022. DOI:10.5815/ijitcs.2022.06.01

Reference

[1]M. Chen, X. Gao and H. Li, "Parallel DBSCAN with Priority R-tree," 2nd IEEE International Conference on Information Management and Engineering, pp. 508-511, 2010, doi: 10.1109/ICIME.2010.5477926.
[2]A. K. Jain, M. N. Murthy, and P. J. Flynn. “Data clustering: a review.” ACM Computing Surv., pp. 264–323,1999, doi: https://doi.org/10.1145/331499.331504.
[3]Zhao W., Ma H., He Q. “Parallel K-Means Clustering Based on MapReduce.” In: Jaatun M.G., Zhao G., Rong C. (eds) Cloud Computing. CloudCom 2009. Lecture Notes in Computer Science, vol 5931. Springer, Berlin, Heidelberg. 2009, doi: https://doi.org/10.1007/978-3-642-10665-1_71
[4]Martin Ester, Hans-Peter Kriegel, J¨org Sander, and Xiaowei Xu. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996, https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf
[5]Robson L. F. Cordeiro, Caetano Traina Jr. “Clustering Very Large Multi-Dimensional Datasets with Map Reduce.” 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 690-698, 2011, doi: https://doi.org/10.1145/2020408.2020516
[6]Jin, R., Hong, L., Wang, C., Wu, L., & Si, W., “A Hierarchical clustering community algorithm which missed the signal in the process of transmission,” Review of Computer Engineering Studies, pp. 27-34, 2015, http://www.iieta.org/sites/default/files/Journals/RCES/02.3_06.pdf.
[7]B J Frey and D Duech, “Clustering by passing messages between data points,” Technical Report in Science Journal, Vol. 315, Issue 5814, pp. 972-976, 2007, doi: 10.1126/science.1136800
[8]Von Luxburg, U. “A tutorial on spectral clustering”, Statistics and Computing 17, pp.395–416, 2007. https://doi.org/10.1007/s11222-007-9033-z.
[9]K. Govindarajan, T. S. Somasundaram, V. S. Kumar and Kinshuk, “Continuous Clustering in Big Data Learning Analytics,” IEEE Fifth International Conference on Technology for Education, Kharagpur, pp. 61-64, 2013, doi: 10.1109/T4E.2013.23.
[10]B. Borah and D. K. Bhattacharyya, “An improved sampling-based DBSCAN for large spatial databases,” International Conference on Intelligent Sensing and Information Processing, Chennai, India, pp. 92-96, 2004, doi: 10.1109/ICISIP.2004.1287631.
[11]P. Liu, D. Zhou and N. Wu. “VDBSCAN: Varied Density Based Spatial Clustering of Applications with Noise”, International Conference on Service Systems and Service Management, Chengdu, pp. 1-4, 2007, doi:10.1109/ICSSSM.2007.4280175.
[12]B. Liu, “A Fast Density-Based Clustering Algorithm for Large Databases,” International Conference on Machine Learning and Cybernetics, Dalian, China, pp. 996-1000, 2006, doi: 10.1109/ICMLC.2006.258531.
[13]Tsai, Cheng-Fa & Wu, Chien-Tsung. “GF-DBSCAN: A new efficient and effective data clustering technique for large databases.” In Proceedings of the 9th WSEAS international conference on Multimedia systems & signal processing, pages 231–236, 2009, ISBN: 978-960-474-077-2.
[14]S. Mahran and K. Mahar. “Using grid for accelerating density-based clustering.” 8th IEEE International Conference on Computer and Information Technology, Sydney, NSW, 2008, pp. 35-40, doi: 10.1109/CIT.2008.4594646.
[15]Zhou, S., Zhou, A., Jin, W., Fan, Y. ning Qian, W. “FDBSCAN: A fast DBSCAN algorithm.” In: Federation, C.C. (ed.) Journal of Software, pp. 735–744, Science Press, Beijing, 2000.
[16]Roy, S., Bhattacharyya, D. K. “An approach to find embedded clusters using density based techniques. “In Distributed Computing and Internet Technology, pp. 523-535, 2005, doi:10.1007/11604655_59
[17]P. Viswanath and R. Pinkesh, “I-DBSCAN: A Fast Hybrid Density Based Clustering Method.” 18th International Conference on Pattern Recognition (ICPR'06), Hong Kong, pp. 912-915, 2006, doi: 10.1109/ICPR.2006.741.
[18]Derya Birant, Alp Kut, “ST-DBSCAN: An algorithm for clustering spatial–temporal data,” Data & Knowledge Engineering, Volume 60, Issue 1, Pages 208-221, ISSN 0169-023X, 2007, https://doi.org/10.1016/j.datak.2006.01.013.
[19]K. Tamura and T. Ichimura, “Density-Based Spatiotemporal Clustering Algorithm for Extracting Bursty Areas from Georeferenced Documents.” IEEE International Conference on Systems, Man, and Cybernetics, Manchester, pp. 2079-2084, 2013, doi:10.1109/SMC.2013.356.
[20]Nitta N., Kumihashi Y., Kato T., Babaguchi N. “Real-World Event Detection Using Flickr Images.” In: Gurrin C., Hopfgartner F., Hurst W., Johansen H., Lee H., O’Connor N. (eds) MultiMedia Modeling (MMM) Lecture Notes in Computer Science, vol 8326. Springer, Cham. 2014, doi: https://doi.org/10.1007/978-3-319-04117-9_29.
[21]POPOVICI, Robert, Andreas WEILER, Michael GROSSNIKLAUS. “On-line Clustering for Real-Time Topic Detection in Social Media Streaming Data.” SNOW 2014 Data Challenge. Seoul, Korea, Apr 8, 2014. In: PAPADOPOULOS, Symeon, ed. and others. Proceedings of the SNOW 2014 Data Challenge co-located with 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, pp. 57-63, April 8, 2014 http://ceur-ws.org/Vol-1150/popovici.pdf
[22]Markus Götz, Christian Bodenstein, and Morris Riedel. “HPDBSCAN: highly parallel DBSCAN.” In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments (MLHPC '15). Association for Computing Machinery, New York, NY, USA, Article 2, 1–10, 2015, doi:https://doi.org/10.1145/2834892.2834894
[23]Sándor Szénás, “Parallel implementation of DBSCAN algorithm using multiple graphics accelerators,” 16th SGEM geoconference on informatics, geoinformatics and remote sensing section informatics 28 june – 7 july, 2016, Bulgaria.
[24]W. Chen, Y. Song, H. Bai, C. Lin and E. Y. Chang, “Parallel Spectral Clustering in Distributed Systems” In IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 568-586, March 2011. doi: 10.1109/TPAMI.2010.88.
[25]Ling Liyang SONG Hongzhen WANG Shen Liu Jinyu, “Parallel Implementation of DBSCAN Algorithm Based on Spark,” ,2019, https://www.cse.ust.hk/msbd5003/pastproj/deep1.pdf.
[26]Luo, G., Luo, X., Gooch, T.F., Tian, L., & Qin, K. “A Parallel DBSCAN Algorithm Based on Spark,” IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom), pp. 548-553, 2016, doi: 10.1109/BDCloud-SocialCom-SustainCom.2016.85.