Performance Analysis of Improved Clustering Algorithm on Real and Synthetic Data

Full Text (PDF, 524KB), PP.57-65

Views: 0 Downloads: 0

Author(s)

Anand Khandare 1,* A. S. Alvi 2

1. Department of CSE, SGB Amravati University Amravati, India

2. Department of CSE, PRMIT &R, Badnera, Amravati, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijcnis.2017.10.07

Received: 15 Jun. 2017 / Revised: 20 Jul. 2017 / Accepted: 10 Aug. 2017 / Published: 8 Oct. 2017

Index Terms

Data mining, Clustering Algorithm, Validity Measure, Run time, Optimal Clusters

Abstract

Clustering is an important technique in data mining to partition the data objects into clusters. It is a way to generate groups from the data objects. Different data clustering methods or algorithms are discussed in the various literature. Some of these are efficient while some are inefficient for large data. The k-means, Partition Around Method (PAM) or k-medoids, hierarchical and DBSCAN are various clustering algorithms. The k-means algorithm is more popular than the other algorithms used to partition data into k clusters. For this algorithm, k should be provided explicitly. Also, initial means are taken randomly but this may generate clusters with poor quality. This paper is a study and implementation of an improved clustering algorithm which automatically predicts the value of k and uses a new technique to take initial means. The performance analysis of the improved algorithm and other algorithms by using real and dummy datasets is presented in this paper. To measure the performance of algorithms, this paper uses running time of algorithms and various cluster validity measures. Cluster validity measures include sum squared error, silhouette score, compactness, separation, Dunn index and DB index. Also, the k predicted by the improved algorithm is compared with optimal k suggested by elbow method. It is found that both values of k are almost similar. Most of the values of validity measures for the improved algorithm are found to be optimal.

Cite This Paper

Anand Khandare, A. S. Alvi, "Performance Analysis of Improved Clustering Algorithm on Real and Synthetic Data", International Journal of Computer Network and Information Security(IJCNIS), Vol.9, No.10, pp.57-65, 2017. DOI:10.5815/ijcnis.2017.10.07

Reference

[1]Mr. Anand Khandare, Dr. A.S. Alvi, “Clustering Algorithms: Experiment and Improvements”, IRSCNS, Springer, LNNS, July 2016.
[2]Rui Xu, Donald Wunsch II, “Survey of Clustering Algorithms, “IEEE transactions on neural networks, vol. 16, no. 3, May 2005.
[3]Qing Liao, Fan Yang, JingmingZhao,”An Improved parallel K-means Clustering Algorithm with MapReduce”, ICCT, pp 764-768, 2013.
[4]Jonathon K. Parker, Lawrence O. Hall, “Accelerating Fuzzy-C Means Using an Estimated Subsample Size”, IEEE trans on fuzzy systems, vol. 22, no. 5, Oct 2014.
[5]JunjieWu, Hongfu Liu, Hui Xiong, Jie Cao, Jian Chen,”K-Means-Based Consensus Clustering: A Unified View “, IEEE Transaction on knowledge and data engineering, vol. 27, no. 1, 2015.
[6]Mr. Anand Khandare, Dr. A.S. Alvi, “Efficient Clustering Algorithm with Improved Clusters Quality”, IOSR Journal of Computer Engineering, vol-18, pp. 15-19, Nov.-Dec. 2016.
[7]Mr. Anand D.Khandare, “Modified K-means Algorithm for Emotional Intelligence Mining”, International Conference on Computer Communication and Informatics (ICCCI -2015), Jan. 08 – 10, 2015.
[8]Adil Fahad1, Najlaa Alshatri1, Zahir Tari1, Abdullah Alamri, Ibrahim Khalil1, Albert Y. Zomaya, SebtiFoufou, AbdelazizBoura,” A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis “, IEEE Transaction on emerging topics in computing, 2014.
[9]Sriparna Saha, Sanghamitra Bandyopadhyay, “Performance Evaluation of Some Symmetry-Based Cluster Validity Indexes “, IEEE Transaction on systems, man, and cybernetics—part c: applications and reviews, vol. 39, no. 4, 2009.
[10]Mr. Anand Khandare, Dr. A.S. Alvi, “ Survey of Improved k-means Clustering Algorithms: Improvements, Shortcomings, and Scope for Further Enhancement and Scalability, Information Systems Design and Intelligent Applications, Springer, AISC, Pages 495-503,2016.
[11]Fasahat Ullah Siddiqui, Nor Ashidi Mat Isa, “Enhanced Moving K-Means (EMKM) Algorithm for Image Segmentation “, IEEE Transactions on Consumer Electronics, Vol. 57, No. 2, 2011.
[12]Jonathon K. Parker, and Lawrence O. Hall, “Accelerating fuzzy-c means using an estimated subsample size”, IEEE Trans. on the fuzzy system, vo. 22, no. 5, 2014.
[13]Jiye Liang, Liang Bai, Chuangyin Dang, and Fuyuan Cao, “The k-means-type algorithms versus imbalanced data distributions”, IEEE Trans. on fuzzy systems, vol. 20, no. 4, 2012.
[14]Wei Zhong, GulsahAltun, Robert Harrison, Phang C. Tai, and Yi Pan, “Improved K-Means Clustering Algorithm for Exploring Local Protein Sequence Motifs Representing Common Structural Property”, IEEE transactions on nanoscience, vol. 4, no. 3, Sept 2005.
[15]https://archive.ics.uci.edu/ml/datasets.html.
[16]http://www.kdnuggets.com/datasets/index.html.
[17]Weiguo Sheng, Shengyong Chen, Mengmeng Sheng, Gang Xiao, Jiafa Mao Yujun Zheng, "Adaptive Multi-Subpopulation Competition. Multi-Niche Crowding based Memetic Algorithm for Automatic Data Clustering", IEEE transactions on evolutionary computation, 2016.
[18]Qiuhong Li, PengWang, WeiWangHao Hu, Zhongsheng Li, Junxian Li, "An Efficient K-means Clustering Algorithm on MapReduce", DASFAA, LNCS 8421, pp. 357–371, 2014.
[19]https://www.rstudio.com.
[20]https://cran.r-project.org.
[21]Qingshan Jiang, YanpingZhang, LifeiChen,”An Initialization Method for Subspace Clustering Algorithm”, .J. Intelligent Systems and Applications, 2011, 3, 54-61, MECS, 2011.
[22]Mohammed El Agha, Wesam M. Ashour, “Efficient and Fast Initialization Algorithm for K-means Clustering”, I.J. Intelligent Systems and Applications, 2012, 1, 21-31, MECS.
[23]Shashank Sharma, Megha Goel, Prabhjot Kaur,
“Performance Comparison of Various Robust Data Clustering Algorithms”, I.J. Intelligent Systems and Applications, 2013, 07, 63-71, MECS.
[24]B.K. Tripathy, Akash Goyal, Rahul Chowdhury, Patra AnupamSourav,”MMeMeR: An Algorithm for Clustering Heterogeneous Data using Rough Set Theory”, I.J. Intelligent Systems and Applications, 2017, 8, 25-33, MECS.
[25]Long Nguyen Hung, Thuy Nguyen Thi Thu, Giap Cu Nguyen, “An Efficient Algorithm in Mining Frequent Itemsets with Weights over Data Stream Using Tree Data Structure”, I.J. Intelligent Systems and Applications, 2015, 12, 23-31, MECS.
[26]Zhengbing Hu, Yevgeniy V. Bodyanskiy, Oleksii K. Tyshchenko, Viktoriia O. Samitova, “Fuzzy Clustering Data Given in the Ordinal Scale”, I.J. Intelligent Systems and Applications, 2017, 1, 67-74, MECS.
[27]Manju Mam, Leena G, N S Saxena, “Improved K-means Clustering based Distribution Planning on a Geographical Network”, I.J. Intelligent Systems and Applications, 2017, 4, 69-75, MECS.
[28]Sharfuddin Mahmood, Mohammad Saiedur Rahaman, Dr. Dip Nandi, Mashiour Rahman, “A Proposed Modification of K-Means Algorithm”, I.J. Modern Education and Computer Science, 2015, 6, 37-42, MECS.
[29]Muhammad Ali Masood, M. N. A. Khan,"Clustering Techniques in Bioinformatics ", I.J. Modern Education and Computer Science, 2015, 1, 38-46, MECS.
[30]JinzhuHu, ChunxiuXiong, JiangboShu, XingZhou, Jun Zhu, "An Improved Text Clustering Method based on Hybrid Model", I.J.Modern Education and Computer Science, 2009, 1, 35-44, MECS.
[31]Prachi, Shikha Sharma, “Energy Efficient Clustering Protocol for Sensor Network”, I. J. Computer Network and Information Security, 2016, 12, 59-66, MECS.
[32]Sukhkirandeep Kaur, RoohieNaaz Mir,” Clustering in Wireless Sensor Networks- A Survey”, I. J. Computer Network and Information Security, 2016, 6, 38-51, MECS.
[33]Mai Abdrabo, Mohammed Elmogy, GhadaEltaweel, Sherif Barakat,” Enhancing Big Data Value Using Knowledge Discovery Techniques”, I.J. Information Technology and Computer Science, 2016, 8, 1-12, MECS.
[34]Wei Zhong, G. Altun, R. Harrison, "Improved K-means clustering algorithm for exploring local protein sequence motifs representing common structural property", IEEE Transactions on NanoBioscience, Pages: 255 - 265, DOI: 10.1109/TNB.2005.853667, 2015.
[35]Chien-Liang Liu, Wen-Hoar Hsaio, Chia-Hoang Lee, "Semi-Supervised Linear Discriminant Clustering", IEEE Transactions on Cybernetics, Pages: 989 - 1000, DOI: 10.1109/TCYB.2013.2278466,2013.
[36]Aleta C. Fabregas, Bobby D. Gerardo, Bartolome T. TanguiligIII,"Enhanced Initial Centroids for K-means Algorithm ", I.J. Information Technology and Computer Science, 2017, 1, 26-33, MECS.
[37]Dazhao Cheng, JiaRao, YanfeiGuo,"Improving Performance of Heterogeneous MapReduce Clusters with Adaptive Task Tuning".
[38]IEEE Transactions on Parallel and Distributed Systems, Pages: 774 - 786, DOI: 10.1109/TPDS.2016.2594765.
[39]Orhan Kislal, Piotr Berman, Mahmut Kandemir, "Improving the performance of k-means clustering through computation skipping and data locality optimizations”, Proceedings of the 9th conference on Computing Frontiers, ACM, 2012.
[40]JeyhunKarimov. Author links open the author workspace.MuratOzbayoglu, "Clustering Quality Improvement of k-means Using a Hybrid Evolutionary Model ", DOI: doi.org/10.1016/j.procs.2015.09.143,Elsevier, 2015.
[41]SalimaOuadfel, SouhamMeshoul, "Handling Fuzzy Image Clustering with a Modified ABC Algorithm ", I.J. Intelligent Systems and Applications, 2012, 12, 65-74, MECS.
[42]Vishwambhar Pathak, Dr. Praveen Dhyani, Dr. Prabhat Mahanti, "Autonomous Image Segmentation using Density Adaptive Dendritic Cell Algorithm ", I.J. Image, Graphics and Signal Processing, 2013, 10, 26-35, MECS.
[43]JunhaoGan, Yufei Tao, " DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation", Proceeding SIGMOD '15 Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pages 519-530, 2015.
[44]Chen Li, YanfengZhang, MinghaiJiao, Ge Yu, "Mux-Kmeans: multiplex means for clustering large-scale data set ", ScienceCloud '14: Proceedings of the 5th ACM workshop on Scientific cloud computing June 2014.
[45]P. S. Bishnu, V. Bhattacherjee, "Application of K-Medoids with Kd-Tree for Software Fault Prediction ", ACM SIGSOFT Software Engineering Notes: Volume 36 Issue 2, March 2011.
[46]Chen Jin, ZhengzhangChen, William Hendrix, Ankit Agrawal, Alok Choudhary, "Improved Hierarchical Clustering for Face Images in Videos: Integrating positional and temporal information with HAC ", April 1, 2014.
[47]Jian YuHoukuan HuangShengfengTian, "Cluster Validity and Stability of Clustering Algorithms ", LNCS 3138, pp. 957–965, 2004.Springer-Verlag Berlin Heidelberg 2004.
[48]Ken-ichiFukuiMasayukiNumao, "Neighborhood-Based Smoothing of External Cluster Validity Measures ", PAKDD 2012, Part I, LNAI 7301, pp. 354–365, Springer-Verlag Berlin Heidelberg 2012.
[49]JunjieWu, Jian Chen, Hua Xiong, MingXie, "External validation measures for K-means clustering: A data distribution perspective ", DOI: doi.org/10.1016/j.eswa.2008.06.093.
[50]Hoel Le Capitaine, CarlFrelicot," A Cluster-Validity Index Combining an Overlap Measure and a Separation Measure Based on Fuzzy-Aggregation Operators", IEEE Transactions on Fuzzy Systems, Volume: 19, Issue: 3, June 2011.