An Univariate Feature Elimination Strategy for Clustering Based on Metafeatures

Full Text (PDF, 673KB), PP.20-30

Views: 0 Downloads: 0

Author(s)

Saptarsi Goswami 1,* Sanjay Chakraborty 2 Himadri Nath Saha 3

1. A.K. Choudhury Institute of Technology, Calcutta University, Kolkata, India

2. Department of Computer Science & Engineering, Institute of Engineering & Management, Kolkata, India

3. Department of Electrical and Electronics Engineering, Institute of Engineering & Management, Kolkata, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2017.10.03

Received: 26 Jan. 2017 / Revised: 1 Apr. 2017 / Accepted: 5 Jun. 2017 / Published: 8 Oct. 2017

Index Terms

Feature Selection, Feature Elimination, Entropy, Skewness, Kurtosis, Coefficient of Variation, Correlation

Abstract

Feature selection plays a very important role in all pattern recognition tasks. It has several benefits in terms of reduced data collection effort, better interpretability of the models and reduced model building and execution time. A lot of problems in feature selection have been shown to be NP – Hard. There has been significant research in feature selection in last three decades. However, the problem of feature selection for clustering is still quite an open area. The main reason is unavailability of target variable as compared to supervised tasks. In this paper, five properties or metafeatures like entropy, skewness, kurtosis, coefficient of variation and average correlation of the features have been studied and analysed. An extensive study has been conducted over 21 publicly available datasets, to evaluate viability of feature elimination strategy based on the values of the metafeatures for feature selection in clustering. A strategy to select the most appropriate metafeatures for a particular dataset has also been outlined. The results indicate that the performance decrease is not statistically significant.

Cite This Paper

Saptarsi Goswami, Sanjay Chakraborty, Himadri Nath Saha, "An Univariate Feature Elimination Strategy for Clustering Based on Metafeatures", International Journal of Intelligent Systems and Applications(IJISA), Vol.9, No.10, pp.20-30, 2017. DOI:10.5815/ijisa.2017.10.03

Reference

[1]H. Liu, and Y. Lei, "Toward integrating feature selection algorithms for classification and clustering." IEEE Transactions on Knowledge and Data Engineering, Vol.17, No.4, pp.491-502, 2005.
[2]I. Guyon, and A. Elisseeff. "An introduction to variable and feature selection." The Journal of Machine Learning Research, Vol.3, pp.1157-1182, 2003.
[3]Y. Saeys, I. Iñaki, and P. Larrañaga. "A review of feature selection techniques in bioinformatics." Bioinformatics, Vol.23, No.19, pp.2507-2517, 2007.
[4]S. Alelyani, T. Jiliang, and H. Liu. "Feature selection for clustering: A review." Data Clustering: Algorithms and Applications, 2013.
[5]Hall, Mark A. Correlation-based feature selection for machine learning. ((Doctoral dissertation) The University of Waikato, 1999.
[6]H. Peng, F. Long, and C. Ding. "Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy",IEEE Transactions on Pattern Analysis and Machine Intelligence,Vol.27, no.8, pp.1226-1238, 2005.
[7]Estévez, Pablo A., M. Tesmer, Claudio A. Perez, and Jacek M. Zurada. "Normalized mutual information feature selection", IEEE Transactions onNeural Networks, Vol.20, no.2, pp.189-201, 2009.
[8]T. Ignac, N. A. Sakhanenko, A. Skupin, and David J. Galas. "New methods for finding associations in large data sets: generalizing the maximal information coefficient (MIC)." In Proc. of the 9th International Workshop on Computational Systems Biology (WCSB2012), pp. 39-42. 2012.
[9]A. Hassan, M. Shariff and N. Baksh, Awaluddin Mohd Shaharoun, and Hishamuddin Jamaluddin. "Improved SPC chart pattern recognition using statistical features." International Journal of Production Research 41, no. 7, pp.1587-1603, 2003.
[10]S. Fong, Dept. of Comput. & Inf. Sci., Univ. of Macau, Taipa, China ; Liang, J. ; Wong, R. ; Ghanavati, M. “A novel feature selection by clustering coefficients of variations”, Digital Information Management (ICDIM) pp 205 -213. 2015.
[11]S.Goswami and A.Chakrabarti,"Feature Selection: A Practitioner View", Internation Journaly of Computer Science and Internet Technology, vol.6, no.11, pp.66-77, 2014.
[12]Microsft Technet SQL Server 2012, Retrieved from https://technet.microsoft.com/enus/library/ms175382%28v=sql.110%29.aspx
[13]G. T. Wang et al., ” A feature subset selection algorithm automaticrecommendation method” , Journal of Artificial Intelligence Research, Vol. 47, pp. 1-34, 2013.
[14]S.Goswami, A. Chakrabarti andB. Chakraborty, “Correlation Structure of Data Set for Efficient Pattern Classification”, In Proceedings of the 2nd International Conference onCybernetics (CYBCONF), pp 24-29, IEEE 2015.
[15]X. He, C. Deng, and N. Partha. "Laplacian score for feature selection." In Proceediing of Advances in neural information processing systems . Vol. 186. pp 507-504, 2005.
[16]Z. Zhao and H. Liu. "Spectral feature selection for supervised and unsupervised learning." In Proceedings of the 24th international conference on Machine learning, pp. 1151-1157. ACM, 2007.
[17]S. Bandyopadhyay, T. Bhadra, P. Mitra, and U. Maulik, "Integration of dense subgraph finding with feature clustering for unsupervised feature selection." Pattern Recognition Letters, Vol.40, 2014,pp104-112.
[18]S.Goswami, and A. Chakrabarti. "Quartile Clustering: A quartile based technique for Generating Meaningful Clusters." Journal of Computing , 2012, pp 48-57.
[19]K. Bache & M. Lichman, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, 2013.
[20]J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera. KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework. Journal of Multiple-Valued Logic and Soft Computing 17:2-3 (2011) 255-287.
[21]R Core Team (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL: http://www.R-project.org/.
[22]Patrick E. Meyer (2012). infotheo: Information-Theoretic Measures. R package version 1.1.1. http://CRAN.R-project.org/package=infotheo.
[23]W. Revelle (2013) psych: Procedures for Personality and Psychological Research, Northwestern University, Evanston, Illinois, USA, http://CRAN.R-project.org/package=psych Version = 1.3.2.
[24]S.Goswami, A.K.Das, A.Chakrabarti and B.Chakraborty, “A feature cluster taxonomy based feature selection technique”, Expert Systems with Applications, Elsevier, Vol.79, pp.76-89, 2017.
[25]S. Goswami, A. K. Das, A. Chakrabarti and B. Chakraborty,“AGraph-Theoretic Approach for Visualization of Data Set Feature Association”, Advanced Computing and Systems for Security, Springer, pp.109-124.
[26]L.Dey and S. Chakraborty, “Canonical PSO Based K-Means Clustering Approach for Real Datasets”, ISRN Software Engineering Journal, Hindawi, Vol.14, 2014.
[27]S.Chakraborty and N.K.Nagwani, “Performance Evaluation of Incremental K-means Clustering Algorithm”, IFRSA International Journal of Data Warehousing & Mining, Vol.1, No.1, pp.54-59, 2011.
[28]S.Chakraborty and N.K.Nagwani, “Analysis and study of Incremental DBSCAN clustering algorithm”, International Journal of Enterprise Computing and Business Systems, Vol.1, No.1, pp.54-59, 2011.
[29]S. Goswami, A. Chakrabarti and B. Chakraborty, “A Proposal for Recommendation of Feature Selection Algorithm based on Data Set Characteristics”, Journal of Universal Computer Science, Vol.22, No.6, pp. 760-781, 2016.
[30]S. Chattopadhyay, S. Mishra and S. Goswami, “ Feature selection using differential evolution with binary mutation scheme”, International Conference on Microelectronics, Computing and Communications (MicroCom), IEEE, pp.1-6, 2016.