Microarray Gene-expression Data Classification using Less Gene Expressions by Combining Feature Selection Methods and Classifiers

Full Text (PDF, 212KB), PP.42-48

Views: 0 Downloads: 0

Author(s)

Aarti Bhalla 1,* R. K. Agrawal 1

1. School of Computer and Systems Sciences, Jawaharlal Nehru University New Delhi, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijieeb.2013.05.06

Received: 10 Jul. 2013 / Revised: 15 Aug. 2013 / Accepted: 18 Sep. 2013 / Published: 8 Nov. 2013

Index Terms

Microarrays, Feature Selection, Hypothesis testing, Classification with less genes

Abstract

Microarray Data, often characterised by high-dimensions and small samples, is used for cancer classification problems that classify the given (tissue) samples as deceased or healthy on the basis of analysis of gene expression profile. The goal of feature selection is to search the most relevant features from thousands of related features of a particular problem domain. The focus of this study is a method that relaxes the maximum accuracy criterion for feature selection and selects the combination of feature selection method and classifier that using small subset of features obtains accuracy not statistically indicatively different than the maximum accuracy. By selecting the classifier employing small number of features along with a good accuracy, the risk of over fitting (bias) is reduced. This has been corroborated empirically using some common attribute selection methods (ReliefF, SVM-RFE, FCBF, and Gain Ratio) and classifiers (3 Nearest Neighbour, Naive Bayes and SVM) applied to 6 different microarray cancer data sets. We use hypothesis testing to compare several configurations and select particular configurations that perform well with small genes on these data sets.

Cite This Paper

Aarti Bhalla, R. K. Agrawal, "Microarray gene-expression data classification using less gene expressions by combining feature selection methods and classifiers", International Journal of Information Engineering and Electronic Business(IJIEEB), vol.5, no.5, pp.42-48, 2013. DOI:10.5815/ijieeb.2013.05.06

Reference

[1]Alonso-González, C.J., et al., Microarray gene expression classification with few genes: criteria to combine attribute selection and classification methods. Expert Systems with Applications 39, 7270–7280, 2012.

[2]Bellman, R., Adaptive Control Processes. A Guided Tour. Princeton University Press 1961.

[3]Guyon, I., Weston, J., Barnhill, S., &Vapnik, V., Gene selection for cancer classification using support vector machines. Machine Learning, 46(1–3), 389–422, 2002.

[4]Guyon, I., Elisseeff, A., An Introduction to Variable and feature Selection, Journal of Machine Learning Research, 3, 1157-1182, 2003. 

[5]Kohavi, R., John, G., H., Wrappers for feature subset selection. Artificial Intelligence, 97 (1-2), 273-324, 1997.

[6]Mitchell, T.M., Machine learning. McGraw-Hill International ed., 1997.

[7]Peng, H., Long, F., Ding, C., Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 8, 1226-1238, 2005. 

[8]Nakariyakul, S., Casasent, D., Adaptive branch and bound algorithm for selecting optimal features. Pattern Recognition Letters, 28, 12, 1415–1427, 2007.

[9]Kittler, J., Pierre, A., D., Pattern Recognition: A Statistical Approach, PHI, 1982.

[10]Ruiz, R., Aguilar, J., S., Riqueline, J., Best agglomerative ranked subset for feature selection.JMLR: Workshop and Conference Proc., 4, New Challenges for feature selection, 148–162, 2009.

[11]Pudil, P., Novovicova, J., Kittler, J., Floating search methods in feature selection. Pattern Recognition Letters, 15, 11, 1119–1125, 1994b.

[12]Chen, X., An improved branch and bound algorithm for feature selection. Pattern Recognition Letters, 24, 1925–1933. , 2003. 

[13]Shevade, S., Keerthi, S., A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19, 17, 2246–2253, 2003.

[14]Breiman, L., Friedman, J., Olshen, R., Stone, C., Classification and regression trees, Chap-man &Hall London, 1984. 

[15]Ferreira, A., Figueiredo, M., Unsupervised feature selection for sparse data. 19th European Symposium on Artificial Neural Networks-ESANN, Bruges, Belgium, 339–344, 2011.

[16]Liu, L., Kang, J., Yu, J., Wang, Z., A comparative study on unsupervised feature selection methods for text clustering. IEEE International Conference on Natural Language Processing and Knowledge Engineering, 597–601, 2005.

[17]Bishop, C., M., Neural Networks for Pattern Recognition. Oxford University, Oxford, 1995.

[18]Zaffalon, M., Hutter, M., Robust feature selection by mutual information distributions. In proceedings of the 18th international conference on artificial Intelligence, 577–584, 2002.

[19]Kira, K., & Rendell, L.A., A practical approach to feature selection. D. Sleeman, P. Edwards (Eds.), Machine learning: proceedings of international conference (ICML-92), 249–256, 1992.

[20]Yu, L., Liu, H., Efficient feature selection via analysis of relevance and redundancy. J. Machine Learning Res. 5, 1205–1224, 2004.

[21]Koller, D., Sahami, M., Towards optimal feature selection. In Proceedings of the ThirteenthInternational Conference on Machine Learning, 284–292, 1996.

[22]Hall, M., Correlation-based feature selection for discrete and numeric class machine learning. In: ICML Proceedings of the Seventeenth International Conference on Machine Learning. Morgan Kaufmann, pp. 359–366, 2000.

[23]Mitra, P., Murthy, C., Pal, S., Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Machine Intell. 24, 301–312, 2002.

[24]Saeys, Y., Inza, I., &naga, P. L., A review of feature selection techniques in bioinformatics. Bioinformatics, 23, 2507–2517, 2005.

[25]Nadeau, C., Bengio, Y., Inference for the generalization error. Machine Learning, 52, 239–281, 2003.

[26]Li, J., & Liu, H., Kent Ridge Bio-medical Dataset, <http://datam.i2r.astar. edu.sg/datasets/krbd/>, 2011.

[27]Aguilar-Ruiz, J. S., Dataset Repository in ARFF (WEKA) of BioInformatics Research Group. Pablo de Olavide University and University of Seville. <http:// www.upo.es/eps/aguilar/datasets.html>, 2011. 

[28]Witten, I., & Frank, E., Data mining: Practical machine learning tools and techniques (2nd ed.), Morgan Kaufman, 2005.

[29]Golub, T., Stomin, D., & Tamayo, P., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537, 1999.