A Hybrid Data Mining Technique for Improving the Classification Accuracy of Microarray Data Set

Full Text (PDF, 182KB), PP.43-50

Views: 0 Downloads: 0

Author(s)

Sujata Dash 1,* Bichitrananda Patra 1 B.K. Tripathy 2

1. KMBB College & Technology of Engineering, Bhubaneswar, India

2. VIT-University, Vellore, Tamilnadu, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijieeb.2012.02.07

Received: 10 Jan. 2012 / Revised: 14 Feb. 2012 / Accepted: 16 Mar. 2012 / Published: 8 Apr. 2012

Index Terms

Partial least square, feature reduction, feature selection, microarrays, gene expression

Abstract

A major challenge in biomedical studies in recent years has been the classification of gene expression profiles into categories, such as cases and controls. This is done by first training a classifier by using a labeled training set containing labeled samples from the two populations, and then using that classifier to predict the labels of new samples. Such predictions have recently been shown to improve the diagnosis and treatment selection practices for several diseases. This procedure is complicated, however, by the high dimensionality of the data. While microarrays can measure the levels of thousands of genes per sample, case-control microarray studies usually involve no more than several dozen samples. Standard classifiers do not work well in these situations where the number of features (gene expression levels measured in these microarrays) far exceeds the number of samples. Selecting only the features that are most relevant for discriminating between the two categories can help construct better classifiers, in terms of both accuracy and efficiency. This paper provides a comparison between dimension reduction technique, namely Partial Least Squares (PLS)method and a hybrid feature selection scheme, and evaluates the relative performance of four different supervised classification procedures such as Radial Basis Function Network (RBFN), Multilayer Perceptron Network (MLP), Support Vector Machine using Polynomial kernel function(Polynomial- SVM) and Support Vector Machine using RBF kernel function (RBF-SVM) incorporating those methods. Experimental results show that the Partial Least-Squares(PLS) regression method is an appropriate feature selection method and a combined use of different classification and feature selection approaches makes it possible to construct high performance classification models for microarray data.

Cite This Paper

Sujata Dash, Bichitrananda Patra, B.K. Tripathy, "A Hybrid Data Mining Technique for Improving the Classification Accuracy of Microarray Data Set", International Journal of Information Engineering and Electronic Business(IJIEEB), vol.4, no.2, pp.43-50, 2012. DOI:10.5815/ijieeb.2012.02.07

Reference

[1]Barker M, Rayens W, Partial least squares for discrimination. journal of chemometrics, 2003, 17: 166–173. 

[2]Cao K-AL, Roussouw D, Robert-Granie C, Besse P , A Sparse PLS for Variable Selection when Integrating Omics Data. Statistical Applications in Genetics and Molecular Biology, 2008, 7: Article 35.

[3]De Jong, S.: SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems 2(4),1993, 251–263.

[4]Ding B, Gentleman R, Classification Using Generalized Partial Least Squares, 2004, Bioconductor Project.

[5]Efron, B.: Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 1979, 7(1) 1–26.

[6]Fort G, Lambert-Lacroix S, Classification using partial least squares with penalized logistic regression, 2005, Bioinformatics 21: 1104–1111.

[7]Frank E, Hall M, Trigg L, Holmes G, Witten IH: Data mining in bioinformatics using Weka. Bioinformatic , 2004, 20(15):2479-2481.

[8]Furey, T.S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D.: Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 2000, 16, 906–914.

[9]Greer BT, Khan J: Diagnostic classification of cancer using DNA microarrays and artificial intelligence. Ann N Y Acad Sci , 2004, 1020:49-66.

[10]Huang X, Pan W , Linear regression and two-class classification with gene expression data. Bioinformatics, 2003, 19: 2072–2078.

[11]Huang X, Pan W, Han X, Chen Y, Miller LW, et al. Borrowing information from relevant microarray studies for sample classification using weighted partial least squares. Comput Biol Chem, 2005, 29: 204–211.

[12]Huang, X., Pan, W., . Linear regression and two-class classification with gene expression data. Bioinformatics, 2003, 19, 2072–2078.

[13]Hastie, T., Tibshirani, R., Friedman, J. H., The elements of statistical learning. Springer-Verlag, 2001, New York.

[14]Hall, M.A., Correlation-based feature selection for machine learning. Ph.D. Thesis., 1999, Department of Computer Science University of Waikato.

[15]Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, 2 edition, 2005.

[16]Lippmann R.P., Moody J.E., Touretzky D.S., Neural Information Processing Systems.1991, Morgan Kauffman.

[17]Moody J.E., Darken C., Fast learning in networks of locally tuned processing units. Neural Computation,1989, 1:281-294.

[18]Martens, H., Reliable and relevant modelling of real world data: a personal account of the development of pls regression. Chemometrics and Intelligent Laboratory Systems, 2001, 58, 85–95.

[19]Mehdi, P., Jack Y. Y., Mary, Q. Y., Youping, D., A comparative study of different machine learning methods on microarray gene expression data, BMC Genomics,2008, 9(Suppl I):S13.

[20]Mitchell Tom M: Machine Learning. McGraw-Hill; 1997.

[21]Nguyen DV, Rocke DM, Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 2002, 18: 39–50.

[22]Nguyen DV, Rocke DM, Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics, 2002, 18: 1216–1226.

[23]Nguyen, D., Rocke, D. M., Tumor classification by partial least squares using microarray gene expressio data. Bioinformatics, 2002, 18, 39–50.

[24]Narayanan A, Keedwell EC, Olsson B. (2002): Artificial intelligence techniques for bioinformatics. Appl Bioinformatics, 1(4):191-222.

[25]Platt, J., Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods–Support Vector Learning. 1998, MIT Press.

[26]Saeys Y, Inza I, Larranaga P, A review of feature selection techniques in bioinformatics. Bioinformatics, 2007, 23: 2507–2517.

[27]Tan, A.C., Gilbert, D., Ensemble machine learning on gene expression data for cancer classification. Applied Bioinformatics, 2003, 2, S75-S83.

[28]V. Bolon-Canedo, A. Alonso-Betanzos,N. Sanchez-Marono , An ensemble of filters and classifiers for microarray Data classification, Pattern Recognition,2012, volume 45, Issue 1.

[29]Vapnik VN: Statistical Learning Theory, Adaptive and Learning Systems for Signal Processing, Communications And Control,1998, Wiley New York.

[30]Wold H , Soft modeling: the basic design and some extensions. Systems Under Indirect Observation,1982, 2: 1–53.

[31]Wold H, Partial least squares., Encyclopedia of the Statistical Sciences, , 1985, 6:581–591.

[32]Wold S, Ruhe H, Wold H, Dunn WJ III, The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverse. SIAM Journal of Scientific and Statistical Computations, 1984, 5: 735 -743.