Modified linear discriminant analysis approaches for classification of high-dimensional microarray data. (English) Zbl 1453.62255

Summary: Linear discriminant analysis (LDA) is one of the most popular methods of classification. For high-dimensional microarray data classification, due to the small number of samples and large number of features, classical LDA has sub-optimal performance corresponding to the singularity and instability of the within-group covariance matrix. Two modified LDA approaches (MLDA and NLDA) were applied for microarray classification and their performance criteria were compared with other popular classification algorithms across a range of feature set sizes (number of genes) using both simulated and real datasets. The results showed that the overall performance of the two modified LDA approaches was as competitive as support vector machines and other regularized LDA approaches and better than diagonal linear discriminant analysis, \(k\)-nearest neighbor, and classical LDA. It was concluded that the modified LDA approaches can be used as an effective classification tool in limited sample size and high-dimensional microarray classification problems.


62-08 Computational methods for problems pertaining to statistics
62P10 Applications of statistics to biology and medical sciences; meta analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
92D20 Protein sequences, DNA sequences
Full Text: DOI


[1] Alon, U.; Barkai, N.; Notterman, D.A.; Gish, K.; Ybarra, S.; Mack, D., Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of national Academy of sciences, 96, 12, 6745-6750, (1999)
[2] Ben-Dor, A.; Bruhn, L.; Friedman, N., Tissue classification with gene expression profiles, Journal of computational biology, 7, 536-540, (2000)
[3] Braga-Neto; Dougherty, E.R., Is cross validation valid for small sample microarray classification?, Bioinformatics, 20, 374-380, (2004)
[4] Brown, M.P.S.; Grundy, W.N.; Lin, D.; Cristianini, N.; Sugnet, C.W.; Furey, T.S., Knowledge-based analysis of microarray gene expression data by using support vector machines, Proceedings of national Academy of sciences, 97, 1, 262-267, (2000)
[5] Chiaretti, S.; Li, X.; Gentleman, R.; Vitale, A.; Vignetti, M.; Mandelli, F., Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival, Blood 1, 103, 7, 2771-2778, (2004)
[6] Cristianini, N.; Shawe-Taylor, J., An introduction to support vector machines and other kernel-based learning methods, (2000), Cambridge University Press
[7] Dipillo, P., The application of bias to discriminant analysis, Communication in statistics theory and methodology, A5, 843-854, (1976) · Zbl 0439.62042
[8] Dudoit, S.; Fridlyand, J.; Speed, T.P., Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American statistical association, 97, 457, 77-87, (2002) · Zbl 1073.62576
[9] Duintjer Tebbens, J.; Schlesinger, P., Improving implementation of linear discriminant analysis for the high dimension/small sample size problem, Computational statistics & data analysis, 52, 1, 423-437, (2007) · Zbl 1452.62470
[10] Friedman, J.H., Regularized discriminant analysis, Journal of the American statistical association, 84, 405, 165-175, (1989)
[11] Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286, 5439, 531-537, (1999)
[12] Guo, Y.; Hastie, T.; Tibshirani, R., Regularized linear discriminant analysis and its application in microarrays, Biostatistics, 8, 1, 86-100, (2007) · Zbl 1170.62382
[13] Hastie, T.; Tibshirani, R., Efficient quadratic regularization for expression arrays, Biostatistics, 5, 2, 329-340, (2004) · Zbl 1154.62393
[14] Hastie, T.; Tibshirani, R.; Friedman, J.H., The elements of statistical learning: data mining, inference, and prediction, (2001), Springer · Zbl 0973.62007
[15] Jain, A.K.; Chandrasekaran, B., (), 835-855
[16] Jeffery, I.; Higgins, D.; Culhane, A., Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data, BMC bioinformatics, 7, 1, 359-374, (2006)
[17] Ledoit, O.; Wolf, M., A well-conditioned estimator for large-dimensional covariance matrices, Journal of multivariate analysis, 88, 2, 365-411, (2004) · Zbl 1032.62050
[18] Li, L.; Weinberg, C.R.; Darden, T.A.; Pedersen, L.G., Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method, Bioinformatics, 17, 12, 1131-1142, (2001)
[19] Molinaro, A.M.; Simon, R.; Pfeiffer, R.M., Prediction error estimation: A comparison of resampling methods, Bioinformatics, 21, 3301-3307, (2005)
[20] Parrish, R.S.; Spencer, H.J.; Xu, P., Distribution modeling and simulation of gene expression data, Computational statistics and data analysis, 53, 5, 1650-1660, (2009) · Zbl 1453.62173
[21] R Development Core Team, 2005. R: A language and environment for statistical computing. R Foundations for Statistical Computing. 2.4.1 ed. Vienna
[22] Ripley, B.D., Pattern recognition and neural networks, (1996), Cambridge University Press · Zbl 0853.62046
[23] Schafer, J.; Strimmer, K., A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics, Berkeley electronic press, 4, 1, (2005), Article 32
[24] Shipp, M.A.; Ross, K.N.; Tamayo, P.; Weng, A.P.; Kutok, J.L.; Aguiar, R.C.T., Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning, Nature medicine, 8, 1, 68-74, (2002)
[25] Simon, R., 2005. Development and validation of biomarker classifiers for treatment selection. Biometric Branch Technical Report 35 National Cancer Institute, Bethesda MD. http://linus.nci.nih.gov/brb
[26] Simon, R.; Radmacher, M.D.; Dobbin, K.; McShane, L.M., Pitfalls in the analysis of DNA microarray data: class prediction methods, Journal of the national cancer institute, 95, 14-18, (2003)
[27] Singh, D.; Febbo, P.G.; Ross, K.; Jackson, D.G.; Manola, J.; Ladd, C., Gene expression correlates of clinical prostate cancer behavior, Cancer cell, 1, 2, 203-209, (2002)
[28] Smyth, G.K., Linear models and empirical Bayes methods for assessing differential expression in microarray experiments, Statistical applications in genetics and molecular biology, 3, 1, (2004), Article 3 · Zbl 1038.62110
[29] Smyth, G.K., Limma: linear models for microarray data, ()
[30] Speed, T.P., Statistical analysis of gene expression microarray data, (2003), CRC Press · Zbl 1108.62331
[31] Thomaz, C.E., Gillies, D.F., 2005. A maximum uncertainty lda-based approach for limited sample size problems #8212; with application to face recognition. In: 18th Brazilian Symposium on Computer Graphics and Image Processing SIBGRAPI 2005, pp. 89-96
[32] Tibshirani, Class prediction by nearest shrunken centroids with applications to DNA microarrays, Statistical science, 18, 104-117, (2003) · Zbl 1048.62109
[33] Wang, S.J., 2005. Utility of high dimensional genomic composite biomarkers in therapeutic and/or diagnostic development. In: Emerging Information Technology Conference. Taipei, Taiwan
[34] Ye, J., Xiong, T., Li, Q., Janardan, R., Bi, J., Cherkassky, V., et al., 2006. Efficient model selection for regularized linear discriminant analysis. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management. Arlington, Virginia, USA, 2006, pp. 532-539
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.