Feature selection in omics prediction problems using cat scores and false nondiscovery rate control. (English) Zbl 1189.62102

Summary: We revisit the problem of feature selection in linear discriminant analysis (LDA), that is, when features are correlated. First, we introduce a pooled centroids formulation of the multiclass LDA predictor function, in which the relative weights of Mahalanobis-transformed predictors are given by correlation-adjusted \(t\)-scores (cat scores). Second, for feature selection we propose thresholding cat scores by controlling false nondiscovery rates (FNDR). Third, training of the classifier is based on James-Stein shrinkage estimates of correlations and variances, where regularization parameters are chosen analytically without resampling. Overall, this results in an effective and computationally inexpensive framework for high-dimensional prediction with natural feature selection. The proposed shrinkage discriminant procedures are implemented in the R package “sda” available from the R repository CRAN.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H20 Measures of association (correlation, canonical correlation, etc.)
65C60 Computational problems in statistics (MSC2010)


CMA; R; rda; scout; fdrtool; sda; corpor
Full Text: DOI arXiv


[1] Ackermann, M. and Strimmer, K. (2009). A general modular framework for gene set enrichment. BMC Bioinformatics 10 47.
[2] Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, J., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O. and Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403 503-511.
[3] Ambroise, C. and McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA 99 6562-6566. · Zbl 1034.92013
[4] Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function, ‘naive Bayes,’ and some alternatives when there are many more variables than observations. Bernoulli 10 989-1010. · Zbl 1064.62073
[5] Dabney, A. R. and Storey, J. D. (2007). Optimality driven nearest centroid classification from genomic data. PLoS ONE 2 e1002.
[6] Donoho, D. and Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 105 14790-15795. · Zbl 1357.62212
[7] Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. J. Amer. Statist. Assoc. 70 892-896. · Zbl 0319.62039
[8] Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96-104. · Zbl 1089.62502
[9] Efron, B. (2008a). Empirical Bayes estimates for large-scale prediction problems. Technical report, Dept. Statistics, Stanford Univ. · Zbl 1388.62009
[10] Efron, B. (2008b). Microarrays, empirical Bayes, and the two-groups model. Statist. Sci. 23 1-22. · Zbl 1327.62046
[11] Fan, J. and Fan, Y. (2008). High-dimensional classification using features annealed independence rules. Ann. Statist. 36 2605-2637. · Zbl 1360.62327
[12] Friedman, J. H. (1989). Regularized discriminant analysis. J. Amer. Statist. Assoc. 84 165-175.
[13] Guo, Y., Hastie, T. and Tibshirani, T. (2007). Regularized discriminant analysis and its application in microarrays. Biostatistics 8 86-100. · Zbl 1170.62382
[14] Hand, D. J. (2006). Classifier technology and the illusion of progress. Statist. Sci. 21 1-14. · Zbl 1426.62188
[15] Hausser, J. and Strimmer, K. (2009). Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. J. Mach. Learn. Res. 10 1469-1484. · Zbl 1235.62006
[16] Hintze, J. L. and Nelson, R. D. (1998). Violin plots: A box plot-density trace synergism. Amer. Statist. 52 181-184.
[17] Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C. and Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Med. 7 673-679.
[18] Opgen-Rhein, R. and Strimmer, K. (2007). Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach. Statist. Appl. Genet. Mol. Biol. 6 9. · Zbl 1166.62361
[19] Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y. H., Goumnerova, L. C., Black, P. M., Lau, C., Allen, J. C., Zagzag, D., Olson, J. M., Curran, T., Wetmore, C., Biegel, J. A., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, D. N., Mesirov, J. P., Lander, E. S. and Golub, T. R. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415 436-442.
[20] Schäfer, J. and Strimmer, K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statist. Appl. Genet. Mol. Biol. 4 32.
[21] Schwender, H., Ickstadt, K. and Rahnenführer, J. (2008). Classification with high-dimensional genetic data: Assigning patients and genetic features to known classes. Biometr. J. 50 911-926.
[22] Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff, P. W., Golub, T. R. and Sellers, W. R. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1 203-209.
[23] Slawski, M., Daumer, M. and Boulesteix, A.-L. (2008). CMA-a comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bionformatics 9 439.
[24] Strimmer, K. (2008). A unified approach to false discovery rate estimation. BMC Bioinformatics 9 303. · Zbl 1318.62329
[25] Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer type by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. USA 99 6567-6572.
[26] Tibshirani, R., Hastie, T., Narsimhan, B. and Chu, G. (2003). Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statist. Sci. 18 104-117. · Zbl 1048.62109
[27] Wilson, E. and Hilferty, M. (1931). The distribution of chi-square. Proc. Natl. Acad. Sci. 17 684-688. · Zbl 0004.36005
[28] Witten, D. M. and Tibshirani, R. (2009). Covariance-regularized regression and classification for high-dimensional problems. J. Roy. Statist. Soc. Ser. B 71 615-636. · Zbl 1250.62033
[29] Xu, P., Brock, G. N. and Parrish, R. S. (2009). Modified linear discriminant analysis approaches for classification of high-dimensional micoarray data. Comput. Stat. Data Anal. 53 1674-1687. · Zbl 1453.62255
[30] Zuber, V. and Strimmer, K. (2009). Gene ranking and biomarker discovery under correlation. Bioinformatics 25 2700-2707.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.