zbMATH — the first resource for mathematics

Two-group classification with high-dimensional correlated data: a factor model approach. (English) Zbl 1218.62064
Summary: A class of linear classification rules, specifically designed for high-dimensional problems, is proposed. The new rules are based on Gaussian factor models and are able to incorporate successfully the information contained in the sample correlations. Asymptotic results, that allow the number of variables to grow faster than the number of observations, demonstrate that the worst possible expected error rate of the proposed rules converges to the error of the optimal Bayes rule when the postulated model is true, and to a slightly larger constant when this model is a reasonable approximation to the data generating process. Numerical comparisons suggest that, when combined with appropriate variable selection strategies, rules derived from one-factor models perform comparably, or better, than the most successful extant alternatives under the conditions they were designed for. The proposed methods are implemented as an \(R\) package named HiDimDA, available from the CRAN repository.

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H25 Factor analysis and principal components; correspondence analysis
65C60 Computational problems in statistics (MSC2010)
Full Text: DOI
[1] Alon, U.; Barkai, N.; Notterman, D.A.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A.J., Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of the national Academy of sciences, 96, 6745-6750, (1999)
[2] Benjamini, Y.; Hochberg, Y., Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the royal statistical society. series B, 57, 1, 289-300, (1995) · Zbl 0809.62014
[3] Benjamini, Y.; Yekutieli, D., The control of the false discovery rate in multiple testing under dependency, The annals of statistics, 29, 4, 1165-1188, (2001) · Zbl 1041.62061
[4] Bickel, P.J.; Levina, E., Some theory for fisher’s linear discriminant function, “naive bayes” and some alternatives when there are many more variables than observations, Bernoulli, 10, 6, 989-1010, (2004) · Zbl 1064.62073
[5] Chakraborty, S.; Guo, R., A Bayesian hybrid huberized support vector machine and its applications in high-dimensional medical data, Computational statistics and data analysis, 55, 3, 1342-1356, (2011) · Zbl 1328.62584
[6] Chang, C.C., Lin, C.J., 2010. LIBSVM: a library for support vector machines. Unpublished Manuscript. URL:http://www.csie.ntu.edu.tw/ cjlin/papers/libsvm.ps.gz.
[7] Choi, H.; Yeo, D.; Kwon, S.; Kim, Y., Gene selection and prediction for cancer classification using support vector machines with a reject option, Computational statistics and data analysis, 55, 5, 1897-1908, (2011) · Zbl 1328.62586
[8] Domingos, P.; Pazzani, M., On the optimality of the simple Bayesian classifier under zero-one loss, Machine learning, 29, 103-130, (1997) · Zbl 0892.68076
[9] Donoho, D.; Jin, J., Higher criticism for detecting sparse heterogeneous mixtures, The annals of statistics, 32, 3, 962-994, (2004) · Zbl 1092.62051
[10] Donoho, D.; Jin, J., Higher criticism thresholding. optimal feature selection when useful features are rare and weak, Proceedings of the national Academy of sciences, 105, 14790-14795, (2008) · Zbl 1357.62212
[11] Donoho, D.; Jin, J., Feature selection by higher criticism thresholding achieves the optimal phase diagram, Philosophical transactions of the royal society. series A, 367, 4449-4470, (2009) · Zbl 1185.62113
[12] Duarte Silva, A.P., Efficient variable screening for multivariate analysis, Journal of multivariate analysis, 76, 1, 35-62, (2001) · Zbl 0996.62063
[13] Duarte Silva, A.P., Linear discriminant analysis with more variables than observations: A not so naive approach, (), 227-234
[14] Duarte Silva, A.P.; Stam, A.; Neter, J., The effects of misclassification costs and skewed distributions in two-group classification, Communications in statistics: simulation and computation, 31, 3, 401-423, (2002) · Zbl 1079.62519
[15] Dudoit, S.; Fridlyand, J.; Speed, T.P., Comparison of discriminant methods for the classification of tumors using gene expression data, Journal of the American statistical association, 97, 457, 77-87, (2002) · Zbl 1073.62576
[16] Efron, B., Large-scale simultaneous hypothesis testing: the choice of a null hypothesis, Journal of the American statistical association, 99, 465, 96-104, (2004) · Zbl 1089.62502
[17] Efron, B., Size, power and false discovery rates, Annals of statistics, 35, 4, 1351-1377, (2007) · Zbl 1123.62008
[18] Fan, J.; Fan, Y., High dimensional classification using features annealed independence rules, Annals of statistics, 36, 6, 2605-2637, (2008) · Zbl 1360.62327
[19] Fisher, T.J.; Sun, X., Improved Stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix, Computational statistics and data analysis, 55, 5, 1909-1918, (2011) · Zbl 1328.62336
[20] Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; Bloomfield, C.D.; Lander, E.S., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286, 5439, 531-537, (2009)
[21] Golub, G.H.; Van Loan, C.F., Matrix computations, (1996), Johns Hopkins Baltimore · Zbl 0865.65009
[22] Greenshtein, E.; Park, J., Regularization through variable selection and conditional MLE with application to classification in high dimensions, Journal of statistical planning and inference, 139, 2, 385-395, (2009) · Zbl 1149.62052
[23] Guo, Y.; Hastie, T.; Tibshirani, R., Regularized discriminant analysis and its application in microarrays, Biostatistics, 8, 1, 86-100, (2007) · Zbl 1170.62382
[24] Johnstone, I.M., 2002. Function estimation and Gaussian sequence models. Unpublished Monograph. http://www-stat.stanford.edu/ imj.
[25] Ledoit, O.; Wolf, M., A well-conditioned estimator for large-dimensional covariance matrices, Journal of multivariate analysis, 88, 2, 365-411, (2004) · Zbl 1032.62050
[26] Luenberger, D.G., Linear and nonlinear programming, (1984), Addison-Wesley · Zbl 0571.90051
[27] McLachlan, G.J., Discriminant analysis and statistical pattern recognition, (1992), Wiley New York
[28] R Development Core Team, 2011. R: A language and environment for statistical computing. R Foundation for statistical computing, Vienna, Austria. ISBN: 3-900051-07-0. URL: http://www.R-project.com.
[29] Schafer, J.; Strimmer, K., A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics, Statistical applications in genetics and molecular biology, 4, 1, (2005), Art. 32
[30] Seber, G.A.F., Multivariate observations, (1984), Wiley New York · Zbl 0627.62052
[31] Singh, D.; Febbo, P.G.; Ross, K.; Jackson, D.G.; Manola, J.; Ladd, C.; Tamayo, P.; Renshaw, A.A.; D’Amico, A.V.; Richie, J.P.; Lander, E.S.; Loda, K.; Kantoff, P.W.; Golub, T.R.; Sellers, W.R., Gene expression correlates of clinical prostate cancer behavior, Cancer cell, 1, 2, 203-220, (2002)
[32] Tibshirani, R.; Hastie, B.; Narismhan, B.; Chu, G., Class prediction by nearest shrunken centroids, with applications to DNA microarrays, Statistical science, 18, 1, 104-117, (2003) · Zbl 1048.62109
[33] Vapnik, V., The nature of statistical learning theory, (1996), Springer New York · Zbl 0934.62009
[34] Xu, P.; Brock, G.N.; Parrish, R.S., Modified linear discriminant analysis approaches for classification of high-dimensional microarray data, Computational statistics and data analysis, 53, 5, 1674-1687, (2009) · Zbl 1453.62255
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.