Testing significance of features by lassoed principal components. (English) Zbl 1149.62092

Summary: We consider the problem of testing the significance of features in high-dimensional settings. In particular, we test for differentially-expressed genes in a microarray experiment. We wish to identify genes that are associated with some type of outcome, such as survival time or cancer type. We propose a new procedure, called Lassoed Principal Components (LPC), that builds upon existing methods and can provide a sizable improvement. For instance, in the case of two-class data, a standard (albeit simple) approach might be to compute a two-sample \(t\)-statistic for each gene. The LPC method involves projecting these conventional gene scores onto the eigenvectors of the gene expression data covariance matrix and then applying an \(L_{1}\) penalty in order to de-noise the resulting projections.
We present a theoretical framework under which LPC is the logical choice for identifying significant genes, and we show that LPC can provide a marked reduction in false discovery rates over the conventional methods on both real and simulated data. Moreover, this flexible procedure can be applied to a variety of types of data and can be used to improve many existing methods for the identification of significant features.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62H25 Factor analysis and principal components; correspondence analysis
65C60 Computational problems in statistics (MSC2010)


lpc; Eigenstrat
Full Text: DOI arXiv


[1] Allison, D., Cui, X., Page, G. and Sabripour, M. (2006). Microarray data analysis: From disarray to consolidation and consensus. Nature Reviews Genetics 7 55-65.
[2] Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D. and Levine, A. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96 6745-6750.
[3] Alter, O., Brown, P. and Botstein, D. (2000). Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. 97 10101-10106.
[4] Bair, E., Hastie, T., Paul, D. and Tibshirani, R. (2006). Prediction by supervised principal components. J. Amer. Statist. Assoc. 101 119-137. · Zbl 1118.62326
[5] Bair, E. and Tibshirani, R. (2004). Semi-supervised methods to predict patient survival from gene expression data. PLOS Biology 2 511-522.
[6] Beer, D. G., Kardia, S. L., Huang, C.-C., Giordano, T. J., Levin, A. M., Misek, D. E., Lin, L., Chen, G., Gharib, T. G., Thomas, D. G., Lizyness, M. L., Kuick, R., Hayasaka, S., Taylor, J. M., Iannettoni, M. D., Orringer, M. B. and Hanash, S. (2002). Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nature Medicine 8 816-824.
[7] Carvalho, C., Lucas, J., Wang, Q., Chang, J., Nevins, J. and West, M. (2008). High-dimensional sparse factor modeling-applications in gene expression genomics. J. Amer. Statist. Assoc. · Zbl 1286.62091
[8] Cui, X. and Churchill, G. A. (2003). Statistical test for differential expression in cdna microarray experiments. Genome Biology 4 210.
[9] Cui, X., Hwang, J. T. G., Qiu, J., Blades, N. J. and Churchill, G. A. (2005). Improved statistical tests for differential gene expression by shrinking variance component estimates. Biostatistics 6 59-75. · Zbl 1069.62090
[10] Getz, G., Hoefling, H., Mesirov, J. P., Golub, T. R., Meyerson, M. L., Tibshirani, R. and Lander, E. S. (2007). Technical comment on Sjoblom et al. Science 317 1500.
[11] Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLOS Genetics 3 1724-1735.
[12] Lonnstedt, I. and Speed, T. (2002). Replicated microarray data. Statist. Sinica 12 31-46. · Zbl 1004.62086
[13] Price, A. L., Patterson, N. J., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38 904-909.
[14] Rosenwald, A., Wright, G., Chan, W. C., Connors, J. M., Campo, E., Fisher, R. I., Gascoyne, R. D., Muller-Hermelink, H. K., Smeland, E. B. and Staudt, L. M. (2002). The use of molecular profiling to predict survival after chemotherapy for diffuse large b-cell lymphoma. The New England J. Medicine 346 1937-1947.
[15] Shen, R., Ghosh, D., Chinnaiyan, A. and Meng, Z. (2006). Eigengene-based linear discriminant model for tumor classification using gene expression microarray data. Bioinformatics 22 2635-2642. · Zbl 1254.92006
[16] Sjoblom, T., Jones, S., Wood, L., Parsons, D., Lin, J., Barber, T., Mandelker, D., Leary, R., Ptak, J., Silliman, N., Szabo, S., Buckhaults, P., Farrell, C., Meeh, P., Markowitz, S., Willis, J., Dawson, D., Willson, J., Gazdar, A., Hartigan, J., Wu, L., Liu, C., Parmigiani, G., Park, B., Bachman, K., Papadopoulos, N., Vogelstein, B., Kinzler, K. and Velculescu, V. (2006). The consensus coding sequences of human breast and colorectal cancers. Science 314 268-274.
[17] Smyth, G. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statist. Appl. Genet. Mol. Biol. 3 . · Zbl 1038.62110
[18] Storey, J. D., Dai, J. Y. and Leek, J. T. (2007). The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. Biostatistics 8 414-432. · Zbl 1213.62175
[19] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
[20] Tusher, V. G., Tibshirani, R. and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. 98 5116-5121. · Zbl 1012.92014
[21] West, M. (2003). Bayesian factor regression models in the “large p , small n ” paradigm. In Bayesian Statistics 7 723-732. Oxford Univ. Press, New York.
[22] Witten, D. M. and Tibshirani, R. (2008). Supplement to “Testing significance of features by lassoed principal components.” DOI: 10.1214/08-AOAS182SUPP. · Zbl 1149.62092
[23] Zhao, H., Ljungberg, B., Grankvist, K., Rasmuson, T., Tibshirani, R. and Brooks, J. (2006). Gene expression profiling predicts survival in conventional renal cell carcinoma. PLOS Medicine 3 115-124.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.