Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. (English) Zbl 1257.62115

Summary: In high throughput settings we inspect a great many candidate variables (e.g., genes) searching for associations with a primary variable (e.g., a phenotype). High throughput hypothesis testing can be made difficult by the presence of systemic effects and other latent variables. It is well known that those variables alter the level of tests and induce correlations between tests. They also change the relative ordering of significance levels among the hypotheses. Poor rankings lead to wasteful and ineffective follow-up studies. The problem becomes acute for latent variables that are correlated with the primary variable.
We propose a two-stage analysis to counter the effects of latent variables on the ranking of hypotheses. Our method, called LEAPP, statistically isolates the latent variables from the primary one. In simulations, it gives better ordering of the hypotheses than competing methods such as SVA and EIGENSTRAT. For an illustration, we turn to data from the AGEMAP study relating gene expression to age for 16 tissues in the mouse. LEAPP generates rankings with greater consistency across tissues than the rankings attained by the other methods.


62P10 Applications of statistics to biology and medical sciences; meta analysis
92D10 Genetics and epigenetics
62J15 Paired and multiple comparisons; multiple testing
65C60 Computational problems in statistics (MSC2010)


leapp; FAMT; Eigenstrat
Full Text: DOI arXiv Euclid


[1] Allen, G. I. and Tibshirani, R. J. (2010). Inference with transposable data: Modeling the effects of row and column correlations. Technical report, Stanford Univ., Dept. Statistics.
[2] Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135-171. · Zbl 1136.62354 · doi:10.1111/1468-0262.00392
[3] Balding, D. and Nicols, R. (1995). A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96 3-12.
[4] Broder, A. Z. (1997). On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings 21-29. IEEE Comput. Soc., Los Alamitos.
[5] Buja, A. and Eyuboglu, N. (1992). Remarks on parallel analysis. Multivariate Behavioral Research 27 509-540.
[6] Candès, E. J. and Randall, P. A. (2006). Highly robust error correction by convex programming. IEEE Trans. Inform. Theory 54 2829-2840. · Zbl 1332.94096 · doi:10.1109/TIT.2008.924688
[7] Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q. and West, M. (2008). High-dimensional sparse factor modeling: Applications in gene expression genomics. J. Amer. Statist. Assoc. 103 1438-1456. · Zbl 1286.62091 · doi:10.1198/016214508000000869
[8] Chen, J. and Chen, Z. (2008). Extended Bayesian information criterion. Biometrika 94 759-771. · Zbl 1437.62415 · doi:10.1093/biomet/asn034
[9] Diskin, S. J., Li, M., Hou, C., Yang, S., Glessner, J., Hakonarson, H., Bucan, M., Maris, J. M. and Wang, K. (2008). Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res. 36 e126.
[10] Dudoit, S. and van der Laan, M. J. (2008). Multiple Testing Procedures with Applications to Genetics . Springer, New York. · Zbl 1261.62014
[11] Efron, B. (2007). Size, power and false discovery rates. Ann. Statist. 35 1351-1377. · Zbl 1123.62008 · doi:10.1214/009053606000001460
[12] Efron, B. (2008). Microarrays, empirical Bayes and the two-groups model. Statist. Sci. 23 1-22. · Zbl 1327.62046 · doi:10.1214/07-STS236
[13] Efron, B. (2010). Large-Scale Inference : Empirical Bayes Methods for Estimation , Testing , and Prediction. Institute of Mathematical Statistics Monographs 1 . Cambridge Univ. Press, Cambridge. · Zbl 1277.62016
[14] Friguet, C., Kloareg, M. and Causeur, D. (2009). A factor model approach to multiple testing under dependence. J. Amer. Statist. Assoc. 104 1406-1415. · Zbl 1123.62008 · doi:10.1214/009053606000001460
[15] Gabriel, K. R. and Zamir, S. (1979). Lower rank approximation of matrices by least squares with any choice of weights. Technometrics 21 489-498. · Zbl 0471.62004 · doi:10.2307/1268288
[16] Hedenfalk, I. (2001). Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 344 539-548.
[17] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295-327. · Zbl 1016.62078 · doi:10.1214/aos/1009210544
[18] Kim, S. K. (2007). Common aging pathways in worms, flies, mice and humans. J. Exp. Biol. 210 1607-1612.
[19] Kim, S. K. (2008). Genome-wide views of aging gene networks. In Molecular Biology of Aging 215-235. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
[20] Leek, J. T. and Storey, J. D. (2008). A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. USA 105 18718-18723. · Zbl 1359.62202
[21] Leek, J. T., Scharpf, R. B., Corrada-Bravo, H., Simcha, D., Langmead, B., Johnson, W. E., Geman, D., Baggerley, K. and Irizarry, R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11 733-739.
[22] Lucas, J. E., Kung, H. N. and Chi, J. T. A. (2010). Latent factor analysis to discover pathway-associated putative segmental aneuploidies in human cancers. PLoS Comput. Biol. 6 e100920:1-15.
[23] Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557-572. · Zbl 1155.62478 · doi:10.1093/biostatistics/kxh008
[24] Owen, A. B. and Perry, P. O. (2009). Bi-cross-validation of the SVD and the non-negative matrix factorization. Ann. Appl. Stat. 3 564-594. · Zbl 1166.62047 · doi:10.1214/08-AOAS227
[25] Patterson, N. J., Price, A. L. and Reich, D. (2006). Population structure and eigenanalysis. PLoS Genetics 2 2074-2093.
[26] Perry, P. O. (2009). Cross-validation for unsupervised learning. Ph.D. thesis, Stanford Univ.
[27] Perry, P. O. and Owen, A. B. (2010). A rotation test to verify latent structure. J. Mach. Learn. Res. 11 603-624. · Zbl 1242.62044
[28] Price, A. L., Patterson, N. J., Plengt, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components ananysis corrects for stratification in genome-wide association studies. Nature Genetics 38 904-909.
[29] Rodwell, G., Sonu, R., Zahn, J. M., Lund, J., Wilhelmy, J., Wang, L., Xiao, W., Mindrinos, M., Crane, E., Segal, E., Myers, B., Davis, R., Higgins, J., Owen, A. B. and Kim, S. K. (2004). A transcriptional profile of aging in the human kidney. PLoS Biology 2 2191-2201.
[30] She, Y. and Owen, A. B. (2011). Outlier identification using nonconvex penalized regression. J. Amer. Statist. Assoc. 106 626-639. · Zbl 1232.62068 · doi:10.1198/jasa.2011.tm10390
[31] Storey, J. D., Akey, J. M. and Kruglyak, L. (2005). Multiple locus linkage analysis of genomewide expression in yeast. PLoS Biology 3 1380-1390.
[32] Sun, Y. (2011). On latent systemic effects in multiple hypotheses. Ph.D. thesis, Stanford Univ.
[33] Tracy, C. A. and Widom, H. (1994). Level-spacing distributions and the Airy kernel. Comm. Math. Phys. 159 151-174. · Zbl 0789.35152 · doi:10.1007/BF02100489
[34] Zahn, J. M., Sonu, R., Vogel, H., Crane, E., Mazan-Mamczarz, K., Rabkin, R., Davis, R. W., Becker, K. G., Owen, A. B. and Kim, S. K. (2006). Transcriptional profiling of aging in human muscle reveals a common aging signature. PLoS Genetics 2 1058-1069.
[35] Zahn, J. M., Poosala, S., Owen, A. B., Ingram, D. K., Lustig, A., Carter, A., Weeratna, A. T., Taub, D. D., Gorospe, M., Mazan-Mamczarz, K., Lakatta, E. G., Boheler, K. R., Xu, X., Mattson, M. P., Falco, G., Ko, M. S. H., Schlessinger, D., Firman, J., Kummerfeld, S. K., III, W. H. W., Zonderman, A. B., Kim, S. K. and Becker, K. G. (2007). AGEMAP: A gene expression database for aging in mice. PLoS Genetics 3 2326-2337.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.