×

Detection boundary and higher criticism approach for rare and weak genetic effects. (English) Zbl 1454.62420

Summary: Genome-wide association studies (GWAS) have identified many genetic factors underlying complex human traits. However, these factors have explained only a small fraction of these traits’ genetic heritability. It is argued that many more genetic factors remain undiscovered. These genetic factors likely are weakly associated at the population level and sparsely distributed across the genome. In this paper, we adapt the recent innovations on J. W. Tukey’s Higher Criticism [“The higher criticism”, in: Course Notes Statistics 411. Princteon, NJ: Princeton Univ. (1976); D. Donoho and J. Jin, Ann. Stat. 32, No. 3, 962–994 (2004; Zbl 1092.62051)] to SNP-set analysis of GWAS, and develop a new theoretical framework in large-scale inference to assess the joint significance of such rare and weak effects for a quantitative trait. In the core of our theory is the so-called detection boundary, a curve in the two-dimensional phase space that quantifies the rarity and strength of genetic effects. Above the detection boundary, the overall effects of genetic factors are strong enough for reliable detection. Below the detection boundary, the genetic factors are simply too rare and too weak for reliable detection. We show that the HC-type methods are optimal in that they reliably yield detection once the parameters of the genetic effects fall above the detection boundary and that many commonly used SNP-set methods are suboptimal. The superior performance of the HC-type approach is demonstrated through simulations and the analysis of a GWAS data set of Crohn’s disease.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62G10 Nonparametric hypothesis testing
92D10 Genetics and epigenetics

Citations:

Zbl 1092.62051
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Ansorge, W. J. (2009). Next-generation DNA sequencing techniques. N. Biotechnol. 25 195-203.
[2] Arias-Castro, E., Candès, E. J. and Plan, Y. (2011). Global testing under sparse alternatives: ANOVA, multiple comparisons and the Higher Criticism. Ann. Statist. 39 2533-2556. · Zbl 1231.62136 · doi:10.1214/11-AOS910
[3] Arnon, T. I., Xu, Y., Lo, C., Pham, T., An, J., Coughlin, S., Dorn, G. W. and Cyster, J. G. (2011). GRK2-dependent S1PR1 desensitization is required for lymphocytes to overcome their attraction to blood. Science Signalling 333 1898.
[4] Ayers, K. L. and Cordell, H. J. (2010). SNP selection in genome-wide and candidate gene studies via penalized logistic regression. Genet. Epidemiol. 34 879-891.
[5] Ballard, D. H., Cho, J. and Zhao, H. (2010). Comparisons of multi-marker association methods to detect association between a candidate region and disease. Genet. Epidemiol. 34 201-212.
[6] Baumgart, D. C. and Sandborn, W. J. (2007). Inflammatory bowel disease: Clinical aspects and established and evolving therapies. Lancet 369 1641-1657.
[7] Baumgart, D. C. and Sandborn, W. J. (2012). Crohn’s disease. Lancet 380 1590-1605.
[8] Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57 289-300. · Zbl 0809.62014
[9] Binns, D., Dimmer, E., Huntley, R., Barrell, D., O’Donovan, C. and Apweiler, R. (2009). QuickGO: A web-based tool for Gene Ontology searching. Bioinformatics 25 3045-3046.
[10] Brandtzaeg, P. and Pabst, R. (2004). Let’s go mucosal: Communication on slippery ground. Trends Immunol. 25 570-577.
[11] By, K. and Qaqish, B. (2011). mvtBinaryEP: Generates correlated binary data (R package).
[12] Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist. 32 962-994. · Zbl 1092.62051 · doi:10.1214/009053604000000265
[13] Donoho, D. and Jin, J. (2008). Higher criticism thresholding: Optimal feature selection when useful features are rare and weak. Proc. Natl. Acad. Sci. USA 105 14790-14795. · Zbl 1357.62212
[14] Duerr, R. H., Taylor, K. D., Brant, S. R., Rioux, J. D., Silverberg, M. S., Daly, M. J., Steinhart, A. H., Abraham, C., Regueiro, M., Griffiths, A. et al. (2006). A genome-wide association study identifies IL23R as an inflammatory bowel disease gene. Science Signalling 314 1461.
[15] Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96-104. · Zbl 1089.62502 · doi:10.1198/016214504000000089
[16] Efron, B. (2007a). Correlation and large-scale simultaneous significance testing. J. Amer. Statist. Assoc. 102 93-103. · Zbl 1284.62340 · doi:10.1198/016214506000001211
[17] Efron, B. (2007b). Size, power and false discovery rates. Ann. Statist. 35 1351-1377. · Zbl 1123.62008 · doi:10.1214/009053606000001460
[18] Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151-1160. · Zbl 1073.62511 · doi:10.1198/016214501753382129
[19] Emrich, L. J. and Piedmonte, M. R. (1991). A method for generating high-dimensional multivariate binary variates. Amer. Statist. 45 302-304.
[20] Falconer, D. S., Mackay, T. F. C. and Frankham, R. (1996). Introduction to quantitative genetics (4th edition). Trends in Genetics 12 280.
[21] Franke, A., McGovern, D. P. B., Barrett, J. C., Wang, K., Radford-Smith, G. L., Ahmad, T., Lees, C. W., Balschun, T., Lee, J., Roberts, R. et al. (2010). Genome-wide meta-analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nature Genetics 42 1118-1125.
[22] Genovese, C., Jin, J. and Wasserman, L. (2009). Revisiting marginal regression. Preprint. Available at . 0911.4080v1
[23] Goldstein, D. B. (2009). Common genetic variation and human traits. N. Engl. J. Med. 360 1696-1698.
[24] Guan, Y. and Stephens, M. (2011). Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann. Appl. Stat. 5 1780-1815. · Zbl 1229.62145 · doi:10.1214/11-AOAS455
[25] Hall, P. and Jin, J. (2008). Properties of higher criticism under strong dependence. Ann. Statist. 36 381-402. · Zbl 1139.62049 · doi:10.1214/009053607000000767
[26] Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. Ann. Statist. 38 1686-1732. · Zbl 1189.62080 · doi:10.1214/09-AOS764
[27] Hall, P., Jin, J. and Miller, H. (2009). Feature selection when there are many influential features. Preprint. Available at . 0911.4076 · Zbl 1398.62162 · doi:10.3150/13-BEJ536
[28] He, S. and Wu, Z. (2011). Gene-based Higher Criticism methods for large-scale exonic single-nucleotide polymorphism data. BMC Proceedings 5 S65.
[29] Hoggart, C. J., Whittaker, J. C., Iorio, M. D. and Balding, D. J. (2008). Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genetics 4 e1000130.
[30] Hoh, J. and Ott, J. (2003). Mathematical multi-locus approaches to localizing complex human trait genes. Nat. Rev. Genet. 4 701-709.
[31] Hoh, J., Wille, A. and Ott, J. (2001). Trimming, weighting, and grouping SNPs in human case-control association studies. Genome Res. 11 2115-2119.
[32] Ingster, Y. I. (2002). Adaptive detection of a signal of growing dimension. II. Math. Methods Statist. 11 37-68. · Zbl 1005.62052
[33] Ingster, Y. I., Tsybakov, A. B. and Verzelen, N. (2010). Detection boundary in sparse regression. Electron. J. Stat. 4 1476-1526. · Zbl 1329.62314 · doi:10.1214/10-EJS589
[34] Jin, J. and Wang, L. (2013). Spectral clustering by Higher Criticism Thresholding. Unpublished manuscript.
[35] Kraft, P. and Hunter, D. J. (2009). Genetic risk prediction-Are we there yet? New England Journal of Medicine 360 1701.
[36] Li, M., Wang, K., Grant, S. F. A., Hakonarson, H. and Li, C. (2009). ATOM: A powerful gene-based association test by combining optimally weighted markers. Bioinformatics 25 497-503.
[37] Liu, D., Lin, X. and Ghosh, D. (2007). Semiparametric regression of multidimensional genetic pathway data: Least-squares kernel machines and linear mixed models. Biometrics 63 1079-1088, 1311. · Zbl 1274.62825 · doi:10.1111/j.1541-0420.2007.00799.x
[38] Loftus, E. V., Schoenfeld, P. and Sandborn, W. J. (2002). The epidemiology and natural history of Crohn’s disease in population-based patient cohorts from North America: A systematic review. Alimentary Pharmacology & Therapeutics 16 51-60.
[39] Luo, L., Peng, G., Zhu, Y., Dong, H., Amos, C. I. and Xiong, M. (2010). Genome-wide gene and pathway analysis. Eur. J. Hum. Genet. 18 1045-1053.
[40] Mardis, E. R. (2008). Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum. Genet. 9 387-402.
[41] McCarthy, M. I., Abecasis, G. R., Cardon, L. R., Goldstein, D. B., Little, J., Ioannidis, J. P. A. and Hirschhorn, J. N. (2008). Genome-wide association studies for complex traits: Consensus, uncertainty and challenges. Nat. Rev. Genet. 9 356-369.
[42] Mendel, G. (1866). Versuche über Pflanzen-Hybriden. Verhandlungen des naturforschenden Vereines in Brünn, Bd. IV for das Jahr 1865, Abhandlungen, 3-47. Genetic Theory 295 3-47.
[43] Metzker, M. L. (2010). Sequencing technologies-The next generation. Nat. Rev. Genet. 11 31-46.
[44] Mukhopadhyay, I., Feingold, E., Weeks, D. E. and Thalamuthu, A. (2010). Association tests using kernel-based measures of multi-locus genotype similarity between individuals. Genet. Epidemiol. 34 213-221.
[45] Pearson, K. (1904). Mathematical contributions to the theory of evolution. XII. On a generalised theory of alternative inheritance, with special reference to Mendel’s laws. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 203 53-86. · JFM 35.0242.02
[46] Peng, G., Luo, L., Siu, H., Zhu, Y., Hu, P., Hong, S., Zhao, J., Zhou, X., Reveille, J. D. and Jin, L. (2009). Gene and pathway-based second-wave analysis of genome-wide association studies. European Journal of Human Genetics 18 111-117.
[47] The UniProt Consortium (2012). Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic. Acids Res. 40 D71-D75.
[48] Tukey, J. W. (1976). The higher criticism. Course Notes, Statistics 411, Princeton Univ.
[49] Wade, N. (2009). Genes show limited value in predicting diseases. New York Times April 16.
[50] Wallukat, G., Homuth, V., Fischer, T., Lindschau, C., Horstkamp, B., Jüpner, A., Baur, E., Nissen, E., Vetter, K., Neichel, D. et al. (1999). Patients with preeclampsia develop agonistic autoantibodies against the angiotensin AT\(_{1}\) receptor. Journal of Clinical Investigation 103 945-952.
[51] Wang, K. and Abbott, D. (2008). A principal components regression approach to multilocus genetic association studies. Genet. Epidemiol. 32 108-118.
[52] Wang, K., Li, M. and Bucan, M. (2007). Pathway-based approaches for analysis of genomewide association studies. Am. J. Hum. Genet. 81 1278-1283.
[53] Wellner, J. A. (1978). Limit theorems for the ratio of the empirical distribution function to the true distribution function. Z. Wahrsch. Verw. Gebiete 45 73-88. · Zbl 0382.60031 · doi:10.1007/BF00635964
[54] Wu, Z. and Zhao, H. (2009). Statistical power of model selection strategies for genome-wide association studies. PLoS Genet. 5 e1000582.
[55] Wu, Z. and Zhao, H. (2012). On model selection strategies to identify genes underlying binary traits using genome-wide association data. Statist. Sinica 22 1041-1074. · Zbl 1257.62116
[56] Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25 714-721.
[57] Wu, M. C., Kraft, P., Epstein, M. P., Taylor, D. M., Chanock, S. J., Hunter, D. J. and Lin, X. (2010). Powerful SNP-set analysis for case-control genome-wide association studies. Am. J. Hum. Genet. 86 929-942.
[58] Wu, Z., Sun, Y., He, S., Cho, J. H., Zhao, H. and Jin, J. (2014). Supplement to “Detection boundary and Higher Criticism approach for rare and weak genetic effects.” . · Zbl 1454.62420
[59] Xie, J., Cai, T. T. and Li, H. (2011). Sample size and power analysis for sparse signal recovery in genome-wide association studies. Biometrika 98 273-290. · Zbl 1215.62118 · doi:10.1093/biomet/asr003
[60] Yang, H. C., Hsieh, H. Y. and Fann, C. S. J. (2008). Kernel-based association test. Genetics 179 1057-1068.
[61] Yu, K., Li, Q., Bergen, A. W., Pfeiffer, R. M., Rosenberg, P. S., Caporaso, N., Kraft, P. and Chatterjee, N. (2009). Pathway analysis by adaptive combination of \(P\)-values. Genet. Epidemiol. 33 700-709.
[62] Yulh, G. U. (1902). Mendel’s laws and their probable relations to intra-racial heredity. The New Phytologist 1 193-207.
[63] Zhang, D. and Lin, X. (2003). Hypothesis testing in semiparametric additive mixed models. Biostatistics 4 57-74. · Zbl 1139.62310 · doi:10.1093/biostatistics/4.1.57
[64] Zuo, Y., Zou, G. and Zhao, H. (2006). Two-stage designs in case-control association analysis. Genetics 173 1747-1760.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.