zbMATH — the first resource for mathematics

Bayesian variable selection regression for genome-wide association studies and other large-scale problems. (English) Zbl 1229.62145
Summary: We consider applying Bayesian Variable Selection Regression, or BVSR, to genome-wide association studies and similar large-scale regression problems. Currently, typical genome-wide association studies measure hundreds of thousands, or millions, of genetic variants (SNPs), in thousands or tens of thousands of individuals, and attempt to identify regions harboring SNPs that affect some phenotype or outcome of interest. This goal can naturally be cast as a variable selection regression problem, with the SNPs as the covariates in the regression. Characteristic features of genome-wide association studies include the following: (i) a focus primarily on identifying relevant variables, rather than on prediction; and (ii) many relevant covariates may have tiny effects, making it effectively impossible to confidently identify the complete “correct” subset of variables. Taken together, these factors put a premium on having interpretable measures of confidence for individual covariates being included in the model, which we argue is a strength of BVSR compared with alternatives such as penalized regression methods. We focus primarily on analysis of quantitative phenotypes, and on appropriate prior specification for BVSR in this setting, emphasizing the idea of considering what the priors imply about the total proportion of variance in outcome explained by relevant covariates. We also emphasize the potential for BVSR to estimate this proportion of variance explained, and hence shed light on the issue of “missing heritability” in genome-wide association studies. More generally, we demonstrate that, despite the apparent computational challenges, BVSR can provide useful inferences in these large-scale problems, and in our simulations produces better power and predictive performance compared with standard single-SNP analyses and the penalized regression method LASSO. Methods described here are implemented in a software package, pi-MASS, available from the Guan Lab website http://bcm.edu/cnrc/mcmcmc/pimass.

62P10 Applications of statistics to biology and medical sciences; meta analysis
62F15 Bayesian inference
92D10 Genetics and epigenetics
65C60 Computational problems in statistics (MSC2010)
62J05 Linear regression; mixed models
Eigenstrat; pi-MASS
Full Text: DOI
[1] Agliari, A. and Parisetti, C. C. (1988). A-g reference informative prior: A note on Zellner’s g prior. J. Roy. Statist. Soc. Ser. D 37 271-275.
[2] Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc. 88 669-679. · Zbl 0774.62031
[3] Barber, M. J., Mangravite, L. M., Hyde, C. L., Chasman, D. I., Smith, J. D., McCarty, C. A., Li, X., Wilke, R. A., Rieder, M. J., Williams, P. T., Ridker, P. M., Chatterjee, A., Rotter, J. I., Nickerson, D. A., Stephens, M. and Krauss, R. M. (2010). Genome-wide association of lipid-lowering response to statins in combined study populations. PLoS ONE 5 e9763.
[4] Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection. Ann. Statist. 32 870-897. · Zbl 1092.62033
[5] Brown, P. J., Vannucci, M. and Fearn, T. (2002). Bayes model averaging with selection of regressors. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 519-536. · Zbl 1073.62004
[6] Casella, G. and Robert, C. P. (1996). Rao-Blackwellisation of sampling schemes. Biometrika 83 81-94. · Zbl 0866.62024
[7] Clayton, D. G., Walker, N. M., Smyth, D. J., Pask, R., Cooper, J. D., Maier, L. M., Smink, L. J., Lam, A. C., Ovington, N. R., Stevens, H. E., Nutland, S., Howson, J. M. M., Faham, M., Moorhead, M., Jones, H. B., Falkowski, M., Hardenbol, P., Willis, T. D. and Todd, J. A. (2005). Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat. Genet. 37 1243-1246.
[8] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407-499. · Zbl 1091.62054
[9] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. · Zbl 1073.62547
[10] Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 70 849-911.
[11] George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881-889.
[12] Guan, Y. and Krone, S. M. (2007). Small world MCMC and convergence to multi-modal distributions: From slow mixing to fast mixing. Ann. Appl. Probab. 17 284-304. · Zbl 1139.65001
[13] Guan, Y. and Stephens, M. (2008). Practical issues in imputation-based association mapping. PLoS Genet. 4 e1000279.
[14] Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57 97-109. · Zbl 0219.65008
[15] Hoggart, C. J., Whittaker, J. C., De Iorio, M. and Balding, D. J. (2008). Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 4 e1000130.
[16] Lange, L. A., Burdon, K., Langefeld, C. D., Liu, Y., Beck, S. R., Rich, S. S., Freedman, B. I., Brosnihan, K. B., Herrington, D. M., Wagenknecht, L. E. and Bowden, D. W. (2006). Heritability and expression of c-reactive protein in type 2 diabetes in the diabetes heart study. Ann. Hum. Genet. 70 717-725.
[17] Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008). Mixtures of g priors for Bayesian variable selection. J. Amer. Statist. Assoc. 103 410-423. · Zbl 1335.62026
[18] Maher, B. (2008). Personal genomes: The case of the missing heritability. Nature 456 18-21.
[19] Marchini, J., Howie, B., Myers, S., McVean, G. and Donnelly, P. (2007). A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39 906-913.
[20] Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A. and Teller, E. (1953). Equations of state calculations by fast computing machines. J. Chem. Phys. 21 1087-1092.
[21] Miller, A. (2002). Subset Selection in Regression , 2nd ed. Monographs on Statistics and Applied Probability 95 . Chapman & Hall/CRC, Boca Raton, FL. · Zbl 1051.62060
[22] Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. J. Amer. Statist. Assoc. 83 1023-1036. · Zbl 0673.62051
[23] O’Hara, R. B. and Sillanpää, M. J. (2009). A review of Bayesian variable selection methods: What, how and which. Bayesian Anal. 4 85-117. · Zbl 1330.62291
[24] Pankow, J. S., Folsom, A. R., Cushman, M., Borecki, I. B., Hopkins, P. N., Eckfeldt, J. H. and Tracy, R. P. (2001). Familial and genetic determinants of systemic markers of inflammation: The NHLBI family heart study. Atherosclerosis 154 681-689.
[25] Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38 904-909.
[26] Pritchard, J. K. (2001). Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69 124-137.
[27] Pritchard, J. K., Stephens, M., Rosenberg, N. A. and Donnelly, P. (2000). Association mapping in structured populations. Am. J. Hum. Genet. 67 170-181.
[28] Raftery, A. E., Madigan, D. and Hoeting, J. A. (1997). Bayesian model averaging for linear regression models. J. Amer. Statist. Assoc. 92 179-191. · Zbl 0888.62026
[29] Raychaudhuri, S., Plenge, R. M., Rossin, E. J., Ng, A. C. Y., Purcell, S. M., Sklar, P., Scolnick, E. M., Xavier, R. J., Altshuler, D., Daly, M. J. and Consortium, I. S. (2009). Identifying relationships among genomic disease regions: Predicting genes at pathogenic snp associations and rare deletions. PLoS Genet. 5 e1000534.
[30] Reiner, A. P., Barber, M. J., Guan, Y., Ridker, P. M., Lange, L. A., Chasman, D. I., Walston, J. D., Cooper, G. M., Jenny, N. S., Rieder, M. J., Durda, J. P., Smith, J. D., Novembre, J., Tracy, R. P., Rotter, J. I., Stephens, M., Nickerson, D. A. and Krauss, R. M. (2008). Polymorphisms of the HNF1A gene encoding hepatocyte nuclear factor-1 alpha are associated with C-reactive protein. Am. J. Hum. Genet. 82 1193-1201.
[31] Ridker, P. M., Rifai, N., Rose, L., Buring, J. E. and Cook, N. R. (2002). Comparison of C-reactive protein and low-density lipoprotein cholesterol levels in the prediction of first cardiovascular events. N. Engl. J. Med. 347 1557-1565.
[32] Ridker, P. M., Pare, G., Parker, A., Zee, R. Y., Danik, J. S., Buring, J. E., Kwiatkowski, D., Cook, N. R., Miletich, J. P. and Chasman, D. I. (2008). Loci related to metabolic-syndrome pathways including LEPR, HNF1A, IL6R, and GCKR associate with plasma c-reactive protein: The women’s genome health study. Am. J. Hum. Genet. 82 1185-1192.
[33] Scheet, P. and Stephens, M. (2006). A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78 629-644.
[34] Servin, B. and Stephens, M. (2007). Efficient multipoint analysis of association studies: Candidate regions and quantitative traits. PLoS Genet. 3 e114.
[35] Smith, G. D. and Ebrahim, S. (2003). Mendelian randomization: Can genetic epidemiology contribute to understanding environmental determinants of disease? Internat. J. Epidemiology 32 1-22.
[36] Smith, M. and Kohn, R. (1996). Nonparametric regression using Bayesian variable selection. J. Econometrics 75 317-343. · Zbl 0864.62025
[37] Stephens, M. and Balding, D. J. (2009). Bayesian statistical methods for genetic association studies. Nat. Rev. Genet. 10 681-690.
[38] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
[39] Verzilli, C., Shah, T., Casas, J. P., Chapman, J., Sandhu, M., Debenham, S. L., Boekholdt, M. S., Khaw, K. T. T., Wareham, N. J., Judson, R., Benjamin, E. J., Kathiresan, S., Larson, M. G., Rong, J., Sofat, R., Humphries, S. E., Smeeth, L., Cavalleri, G., Whittaker, J. C. and Hingorani, A. D. (2008). Bayesian meta-analysis of genetic association studies with different sets of markers. Am. J. Hum. Genet. 82 859-872.
[40] Veyrieras, J.-B., Kudaravalli, S., Kim, S. Y., Dermitzakis, E. T., Gilad, Y., Stephens, M. and Pritchard, J. K. (2008). High-resolution mapping of expression-QTLs yields insight into human gene regulation. PLoS Genet. 4 e1000214.
[41] Wakefield, J. (2009). Bayes factors for genome-wide association studies: Comparison with P -values. Genet. Epidemiol. 33 79-86.
[42] Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661-678.
[43] Wu, T. T., Chen, Y. F., Hastie, T., Sobel, E. and Lange, K. (2009). Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics 25 714-721.
[44] Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Henders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Martin, N. G., Montgomery, G. W., Goddard, M. E. and Visscher, P. M. (2010). Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42 565-569.
[45] Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g -prior distributions. In Bayesian Inference and Decision Techniques (P. K. Goel and A. Zellner, eds.) Stud. Bayesian Econometrics Statist. 6 233-243. North-Holland, Amsterdam. · Zbl 0655.62071
[46] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301-320. · Zbl 1069.62054
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.