×

Variable prioritization in nonlinear black box methods: a genetic association case study. (English) Zbl 1423.62062

Summary: The central aim in this paper is to address variable selection questions in nonlinear and nonparametric regression. Motivated by statistical genetics, where nonlinear interactions are of particular interest, we introduce a novel and interpretable way to summarize the relative importance of predictor variables. Methodologically, we develop the “RelATive cEntrality” (RATE) measure to prioritize candidate genetic variants that are not just marginally important, but whose associations also stem from significant covarying relationships with other variants in the data. We illustrate RATE through Bayesian Gaussian process regression, but the methodological innovations apply to other “black box” methods. It is known that nonlinear models often exhibit greater predictive accuracy than linear models, particularly for phenotypes generated by complex genetic architectures. With detailed simulations and two real data association mapping studies, we show that applying RATE enables an explanation for this improved performance.

MSC:

62J02 General nonlinear regression
62P10 Applications of statistics to biology and medical sciences; meta analysis
60G15 Gaussian processes
92D10 Genetics and epigenetics
62H20 Measures of association (correlation, canonical correlation, etc.)
62G08 Nonparametric regression and quantile regression
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Alaa, A. M. and van der Schaar, M. (2017). Bayesian nonparametric causal inference: Information rates and learning algorithms. Available at ArXiv:1712.08914.
[2] Ankra-Badu, G. A., Pomp, D., Shriner, D., Allison, D. B. and Yi, N. (2009). Genetic influences on growth and body composition in mice: Multilocus interactions. Int. J. Obes.33 89-95. DOI:10.1038/ijo.2008.215.
[3] Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection. Ann. Statist.32 870-897. · Zbl 1092.62033
[4] Brockmann, G. A., Haley, C. S., Renne, U., Knott, S. A. and Schwerin, M. (1998). Quantitative trait loci affecting body weight and fatness from a mouse line selected for extreme high growth. Genetics150 369-381.
[5] Bross, C. D., Howes, T. R., Abolhassani Rad, S., Kljakic, O. and Kohalmi, S. E. (2017). Subcellular localization of Arabidopsis arogenate dehydratases suggests novel and non-enzymatic roles. J. Exp. Bot.68 1425-1440.
[6] Carvalho, C. M., Polson, N. G. and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika97 465-480. · Zbl 1406.62021
[7] Carvalho, C. M. and West, M. (2007). Dynamic matrix-variate graphical models. Bayesian Anal.2 69-97. · Zbl 1331.62040
[8] Chaudhuri, A., Kakde, D., Sadek, C., Gonzalez, L. and Kong, S. (2017). The mean and median criterion for automatic kernel bandwidth selection for support vector data description. Available at arXiv:1708.05106.
[9] Chen, X., McClusky, R., Chen, J., Beaven, S. W., Tontonoz, P., Arnold, A. P. and Reue, K. (2012). The number of X chromosomes causes sex differences in adiposity in mice. PLoS Genet.8 e1002709.
[10] Chen, X., McClusky, R., Itoh, Y., Reue, K. and Arnold, A. P. (2013). X and Y chromosome complement influence adiposity and metabolism in mice. Endocrinology154 1092-1104. DOI:10.1210/en.2012-2098.
[11] Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Ann. Appl. Stat.4 266-298. · Zbl 1189.62066
[12] Cotter, A., Keshet, J. and Srebro, N. (2011). Explicit approximations of the Gaussian kernel. Available at arXiv:1109.4603.
[13] Cox, K. H., Bonthuis, P. J. and Rissman, E. F. (2014). Mouse model systems to study sex chromosome genes and behavior: Relevance to humans. Front. Neuroendocrinol.35 405-419. DOI:10.1016/j.yfrne.2013.12.004.
[14] Crawford, L. and Zhou, X. (2018). Genome-wide marginal epistatic association mapping in case-control studies. BioRxiv 374983.
[15] Crawford, L., Zeng, P., Mukherjee, S. and Zhou, X. (2017). Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLoS Genet.13 e1006869.
[16] Crawford, L., Wood, K. C., Zhou, X. and Mukherjee, S. (2018). Bayesian approximate kernel regression with variable selection. J. Amer. Statist. Assoc.113 1710-1721. · Zbl 1409.62132
[17] Crawford, L., Flaxman, S. R., Runcie, D. E. and West, M. (2019). Supplement to “Variable prioritization in nonlinear black box methods: A genetic association case study.” DOI:10.1214/18-AOAS1222SUPPA, DOI:10.1214/18-AOAS1222SUPPB, DOI:10.1214/18-AOAS1222SUPPC, DOI:10.1214/18-AOAS1222SUPPD. · Zbl 1423.62062
[18] Cuevas, J., Crossa, J., Montesinos-López, O. A., Burgueño, J., Pérez-Rodríguez, P. and de Los Campos, G. (2017). Bayesian genomic prediction with genotype \(\times\) environment interaction kernel models. G3 (Bethesda) 7 41-53.
[19] Demetrashvili, N., den Heuvel, E. R. V. and Wit, E. C. (2013). Probability genotype imputation method and integrated weighted lasso for QTL identification. BMC Genet.14 125.
[20] de los Campos, G., Naya, H., Gianola, D., Crossa, J., Legarra, A., Manfredi, E., Weigel, K. and Cotes, J. (2009). Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics182 375-385.
[21] de los Campos, G., Gianola, D., Rosa, G. J. M., Weigel, K. A. and Crossa, J. (2010). Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genet. Res.92 295-308.
[22] Diament, A. L. and Warden, C. H. (2003). Multiple linked mouse chromosome 7 loci influence body fat mass. Int. J. Obes.28 199 EP.
[23] Drineas, P. and Mahoney, M. W. (2005). On the Nyström method for approximating a Gram matrix for improved kernel-based learning. J. Mach. Learn. Res.6 2153-2175. · Zbl 1222.68186
[24] Fasshauer, G. and McCourt, M. (2016). Kernel-Based Approximation Methods Using MATLAB. World Scientific, Hackensack, NJ. · Zbl 1318.00001
[25] Gelman, A., Hwang, J. and Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Stat. Comput.24 997-1016. · Zbl 1332.62090
[26] Goutis, C. and Robert, C. P. (1998). Model choice in generalised linear models: A Bayesian approach via Kullback-Leibler projections. Biometrika85 29-37. · Zbl 0903.62061
[27] Gruber, L. and West, M. (2016). GPU-accelerated Bayesian learning and forecasting in simultaneous graphical dynamic linear models. Bayesian Anal.11 125-149. · Zbl 1359.62367
[28] Gruber, L. F. and West, M. (2017). Bayesian online variable selection and scalable multivariate volatility forecasting in simultaneous graphical dynamic linear models. Econ. Stat.3 3-22.
[29] Guan, Y. and Stephens, M. (2011). Bayesian variable selection regression for Genome-wide association studies and other large-scale problems. Ann. Appl. Stat.5 1780-1815. · Zbl 1229.62145
[30] Hemani, G., Knott, S. and Haley, C. (2013). An evolutionary perspective on epistasis and the missing heritability. PLoS Genet.9 e1003295.
[31] Hemani, G., Shakhbazov, K., Westra, H.-J., Esko, T., Henders, A. K., McRae, A. F., Yang, J., Gibson, G., Martin, N. G., Metspalu, A., Franke, L., Montgomery, G. W., Visscher, P. M. and Powell, J. E. (2014). Detection and replication of epistasis influencing transcription in humans. Nature508 249-253.
[32] Hill, W. G., Goddard, M. E. and Visscher, P. M. (2008). Data and theory point to mainly additive genetic variance for complex traits. PLoS Genet.4 e1000008.
[33] Horn, T., Sandmann, T., Fischer, B., Axelsson, E., Huber, W. and Boutros, M. (2011). Mapping of signaling networks through synthetic genetic interaction analysis by RNAi. Nat. Methods8 341-346.
[34] Hou, Q. and Bartels, D. (2015). Comparative study of the aldehyde dehydrogenase (ALDH) gene superfamily in the glycophyte Arabidopsis thaliana and Eutrema halophytes. Ann. Bot.115 465-479.
[35] Howard, R., Carriquiry, A. L. and Beavis, W. D. (2014). Parametric and nonparametric statistical methods for genomic selection of traits with additive and epistatic genetic architectures. G3 (Bethesda) 4 1027-1046.
[36] Jiang, Y. and Reif, J. C. (2015). Modeling epistasis in genomic selection. Genetics201 759-768.
[37] Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S.-y., Freimer, N. B., Sabatti, C. and Eskin, E. (2010). Variance component model to account for sample structure in genome-wide association studies. Nat. Genet.42 348-354.
[38] Kim, S. V., Mehal, W. Z., Dong, X., Heinrich, V., Pypaert, M., Mellman, I., Dembo, M., Mooseker, M. S., Wu, D. and Flavell, R. A. (2006). Modulation of cell adhesion and motility in the immune system by Myo1f. Science314 136-139.
[39] Kirch, H.-H., Bartels, D., Wei, Y., Schnable, P. S. and Wood, A. J. (2004). The ALDH gene superfamily of Arabidopsis. Trends Plant Sci.9 371-377.
[40] Kleyn, P. W., Fan, W., Kovats, S. G., Lee, J. J., Pulido, J. C., Wu, Y., Berkemeier, L. R., Misumi, D. J., Holmgren, L. et al. (1996). Identification and characterization of the mouse obesity gene tubby: A member of a novel gene family. Cell85 281-290.
[41] Kolmogorov, A. N. and Rozanov, Ju. A. (1960). On a strong mixing condition for stationary Gaussian processes. Theory Probab. Appl.5 222-227. · Zbl 0091.30001
[42] Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008). Mixtures of \(g\) priors for Bayesian variable selection. J. Amer. Statist. Assoc.103 410-423. · Zbl 1335.62026
[43] Lim, C. and Yu, B. (2016). Estimation stability with cross-validation (ESCV). J. Comput. Graph. Statist.25 464-492.
[44] Lin, L., Chan, C. and West, M. (2016). Discriminative variable subsets in Bayesian classification with mixture models, with application in flow cytometry studies. Biostatistics17 40-53.
[45] Lippert, C., Listgarten, J., Liu, Y., Kadie, C. M., Davidson, R. I. and Heckerman, D. (2011). FaST linear mixed models for genome-wide association studies. Nat. Methods8 833-835.
[46] Loudet, O., Chaillou, S., Camilleri, C., Bouchez, D. and Daniel-Vedele, F. (2002). Bay-\(0 \times\) Shahdara recombinant inbred line population: A powerful tool for the genetic dissection of complex traits in Arabidopsis. Theor. Appl. Genet.104 1173-1184.
[47] Mackay, T. F. C. (2014). Epistasis and quantitative traits: Using model organisms to study gene-gene interactions. Nat. Rev. Genet.15 22-33.
[48] Mathai, A. M. and Provost, S. B. (1992). Quadratic Forms in Random Variables. Theory and Applications. Statistics: Textbooks and Monographs126. Dekker, New York. · Zbl 0792.62045
[49] Mercer, J. (1909). Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. R. Soc. Lond. Ser. A209 415-446. · JFM 40.0408.02
[50] Paigen, B., Mitchell, D., Reue, K., Morrow, A., Lusis, A. J. and LeBoeuf, R. C. (1987). Ath-1, a gene determining atherosclerosis susceptibility and high density lipoprotein levels in mice. Proc. Natl. Acad. Sci. USA84 3763-3767.
[51] Phillips, P. C. (2008). Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nat. Rev. Genet.9 855-867. DOI:10.1038/nrg2452.
[52] Piironen, J. and Vehtari, A. (2016). Projection predictive model selection for Gaussian processes. In IEEE International Workshop on Machine Learning for Signal Processing 1-6. IEEE, New York. · Zbl 1505.62321
[53] Piironen, J. and Vehtari, A. (2017). Comparison of Bayesian predictive methods for model selection. Stat. Comput.27 711-735. · Zbl 1505.62321
[54] Pillai, N. S., Wu, Q., Liang, F., Mukherjee, S. and Wolpert, R. L. (2007). Characterizing the function space for Bayesian kernel models. J. Mach. Learn. Res.8 1769-1797. · Zbl 1222.62039
[55] Prabhu, S. and Pe’er, I. (2012). Ultrafast genome-wide scan for SNP-SNP interactions in common complex disease. Genome Res.22 2230-2240.
[56] Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A. R., Bender, D., Maller, J., Sklar, P., de Bakker, P. I. W., Daly, M. J. and Sham, P. C. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet.81 559-575. DOI:10.1086/519795.
[57] Rahimi, A. and Recht, B. (2007). Random features for large-scale kernel machines. Adv. Neural Inf. Process. Syst.3 5.
[58] Rance, K. A., Hill, W. G. and Keightley, P. D. (1997). Mapping quantitative trait loci for body weight on the X chromosome in mice. I. Analysis of a reciprocal F2 population. Genet. Res.70 117-124.
[59] Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA. · Zbl 1177.68165
[60] Richard, M. D. and Lippmann, R. P. (1991). Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comput.3 461-483.
[61] Schölkopf, B., Herbrich, R. and Smola, A. J. (2001). A generalized representer theorem. In Computational Learning Theory (Amsterdam, 2001). Lecture Notes in Computer Science2111 416-426. Springer, Berlin. · Zbl 0992.68088
[62] Shi, J. Q., Wang, B., Will, E. J. and West, R. M. (2012). Mixed-effects Gaussian process functional regression models with application to dose-response curve prediction. Stat. Med.31 3165-3177.
[63] Smith, A., Naik, P. A. and Tsai, C.-L. (2006). Markov-switching model selection using Kullback-Leibler divergence. J. Econometrics134 553-577. · Zbl 1418.62537
[64] Stephens, M. and Balding, D. J. (2009). Bayesian statistical methods for genetic association studies. Nat. Rev. Genet.10 681-690.
[65] Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J. et al. (2015). UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med.12 e1001779.
[66] Tan, S., Caruana, R., Hooker, G. and Lou, Y. (2017). Detecting bias in black-box models using transparent model distillation. Available at arXiv:1710.06169.
[67] The 1000 Genomes Project Consortium (2010). A map of human genome variation from population-scale sequencing. Nature467 1061-1073.
[68] The Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature447 661-678.
[69] Valdar, W., Solberg, L. C., Gauguier, D., Burnett, S., Klenerman, P., Cookson, W. O., Taylor, M. S., Rawlins, J. N. P., Mott, R. and Flint, J. (2006). Genome-wide genetic association of complex traits in heterogeneous stock mice. Nat. Genet.38 879-887.
[70] Wahba, G. (1990). Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics59. SIAM, Philadelphia, PA. · Zbl 0813.62001
[71] Waldmann, P., Mészáros, G., Gredler, B., Fürst, C. and Sölkner, J. (2013). Evaluation of the lasso and the elastic net in genome-wide association studies. Front. Genet.4 270.
[72] Wan, X., Yang, C., Yang, Q., Xue, H., Fan, X., Tang, N. L. and Yu, W. (2010). BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies. Am. J. Hum. Genet.87 325-340.
[73] Wang, X., Elston, R. C. and Zhu, X. (2011a). Statistical interaction in human genetics: How should we model it if we are looking for biological interaction? Nat. Rev. Genet.12 74.
[74] Wang, X., Elston, R. C. and Zhu, X. (2011b). The meaning of interaction. Hum. Hered.70 269-277.
[75] Weissbrod, O., Geiger, D. and Rosset, S. (2016). Multikernel linear mixed models for complex phenotype prediction. Genome Res.26 969-979.
[76] Wentzell, A. M., Rowe, H. C., Hansen, B. G., Ticconi, C., Halkier, B. A. and Kliebenstein, D. J. (2007). Linking metabolic QTLs with network and cis-eQTLs controlling biosynthetic pathways. PLoS Genet.3 e162.
[77] Woo, J. H., Shimoni, Y., Yang, W. S., Subramaniam, P., Iyer, A., Nicoletti, P., Rodríguez Martínez, M., López, G., Mattioli, M. et al. (2015). Elucidating compound mechanism of action by network perturbation analysis. Cell162 441-451.
[78] Wood, A. R., Tuke, M. A., Nalls, M. A., Hernandez, D. G., Bandinelli, S., Singleton, A. B., Melzer, D., Ferrucci, L., Frayling, T. M. and Weedon, M. N. (2014). Another explanation for apparent epistasis. Nature514 E3-E5.
[79] Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M. and Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet.89 82-93.
[80] Wu, J., Zhao, Q., Yang, Q., Liu, H., Li, Q., Yi, X., Cheng, Y., Guo, L., Fan, C. and Zhou, Y. (2016). Comparative transcriptomic analysis uncovers the complex genetic network for resistance to Sclerotinia sclerotiorum in Brassica napus. Sci. Rep.6 19007 EP.
[81] Yalcin, B., Nicod, J., Bhomra, A., Davidson, S., Cleak, J., Farinelli, L., Østerås, M., Whitley, A., Yuan, W. et al. (2010). Commercially available outbred mice for genome-wide association studies. PLoS Genet.6 e1001085.
[82] Yandell, B. S., Mehta, T., Banerjee, S., Shriner, D., Venkataraman, R., Moon, J. Y., Neely, W. W., Wu, H., von Smith, R. and Yi, N. (2007). R/qtlbim: QTL with Bayesian interval mapping in experimental crosses. Bioinformatics23 641-643. DOI:10.1093/bioinformatics/btm011.
[83] Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. and Price, A. L. (2014). Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet.46 100-106.
[84] Zeng, P. and Zhou, X. (2017). Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nat. Commun.8 456.
[85] Zhang, Z., Dai, G. and Jordan, M. I. (2011). Bayesian generalized kernel mixed models. J. Mach. Learn. Res.12 111-139. · Zbl 1280.68221
[86] Zhang, Y. and Liu, J. S. (2007). Bayesian inference of epistatic interactions in case-control studies. Nat. Genet.39 1167-1173.
[87] Zhang, X., Huang, S., Zou, F. and Wang, W. (2010). TEAM: Efficient two-locus epistasis tests in human genome-wide association study. Bioinformatics26 i217-i227. DOI:10.1093/bioinformatics/btq186.
[88] Zhou, X. (2017). A unified framework for variance component estimation with summary statistics in genome-wide association studies. Ann. Appl. Stat.11 2027-2051. · Zbl 1383.62305
[89] Zhou, X. and Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nat. Genet.44 821-825.
[90] Zhou, X. and Stephens, M. (2014). Efficient multivariate linear mixed model algorithms for genomewide association studies. Nat. Methods11 407-409.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.