Sparse partitioning: nonlinear regression with binary or tertiary predictors, with application to association studies. (English) Zbl 1232.62049

Summary: This paper presents sparse partitioning, a Bayesian method for identifying predictors that either individually or in combination with others affect a response variable. The method is designed for regression problems involving binary or tertiary predictors and allows the number of predictors to exceed the size of the sample, two properties which make it well suited for association studies.
Sparse partitioning differs from other regression methods by placing no restrictions on how the predictors may influence the response. To compensate for this generality, parse partitioning implements a novel way of exploring the model space. It searches for high posterior probability partitions of the predictor set, where each partition defines groups of predictors that jointly influence the response.
The result is a robust method that requires no prior knowledge of the true predictor-response relationship. Testing on simulated data suggests that sparse partitioning will typically match the performance of an existing method on a data set which obeys the existing method’s model assumptions. When these assumptions are violated, sparse partitioning will generally offer superior performance.


62F15 Bayesian inference
62J02 General nonlinear regression
62F35 Robustness and adaptive procedures (parametric inference)
65C60 Computational problems in statistics (MSC2010)


SSS; BayesDA; LogicReg
Full Text: DOI arXiv


[1] Albert, J. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc. 88 669-679. · Zbl 0774.62031
[2] Atwell, S., Huang, Y., Vilhjálmsson, B., Willems, G., Horton, M. and Li, Y. (2010). Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465 627-631.
[3] Balding, D. (2006). A tutorial on statistical methods for population association studies. Nat. Rev. Genet. 7 781-791.
[4] Breiman, L. (2004). Random Forests. Machine Learning 45 5-32. · Zbl 1007.68152
[5] Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees . Wadsworth, Belmont, CA. · Zbl 0541.62042
[6] Cordell, H. (2009). Detecting gene-gene interactions that underlie human diseases. Nat. Rev. Genet. 10 392-404.
[7] Dimas, A. (2009). The role of regulatory variation in sculpting gene expression across human populations and cell types. Ph.D. thesis, Darwin College, Univ. Cambridge.
[8] Gelman, A., Carlin, J., Stern, H. and Rubin, D. (2004). Bayesian Data Analysis . Chapman and Hall/CRC, Boca Raton, FL. · Zbl 1039.62018
[9] Hans, C., Dobra, A. and West, M. (2007). Shotgun stochastic search for “large p ” regression. J. Amer. Statist. Assoc. 102 507-516. · Zbl 1134.62398
[10] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning . Springer, New York. · Zbl 0973.62007
[11] Hoggart, C., Whittaker, J., De Iorio, M. and Balding, D. (2008). Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 4 e10000130.
[12] Johanson, U., West, J., Lister, C., Michaels, S., Amasino, R. and Dean, C. (2000). Molecular analysis of FRIGIDA, a major determinant of natural variation in Arabidopsis flowering time. Science 290 344-347.
[13] Maini, M., Gilson, R., Chavda, N., Gill, S., Fakoya, A., Ross, E., Phillips, A. and Weller, I. (1996). Reference ranges and sources of variability of CD4 counts in HIV-seronegative women and men. Genitourin. Med. 72 27-31.
[14] Marchini, J., Donnelly, P. and Cardon, L. (2005). Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 37 413-417.
[15] McCarthy, M., Abecasis, G., Cardon, L., Goldstein, D., Little, J., Ioannidis, J. and Hirschhorn, J. (2008). Genome-wide association studies for complex traits: Consensus, uncertainty and challenges. Nat. Rev. Genet. 10 356-369.
[16] Ruczinski, I., Kooperberg, C. and LeBlanc, M. (2003). Logic regression. J. Comput. Graph. Stat. 12 475-511. · Zbl 1142.62386
[17] Shindo, C., Aranzana, M., Lister, C., Baxter, C., Nicholls, C., Nordborg, M. and Dean, C. (2005). Role of FRIGIDA and FLOWERING LOCUS C in determining variation in flowering time of Arabidopsis thaliana. Plant Physiol. 138 1163-1173.
[18] Solberg, L., Valdar, W., Gauguier, D., Nunez, G., Taylor, A., Burnett, S., Arboledas-Hita, C., Hernandez-Pliego, P., Davidson, S., Burns, P., Bhattacharya, S., Hough, T., Higgs, D., Klenerman, P., Cookson, W., Zhang, Y., Deacon, R., Rawlins, J., Mott, R. and Flint, J. (2006). A protocol for high-throughput phenotyping, suitable for quantitative trait analysis in mice. Mamm. Genome 17 129-146.
[19] Speed, D. and Tavaré, S. (2010). Supplement to “Sparse Partitioning: Nonlinear regression with binary or tertiary predictors with application to association studies.” DOI: . · Zbl 1232.62049
[20] Stephens, M. and Balding, D. (2009). Bayesian statistical methods for genetic association studies. Nat. Rev. Genet. 10 681-690.
[21] Stranger, B., Forrest, M., Dunning, M., Ingle, C., Beazley, C. and Thorne, N. (2007). Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315 848-853.
[22] The Wellcome Trust Case Control Consortium (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661-678.
[23] Wang, H., Zhang, Y., Li, X., Masinde, G., Mohan, S., Baylink, D. and Xu, S. (2005). Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics 170 465-480.
[24] Zhang, M., Montooth, K., Wells, M., Clark, A. and Zhang, D. (2005). Mapping multiple quantitative trait loci by Bayesian classification. Genetics 169 2305-2318.
[25] Zhao, K., Aranzana, M., Kim, S., Lister, C., Shindo, C., Tang, C., Toomajian, C., Zheng, H., Dean, C., Marjoram, P. and Nordborg, M. (2007). An Arabidopsis example of association mapping in structured samples. PLoS Genet. 3 e4.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.