A nonparametric spatial test to identify factors that shape a microbiome. (English) Zbl 1435.62400

Summary: The advent of high-throughput sequencing technologies has made data from DNA material readily available, leading to a surge of microbiome-related research establishing links between markers of microbiome health and specific outcomes. However, to harness the power of microbial communities we must understand not only how they affect us, but also how they can be influenced to improve outcomes. This area has been dominated by methods that reduce community composition to summary metrics, which can fail to fully exploit the complexity of community data. Recently, methods have been developed to model the abundance of taxa in a community, but they can be computationally intensive and do not account for spatial effects underlying microbial settlement. These spatial effects are particularly relevant in the microbiome setting because we expect communities that are close together to be more similar than those that are far apart. In this paper, we propose a flexible Bayesian spike-and-slab variable selection model for presence-absence indicators that accounts for spatial dependence and cross-dependence between taxa while reducing dimensionality in both directions. We show by simulation that in the presence of spatial dependence, popular distance-based hypothesis testing methods fail to preserve their advertised size, and the proposed method improves variable selection. Finally, we present an application of our method to an indoor fungal community found within homes across the contiguous United States.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62G10 Nonparametric hypothesis testing
62H25 Factor analysis and principal components; correspondence analysis


ElemStatLearn; MIMIX
Full Text: DOI arXiv Euclid


[1] Anderson, M. J. (2001). A new method for non-parametric multivariate analysis of variance. Austral Ecology 26 32-46.
[2] Banerjee, S. (2005). On geodetic distance computations in spatial modeling. Biometrics 61 617-625.
[3] Barberán, A., Dunn, R. R., Reich, B. J., Pacifici, K., Laber, E. B., Menninger, H. L., Morton, J. M., Henley, J. B., Leff, J. W. et al. (2015). The ecology of microscopic life in household dust. Proc. R. Soc. Lond., B Biol. Sci. 282 212-220.
[4] Bray, J. R. and Curtis, J. T. (1957). An ordination of the upland forest communities of southern Wisconsin. Ecol. Monogr. 27 325-349.
[5] Castillo, I. and van der Vaart, A. (2012). Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. Ann. Statist. 40 2069-2101. · Zbl 1257.62025
[6] Chen, J. and Li, H. (2013). Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. Ann. Appl. Stat. 7 418-442. · Zbl 1454.62317
[7] Clark, J. S., Nemergut, D., Seyednasrollah, B., Turner, P. J. and Zhang, S. (2017). Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data. Ecol. Monogr. 87 34-56.
[8] Clarke, K. R. (1993). Non-parametric multivariate analyses of changes in community structure. Aust. J. Ecol. 18 117-143.
[9] Craven, P. and Wahba, G. (1978). Smoothing noisy data with spline functions. Numer. Math. 31 377-403. · Zbl 0377.65007
[10] Dannemiller, K. C., Gent, J. F., Leaderer, B. P. and Peccia, J. (2016). Influence of housing characteristics on bacterial and fungal communities in homes of asthmatic children. Indoor Air 26 179-192.
[11] Dunn, R. R., Fierer, N., Henley, J. B., Leff, J. W. and Menninger, H. L. (2013). Home life: Factors structuring the bacterial diversity found within and between homes. PLoS ONE 8 e64133.
[12] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209-230. · Zbl 0255.62037
[13] Flores, G. E., Henley, J. B. and Fierer, N. (2012). A direct PCR approach to accelerate analyses of human-associated microbial communities. PLoS ONE 7 e44563.
[14] Fry, J. A., Xian, G., Jin, S., Dewitz, J. A., Homer, C. G., Limin, Y., Barnes, C. A., Herold, N. D. and Wickham, J. D. (2011). Completion of the 2006 national land cover database for the conterminous United States. Photogramm. Eng. Remote Sens. 77 858-864.
[15] Gelfand, A. E., Kottas, A. and MacEachern, S. N. (2005). Bayesian nonparametric spatial modeling with Dirichlet process mixing. J. Amer. Statist. Assoc. 100 1021-1035. · Zbl 1117.62342
[16] George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881-889.
[17] Grantham, N. S., Reich, B. J., Pacifici, K., Laber, E. B., Menninger, H. L., Henley, J. B., Barberán, A., Leff, J. W., Fierer, N. et al. (2015). Fungi identify the geographic origin of dust samples. PLoS ONE 10 e0122605.
[18] Grantham, N. S., Reich, B. J., Borer, E. T. and Gross, K. (2017). MIMIX: A Bayesian mixed-effects model for microbiome data from designed experiments. Manuscript in review.
[19] Hall, P., Müller, H.-G. and Yao, F. (2008). Modelling sparse generalized longitudinal observations with latent Gaussian processes. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 703-723. · Zbl 05563365
[20] Hamada, N. and Fujita, T. (2002). Effect of air-conditioner on fungal contamination. Atmos. Environ. 36 5443-5448.
[21] Harris, I., Jones, P. D., Osborn, T. J. and Lister, D. H. (2014). Updated high-resolution grids of monthly climatic observations—the CRU TS3.10 dataset. Int. J. Climatol. 34 623-642.
[22] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer Series in Statistics. Springer, New York. · Zbl 1273.62005
[23] Heather, J. M. and Chain, B. (2016). The sequence of sequencers: The history of sequencing DNA. Genomics 107 1-8.
[24] Human Microbiome Project Consortium (2012). Structure, function and diversity of the healthy human microbiome. Nature 486 207-214.
[25] Kettleson, E. M., Adhikari, A., Vesper, S., Coombs, K., Indugula, R. and Reponen, T. (2015). Key determinants of the fungal and bacterial microbiomes in homes. Environ. Res. 138 130-135.
[26] Kuo, L. and Mallick, B. (1998). Variable selection for regression models. Sankhya B 60 65-81. · Zbl 0972.62016
[27] Lee, S., Huang, J. Z. and Hu, J. (2010). Sparse logistic principal components analysis for binary data. Ann. Appl. Stat. 4 1579-1601. · Zbl 1202.62084
[28] Lorenz, E. N. (1956). Empirical orthogonal functions and statistical weather prediction.
[29] McArdle, B. H. and Anderson, M. J. (2001). Fitting multivariate models to community data: A comment on distance-based redundancy analysis. Ecology 82 290-297.
[30] Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. J. Amer. Statist. Assoc. 83 1023-1036. With comments by James Berger and C. L. Mallows and with a reply by the authors. · Zbl 0673.62051
[31] Nelsen, R. B. (1999). An Introduction to Copulas. Lecture Notes in Statistics 139. Springer, New York. · Zbl 0909.62052
[32] Ovaskainen, O., Hottola, J. and Siitonen, J. (2010). Modeling species co-occurrence by multivariate logistic regression generates new hypotheses on fungal interactions. Ecology 91 2514-2521.
[33] Ovaskainen, O., Roy, D. B., Fox, R. and Anderson, B. J. (2016). Uncovering hidden spatial structure in species communities with spatially explicit joint species distribution models. Methods Ecol. Evol. 7 428-436.
[34] Ovaskainen, O., Tikhonov, G., Norberg, A., Guillaume Blanchet, F., Duan, L., Dunson, D., Roslin, T. and Abrego, N. (2017). How to make more out of community data? A conceptual framework and its implementation as models and software. Ecol. Lett. 20 561-576.
[35] Petrone, S., Guindani, M. and Gelfand, A. E. (2009). Hybrid Dirichlet mixture models for functional data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 71 755-782. · Zbl 1248.62079
[36] Qin, J., Li, Y., Cai, Z., Li, S., Zhu, J., Zhang, F., Liang, S., Zhang, W., Guan, Y. et al. (2012). A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490 55-60.
[37] Ravel, J., Gajer, P., Abdo, Z., Schneider, G. M., Koenig, S. S. K., McCulle, S. L., Karlebach, S., Gorle, R., Russell, J. et al. (2011). Vaginal microbiome of reproductive-age women. Proc. Natl. Acad. Sci. USA 108 4680-4687.
[38] Reich, B. J. and Fuentes, M. (2007). A multivariate semiparametric Bayesian spatial modeling framework for hurricane surface wind fields. Ann. Appl. Stat. 1 249-264. · Zbl 1129.62114
[39] Reuter, J. A., Spacek, D. V. and Snyder, M. P. (2015). High-throughput sequencing technologies. Molecular Cell 58 586-597.
[40] Ročková, V. and George, E. I. (2018). The spike-and-slab LASSO. J. Amer. Statist. Assoc. 113 431-444. · Zbl 1398.62186
[41] Rodríguez, A., Dunson, D. B. and Gelfand, A. E. (2010). Latent stick-breaking processes. J. Amer. Statist. Assoc. 105 647-659. · Zbl 1392.60050
[42] Round, J. L. and Mazmanian, S. K. (2009). The gut microbiota shapes intestinal immune responses during health and disease. Nat. Rev., Immunol. 9 313-323.
[43] Serban, N., Staicu, A.-M. and Carroll, R. J. (2013). Multilevel cross-dependent binary longitudinal data. Biometrics 69 903-913. · Zbl 1419.62440
[44] Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 639-650. · Zbl 0823.62007
[45] Shirota, S., Gelfand, A. E. and Banerjee, S. (2017). Spatial joint species distribution modeling using Dirichlet processes. · Zbl 1421.62154
[46] Singh, S. P., Staicu, A., Dunn, R. R., Fierer, N. and Reich, B. J. (2019). Supplement to “A nonparametric spatial test to identify factors that shape a microbiome.” DOI:10.1214/19-AOAS1262SUPP.
[47] Stein, M. L. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer Series in Statistics. Springer, New York. · Zbl 0924.62100
[48] Thorson, J. T., Scheuerell, M. D., Shelton, A. O., See, K. E., Skaug, H. J. and Kristensen, K. (2015). Spatial factor analysis: A new tool for estimating joint species distributions and correlations in species range. Methods Ecol. Evol. 6 627-637.
[49] Turnbaugh, P. J., Hamady, M., Yatsunenko, T., Cantarel, B. L., Duncan, A., Ley, R. E., Sogin, M. L., Jones, W. J., Roe, B. A. et al. (2009). A core gut microbiome in obese and lean twins. Nature 457 480-484.
[50] Wadsworth, W. D., Argiento, R., Guindani, M., Galloway-Pena, J., Shelburne, S. A. and Vannucci, M. (2017). An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data. BMC Bioinform. 18 94.
[51] Wang, T. and Zhao, H. (2017). A Dirichlet-tree multinomial regression model for associating dietary nutrients with gut microorganisms. Biometrics 73 792-801.
[52] Warton, D. I. (2011). Regularized sandwich estimators for analysis of high-dimensional data using generalized estimating equations. Biometrics 67 116-123. · Zbl 1216.62186
[53] Warton, D. I., Wright, S. T. and Wang, Y. (2012). Distance-based multivariate analyses confound location and dispersion effects. Methods Ecol. Evol. 3 89-101.
[54] Zhao, N., Chen, J., Carroll, I. M., Ringel-Kulka, T., Epstein, M. P., Zhou, H., Zhou, J. J., Ringel, Y., Li, H. et al. (2015). Testing in microbiome-profiling studies with MiRKAT, the microbiome regression-based kernel association test. Am. J. Hum. Genet. 96 797-807.
[55] Zhou, J., Bhattacharya, A., Herring, A. H. and Dunson, D. B. (2015). Bayesian factorizations of big sparse tensors. J. Amer. Statist. Assoc. 110 1562-1576. · Zbl 1373.62282
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.