Modeling microbial abundances and dysbiosis with beta-binomial regression. (English) Zbl 1439.62223

Summary: Using a sample from a population to estimate the proportion of the population with a certain category label is a broadly important problem. In the context of microbiome studies, this problem arises when researchers wish to use a sample from a population of microbes to estimate the population proportion of a particular taxon, known as the taxon’s relative abundance. In this paper, we propose a beta-binomial model for this task. Like existing models, our model allows for a taxon’s relative abundance to be associated with covariates of interest. However, unlike existing models, our proposal also allows for the overdispersion in the taxon’s counts to be associated with covariates of interest. We exploit this model in order to propose tests not only for differential relative abundance, but also for differential variability. The latter is particularly valuable in light of speculation that dysbiosis, the perturbation from a normal microbiome that can occur in certain disease conditions, may manifest as a loss of stability, or increase in variability, of the counts associated with each taxon. We demonstrate the performance of our proposed model using a simulation study and an application to soil microbial data.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62P12 Applications of statistics to environmental and related topics
62H20 Measures of association (correlation, canonical correlation, etc.)
Full Text: DOI arXiv Euclid


[1] Aerts, M., Molenberghs, G., Geys, H. and Ryan, L. M. (2002). Topics in Modelling of Clustered Data. CRC Press/CRC, Boca Raton, FL. · Zbl 1084.62513
[2] Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Monographs on Statistics and Applied Probability. CRC Press, London. · Zbl 0688.62004
[3] Albert, A. and Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71 1-10. · Zbl 0543.62020
[4] Bastedo, M. N. and Jaquette, O. (2011). Running in place: Low-income students and the dynamics of higher education stratification. Educ. Eval. Policy Anal. 33 318-339.
[5] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289-300. · Zbl 0809.62014
[6] Callahan, B. J., DiGiulio, D. B., Goltsman, D. S. A., Sun, C. L., Costello, E. K., Jeganathan, P., Biggio, J. R., Wong, R. J., Druzin, M. L. et al. (2017). Replication and refinement of a vaginal microbial signature of preterm birth in two racially distinct cohorts of US women. Proc. Natl. Acad. Sci. USA 114 9966-9971.
[7] Cao, Y., Zhang, A. and Li, H. (2017). Microbial composition estimation from sparse count data. Preprint. Available at arXiv:1706.02380.
[8] Chai, H., Jiang, H., Lin, L. and Liu, L. (2018). A marginalized two-part Beta regression model for microbiome compositional data. PLoS Comput. Biol. 14 e1006329.
[9] Chen, J. and Li, H. (2013). Variable selection for sparse Dirichlet-multinomial regression with an application to microbiome data analysis. Ann. Appl. Stat. 7 418-442. · Zbl 1454.62317
[10] Chen, E. Z. and Li, H. (2016). A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics 32 2611-2617.
[11] Chen, L., Reeve, J., Zhang, L., Huang, S., Wang, X. and Chen, J. (2018). GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data. PeerJ 6 e4600.
[12] Dethlefsen, L. and Relman, D. A. (2011). Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation. Proc. Natl. Acad. Sci. USA 108 4554-4561.
[13] DiGiulio, D. B., Callahan, B. J., McMurdie, P. J., Costello, E. K., Lyell, D. J., Robaczewska, A., Sun, C. L., Goltsman, D. S. A., Wong, R. J. et al. (2015). Temporal and spatial variation of the human microbiota during pregnancy. Proc. Natl. Acad. Sci. USA 112 11060-11065.
[14] Dolzhenko, E. and Smith, A. D. (2014). Using beta-binomial regression for high-precision differential methylation analysis in multifactor whole-genome bisulfite sequencing experiments. BMC Bioinform. 15 215.
[15] Edgar, R. C. (2013). UPARSE: Highly accurate OTU sequences from microbial amplicon reads. Nat. Methods 10 996-998.
[16] Fang, R., Wagner, B. D., Harris, J. K. and Fillon, S. A. (2016). Zero-inflated negative binomial mixed model: An application to two microbial organisms important in oesophagitis. Epidemiol. Infect. 144 2447-2455.
[17] Faust, K., Lahti, L., Gonze, D., de Vos, W. M. and Raes, J. (2015). Metagenomics meets time series analysis: Unraveling microbial community dynamics. Curr. Opin. Microbiol. 25 56-66.
[18] Fiacco, A. V. and McCormick, G. P. (1968). Nonlinear Programming: Sequential Unconstrained Minimization Techniques. Wiley, New York. · Zbl 0193.18805
[19] Fletcher, R. (1987). Practical Methods of Optimization, 2nd ed. Wiley, Chichester. · Zbl 0905.65002
[20] Gerber, G. K. (2014). The dynamic microbiome. FEBS Lett. 588 4131-4139.
[21] Gevers, D., Kugathasan, S., Denson, L. A., Vázquez-Baeza, Y., Van Treuren, W., Ren, B., Schwager, E., Knights, D., Song, S. J. et al. (2014). The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 15 382-392.
[22] Geyer, C. J. (2015). trust: Trust region optimization. R package version 0.1-7.
[23] Grice, E. A. (2014). The skin microbiome: Potential for novel diagnostic and therapeutic approaches to cutaneous disease. Semin. Cutan. Med. Surg. 33 98. NIH Public Access.
[24] Halfvarson, J., Brislawn, C. J., Lamendella, R., Vázquez-Baeza, Y., Walters, W. A., Bramer, L. M., D’Amato, M., Bonfiglio, F., McDonald, D. et al. (2017). Dynamics of the human gut microbiome in inflammatory bowel disease. Nat. Microbiol. 2 17004.
[25] Heinze, G. (2006). A comparative investigation of methods for logistic regression with separated or nearly separated data. Stat. Med. 25 4216-4226.
[26] Heinze, G. and Schemper, M. (2002). A solution to the problem of separation in logistic regression. Stat. Med. 21 2409-2419.
[27] Hill-Burns, E. M., Debelius, J. W., Morton, J. T., Wissemann, W. T., Lewis, M. R., Wallen, Z. D., Peddada, S. D., Factor, S. A., Molho, E. et al. (2017). Parkinson’s disease and Parkinson’s disease medications have distinct signatures of the gut microbiome. Mov. Disord. 32 739-749.
[28] Holmes, I., Harris, K. and Quince, C. (2012). Dirichlet multinomial mixtures: Generative models for microbial metagenomics. PLoS ONE 7 e30126.
[29] Hooks, K. B. and O’Malley, M. A. (2017). Dysbiosis and its discontents. mBio 8 e01492-17.
[30] Kleinman, J. C. (1973). Proportions with extraneous variance: Single and independent samples. J. Amer. Statist. Assoc. 68 46-54.
[31] Kosmidis, I. (2018). brglm2: Bias reduction in generalized linear models. R package version 0.1.8.
[32] Kurtz, Z. D., Müller, C. L., Miraldi, E. R., Littman, D. R., Blaser, M. J. and Bonneau, R. A. (2015). Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput. Biol. 11 e1004226.
[33] Law, C. W., Chen, Y., Shi, W. and Smyth, G. K. (2014). voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15 R29.
[34] La Rosa, P. S., Brooks, J. P., Deych, E., Boone, E. L., Edwards, D. J., Wang, Q., Sodergren, E., Weinstock, G. and Shannon, W. D. (2012). Hypothesis testing and power calculations for taxonomic-based human microbiome data. PLoS ONE 7 e52078.
[35] Li, Z., Lee, K., Karagas, M. R., Madan, J. C., Hoen, A. G., O’Malley, A. J. and Li, H. (2018). Conditional regression based on a multivariate zero-inflated logistic-normal model for microbiome relative abundance data. Stat. Biosci. 10 587-608.
[36] Love, M. I., Huber, W. and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15 550.
[37] Mandal, S., Van Treuren, W., White, R. A., Eggesbø, M., Knight, R. and Peddada, S. D. (2015). Analysis of composition of microbiomes: A novel method for studying microbial composition. Microb. Ecol. Health Dis. 26 27663.
[38] Martin, B. D., Witten, D. and Willis, A. D. (2020a). Supplement A to “Modeling microbial abundances and dysbiosis with beta-binomial regression.” https://doi.org/10.1214/19-AOAS1283SUPPA.
[39] Martin, B. D., Witten, D. and Willis, A. D. (2020b). Supplement B to “Modeling microbial abundances and dysbiosis with beta-binomial regression.” https://doi.org/10.1214/19-AOAS1283SUPPB.
[40] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Monographs on Statistics and Applied Probability. CRC Press, London. · Zbl 0744.62098
[41] McMurdie, P. J. and Holmes, S. (2013). phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8 e61217.
[42] McMurdie, P. J. and Holmes, S. (2014). Waste not, want not: Why rarefying microbiome data is inadmissible. PLoS Comput. Biol. 10 e1003531.
[43] Mercer, L. D., Wakefield, J., Pantazis, A., Lutambi, A. M., Masanja, H. and Clark, S. (2015). Space-time smoothing of complex survey data: Small area estimation for child mortality. Ann. Appl. Stat. 9 1889-1905. · Zbl 1397.62461
[44] Morgan, X. C., Tickle, T. L., Sokol, H., Gevers, D., Devaney, K. L., Ward, D. V., Reyes, J. A., Shah, S. A., LeLeiko, N. et al. (2012). Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 13 R79.
[45] Morgan, X. C., Kabakchiev, B., Waldron, L., Tyler, A. D., Tickle, T. L., Milgrom, R., Stempak, J. M., Gevers, D., Xavier, R. J. et al. (2015). Associations between host gene expression, the mucosal microbiome, and clinical outcome in the pelvic pouch of patients with inflammatory bowel disease. Genome Biol. 16 67.
[46] Nocedal, J. and Wright, S. J. (1999). Numerical Optimization. Springer Series in Operations Research. Springer, New York.
[47] Parker, I. M., Saunders, M., Bontrager, M., Weitz, A. P., Hendricks, R., Magarey, R., Suiter, K. and Gilbert, G. S. (2015). Phylogenetic structure and host abundance drive disease pressure in communities. Nature 520 542-544.
[48] Paulson, J. N., Stine, O. C., Bravo, H. C. and Pop, M. (2013). Differential abundance analysis for microbial marker-gene surveys. Nat. Methods 10 1200-1202.
[49] Peng, X., Li, G. and Liu, Z. (2016). Zero-inflated beta regression for differential abundance analysis with metagenomics data. J. Comput. Biol. 23 102-110.
[50] Petersen, C. and Round, J. L. (2014). Defining dysbiosis and its influence on host immunity and disease. Cell. Microbiol. 16 1024-1033.
[51] Poussin, C., Sierro, N., Boué, S., Battey, J., Scotti, E., Belcastro, V., Peitsch, M. C., Ivanov, N. V. and Hoeng, J. (2018). Interrogating the microbiome: Experimental and computational considerations in support of study reproducibility. Drug Discov. Today 23 1644-1657.
[52] Prentice, R. L. (1986). Binary regression using an extended beta-binomial distribution, with discussion of correlation induced by covariate measurement errors. J. Amer. Statist. Assoc. 81 321-327. · Zbl 0608.62086
[53] Qin, N., Yang, F., Li, A., Prifti, E., Chen, Y., Shao, L., Guo, J., Le Chatelier, E., Yao, J. et al. (2014). Alterations of the human gut microbiome in liver cirrhosis. Nature 513 59.
[54] R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
[55] Robinson, M. D., McCarthy, D. J. and Smyth, G. K. (2010). edgeR: A bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 139-140.
[56] Robinson, M. D. and Oshlack, A. (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 11 R25.
[57] Ryan, D. M. (1974). Penalty and barrier functions. In Numerical Methods for Constrained Optimization (Proc. Sympos., National Physical Lab., Teddington, 1974) 175-190.
[58] Sankaran, K. and Holmes, S. P. (2017). Latent variable modeling for the microbiome. Preprint. Available at arXiv:1706.04969.
[59] Segata, N., Izard, J., Waldron, L., Gevers, D., Miropolsky, L., Garrett, W. S. and Huttenhower, C. (2011). Metagenomic biomarker discovery and explanation. Genome Biol. 12 R60.
[60] Sender, R., Fuchs, S. and Milo, R. (2016). Revised estimates for the number of human and bacteria cells in the body. PLoS Biol. 14 e1002533.
[61] Shi, B., Chang, M., Martin, J., Mitreva, M., Lux, R., Klokkevold, P., Sodergren, E., Weinstock, G. M., Haake, S. K. et al. (2015). Dynamic changes in the subgingival microbiome and their potential for diagnosis and prognosis of periodontitis. mBio 6 e01926-14.
[62] Skellam, J. G. (1948). A probability distribution derived from the binomial distribution by regarding the probability of success as variable between the sets of trials. J. R. Stat. Soc. Ser. B. Stat. Methodol. 10 257-261. · Zbl 0032.41903
[63] Sogin, M. L., Morrison, H. G., Huber, J. A., Welch, D. M., Huse, S. M., Neal, P. R., Arrieta, J. M. and Herndl, G. J. (2006). Microbial diversity in the deep sea and the underexplored “rare biosphere.” Proc. Natl. Acad. Sci. USA 103 12115-12120.
[64] Sohn, M. B., Du, R. and An, L. (2015). A robust approach for identifying differentially abundant features in metagenomic samples. Bioinformatics 31 2269-2275.
[65] Tamboli, C. P., Neut, C., Desreumaux, P. and Colombel, J. F. (2004). Dysbiosis in inflammatory bowel disease. Gut 53 1-4.
[66] Tromas, N., Taranu, Z. E., Martin, B. D., Willis, A., Fortin, N., Greer, C. W. and Shapiro, B. J. (2018). Niche separation increases with genetic distance among bloom-forming cyanobacteria. Front. Microbiol. 9 438.
[67] Wagner, B., Riggs, P. and Mikulich-Gilbertson, S. (2015). The importance of distribution-choice in modeling substance use data: A comparison of negative binomial, beta binomial, and zero-inflated distributions. Am. J. Drug Alcohol Abuse 41 489-497.
[68] Wahba, G., Wang, Y., Gu, C., Klein, R. and Klein, B. (1995). Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy. Ann. Statist. 23 1865-1895. · Zbl 0854.62042
[69] Welch, J. L. M., Rossetti, B. J., Rieken, C. W., Dewhirst, F. E. and Borisy, G. G. (2016). Biogeography of a human oral microbiome at the micron scale. Proc. Natl. Acad. Sci. USA 113 E791-E800.
[70] White, J. R., Nagarajan, N. and Pop, M. (2009). Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput. Biol. 5 e1000352.
[71] Whitman, T., Pepe-Ranney, C., Enders, A., Koechli, C., Campbell, A., Buckley, D. H. and Lehmann, J. (2016). Dynamics of microbial community composition and soil organic carbon mineralization in soil following addition of pyrogenic and fresh organic matter. ISME J. 10 2918-2930.
[72] Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer, New York. · Zbl 1397.62006
[73] Williams, D. A. (1975). 394: The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics 31 949-952. · Zbl 0333.62069
[74] Willis, A. D. and Martin, B. D. (2018). DivNet: Estimating diversity in networked communities. BioRxiv 305045.
[75] Xia, F., Chen, J., Fung, W. K. and Li, H. (2013). A logistic normal multinomial regression model for microbiome compositional data analysis. Biometrics 69 1053-1063. · Zbl 1288.62171
[76] Yee, T. W. (2010). The VGAM package for categorical data analysis. J. Stat. Softw. 32 1-34.
[77] Zhang, X., Mallick, H., Tang, Z., Zhang, L., Cui, X., Benson, A. K. and Yi, N. (2017). Negative binomial mixed models for analyzing microbiome count data. BMC Bioinform. 18 4.
[78] Zhou, Y., Shan, G., Sodergren, E., Weinstock, G., Walker, W. A. and Gregory, K. E. (2015). Longitudinal analysis of the premature infant intestinal microbiome prior to necrotizing enterocolitis: A case-control study. PLoS ONE 10 e0118632.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.