Gaussian process modelling in approximate Bayesian computation to estimate horizontal gene transfer in bacteria. (English) Zbl 1411.62320

Summary: Approximate Bayesian computation (ABC) can be used for model fitting when the likelihood function is intractable but simulating from the model is feasible. However, even a single evaluation of a complex model may take several hours, limiting the number of model evaluations available. Modelling the discrepancy between the simulated and observed data using a Gaussian process (GP) can be used to reduce the number of model evaluations required by ABC, but the sensitivity of this approach to a specific GP formulation has not yet been thoroughly investigated. We begin with a comprehensive empirical evaluation of using GPs in ABC, including various transformations of the discrepancies and two novel GP formulations. Our results indicate the choice of GP may significantly affect the accuracy of the estimated posterior distribution. Selection of an appropriate GP model is thus important. We formulate expected utility to measure the accuracy of classifying discrepancies below or above the ABC threshold, and show that it can be used to automate the GP model selection step. Finally, based on the understanding gained with toy examples, we fit a population genetic model for bacteria, providing insight into horizontal gene transfer events within the population and from external origins.


62P10 Applications of statistics to biology and medical sciences; meta analysis
60G15 Gaussian processes
62F15 Bayesian inference
92D10 Genetics and epigenetics
Full Text: DOI arXiv Euclid


[1] Ansari, M. A. and Didelot, X. (2014). Inference of the properties of the recombination process from whole bacterial genomes. Genetics196 253-265.
[2] Beaumont, M. A., Zhang, W. and Balding, D. J. (2002). Approximate Bayesian computation in population genetics. Genetics162 2025-2035.
[3] Beaumont, M. A., Cornuet, J.-M., Marin, J.-M. and Robert, C. P. (2009). Adaptive approximate Bayesian computation. Biometrika96 983-990. · Zbl 1437.62393 · doi:10.1093/biomet/asp052
[4] Bernardo, J.-M. and Smith, A. F. M. (2001). Bayesian Theory. Wiley, Chichester.
[5] Blum, M. G. B. (2010). Approximate Bayesian computation: A nonparametric perspective. J. Amer. Statist. Assoc.105 1178-1187. · Zbl 1390.62052 · doi:10.1198/jasa.2010.tm09448
[6] Blum, M. G. B. and François, O. (2010). Non-linear regression models for approximate Bayesian computation. Stat. Comput.20 63-73.
[7] Brochu, E., Cora, V. M. and de Freitas, N. (2010). A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Preprint. Available at arXiv:1012.2599.
[8] Chewapreecha, C., Harris, S. R., Croucher, N. J., Turner, C., Marttinen, P., Cheng, L., Pessia, A., Aanensen, D. M., Mather, A. E., Page, A. J. et al. (2014). Dense genomic sampling identifies highways of pneumococcal recombination. Nat. Genet.46 305-309.
[9] Cohan, F. M. and Perry, E. B. (2007). A systematics for discovering the fundamental units of bacterial diversity. Curr. Biol.17 R373-R386.
[10] Croucher, N. J., Harris, S. R., Fraser, C., Quail, M. A., Burton, J., van der Linden, M., McGee, L., von Gottberg, A., Song, J. H., Ko, K. S. et al. (2011). Rapid pneumococcal evolution in response to clinical interventions. Science331 430-434.
[11] Croucher, N. J., Finkelstein, J. A., Pelton, S. I., Mitchell, P. K., Lee, G. M., Parkhill, J., Bentley, S. D., Hanage, W. P. and Lipsitch, M. (2013). Population genomics of post-vaccine changes in pneumococcal epidemiology. Nat. Genet.45 656-663.
[12] Del Moral, P., Doucet, A. and Jasra, A. (2012). An adaptive sequential Monte Carlo method for approximate Bayesian computation. Stat. Comput.22 1009-1020. · Zbl 1252.65025 · doi:10.1007/s11222-011-9271-y
[13] Doroghazi, J. R. and Buckley, D. H. (2011). A model for the effect of homologous recombination on microbial diversification. Genome Biol. Evol.3 1349-1356.
[14] Drovandi, C. C., Moores, M. T. and Boys, R. J. (2018). Accelerating pseudo-marginal MCMC using Gaussian processes. Comput. Statist. Data Anal.118 1-17. · Zbl 1469.62057 · doi:10.1016/j.csda.2017.09.002
[15] Drovandi, C. C. and Pettitt, A. N. (2011). Estimation of parameters for macroparasite population evolution using approximate Bayesian computation. Biometrics67 225-233. · Zbl 1217.62128 · doi:10.1111/j.1541-0420.2010.01410.x
[16] Drovandi, C. C., Pettitt, A. N. and Lee, A. (2015). Bayesian indirect inference using a parametric auxiliary model. Statist. Sci.30 72-95. · Zbl 1332.62088 · doi:10.1214/14-STS498
[17] Fan, Y., Nott, D. J. and Sisson, S. A. (2013). Approximate Bayesian computation via regression density estimation. Stat2 34-48.
[18] Fearnhead, P. and Prangle, D. (2012). Constructing summary statistics for approximate Bayesian computation: Semi-automatic approximate Bayesian computation. J. R. Stat. Soc. Ser. B. Stat. Methodol.74 419-474. · Zbl 1411.62057
[19] Fraser, C., Hanage, W. P. and Spratt, B. G. (2007). Recombination and the nature of bacterial speciation. Science315 476-480.
[20] Goldberg, P. W., Williams, C. K. I. and Bishop, C. M. (1997). Regression with input-dependent noise: A Gaussian process treatment. Adv. Neural Inf. Process. Syst.10 493-499.
[21] Gutmann, M. U. and Corander, J. (2016). Bayesian optimization for likelihood-free inference of simulator-based statistical models. J. Mach. Learn. Res.17 Paper No. 125, 47. · Zbl 1392.62072
[22] Hartig, F., Calabrese, J. M., Reineking, B., Wiegand, T. and Huth, A. (2011). Statistical inference for stochastic simulation models – Theory and application. Ecol. Lett.14 816-827.
[23] Jabot, F., Lagarrigues, G., Courbaud, B. and Dumoulin, N. (2014). A comparison of emulation methods for approximate Bayesian computation. Preprint. Available at arXiv:1412.7560.
[24] Järvenpää, M., Gutmann, M., Pleska, A., Vehtari, A. and Marttinen, P. (2017). Efficient acquisition rules for model-based approximate Bayesian computation. Preprint. Available at arXiv:1704.00520. · Zbl 1416.62453
[25] Järvenpää, M., Gutmann, M., Vehtari, A. and Marttinen, P. (2018). Supplement to “Gaussian process modeling in approximate Bayesian computation to estimate horizontal gene transfer in bacteria.” DOI:10.1214/18-AOAS1150SUPP. · Zbl 1411.62320
[26] Kandasamy, K., Schneider, J. and Póczos, B. (2015). Bayesian active learning for posterior estimation. In International Joint Conference on Artificial Intelligence 3605-3611.
[27] Lenormand, M., Jabot, F. and Deffuant, G. (2013). Adaptive approximate Bayesian computation for complex models. Comput. Statist.28 2777-2796. · Zbl 1306.65088 · doi:10.1007/s00180-013-0428-3
[28] Lintusaari, J., Gutmann, M. U., Dutta, R., Kaski, S. and Corander, J. (2016). Fundamentals and recent developments in approximate Bayesian computation. Syst. Biol.66 e66-e82.
[29] Majewski, J. (2001). Sexual isolation in bacteria. FEMS Microbiol. Lett.199 161-169.
[30] Marin, J.-M., Pudlo, P., Robert, C. P. and Ryder, R. J. (2012). Approximate Bayesian computational methods. Stat. Comput.22 1167-1180. · Zbl 1252.62022 · doi:10.1007/s11222-011-9288-2
[31] Marjoram, P., Molitor, J., Plagnol, V. and Tavare, S. (2003). Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA100 15324-15328.
[32] Marttinen, P., Croucher, N. J., Gutmann, M. U., Corander, J. and Hanage, W. P. (2015). Recombination produces coherent bacterial species clusters in both core and accessory genomes. Microb. Genomes1 e000038.
[33] Meeds, E. and Welling, M. (2014). GPS-ABC: Gaussian process surrogate approximate Bayesian computation. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence.
[34] Niehus, R., Mitri, S., Fletcher, A. G. and Foster, K. R. (2015). Migration and horizontal gene transfer divide microbial genomes into multiple niches. Nat. Commun.6 8924.
[35] Papamakarios, G. and Murray, I. (2016). Fast e-free inference of simulation models with Bayesian conditional density estimation. In Advances in Neural Information Processing Systems 29.
[36] Price, L. F., Drovandi, C. C., Lee, A. and Nott, D. J. (2018). Bayesian synthetic likelihood. J. Comput. Graph. Statist.27 1-11. · Zbl 07498962
[37] Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA. · Zbl 1177.68165
[38] Shahriari, B., Swersky, K., Wang, Z., Adams, R. P. and de Freitas, N. (2015). Taking the human out of the loop: A review of Bayesian optimization. Proc. IEEE104.
[39] Shapiro, B. J., Friedman, J., Cordero, O. X., Preheim, S. P., Timberlake, S. C., Szabó, G., Polz, M. F. and Alm, E. J. (2012). Population genomics of early events in the ecological differentiation of bacteria. Science336 48-51.
[40] Sisson, S. A., Fan, Y. and Tanaka, M. M. (2007). Sequential Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA104 1760-1765. · Zbl 1160.65005 · doi:10.1073/pnas.0607208104
[41] Snelson, E., Rasmussen, C. E. and Ghahramani, Z. (2004). Warped Gaussian processes. In Advances in Neural Information Processing Systems 16 337-344.
[42] Snoek, J., Larochelle, H. and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems 25 1-9. · Zbl 1433.68379
[43] Thomas, C. M. and Nielsen, K. M. (2005). Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat. Rev., Microbiol.3 711-721.
[44] Tolvanen, V., Jylänki, P. and Vehtari, A. (2014). Approximate inference for nonstationary heteroscedastic Gaussian process regression. In 2014 IEEE International Workshop on Machine Learning for Signal Processing 1-24.
[45] Toni, T., Welch, D., Strelkowa, N., Ipsen, A. and Stumpf, M. P. H. (2009). Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. J. R. Soc. Interface6 187-202.
[46] Touchon, M., Hoede, C., Tenaillon, O., Barbe, V., Baeriswyl, S., Bidet, P., Bingen, E., Bonacorsi, S., Bouchier, C., Bouvet, O. et al. (2009). Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet.5 e1000344.
[47] Turner, B. M. and Sederberg, P. B. (2014). A generalized, likelihood-free method for posterior estimation. Psychon. Bull. Rev.21 227-250.
[48] Turner, B. M. and Van Zandt, T. (2012). A tutorial on approximate Bayesian computation. J. Math. Psych.56 69-85. · Zbl 1245.91084 · doi:10.1016/j.jmp.2012.02.005
[49] Vanhatalo, J., Riihimäki, J., Hartikainen, J., Jylänki, P., Tolvanen, V. and Vehtari, A. (2013). GPstuff: Bayesian modeling with Gaussian processes. J. Mach. Learn. Res.14 1175-1179. · Zbl 1320.62010
[50] Vehtari, A. and Lampinen, J. (2002). Bayesian model assessment and comparison using cross-validation predictive densities. Neural Comput.14 2439-2468. · Zbl 1002.62029 · doi:10.1162/08997660260293292
[51] Vehtari, A. and Ojanen, J. (2012). A survey of Bayesian predictive methods for model assessment, selection and comparison. Stat. Surv.6 142-228. · Zbl 1302.62011 · doi:10.1214/12-SS102
[52] Wilkinson, R. D. (2014). Accelerating ABC methods using Gaussian processes. In Proceedings of the Seventeeth International Conference on Artificial Intelligence and Statistics 1015-1023.
[53] Wood, S. N. (2010). Statistical inference for noisy nonlinear ecological dynamic systems. Nature466 1102-1104.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.