×

zbMATH — the first resource for mathematics

Stochastic gradient Markov chain Monte Carlo. (English) Zbl 1457.62024
Summary: Markov chain Monte Carlo (MCMC) algorithms are generally regarded as the gold standard technique for Bayesian inference. They are theoretically well-understood and conceptually simple to apply in practice. The drawback of MCMC is that performing exact inference generally requires all of the data to be processed at each iteration of the algorithm. For large datasets, the computational cost of MCMC can be prohibitive, which has led to recent developments in scalable Monte Carlo algorithms that have a significantly lower computational cost than standard MCMC. In this article, we focus on a particular class of scalable Monte Carlo algorithms, stochastic gradient Markov chain Monte Carlo (SGMCMC) which utilizes data subsampling techniques to reduce the per-iteration cost of MCMC. We provide an introduction to some popular SGMCMC algorithms and review the supporting theoretical results, as well as comparing the efficiency of SGMCMC algorithms against MCMC on benchmark examples. The supporting R code is available online at https://github.com/chris-nemeth/sgmcmc-review-paper.
MSC:
62-08 Computational methods for problems pertaining to statistics
62F15 Bayesian inference
62L20 Stochastic approximation
65C05 Monte Carlo methods
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Ahn, S.; Korattikara, A.; Liu, N.; Rajan, S.; Welling, M., Large-Scale Distributed Bayesian Matrix Factorization Using Stochastic Gradient MCMC, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 9-18 (2015)
[2] Ahn, S.; Korattikara, A.; Welling, M., “Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring, 1591-1598 (2012)
[3] Aicher, C.; Ma, Y.-A.; Foti, N. J.; Fox, E. B., “Stochastic Gradient MCMC for State Space Models, SIAM Journal on Mathematics of Data Science, 1, 555-587 (2019)
[4] Aicher, C., Putcha, S., Nemeth, C., Fearnhead, P., and Fox, E. B. (2019), “Stochastic Gradient MCMC for Nonlinear State Space Models,” arXiv no. 1901.10568.
[5] Andersen, M.; Winther, O.; Hansen, L. K.; Poldrack, R.; Koyejo, O., Bayesian Structure Learning for Dynamic Brain Connectivity, 1436-1446 (2018)
[6] Baker, J.; Fearnhead, P.; Fox, E. B.; Nemeth, C., Large-Scale Stochastic Sampling From the Probability Simplex, Advances in Neural Information Processing Systems, 6721-6731 (2018)
[7] Baker, J.; Fearnhead, P.; Fox, E. B.; Nemeth, C., “Control Variates for Stochastic Gradient MCMC, Statistics and Computing, 29, 599-615 (2019) · Zbl 1430.62265
[8] Baker, J.; Fearnhead, P.; Fox, E. B.; Nemeth, C., “sgmcmc: An R Package for Stochastic Gradient Markov Chain Monte Carlo, Journal of Statistical Software, 91, 1-27 (2019)
[9] Balan, A. K.; Rathod, V.; Murphy, K. P.; Welling, M., Bayesian Dark Knowledge, Advances in Neural Information Processing Systems, 3438-3446 (2015)
[10] Bardenet, R.; Doucet, A.; Holmes, C., Towards Scaling Up Markov Chain Monte Carlo: An Adaptive Subsampling Approach, 405-413 (2014)
[11] Bardenet, R.; Doucet, A.; Holmes, C., “On Markov Chain Monte Carlo Methods for Tall Data, The Journal of Machine Learning Research, 18, 1515-1557 (2017)
[12] Beck, A.; Teboulle, M., “Mirror Descent and Nonlinear Projected Subgradient Methods for Convex Optimization, Operations Research Letters, 31, 167-175 (2003) · Zbl 1046.90057
[13] Besag, J., “Comments on ‘Representations of Knowledge in Complex Systems’ by U. Grenander and MI Miller,”, Journal of the Royal Statistical Society, Series B, 56, 591-592 (1994)
[14] Bierkens, J.; Fearnhead, P.; Roberts, G. O., “The Zig-Zag Process and Super-Efficient Sampling for Bayesian Analysis of Big Data, The Annals of Statistics, 47, 1288-1320 (2019) · Zbl 1417.65008
[15] Bishop, C. M., Pattern Recognition and Machine Learning (2006), New York: Springer, New York · Zbl 1107.68072
[16] Blei, D. M.; Kucukelbir, A.; McAuliffe, J. D., “Variational Inference: A Review for Statisticians, Journal of the American Statistical Association, 112, 859-877 (2017)
[17] Blei, D. M.; Ng, A. Y.; Jordan, M. I., “Latent Dirichlet Allocation, Journal of Machine Learning Research, 3, 993-1022 (2003) · Zbl 1112.68379
[18] Bouchard-Côté, A.; Vollmer, S. J.; Doucet, A., “The Bouncy Particle Sampler: A Nonreversible Rejection-Free Markov Chain Monte Carlo Method, Journal of the American Statistical Association, 113, 855-867 (2018) · Zbl 1398.60084
[19] Brooks, S.; Gelman, A., “General Methods for Monitoring Convergence of Iterative Simulations,”, Journal of Computational and Graphical Statistics, 7, 434-455 (1998)
[20] Brooks, S.; Gelman, A.; Jones, G.; Meng, X.-L., Handbook of Markov Chain Monte Carlo (2011), Boca Raton, FL: CRC Press, Boca Raton, FL · Zbl 1218.65001
[21] Brosse, N.; Durmus, A.; Moulines, É., The Promises and Pitfalls of Stochastic Gradient Langevin Dynamics, Advances in Neural Information Processing Systems, 8278-8288 (2018)
[22] Brosse, N.; Durmus, A.; Moulines, É.; Pereyra, M., Sampling From a Log-Concave Distribution With Compact Support With Proximal Langevin Monte Carlo, Conference on Learning Theory, 319-342 (2017)
[23] Bubeck, S.; Eldan, R.; Lehec, J., “Sampling From a Log-Concave Distribution With Projected Langevin Monte Carlo,”, Discrete & Computational Geometry, 59, 757-783 (2018) · Zbl 1397.65010
[24] Carpenter, B.; Gelman, A.; Hoffman, M. D.; Lee, D.; Goodrich, B.; Betancourt, M.; Brubaker, M.; Guo, J.; Li, P.; Riddell, A., “Stan: A Probabilistic Programming Language, Journal of Statistical Software, 76, 1-32 (2017)
[25] Chatterji, N.; Flammarion, N.; Ma, Y.; Bartlett, P.; Jordan, M., On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo, Proceedings of Machine Learning Research (PMLR), 80, 764-773 (2018)
[26] Chen, T.; Fox, E.; Guestrin, C., Stochastic Gradient Hamiltonian Monte Carlo, 1683-1691 (2014)
[27] Cheng, X.; Chatterji, N. S.; Bartlett, P. L.; Jordan, M. I.; Bubeck, S.; Perchet, V.; Rigollet, P., Proceedings of the 31st Conference on Learning Theory, Proceedings of Machine Learning Research (PMLR), Underdamped Langevin MCMC: A Non-Asymptotic Analysis, 300-323 (2018)
[28] Dalalyan, A. S., “Theoretical Guarantees for Approximate Sampling From Smooth and Log-Concave Densities, Journal of the Royal Statistical Society, Series B, 79, 651-676 (2017) · Zbl 1411.62030
[29] Dalalyan, A. S.; Karagulyan, A., “User-Friendly Guarantees for the Langevin Monte Carlo With Inaccurate Gradient, Stochastic Processes and Their Applications, 129, 5278-5311 (2019) · Zbl 1428.62316
[30] de Valpine, P.; Turek, D.; Paciorek, C. J.; Anderson-Bergman, C.; Lang, D. T.; Bodik, R., “Programming With Models: Writing Statistical Algorithms for General Model Structures With Nimble, Journal of Computational and Graphical Statistics, 26, 403-413 (2017)
[31] Ding, N.; Fang, Y.; Babbush, R.; Chen, C.; Skeel, R. D.; Neven, H., Bayesian Sampling Using Stochastic Gradient Thermostats, Advances in Neural Information Processing Systems, 3203-3211 (2014)
[32] Dubey, K. A.; Reddi, S. J.; Williamson, S. A.; Poczos, B.; Smola, A. J.; Xing, E. P., Variance Reduction in Stochastic Gradient Langevin Dynamics, Advances in Neural Information Processing Systems, 1154-1162 (2016)
[33] Dunson, D. B.; Johndrow, J., “The Hastings Algorithm at Fifty, Biometrika, 107, 1-23 (2020) · Zbl 1435.62042
[34] Durmus, A.; Moulines, E., “Nonasymptotic Convergence Analysis for the Unadjusted Langevin Algorithm, The Annals of Applied Probability, 27, 1551-1587 (2017) · Zbl 1377.65007
[35] Ermak, D. L., “A Computer Simulation of Charged Particles in Solution. I. Technique and Equilibrium Properties, The Journal of Chemical Physics, 62, 4189-4196 (1975)
[36] Fearnhead, P.; Bierkens, J.; Pollock, M.; Roberts, G. O., “Piecewise Deterministic Markov Processes for Continuous-Time Monte Carlo, Statistical Science, 33, 386-412 (2018) · Zbl 1403.62148
[37] Fearnhead, P.; Papaspiliopoulos, O.; Roberts, G. O., “Particle Filters for Partially Observed Diffusions, Journal of the Royal Statistical Society, Series B, 70, 755-777 (2008) · Zbl 05563368
[38] Gan, Z.; Chen, C.; Henao, R.; Carlson, D.; Carin, L., Scalable Deep Poisson Factor Analysis for Topic Modeling, 1823-1832 (2015)
[39] Gibbs, A. L.; Su, F. E., “On Choosing and Bounding Probability Metrics, International Statistical Review, 70, 419-435 (2002) · Zbl 1217.62014
[40] Girolami, M.; Calderhead, B., “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods” (with discussion), Journal of the Royal Statistical Society, Series B, 73, 123-214 (2011) · Zbl 1411.62071
[41] Gorham, J.; Duncan, A. B.; Vollmer, S. J.; Mackey, L., “Measuring Sample Quality With Diffusions, The Annals of Applied Probability, 29, 2884-2928 (2019) · Zbl 1439.60073
[42] Gorham, J.; Mackey, L., Measuring Sample Quality With Stein’s Method, Advances in Neural Information Processing Systems, 226-234 (2015)
[43] Gorham, J.; Mackey, L., Measuring Sample Quality With Kernels, Proceedings of the 34th International Conference on Machine Learning, 70, 1292-1301 (2017)
[44] Hastings, W. K., “Monte Carlo Sampling Methods Using Markov Chains and Their Applications, Biometrika, 57, 97-109 (1970) · Zbl 0219.65008
[45] Hoffman, M. D.; Gelman, A., “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo,”, Journal of Machine Learning Research, 15, 1593-1623 (2014) · Zbl 1319.60150
[46] Hsieh, Y.-P.; Kavis, A.; Rolland, P.; Cevher, V., Mirrored Langevin Dynamics, Advances in Neural Information Processing Systems, 2883-2892 (2018)
[47] Huggins, J.; Zou, J., Quantifying the Accuracy of Approximate Diffusions and Markov Chains, Artificial Intelligence and Statistics, 382-391 (2017)
[48] Korattikara, A.; Chen, Y.; Welling, M., Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget, International Conference on Machine Learning, 181-189 (2014)
[49] Kucukelbir, A.; Ranganath, R.; Gelman, A.; Blei, D., Automatic Variational Inference in Stan, Advances in Neural Information Processing Systems, 568-576 (2015)
[50] LeCun, Y.; Cortes, C.; Burges, C. (2010)
[51] Li, W.; Ahn, S.; Welling, M., Scalable MCMC for Mixed Membership Stochastic Blockmodels, Artificial Intelligence and Statistics, 723-731 (2016)
[52] Lunn, D. J.; Thomas, A.; Best, N.; Spiegelhalter, D., WinBUGS—A Bayesian Modelling Framework: Concepts, Structure, and Extensibility, Statistics and Computing, 10, 325-337 (2000)
[53] Ma, Y.-A.; Chen, T.; Fox, E., A Complete Recipe for Stochastic Gradient MCMC, Advances in Neural Information Processing Systems, 2917-2925 (2015)
[54] Ma, Y.-A., Foti, N. J., and Fox, E. B. (2017), “Stochastic Gradient MCMC Methods for Hidden Markov Models,” arXiv no. 1706.04632.
[55] Majka, M. B.; Mijatović, A.; Szpruch, Ł., “Nonasymptotic Bounds for Sampling Algorithms Without Log-Concavity, The Annals of Applied Probability, 30, 1534-1581 (2020)
[56] Metropolis, N.; Rosenbluth, A. W.; Rosenbluth, M. N.; Teller, A. H.; Teller, E., “Equation of State Calculations by Fast Computing Machines, The Journal of Chemical Physics, 21, 1087-1092 (1953) · Zbl 1431.65006
[57] Meyn, S. P.; Tweedie, R. L., “Computable Bounds for Geometric Convergence Rates of Markov Chains, The Annals of Applied Probability, 4, 981-1011 (1994) · Zbl 0812.60059
[58] Minka, T. P., Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, Expectation Propagation for Approximate Bayesian Inference, 362-369 (2001), Morgan Kaufmann Publishers Inc
[59] Minsker, S.; Srivastava, S.; Lin, L.; Dunson, D. B., “Robust and Scalable Bayes via a Median of Subset Posterior Measures, The Journal of Machine Learning Research, 18, 4488-4527 (2017)
[60] Nagapetyan, T., Duncan, A. B., Hasenclever, L., Vollmer, S. J., Szpruch, L. and Zygalakis, K. (2017), “The True Cost of Stochastic Gradient Langevin Dynamics,” arXiv no. 1706.02692.
[61] Neal, R. M.; Brooks, S.; Gelman, A.; Jones, G. L.; Meng, X.-L., Handbook of Markov chain Monte Carlo, MCMC Using Hamiltonian Dynamics,”, 113-162 (2011), Boca Raton, FL: CRC Press, Boca Raton, FL
[62] Neal, R. M., Bayesian Learning for Neural Networks, 118 (2012), New York: Springer, New York
[63] Neiswanger, W.; Wang, C.; Xing, E. P., Asymptotically Exact, Embarrassingly Parallel MCMC, 623-632 (2014)
[64] Nemeth, C.; Sherlock, C., “Merging MCMC Subposteriors Through Gaussian-Process Approximations, Bayesian Analysis, 13, 507-530 (2018) · Zbl 1407.62081
[65] Parisi, G., “Correlation Functions and Computer Simulations, Nuclear Physics B, 180, 378-384 (1981)
[66] Patterson, S.; Teh, Y. W., Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex, Advances in Neural Information Processing Systems, 3102-3110 (2013)
[67] Pillai, N. S.; Stuart, A. M.; Thiéry, A. H., “Optimal Scaling and Diffusion Limits for the Langevin Algorithm in High Dimensions, The Annals of Applied Probability, 22, 2320-2356 (2012) · Zbl 1272.60053
[68] Plummer, M., JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling, Proceedings of the 3rd International Workshop on Distributed Statistical Computing, 124, 10 (2003)
[69] Pollock, M.; Fearnhead, P.; Johansen, A. M.; Roberts, G. O., “Quasi-Stationary Monte Carlo and the Scale Algorithm,”, Journal of the Royal Statistical Society, Series B (2020)
[70] Quiroz, M.; Kohn, R.; Villani, M.; Tran, M.-N., “Speeding Up MCMC by Efficient Data Subsampling, Journal of the American Statistical Association, 114, 831-843 (2019) · Zbl 1420.62121
[71] Rabinovich, M.; Angelino, E.; Jordan, M. I., Variational Consensus Monte Carlo, Advances in Neural Information Processing Systems, 1207-1215 (2015)
[72] Raginsky, M.; Rakhlin, A.; Telgarsky, M., Non-Convex Learning via Stochastic Gradient Langevin Dynamics: A Nonasymptotic Analysis, Conference on Learning Theory, 1674-1703 (2017)
[73] Raj, A.; Stephens, M.; Pritchard, J. K., “fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets, Genetics, 197, 573-589 (2014)
[74] Ripley, B. D., Stochastic Simulation (1987), New York: Wiley, New York · Zbl 0613.65006
[75] Robbins, H.; Monro, S., “A Stochastic Approximation Method, The Annals of Mathematical Statistics, 22, 400-407 (1951) · Zbl 0054.05901
[76] Roberts, G. O.; Rosenthal, J. S., “Geometric Ergodicity and Hybrid Markov Chains, Electronic Communications in Probability, 2, 13-25 (1997) · Zbl 0890.60061
[77] Roberts, G. O.; Rosenthal, J. S., “Optimal Scaling of Discrete Approximations to Langevin Diffusions, Journal of the Royal Statistical Society, Series B, 60, 255-268 (1998) · Zbl 0913.60060
[78] Roberts, G. O.; Rosenthal, J. S., “Optimal Scaling for Various Metropolis-Hastings Algorithms, Statistical Science, 16, 351-367 (2001) · Zbl 1127.65305
[79] Roberts, G. O.; Rosenthal, J. S., “General State Space Markov Chains and MCMC Algorithms, Probability Surveys, 1, 20-71 (2004) · Zbl 1189.60131
[80] Roberts, G. O.; Tweedie, R. L., “Exponential Convergence of Langevin Distributions and Their Discrete Approximations, Bernoulli, 2, 341-363 (1996) · Zbl 0870.60027
[81] Salakhutdinov, R.; Mnih, A., Bayesian Probabilistic Matrix Factorization Using Markov Chain Monte Carlo, Proceedings of the 25th International Conference on Machine Learning, 880-887 (2008)
[82] Salehi, F., Celis, L. E., and Thiran, P. (2017), “Stochastic Optimization With Bandit Sampling,” arXiv no. 1708.02544.
[83] Scott, S. L.; Blocker, A. W.; Bonassi, F. V.; Chipman, H. A.; George, E. I.; McCulloch, R. E., “Bayes and Big Data: The Consensus Monte Carlo Algorithm, International Journal of Management Science and Engineering Management, 11, 78-88 (2016)
[84] Sen, D.; Sachs, M.; Lu, J.; Dunson, D. B., “Efficient Posterior Sampling for High-Dimensional Imbalanced Logistic Regression, Biometrika (2020) · Zbl 1457.62221
[85] Srivastava, S.; Li, C.; Dunson, D. B., “Scalable Bayes via Barycenter in Wasserstein Space,”, The Journal of Machine Learning Research, 19, 312-346 (2018) · Zbl 1444.62037
[86] Teh, Y. W.; Thiery, A. H.; Vollmer, S. J., “Consistency and Fluctuations for Stochastic Gradient Langevin Dynamics, The Journal of Machine Learning Research, 17, 193-225 (2016)
[87] Tran, D., Kucukelbir, A., Dieng, A. B., Rudolph, M., Liang, D., and Blei, D. M. (2016), “Edward: A Library for Probabilistic Modeling, Inference, and Criticism,” arXiv no. 1610.09787.
[88] Vollmer, S. J.; Zygalakis, K. C.; Teh, Y. W., “Exploration of the (Non-)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics, The Journal of Machine Learning Research, 17, 5504-5548 (2016)
[89] Wang, Y.-X.; Fienberg, S.; Smola, A., Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo, 2493-2502 (2015)
[90] Welling, M.; Teh, Y. W., Bayesian Learning via Stochastic Gradient Langevin Dynamics, 681-688 (2011)
[91] Yogatama, D.; Wang, C.; Routledge, B. R.; Smith, N. A.; Xing, E. P., “Dynamic Language Models for Streaming Text, Transactions of the Association for Computational Linguistics, 2, 181-192 (2014)
[92] Zuanetti, D.; Müller, P.; Zhu, Y.; Yang, S.; Ji, Y., “Bayesian Nonparametric Clustering for Large Data Sets, Statistics and Computing, 29, 203-215 (2019) · Zbl 1430.62146
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.