zbMATH — the first resource for mathematics

A stochastic variational framework for fitting and diagnosing generalized linear mixed models. (English) Zbl 1327.62167
Summary: In stochastic variational inference, the variational Bayes objective function is optimized using stochastic gradient approximation, where gradients computed on small random subsets of data are used to approximate the true gradient over the whole data set. This enables complex models to be fit to large data sets as data can be processed in mini-batches. In this article, we extend stochastic variational inference for conjugate-exponential models to nonconjugate models and present a stochastic nonconjugate variational message passing algorithm for fitting generalized linear mixed models that is scalable to large data sets. In addition, we show that diagnostics for prior-likelihood conflict, which are useful for Bayesian model criticism, can be obtained from nonconjugate variational message passing automatically, as an alternative to simulation-based Markov chain Monte Carlo methods. Finally, we demonstrate that for moderate-sized data sets, convergence can be accelerated by using the stochastic version of nonconjugate variational message passing in the initial stage of optimization before switching to the standard version.
Reviewer: Reviewer (Berlin)

62F15 Bayesian inference
62J12 Generalized linear models (logistic models)
62J20 Diagnostics, and linear inference and regression
Full Text: DOI Euclid
[1] Amari, S. (1998). “Natural gradient works efficiently in learning.” Neural Computation , 10: 251-276.
[2] Attias, H. (1999). “Inferring parameters and structure of latent variable models by variational Bayes.” In Laskey, K. and Prade, H. (eds.), Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence , 21-30. San Francisco, CA: Morgan Kaufmann.
[3] Bishop, C. M. (2006). Pattern recognition and machine learning . New York: Springer. · Zbl 1107.68072
[4] Booth, J. G. and Hobert, J. P. (1999). “Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm.” Journal of the Royal Statistical Society: Series B , 61: 265-285. · Zbl 0917.62058 · doi:10.1111/1467-9868.00176
[5] Bottou, L. and Le Cun, Y. (2005). “On-line learning for very large data sets.” Applied stochastic models in business and industry , 21: 137-151. · Zbl 1091.68063 · doi:10.1002/asmb.538
[6] Bottou, L. and Bousquet, O. (2008). “The trade-offs of large scale learning.” In Platt, J. C., Koller, D., Singer, Y. and Roweis, S. (eds.), Advances in Neural Information Processing Systems 20 , 161-168. Red Hook, NY: Curran Associates, Inc.
[7] Box, G. E. P. and Tiao, G. C. (1973). Bayesian inference in statistical analysis . MA: Addison-Wesley. · Zbl 0271.62044
[8] Breslow, N. E. and Clayton, D. G. (1993). “Approximate inference in generalized linear mixed models.” Journal of the American Statistical Association , 88, 9-25. · Zbl 0775.62195 · doi:10.2307/2290687
[9] Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C. and Jordan, M. I. (2013). “Streaming variational Bayes.” In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z. and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 26 , 1727-1735. Red Hook, NY: Curran Associates, Inc.
[10] Diggle, P. J., Heagerty, P., Liang, K. and Zeger, S. L. (2002). Analysis of longitudinal data . UK: Oxford University Press, 2nd edition. · Zbl 1031.62002
[11] Donohue, M. C., Overholser, R., Xu, R. and Vaida, F. (2011). “Conditional Akaike information under generalized linear and proportional hazards mixed models.” Biometrika , 98: 685-700. · Zbl 1231.62138 · doi:10.1093/biomet/asr023
[12] Evans, M. and Moshonov, H. (2006). “Checking for prior-data conflict.” Bayesian Analysis , 4: 893-914. · Zbl 1331.62030 · doi:10.1214/06-BA129 · euclid:ba/1340370946
[13] Farrell, P. J., Groshen, S., MacGibbon, B. and Tomberlin, T. J. (2010). “Outlier detection for a hierarchical Bayes model in a study of hospital variation in surgical procedures.” Statistical Methods in Medical Research , 19: 601-619. · doi:10.1177/0962280209344926
[14] Fong, Y., Rue, H. and Wakefield, J. (2010). “Bayesian inference for generalised linear mixed models.” Biostatistics , 11: 397-412.
[15] Gelfand, A. E., Sahu, S. K. and Carlin, B. P. (1995). “Efficient parametrisations for normal linear mixed models.” Biometrika , 82: 479-488. · Zbl 0832.62064 · doi:10.1093/biomet/82.3.479
[16] — (1996). “Efficient parametrizations for generalized linear mixed models.” In Bernardo, J. M., Berger, J. O., Dawid, A. P. and Smith, A. F. (eds.), Bayesian Statistics 5 , 165-180. Oxford: Clarendon Press.
[17] Ghahramani, Z. and Beal, M. J. (2001). “Propagation algorithms for variational Bayesian learning.” In Leen, T. K., Dietterich, T. G. and Tresp, V. (eds.), Advances in Neural Information Processing Systems 13 , 507-513. Cambridge, MA: MIT Press.
[18] Greenberg, E. R., Baron, J. A., Stevens, M. M., Stukel, T. A., Mandel, J. S., Spencer, S. K., Elias, P. M., Lowe, N., Nierenberg, D. N., Bayrd G. and Vance, J. C. (1989). “The skin cancer prevention study: design of a clinical trial of beta-carotene among persons at high risk for nonmelanoma skin cancer.” Controlled Clinical Trials , 10: 153-166.
[19] Hoffman, M. D., Blei, D. M. and Bach, F. (2010). “Online learning for latent Dirichlet allocation.” In Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R. and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23 , 856-864. Red Hook, NY: Curran Associates, Inc.
[20] Hoffman, M. D., Blei, D. M., Wang, C. and Paisley, J. (2013). “Stochastic variational inference.” Journal of Machine Learning Research , 14: 1303-1347. · Zbl 1317.68163 · jmlr.csail.mit.edu
[21] Honkela, A., Tornio, M., Raiko, T. and Karhunen, J. (2008). “Natural conjugate gradient in variational inference.” In Ishikawa, M., Doya, K., Miyamoto, H. and Yamakawa, T. (eds.), Neural Information Processing , 305-314. Berlin: Springer-Verlag. · Zbl 1242.62022
[22] Hosmer, D. W., Lemeshow, S. and Sturdivant, R. X. (2013). Applied Logistic Regression . Hoboken, New Jersey: John Wiley & Sons Inc., 3rd edition. · Zbl 1276.62050 · doi:10.1002/9781118548387
[23] Huang, A. and Wand, M. P. (2013). “Simple Marginally Noninformative Prior Distributions for Covariance Matrices.” Bayesian Analysis , 8: 439-452. · Zbl 1329.62135 · doi:10.1214/13-BA815 · euclid:ba/1369407559
[24] Ibrahim, J. G. and Laud, P. W. (1991). “On Bayesian analysis of generalized linear models using Jeffreys’s prior.” Journal of the American Statistical Association , 86: 981-986. · Zbl 0850.62292 · doi:10.2307/2290514
[25] Jank, W. (2006). “Implementing and diagnosing the stochastic approximation EM algorithm.” Journal of Computational and Graphical Statistics , 15: 803-829. · doi:10.1198/106186006X157469
[26] Ji, C., Shen, H. and West, M. (2010). “Bounded approximations for marginal likelihoods.” Available at http://ftp.stat.duke.edu/WorkingPapers/10-05.pdf.
[27] Kass, R. E. and Natarajan, R. (2006). “A default conjugate prior for variance components in generalized linear mixed models (Comment on article by Browne and Draper).” Bayesian Analysis , 1: 535-542. · Zbl 1331.62148 · doi:10.1214/06-BA117B · euclid:ba/1340371049
[28] Knowles, D. A., Minka, T. P. (2011). “Non-conjugate variational message passing for multinomial and binary regression.” In Shawe-Taylor, J., Zemel, R. S., Bartlett, P., Pereira, F. and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24 , 1701-1709. Red Hook, NY: Curran Associates, Inc.
[29] Liang, F., Cheng, Y., Song, Q., Park, J. and Yang, P. (2013). “A resampling-based stochastic approximation method for analysis of large geostatistical data.” Journal of the American Statistical Association , 108: 325-339. · Zbl 06158346 · doi:10.1080/01621459.2012.746061
[30] Liu, Q. and Pierce, D. A. (1994). “A note on Gauss-Hermite quadrature.” Biometrika , 81: 624-629. · Zbl 0813.65053
[31] Lunn, D., Spiegelhalter, D., Thomas, A. and Best, N. (2009). “The BUGS project: Evolution, critique and future directions.” Statistics in Medicine , 28: 3049-3067. · doi:10.1002/sim.3680
[32] Luts, J., Broderick, T. and Wand, M. P. (2013). “Real-time semiparametric regression.” Journal of Computational and Graphical Statistics , (to appear). · doi:10.1080/10618600.2013.810150
[33] Magnus, J. R. and Neudecker, H. (1988). Matrix differential calculus with applications in statistics and econometrics. Chichester, UK: Wiley. · Zbl 0651.15001
[34] Marshall, E. C. and Spiegelhalter, D. J. (2007). “Identifying outliers in Bayesian hierarchical models: a simulation-based approach.” Bayesian Analysis , 2: 409-444. · Zbl 1331.62032 · doi:10.1214/07-BA218 · euclid:ba/1340393242
[35] Nott, D. J., Tan, S. L., Villani, M. and Kohn, R. (2012). “Regression density estimation with variational methods and stochastic approximation.” Journal of Computational and Graphical Statistics , 21: 797-820. · doi:10.1080/10618600.2012.679897
[36] Nott, D. J., Tran, M.-N., Kuk, A. Y. C., Kohn, R. (2013). “Efficient variational inference for generalized linear mixed models with large datasets.”
[37] Ormerod, J. T. and Wand, M. P. (2010). “Explaining variational approximations.” The American Statistician , 64: 140-153. · Zbl 1200.65007 · doi:10.1198/tast.2010.09058
[38] — (2012). “Gaussian variational approximate inference for generalized linear mixed models.” Journal of Computational and Graphical Statistics , 21: 2-17. · doi:10.1198/jcgs.2011.09118
[39] Overstall, A. M. and Forster, J. J. (2010). “Default Bayesian model determination methods for generalised linear mixed models.” Computational Statistics and Data Analysis , 54: 3269-3288. · Zbl 1284.62462 · doi:10.1016/j.csda.2010.03.008
[40] Paisley, J., Blei, D. M. and Jordan, M. I. (2012). “Variational Bayesian inference with stochastic search.” In Langford, J. and Pineau, J. (eds.), Proceedings of the 29th International Conference on Machine Learning , 1367-1374. Madison, WI: Omnipress.
[41] Papaspiliopoulos, O., Roberts, G. O. and Sköld, M. (2003). “Non-centered parametrizations for hierarchical models and data augmentation.” In Bernardo, J. M., Bayarri, M. J., Berger, J. O., Dawid, A. P., Heckerman, D. A., Smith, F. M. and West, M. (eds.), Bayesian Statistics 7 , 307-326. New York: Oxford University Press.
[42] — (2007). “A general framework for the parametrization of hierarchical models.” Statistical Science , 22: 59-73. · Zbl 1246.62195 · doi:10.1214/088342307000000014 · euclid:ss/1185975637
[43] Petris, G. and Tardella, L. (2003). “A geometric approach to transdimensional Markov chain Monte Carlo.” The Canadian Journal of Statistics , 31: 469-482. · Zbl 1052.62031 · doi:10.2307/3315857
[44] Polyak, B. T. and Juditsky, A. B. (1992). “Acceleration of stochastic approximation by averaging.” SIAM Journal on Control and Optimization , 30: 838-855. · Zbl 0762.62022 · doi:10.1137/0330046
[45] Presanis, A. M., Ohlssen, D., Spiegelhalter, D. J. and De Angelis, D. (2013). “Conflict diagnostics in directed acyclic graphs, with applications in Bayesian evidence synthesis.” Statistical Science , 28: 376-397. · Zbl 1331.62160 · doi:10.1214/13-STS426 · euclid:ss/1377696942
[46] Ranganath, R., Wang, C., Blei, D. M. and Xing, E. P. (2013). “An adaptive learning rate for stochastic variational inference.” In Dasgupta, S. and McAllester, D. (eds.) JMLR W&CP: Proceedings of the 30th International Conference on Machine Learning , 28: 298-306.
[47] Raudenbush, S. W., Yang, M. L. and Yosef, M. (2000). “Maximum likelihood for generalized linear models with nested random effects via high-order, multivariate Laplace approximation.” Journal of Computational and Graphical Statistics , 9: 141-157.
[48] Robbins, H. and Monro, S. (1951). “A stochastic approximation method.” Annals of Mathematical Statistics 22: 400-407. · Zbl 0054.05901 · doi:10.1214/aoms/1177729586
[49] Roux, N. L., Schmidt, M. and Bach, F. (2012). “A stochastic gradient method with an exponential convergence rate for finite training sets.” In Pereira, F., Burges, C. J. C., Bottou, L. and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25 , 2663-2671. Red Hook, NY: Curran Associates, Inc.
[50] Salimans, T. and Knowles, D. A. (2013). “Fixed-form variational posterior approximation through stochastic linear regression.” Bayesian Analysis , 4: 837-882. · Zbl 1329.62142 · doi:10.1214/13-BA858 · euclid:ba/1386166315
[51] Sato, M. (2001). “Online model selection based on the variational Bayes.” Neural Computation , 13: 1649-1681. · Zbl 1013.62087 · doi:10.1162/089976601750265045
[52] Scheel, I., Green, P. J. and Rougier, J. C. (2011). “A graphical diagnostic for identifying influential model choices in Bayesian hierarchical models.” Scandinavian Journal of Statistics , 38: 529-550. · Zbl 1246.62064 · doi:10.1111/j.1467-9469.2010.00717.x
[53] Spall, J. C. (2003). Introduction to stochastic search and optimization: estimation, simulation and control. New Jersey: Wiley. · Zbl 1088.90002 · doi:10.1002/0471722138
[54] Sturtz, S., Ligges, U., and Gelman, A. (2005). “R2WinBUGS: A package for running WinBUGS from R.” Journal of Statistical Software , 12: 1-16.
[55] Tan, L. S. L. and Nott, D. J. (2013). “Variational inference for generalized linear mixed models using partially noncentered parametrizations.” Statistical Science , 28: 168-188. · Zbl 1331.62167 · doi:10.1214/13-STS418 · euclid:ss/1369147910
[56] Thall, P. F. and Vail, S. C. (1990). “Some covariance models for longitudinal count data with overdispersion.” Biometrics , 46: 657-671. · Zbl 0712.62048 · doi:10.2307/2532086
[57] Thara, R., Henrietta, M., Joseph, A., Rajkumar, S. and Eaton, W. (1994). “Ten year course of schizophrenia - the Madras longitudinal study.” Acta Psychiatrica Scandinavica , 90: 329-336.
[58] Tseng, P. (1998). An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization , 8: 506-531. · Zbl 0922.90131 · doi:10.1137/S1052623495294797
[59] Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S . NY: Springer, 4th edition. · Zbl 1006.62003 · doi:10.1007/b97626
[60] Wand, M. P. (2013). “Fully simplified multivariate normal updates in non-conjugate variational message passing.” Available at http://www.uow.edu.au/ mwand/fsupap.pdf. · Zbl 1319.62066 · jmlr.csail.mit.edu
[61] Wang, C., Paisley, J. and Blei, D. M. (2011). “Online variational inference for the hierarchical Dirichlet process.” In Gordon, G., Dunson, D. and Dudik, M. (eds.) JMLR W&CP: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics , 15: 752-760.
[62] Wang, B. and Titterington, D. M. (2005). “Inadequacy of interval estimates corresponding to variational Bayesian approximations.” In Cowell, R. G. and Ghahramani, Z. (eds.), Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics , 373-380. Society for Artificial Intelligence and Statistics.
[63] Winn, J. and Bishop, C. M. (2005). “Variational message passing.” Journal of Machine Learning Research , 6: 661-694. · Zbl 1222.68332 · www.jmlr.org
[64] Xiao, L. (2010). “Dual averaging methods for regularized stochastic learning and online optimization.” Journal of Machine Learning Research , 11: 2543-2596. · Zbl 1242.62011 · www.jmlr.org
[65] Zhao, H. and Marriott, P. (2013). “Diagnostics for variational Bayes approximations.” arXiv: · arxiv.org
[66] Zhu, H. T. and Lee, S. Y. (2002). “Analysis of generalized linear mixed models via a stochastic approximation algorithm with Markov chain Monte Carlo method.” Statistics and Computing , 12: 175-183. · doi:10.1023/A:1014890720461
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.