×

Consistency of variational Bayes inference for estimation and model selection in mixtures. (English) Zbl 1403.62035

Summary: Mixture models are widely used in Bayesian statistics and machine learning, in particular in computational biology, natural language processing and many other fields. Variational inference, a technique for approximating intractable posteriors thanks to optimization algorithms, is extremely popular in practice when dealing with complex models such as mixtures. The contribution of this paper is two-fold. First, we study the concentration of variational approximations of posteriors, which is still an open problem for general mixtures, and we derive consistency and rates of convergence. We also tackle the problem of model selection for the number of components: we study the approach already used in practice, which consists in maximizing a numerical criterion (the Evidence Lower Bound). We prove that this strategy indeed leads to strong oracle inequalities. We illustrate our theoretical results by applications to Gaussian and multinomial mixtures.

MSC:

62F15 Bayesian inference
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F12 Asymptotic properties of parametric estimators
62F35 Robustness and adaptive procedures (parametric inference)

Software:

nmfem
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] H. Akaike. A new look at the statistical model identification., IEEE Transactions on Automatic Control, 19:716–723, 1974. · Zbl 0314.62039 · doi:10.1109/TAC.1974.1100705
[2] P. Alquier and V. Cottet. 1-bit matrix completion: PAC-Bayesian analysis of a variational approximation., Machine Learning, 107(3):579–603, 2018. · Zbl 1461.15032 · doi:10.1007/s10994-017-5667-z
[3] P. Alquier and J. Ridgway. Concentration of tempered posteriors and of their variational approximations., arXiv preprint arXiv :1706.09293, 2017.
[4] P. Alquier, J. Ridgway, and N. Chopin. On the properties of variational approximations of Gibbs posteriors., JMLR, 17(239):1–41, 2016. · Zbl 1437.62129
[5] S. Ayer and H.S. Sawhney. Layered representation of motion video using robust maximum-likelihood estimation of mixture models and MDL encoding., International Conference on Computer Vision, 1995.
[6] A.G. Bacharoglou. Approximation of probability distributions by convex mixtures of Gaussian measures., Proceedings of the American of the American Mathematical Society, 138(7) :2619–2628, 2010. · Zbl 1513.62029 · doi:10.1090/S0002-9939-10-10340-2
[7] G. Behrens, N. Friel, and M. Hurn. Tuning tempered transitions., Statistics and computing, 22(1):65–78, 2012. · Zbl 1322.62008 · doi:10.1007/s11222-010-9206-z
[8] A. Bhattacharya, D. Pati, and Y. Yang. Bayesian fractional posteriors., arXiv preprint arXiv :1611.01125 (to appear in the Annals of Statistics), 2016.
[9] C. Biernacki, G. Celeux, and G. Govaert. An improvement of the NEC criterion for assessing the number of clusters in a mixture model., Pattern Recognition Letters, 20(3):267–272, 1999. · Zbl 0933.68117 · doi:10.1016/S0167-8655(98)00144-5
[10] D. M. Blei, A.Y. Ng, C. Wang, and M.I. Jordan. Latent Dirichlet allocation., The Journal of Machine Learning Research, 3:993 –1022, 2003. · Zbl 1112.68379
[11] D.M. Blei, A. Kucukelbir, and J.D. McAuliffe. Variational inference: a review for statisticians., arXiv preprint arXiv :1601.00670, 2017.
[12] C. Bouveyron and C. Brunet-Saumard. Model-based clustering of high-dimensional data: a review., Computational Statistics and Data Analysis, 71:52–78, 2014. · Zbl 1306.65033 · doi:10.1016/j.csda.2012.12.008
[13] P. Carbonetto and M. Stephens. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies., Bayesian analysis, 7(1):73–108, 2012. · Zbl 1330.62089 · doi:10.1214/12-BA703
[14] L. Carel and P. Alquier. Simultaneous dimension reduction and clustering via the NMF-EM algorithm., arXiv preprint arXiv :1709.03346, 2017.
[15] O. Catoni., Statistical learning theory and stochastic optimization. Saint-Flour Summer School on Probability Theory 2001 (Jean Picard ed.), Lecture Notes in Mathematics. Springer, 2004.
[16] O. Catoni., PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series, 56. Institute of Mathematical Statistics, Beachwood, OH, 2007. · Zbl 1277.62015
[17] B. E. Chérief-Abdellatif and P. Alquier., Supplement to “Consistency of Variational Bayes Inference for Estimation and Model Selection in Mixtures”, DOI: 10.1214/18-EJS1475SUPP, 2018. · Zbl 1403.62035
[18] G. Celeux, S. Frühwirth-Schnatter, and C. P. (Editors) Robert., Handbook of mixture analysis. CRC Press, 2018.
[19] P. Deb, W.T. Gallo, P. Ayyagari, J.M. Fletcher, and J.L. Sindelar. The effect of job loss on overweight and drinking., Journal of Health Economics, 2011.
[20] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm., Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. · Zbl 0364.62022 · doi:10.1111/j.2517-6161.1977.tb01600.x
[21] M. N. Do. Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models., IEEE Signal Processing Letters, 10(4):115–118, 4 2003.
[22] E. Gassiat, J. Rousseau, and E. Vernet. Efficient semiparametric estimation and model selection for multidimensional mixtures., Electronic Journal of Statistics, 12(1):703–740, 2018. · Zbl 1473.62106 · doi:10.1214/17-EJS1387
[23] S. Ghosal, J. K. Ghosh, and A. W. Van Der Vaart. Convergence rates of posterior distributions., Annals of Statistics, pages 500–531, 2000. · Zbl 1105.62315 · doi:10.1214/aos/1016218228
[24] L. Gordon. A stochastic approach to the gamma function., The American Mathematical Monthly, 101(9):858–865, 1994. · Zbl 0823.33001 · doi:10.1080/00029890.1994.11997039
[25] P. D. Grünwald and T. Van Ommen. Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it., Bayesian Analysis, 12(4) :1069–1103, 2017. · Zbl 1384.62088 · doi:10.1214/17-BA1085
[26] S. Guo. Monotonicity and concavity properties of some functions involving the gamma function with applications., JIPAM. Journal of Inequalities in Pure & Applied Mathematics [electronic only], 7, 01 2006. · Zbl 1133.33002
[27] J.R. Hershey and P.A. Olsen. Approximating the Kullback Leibler divergence between Gaussian mixture models., IEEE International Conference on Acoustics, Speech and Signal Processing, 4, 2007.
[28] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference., The Journal of Machine Learning Research, 14(1) :1303–1347, 2013. · Zbl 1317.68163
[29] W. Kruijer, J. Rousseau, and A. Van Der Vaart. Adaptive Bayesian density estimation with location-scale mixtures., Electronic Journal of Statistics, 4 :1225–1257, 2010. · Zbl 1329.62188 · doi:10.1214/10-EJS584
[30] A. Laforgia and P. Natalin. On some inequalities for the gamma function., Advances in Dynamical Systems and Applications, 8(2):261–267, 2013.
[31] P. Massart., Concentration inequalities and model selection. Saint-Flour Summer School on Probability Theory 2003 (Jean Picard ed.), Lecture Notes in Mathematics. Springer, 2007.
[32] P. D. McNicholas. Model-based clustering., Journal of Classification, 33(3):331–373, 2016. · Zbl 1364.62155 · doi:10.1007/s00357-016-9211-9
[33] N. Nasios and A.G. Bors. Variational learning for Gaussian mixture models. In, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), volume 36, pages 849–862, 2006.
[34] R. M. Neal. Sampling from multimodal distributions using tempered transitions., Statistics and Computing, 6(4):353–366, 1996.
[35] A. O’Hagan, T. B. Murphy, and I. C. Gormley. Computational aspects of fitting mixture models via the expectation–maximization algorithm., Computational Statistics & Data Analysis, 56(12) :3843–3864, 2012. · Zbl 1255.62180 · doi:10.1016/j.csda.2012.05.011
[36] W. Pan, J. Lin, and C.T. Le. A mixture model approach to detecting differentially expressed genes with microarray data., Functional & Integrative Genomics, 3:117–124, 2003.
[37] L. Rigouste, O. Cappé, and F. Yvon. Inference and evaluation of the multinomial mixture model for text clustering. In, Information Processing & Management, volume 43, pages 1260–1280, 2007.
[38] G. Schwarz. Estimating the dimension of a model., The Annals of Statistics, 6(2):461–464, 1978. · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[39] Y. Singer and M. K. Warmuth. Batch and online parameter estimation of Gaussian mixtures based on the joint entropy. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 11. MIT Press, Cambridge, MA, 1999.
[40] C. J. Stoneking. Bayesian inference of Gaussian mixture models with noninformative priors., arXiv preprint arXiv :1405.4895, 2014.
[41] E.B. Sudderth and M.I. Jordan. Shared segmentation of natural scenes using dependent pitman-yor processes., In Advances in Neural Information Processing Systems, pages 1585–1592, 2009.
[42] N. Syring and R. Martin. Scaling the Gibbs posterior credible regions., Preprint, 2015.
[43] T. Van Erven and P. Harremos. Rényi divergence and Kullback-Leibler divergence., IEEE Transactions on Information Theory, 60(7) :3797–3820, 2014. · Zbl 1360.94180 · doi:10.1109/TIT.2014.2320500
[44] Y. Wang and D.M. Blei. Frequentist consistency of variational Bayes., arXiv preprint arXiv :1705.034339v1, accepted for publication in JASA, 2017.
[45] L. Watier, S. Richardson, and P. J. Green. Using Gaussian mixtures with unknown number of components for mixed model estimation. In, 14th International Workshop on Statistical Modelling, Graz, Austria, 1999.
[46] Y. Wu and P. Yang. Optimal estimation of Gaussian mixtures via denoised method of moments., working paper, 2018.
[47] Y. Yang. Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation., Biometrika, 92(4):937–950, 2005. · Zbl 1151.62301 · doi:10.1093/biomet/92.4.937
[48] Y. Yang, Pati D., and A. Bhattacharya. \(α\)-variational inference with statistical guarantees., preprint arXiv :1710.03266v1, 2017.
[49] F. Zhang and C. Gao. Convergence rates of variational posterior distributions., arXiv preprint arXiv :1712.02519v1, 2017.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.