×

Trust-region variational inference with Gaussian mixture models. (English) Zbl 07307467

Summary: Many methods for machine learning rely on approximate inference from intractable probability distributions. Variational inference approximates such distributions by tractable models that can be subsequently used for approximate inference. Learning sufficiently accurate approximations requires a rich model family and careful exploration of the relevant modes of the target distribution. We propose a method for learning accurate GMM approximations of intractable probability distributions based on insights from policy search by using information-geometric trust regions for principled exploration. For efficient improvement of the GMM approximation, we derive a lower bound on the corresponding optimization objective enabling us to update the components independently. Our use of the lower bound ensures convergence to a stationary point of the original objective. The number of components is adapted online by adding new components in promising regions and by deleting components with negligible weight. We demonstrate on several domains that we can learn approximations of complex, multimodal distributions with a quality that is unmet by previous variational inference methods, and that the GMM approximation can be used for drawing samples that are on par with samples created by state-of-the-art MCMC samplers while requiring up to three orders of magnitude less computational resources.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
PDF BibTeX XML Cite
Full Text: arXiv Link

References:

[1] M. Abadi, M. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URLhttps://www.tensorflow.org/. Software available from tensorflow.org.
[2] A. Abdolmaleki, R. Lioutikov, N. Lua, L. Paulo Reis, J. Peters, and G. Neumann. Modelbased relative entropy stochastic search. InAdvances in Neural Information Processing Systems (NeurIPS), pages 153-154, 2015.
[3] A. Abdolmaleki, B. Price, N. Lau, L. P. Reis, and G. Neumann. Deriving and improving cma-es with information geometric trust regions. InThe Genetic and Evolutionary Computation Conference (GECCO 2017), July 2017.
[4] F. V. Agakov and D. Barber. An auxiliary variational method. InInternational Conference on Neural Information Processing, pages 561-566. Springer, 2004.
[5] R. Akrour, A. Abdolmaleki, H. Abdulsamad, J. Peters, and G. Neumann.Model-free trajectory-based policy optimization with monotonic improvement.Journal of Machine Learning Research (JMLR), 19(14), 2018. · Zbl 1437.68147
[6] J. Altosaar, R. Ranganath, and D. M. Blei. Proximity variational inference. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2018.
[7] O. Arenz, M. Zhong, and G. Neumann. Efficient gradient-free variational inference using policy search. InInternational Conference on Machine Learning (ICML), 2018.
[8] C. M. Bishop.Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, 2006. · Zbl 1107.68072
[9] C. M. Bishop, N. D. Lawrence, T. Jaakkola, and M. I. Jordan. Approximating posterior distributions in belief networks using mixtures. InAdvances in Neural Information Processing Systems (NeurIPS), pages 416-422, 1998.
[10] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians.Journal of the American Statistical Association, 2017.
[11] J. Bovy. Python implementation of elliptical slice sampling, 2013. URLhttps://github. com/jobovy/bovy_mcmc.
[12] S. Boyd and L. Vandenberghe.Convex Optimization. Cambridge University Press, 2004. · Zbl 1058.90049
[13] R.H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization.SIAM Journal on Scientific Computing, 16(5):1190-1208, 1995. · Zbl 0836.65080
[14] B. Calderhead. A general construction for parallelizing Metropolis-Hastings algorithms. Proceedings of the National Academy of Sciences of the United States of America (PNAS), Nov 2014.
[15] B. Calderhead and M. Girolami. Estimating Bayes factors via thermodynamic integration and population mcmc.Computational Statistics & Data Analysis, 53(12):4028-4045, 2009. · Zbl 1453.62055
[16] X. Chen, M. Monfort, A. Liu, and B. D. Ziebart. Robust covariate shift regression. In International Conference on Artificial Intelligence and Statistics, pages 1270-1279, 2016.
[17] R. Coulom. Efficient selectivity and backup operators in monte-carlo tree search. InInternational conference on computers and games, pages 72-83. Springer, 2006.
[18] C. Daniel, G. Neumann, and J. Peters. Hierarchical relative entropy policy search. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2012. · Zbl 1367.68318
[19] C. Daniel, G. Neumann, O. Kroemer, and J. Peters. Hierarchical relative entropy policy search.Journal of Machine Learning Research (JMLR), 17:1-50, June 2016. URLhttp: //eprints.lincoln.ac.uk/25743/. · Zbl 1367.68318
[20] M. P. Deisenroth, G. Neumann, and J. Peters. A survey on policy search for robotics. Foundations and Trends in Robotics, pages 388-403, 2013.
[21] L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
[22] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP.arXiv preprint arXiv:1605.08803, 2016.
[23] P. M. Djuric, J. H. Kotecha, J. Zhang, Y. Huang, T. Ghirmai, M. F. Bugallo, and J. Miguez. Particle filtering.IEEE signal processing magazine, 20(5):19-38, 2003.
[24] A. Doucet, N. De Freitas, and N. Gordon.An introduction to sequential monte carlo methods. InSequential Monte Carlo methods in practice, pages 3-14. Springer, 2001. · Zbl 1056.93576
[25] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo.Physics Letters B, 195(2):216-222, 1987.
[26] D. J. Earl and M. W. Deem. Parallel tempering: Theory, applications, and new perspectives. Physical Chemistry Chemical Physics, 7(23):3910-3916, 2005.
[27] J. Ellis and R. van Haasteren. jellis18/ptmcmcsampler: Official release, 2017. URLhttps: //doi.org/10.5281/zenodo.1037579.
[28] F. End, R. Akrour, J. Peters, and G. Neumann. Layered direct policy search for learning hierarchical skills. InInternational Conference on Robotics and Automation (ICRA), 2017.
[29] K. Fan, Z. Wang, J. Beck, J. T. Kwok, and K. Heller. Fast second-order stochastic backpropagation for variational inference. InAdvances in Neural Information Processing Systems (NeurIPS), pages 1387-1395, 2015.
[30] M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribution estimation. InInternational Conference on Machine Learning (ICML), pages 881-889, 2015.
[31] S. J. Gershman, M. D. Hoffman, and D. M. Blei. Nonparametric variational inference. In International Conference on Machine Learning (ICML), 235-242, 2012.
[32] B.C. Goodwin. Oscillatory behavior in enzymatic control processes.Advances in Enzyme Regulation, 3:425-437, 1965.
[33] W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. FFJORD: free-form continuous dynamics for scalable reversible generative models. InInternational Conference on Learning Representations (ICLR), 2019.
[34] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola. A kernel twosample test.Journal of Machine Learning Research (JMLR), 13:723-773, March 2012. ISSN 1532-4435. · Zbl 1283.62095
[35] F. Guo, X. Wang, K. Fan, T. Broderick, and D. B. Dunson. Boosting variational inference. arXiv:1611.05559v2 [stat.ML], 2016.
[36] T. C. Hesterberg.Advances in Importance Sampling. PhD thesis, Stanford University, 1988.
[37] M. D. Hoffman and A. Gelman. The no-u-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo.Journal of Machine Learning Research (JMLR), 15(1): 1593-1623, 2014. · Zbl 1319.60150
[38] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine Learning Research (JMLR), 14(4):1303-1347, 2013. · Zbl 1317.68163
[39] C. Huang, D. Krueger, A. Lacoste, and A. Courville.Neural autoregressive flows.In International Conference on Machine Learning (ICML), pages 2078-2087, 2018.
[40] T. S. Jaakkola and M. I. Jordan. Improving the mean field approximation via the use of mixture distributions.Learning in Graphical Models, 89:163-174, 1998. · Zbl 0953.60100
[41] M. E. Khan, R. Babanezhad, W. Lin, M. Schmidt, and M. Sugiyama. Faster stochastic variational inference using proximal-gradient methods with general divergence functions. InConference on Uncertainty in Artificial Intelligence (UAI), pages 319-328, 2016.
[42] M. E. E. Khan, P. Baqu´e, F. Fleuret, and P. Fua. Kullback-leibler proximal variational inference. InAdvances in Neural Information Processing Systems (NeurIPS), pages 3402- 3410, 2015.
[43] D. Kingma and M. Welling. Auto-encoding variational bayes. InInternational Conference on Learning Representations (ICLR), 2014.
[44] D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. InAdvances in Neural Information Processing Systems (NeurIPS), pages 10215-10224, 2018.
[45] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. InAdvances in Neural Information Processing Systems (NeurIPS), pages 4743-4751, 2016.
[46] A. Kong, J. S. Liu, and W. H. Wong. Sequential imputations and bayesian missing data problems.Journal of the American Statistical Association, 89(425):278-288, 1994. · Zbl 0800.62166
[47] S. Levine and V. Koltun. Guided policy search. InInternational Conference on Machine Learning (ICML), pages 1-9, 2013.
[48] M. Lichman. UCI machine learning repository, 2013. URLhttp://archive.ics.uci.edu/ ml.
[49] L. Lin. Reinforcement learning for robots using neural networks. Technical report, CarnegieMellon Univ Pittsburgh PA School of Computer Science, 1993.
[50] Q. Liu and D. Wang.Stein variational gradient descent: A general purpose Bayesian inference algorithm. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,Advances in Neural Information Processing Systems (NeurIPS), pages 2378-2386. Curran Associates, Inc., 2016.
[51] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther. Auxiliary deep generative models. InInternational Conference on Machine Learning (ICML), pages 1445-1454, 2016.
[52] H. Mania, A. Guy, and B. Recht. Simple random search of static linear policies is competitive for reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems (NeurIPS), pages 1803-1812. Curran Associates, Inc., 2018.
[53] A. C. Miller, N. J. Foti, A. D’Amour, and R. P. Adams. Variational boosting: Iteratively refining posterior approximations.InInternational Conference on Machine Learning (ICML), 2017.
[54] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013.
[55] I. Murray, R. Adams, and D. MacKay. Elliptical slice sampling. InInternational Conference on Artificial Intelligence and Statistics, pages 541-548, 2010.
[56] I. T. Nabney, A. Vehtari, Koepsell K., and McGibbon R. T. pyhmc: Hamiltonian Monte Carlo in python, 2018. URLhttps://github.com/rmcgibbo/pyhmc.
[57] R. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. InLearning in Graphical Models, pages 355-368. Kluwer Academic Publishers, 1998. · Zbl 0916.62019
[58] R. M. Neal. Sampling from multimodal distributions using tempered transitions.Statistics and Computing, 6(4):353-366, Dec 1996.
[59] R. M. Neal. Slice sampling.The Annals of Statistics, 31(3):705-767, 06 2003. doi: 10.1214/ aos/1056562461. · Zbl 1051.65007
[60] G. Neu, A. Jonsson, and V. G´omez. A unified view of entropy-regularized Markov decision processes.arXiv preprint arXiv: 1705.07798, 2017. URLhttp://arxiv.org/abs/1705. 07798.
[61] R. Nishihara, I. Murray, and R. P. Adams. Parallel mcmc with generalized elliptical slice sampling.Journal of Machine Learning Research (JMLR), 15(1):2087-2112, January 2014. · Zbl 1319.60153
[62] G. Papamakarios, T. Pavlakou, and I. Murray. Masked autoregressive flow for density estimation. InAdvances in Neural Information Processing Systems (NeurIPS), pages 2338-2347, 2017.
[63] J. Peters, K. Muelling, and Y. Altun. Relative entropy policy search. InAAAI Conference on Artificial Intelligence, 2010.
[64] C. Peterson and E. Hartman. Explorations of the mean field theory learning algorithm. Neural Networks, 2(6):457-494, 1989.
[65] T. Rainforth, Y. Zhou, X. Lu, Y. W. Teh, F. Wood, H. Yang, and J. van de Meent. Inference trees: Adaptive inference with exploration.arXiv preprint arXiv:1806.09550, 2018.
[66] R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference.Artificial Intelligence and Statistics, pages 814-822, 2014.
[67] R. Ranganath, D. Tran, and D. Blei. Hierarchical variational models. InInternational Conference on Machine Learning (ICML), pages 324-333, 2016.
[68] C. E. Rasmussen and C. K. I. Williams.Gaussian Processes for Machine Learning. The MIT Press, 2006. · Zbl 1177.68165
[69] J. Regier, M. I. Jordan, and J. McAuliffe. Fast black-box variational inference through stochastic trust-region optimization. InAdvances in Neural Information Processing Systems (NeurIPS), pages 2399-2408, 2017.
[70] D. Rezende and S. Mohamed. Variational inference with normalizing flows. InInternational Conference on Machine Learning (ICML), pages 1530-1538, 2015.
[71] D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. InInternational Conference on Machine Learning (ICML), pages 1278-1286, 2014.
[72] G. O. Roberts and O. Stramer. Langevin diffusions and Metropolis-Hastings algorithms. Methodology and Computing in Applied Probability, 4(4):337-357, 2002. · Zbl 1033.65003
[73] T. Salimans and D. A. Knowles. Fixed-form variational posterior approximation through stochastic linear regression.Bayesian Analysis, 8(4):837-882, 2013. · Zbl 1329.62142
[74] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning.arXiv preprint arXiv:1703.03864, 2017.
[75] L. K. Saul, T. Jaakkola, and M. I. Jordan. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4:61-76, 1996. · Zbl 0900.68379
[76] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay.International Conference on Learning Representations (ICLR), 2016.
[77] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. InInternational Conference on Machine Learning (ICML), 2015.
[78] S. Shirakawa, Y. Akimoto, K. Ouchi, and K. Ohara. Sample reuse in the covariance matrix adaptation evolution strategy based on importance sampling. InAnnual Conference on Genetic and Evolutionary Computation, pages 305-312, 2015.
[79] I. Slavitt. Python implementation of slice sampling, 2013. URLhttps://isaacslavitt. com/2013/12/30/metropolis-hastings-and-slice-sampling/.
[80] R. S. Sutton and A. G. Barto.Reinforcement Learning. MIT Press, Boston, MA, 1998. · Zbl 1407.68009
[81] L. Theis and M. Hoffman. A trust-region method for stochastic variational inference with applications to streaming data. InInternational Conference on Machine Learning (ICML), pages 2503-2511. PMLR, 2015.
[82] D. Tran, R. Ranganath, and D. M. Blei. The variational gaussian process. InInternational Conference on Learning Representations (ICLR), 2016.
[83] E. Uchibe. Efficient sample reuse in policy search by multiple importance sampling. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 545-552, 2018.
[84] T. Weber, N. Heess, A. Eslami, J. Schulman, D. Wingate, and D. Silver. Reinforced variational inference.InAdvances in Neural Information Processing Systems (NeurIPS) Workshops, 2015.
[85] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8:229-256, 1992. · Zbl 0772.68076
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.