×

Posterior sampling strategies based on discretized stochastic differential equations for machine learning applications. (English) Zbl 07307488

Summary: With the advent of GPU-assisted hardware and maturing high-efficiency software platforms such as TensorFlow and PyTorch, Bayesian posterior sampling for neural networks becomes plausible. In this article we discuss Bayesian parametrization in machine learning based on Markov Chain Monte Carlo methods, specifically discretized stochastic differential equations such as Langevin dynamics and extended system methods in which an ensemble of walkers is employed to enhance sampling. We provide a glimpse of the potential of the sampling-intensive approach by studying (and visualizing) the loss landscape of a neural network applied to the MNIST data set. Moreover, we investigate how the sampling efficiency itself can be significantly enhanced through an ensemble quasi-Newton preconditioning method. This article accompanies the release of a new TensorFlow software package, the Thermodynamic Analytics ToolkIt, which is used in the computational experiments.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
PDFBibTeX XMLCite
Full Text: Link

References:

[1] P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from examples without local minima.Neural Networks, 2(1):53-58, 1989.
[2] R. Baldock and N. Marzari.Bayesian neural networks at finite temperature.2019. arXiv:1904.04154.
[3] A.J. Ballard, R. Das, S. Martiniani, D. Mehta, L. Sagun, J.D. Stevenson, and D.J. Wales. Energy landscapes for machine learning.Phys. Chem. Chem. Phys., 19:12585-12603, 2017.
[4] D. Barber and C. M. Bishop.Ensemble learning in bayesian neural networks.Neural Networks and Machine Learning, 168:215-238, 1998.
[5] R. Bernardi, M. Melo, and K. Schulten. Enhanced sampling techniques in molecular dynamics simulations of biological systems.Biochimica et Biophysica Acta, 1850(5):872 - 877, 2015.
[6] A. Beskos and A. Stuart. Computational complexity of Metropolis-Hastings methods in high dimensions. In Pierre L’ Ecuyer and Art B. Owen, editors,Monte Carlo and Quasi-Monte Carlo Methods 2008, pages 61-71. Springer, 2009. · Zbl 1184.65002
[7] P.G. Bolhuis, D. Chandler, C. Dellago, and P.L. Geissler. Transition path sampling: throwing ropes over rough mountain passes, in the dark.Annual Review of Physical Chemistry, 53(1):291-318, 2002.
[8] D. Bone, M. Goodwin, M. Black, C.-C. Lee, K. Audhkhasi, and S. Narayanan. Applying machine learning to facilitate autism diagnostics: pitfalls and promises.J. Autism Dev. Disord., 45:1121-1136, 2015.
[9] L. Bottou, F. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223-311, 2018. · Zbl 1397.65085
[10] N. Bou-Rabee and H. Owhadi. Long-run accuracy of variational integrators in the stochastic context.SIAM J. Num. Anal., 48(1):278-297, 2010. · Zbl 1215.65012
[11] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina. Entropy-sgd: biasing gradient descent into wide valleys.J. Stat. Mech., 2019(12):124018, 2019. · Zbl 1459.65091
[12] T. Chen, E. Fox, and C. Guestrin. Stochastic gradient hamiltonian monte carlo. In31st International Conference on Machine Learning, pages 1683-1691, 2014.
[13] A. Choromanska, M. Henaff, M. Mathieu, G. Ben Arous, and Y. LeCun. The loss surfaces of multilayer networks.J. Mach. Learn. Res., 38:192-204, 2015.
[14] P. Collet, S. Martinez, and J. San Martin.Quasi-stationary distributions: Markov chains, diffusions and dynamical systems. Springer, 2012.
[15] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio. Sharp minima can generalize for deep nets. In D. Precup and Y.-W. Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1019-1028, 2017.
[16] Felix Draxler, Kambis Veschgini, Manfred Salmhofer, and Fred Hamprecht. Essentially no barriers in neural network energy landscape. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1309-1318, 2018.
[17] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.J. of Mach. Learn. Res., 12:2121-2159, 2011. · Zbl 1280.68164
[18] D.B. Dunson. Statistics in the big data era: Failures of the machine.Stat. Probab. Lett., 136:4-9, 2018. · Zbl 1489.62411
[19] A. Durmus and E. Moulines.High-dimensional Bayesian inference via the Unadjusted Langevin Algorithm.Bernoulli, 25(4):2854-2882, 2019. · Zbl 1428.62111
[20] R. Elber. Perspective: computer simulations of long time dynamics.J. Chem. Phys., 144: 060901, 2016.
[21] Y. Gal.Uncertainty in deep learning. PhD thesis, Cambridge University, 2016.
[22] I.J. Goodfellow, O. Vinyals, and A.M. Saxe. Qualitatively characterizing neural network optimization problems. InInt. Conf. Learn. Represent., 2014.
[23] J. Goodman.ACOR package, 2009. URLhttp://www.math.nyu.edu/faculty/goodman/ software/.
[24] J. Goodman and J. Weare. Ensemble samplers with affine invariance.Commun. Appl. Math. Comput. Sci., 5(1):65-80, 2010. · Zbl 1189.65014
[25] C.L. Gómez, I. Santos, J.G. de la Puerta, P.G. Bringas, and P. Galán-García. Supervised machine learning for the detection of troll profiles in twitter social network: application to a real case of cyberbullying.Logic Journal of the IGPL, 24(1):42-53, 10 2015.
[26] B.D. Haeffele and R. Vidal. Global optimality in neural network training.Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, 2017-Janua(3):4390-4398, 2017.
[27] M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In33rd International Conference on Machine Learning, New York, NY, 2016. JMLR: Workshops and Conference Proceedings, Vol. 48.
[28] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. InProc. IEEE Conf. Comput. Vis. pattern Recognit., pages 770-778, 2016.
[29] G. Hinton, N. Srivastava, K. Swervsky, and T. Tieleman. rmsprop: Divide the gradient by a running average of its recent magnitude, 2014. unpublished.
[30] G.E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. InProceedings of the sixth annual conference on Computational learning theory, pages 5-13. ACM, 1993.
[31] S. Hochreiter and J. Schmidhuber. Flat Minima.Neural Comput., 9(1):1-42, 1997. · Zbl 0872.68150
[32] E. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. InAdv. Neur. Inf. Proc. Sys. 31, 2017.
[33] D.J. Im, M. Tao, and K. Branson. An empirical analysis of the optimization of deep network loss surfaces. 2017. arXiv:1612.04010.
[34] K. Johnson, J. Torres Soto, B. Glicksberg, K. Shameer, R. Miotto, M. Ali, E. Ashley, and J. Dudley. Artificial intelligence in cardiology.Journal of the American College of Cardiology, 71:2668-2679, 2018.
[35] K. Kadau, T.C. Germann, and P.S. Lomdahl. Large scale molecular dynamics simulation of 19 billion particles.International Journal of Modern Physics C, 15(01):193-201, 2004.
[36] K. Kawaguchi, L. Kaelbling, and Y. Bengio.Generalization in deep learning.2020. arXiv:1710.05468v6.
[37] D.P. Kingma and J. Ba.Adam:A method for stochastic optimization.2014. arXiv:1412.6980.
[38] M.L. Klein and W. Shinoda. Large-scale molecular dynamics simulations of self-assembling systems.Science, 321(5890):798-800, 2008.
[39] A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional neural networks. InAdv. Neur. Inf. Proc. Syst. 25, pages 1097-1105, 2012.
[40] Y. LeCun. MNIST dataset. URLhttp://yann.lecun.com/exdb/mnist/. accessed 2018.
[41] Y. LeCun, I. Kanter, and S.A. Solla. Eigenvalues of covariance matrices: Application to neural-network learning.Phys. Rev. Lett., 66(18):2396-2399, 1991.
[42] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.Proc. IEEE, 86(11):2278-2324, 1998.
[43] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.Nature, 521(7553):436, 2015.
[44] M. Ledoux.The concentration of measure phenomenon. Mathematical surveys and monographs ; no. 89. American Mathematical Society, Providence, R.I., 2001. · Zbl 0995.60002
[45] B. Leimkuhler and C. Matthews. Rational construction of stochastic numerical methods for molecular sampling.Appl. Math. Res. eXpress, 2013(1):34-56, 2012. · Zbl 1264.82102
[46] B. Leimkuhler and C. Matthews.Molecular Dynamics. Springer International Publishing, Heidelberg, 1st edition, 2015. · Zbl 1351.82001
[47] B. Leimkuhler, C. Matthews, and G. Stoltz. The computation of averages from equilibrium and nonequilibrium Langevin molecular dynamics.IMA J. Numer. Anal., 36(1):13-79, 2015. · Zbl 1347.65014
[48] B. Leimkuhler, C. Matthews, and J. Weare. Ensemble preconditioning for Markov chain Monte Carlo simulation.Stat. Comput., 28(2):277-290, 2018. · Zbl 1384.65004
[49] B. Leimkuhler, C. Matthews, and T. Vlaar. Partitioned integrators for thermodynamic parameterization of neural networks.Foundations of Data Science, 1:457-489, 2019. doi: 10.3934/fods.2019019.
[50] T. Leliévre, M. Rousset, and G. Stoltz.Free Energy Computations. Imperial College Press, 2010. · Zbl 1227.82002
[51] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein. Visualizing the loss landscape of neural nets. InAdv. Neur. Inf. Proc. Sys. 32, 2018.
[52] R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational efficiency of training neural networks. InAdv. Neur. Inf. Proc. Syst. 27, pages 855-863, 2014.
[53] D.J. MacKay. A practical bayesian framework for backpropagation networks.Neural Computation, 4:448-472, 1992.
[54] E Marinari and G Parisi. Simulated tempering: a new monte carlo scheme.Europhysics Letters (EPL), 19(6):451-458, jul 1992.
[55] Anton Martinsson, Jianfeng Lu, Benedict Leimkuhler, and Eric Vanden-Eijnden. The simulated tempering method in the infinite switch limit with adaptive weight learning.J. Stat. Mech., 2019(1):013207, jan 2019. · Zbl 07382763
[56] C. Matthews and J. Weare. Langevin Markov Chain Monte Carlo with stochastic gradients. 2018. arXiv:1805.08863.
[57] F. Mezzadri. How to generate random matrices from the classical compact groups.Notices of the AMS, 54:592-604, 2007. · Zbl 1156.22004
[58] R.M. Neal.Bayesian training of backpropagation networks by the hybrid monte carlo method.Technical Report CRG-TR-92-1, Dept. of Computer Science, University of Toronto, 1992.
[59] J. Nocedal and S.J. Wright.Numerical Optimization. Springer-Verlag, New York, 1. edition, 1999. · Zbl 0930.65067
[60] S. Patterson and Y.-W. Teh. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. InAdv. Neur. Inf. Proc. Syst. 26, pages 3102-3110, 2013.
[61] C.P. Robert.The Metropolis-Hastings algorithm, pages 1-15. American Cancer Society, 2015.
[62] Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana, and Alessandro Verri. Are loss functions all the same?Neural Computation, 16(5):1063-1076, 2004. · Zbl 1089.68109
[63] Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks. InProceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 774-782. JMLR.org, 2016.
[64] N. Sebe, I. Cohen, A. Garg, and T.S. Huang. Machine learning in computer vision. In Computational Imaging and Vision, 2005. · Zbl 1081.68667
[65] X. Shang, Z. Zhu, B. Leimkuhler, and A. Storkey. Covariance-controlled adaptive Langevin thermostat for large-scale Bayesian sampling. InAdv. Neural Inf. Process. Syst. 28, pages 37-45, 2015.
[66] R. Smith.Uncertainty quantification: theory, implementation, and applications. SIAM, 2013.
[67] M. Soltanolkotabi, A. Javanmard, and J.D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks.IEEE Transactions on Information Theory, 65(2):742-769, Feb 2019. · Zbl 1428.68255
[68] Conghui Tan, Shiqian Ma, Yu-Hong Dai, and Yuqiu Qian. Barzilai-Borwein step size for stochastic gradient descent. InAdv. Neural Inf. Process. Syst. 29, 2016.
[69] Z. Trstanova, B. Leimkuhler, and T. Lelièvre. Local and global perspectives on diffusion maps in the analysis of molecular systems.Proceedings of the Royal Society A, 476(2233): 20190036, 2020. · Zbl 1472.82029
[70] S.J. Vollmer, K.C. Zygalakis, and Y.-W. Teh. Exploration of the (non-) asymptotic bias and variance of stochastic gradient langevin dynamics.J. Mach. Learn. Res., 17(1):5504-5548, 2016. · Zbl 1391.60178
[71] M. Welling and Y.-W. Teh. Bayesian learning via stochastic gradient langevin dynamics. Proc. 28th Int. Conf. Mach. Learn., pages 681-688, 2011.
[72] F. Wenzel, K. Roth, B.S. Veeling, J. Swiatkowski, L. Tran, S. Mandt, J. Snoek, T. Salimans, R. Jenatton, and S. Nowozin. How good is the bayes posterior in deep neural networks really? InProceedings of the 37th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020.
[73] T. Young, D. Hazarika, S. Poria, and E. Cambria. Recent trends in deep learning based natural language processing.CoRR, abs/1708.02709, 2017.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.