# zbMATH — the first resource for mathematics

Bridging the gap between constant step size stochastic gradient descent and Markov chains. (English) Zbl 07241594
The paper deals with the minimization algorithm of a strongly convex objective function given access to unbiased estimates of its gradient through Stochastic Gradient Descent (SGD), aka Robbins-Monro algorithm [H. Robbins and S. Monro, Ann. Math. Stat. 22, 400–407 (1951; Zbl 0054.05901)], with constant step size. As the detailed analysis was only performed for quadratic functions, the authors provide an explicit asymptotic expansion of the moments of the averaged SGD iterates that outlines the dependence on initial conditions, the effect of noise and the step size, as well as the lack of convergence in the general (nonquadratic) case. This analysis is based on tools from Markov chain theory into the analysis of stochastic gradient. It is observed that Richardson-Romberg extrapolation [G. Pagès, Monte Carlo Methods Appl. 13, No. 1, 37–70 (2007; Zbl 1119.65004); N. Frikha and L. Huang, Stochastic Processes Appl. 125, No. 11, 4066–4101 (2015; Zbl 1336.60137)] may be used to get closer to the global optimum. Empirical improvements of the new extrapolation scheme is shown.
This methodological problem has interest in different practical tasks appearing in large-scale machine learning, optimization and stochastic approximation.
##### MSC:
 62L20 Stochastic approximation 90C15 Stochastic programming 93E35 Stochastic learning and adaptive control 60J22 Computational methods in Markov chains
##### Keywords:
 [1] Abdulle, A., Vilmart, G. and Zygalakis, K. C. (2014). High order numerical approximation of the invariant measure of ergodic SDEs. SIAM J. Numer. Anal. 52 1600-1622. · Zbl 1310.65007 [2] Aguech, R., Moulines, E. and Priouret, P. (2000). On a perturbation approach for the analysis of stochastic tracking algorithms. SIAM J. Control Optim. 39 872-899. · Zbl 0972.60026 [3] Bach, F. (2014). Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. J. Mach. Learn. Res. 15 595-627. · Zbl 1318.62224 [4] Bach, F. and Moulines, E. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances on Neural Information Processing Systems (NIPS) 451-459. [5] Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate $$O(1/n)$$. In Advances in Neural Information Processing Systems (NIPS). [6] Benaim, M. (1996). A dynamical system approach to stochastic approximations. SIAM J. Control Optim. 34 437-472. · Zbl 0841.62072 [7] Benveniste, A., Métivier, M. and Priouret, P. (1990). Adaptive Algorithms and Stochastic Approximations. Applications of Mathematics (New York) 22. Springer, Berlin. Translated from the French by Stephen S. Wilson. · Zbl 0752.93073 [8] Bertsekas, D. P. (1999). Nonlinear Programming, 2nd ed. Athena Scientific Optimization and Computation Series. Athena Scientific, Belmont, MA. · Zbl 1015.90077 [9] Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (NIPS). [10] Bouton, C. and Pagès, G. (1997). About the multidimensional competitive learning vector quantization algorithm with constant gain. Ann. Appl. Probab. 7 679-710. · Zbl 0892.60082 [11] Chee, J. and Toulis, P. (2017). Convergence diagnostics for stochastic gradient descent with constant step size. In Proceedings of the International Conference on Artificial Intelligence and Statistics, (AISTATS). [12] Chen, C., Ding, N. and Carin, L. (2015). On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In Advances on Neural Information Processing Systems (NIPS) 2269-2277. [13] Chen, X., Lee, J. D., Tong, X. T. and Zhang, Y. (2016). Statistical inference for model parameters in stochastic gradient descent. ArXiv preprint. Available at arXiv:1610.08637. · Zbl 1440.62287 [14] Dalalyan, A. S. (2017). Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 651-676. · Zbl 1411.62030 [15] Défossez, A. and Bach, F. (2015). Averaged least-mean-squares: Bias-variance trade-offs and optimal sampling distributions. In Proceedings of the International Conference on Artificial Intelligence and Statistics, (AISTATS). [16] Dieuleveut, A. and Bach, F. (2016). Nonparametric stochastic approximation with large step-sizes. Ann. Statist. 44 1363-1399. · Zbl 1346.60041 [17] Dieuleveut, A., Durmus, A. and Bach, F. (2019). Supplement to “Bridging the gap between constant step size stochastic gradient descent and Markov Chains.” https://doi.org/10.1214/19-AOS1850SUPP. [18] Dieuleveut, A., Flammarion, N. and Bach, F. (2017). Harder, better, faster, stronger convergence rates for least-squares regression. J. Mach. Learn. Res. 18 101. · Zbl 1441.62215 [19] Durmus, A. and Moulines, É. (2017). Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab. 27 1551-1587. · Zbl 1377.65007 [20] Durmus, A., Simsekli, U., Moulines, E., Badeau, R. and Richard, G. (2016). Stochastic gradient Richardson-Romberg Markov chain Monte Carlo. In Advances in Neural Information Processing Systems 2047-2055. [21] Fort, J.-C. and Pagès, G. (1996). Convergence of stochastic algorithms: From the Kushner-Clark theorem to the Lyapounov functional method. Adv. in Appl. Probab. 28 1072-1094. · Zbl 0881.62085 [22] Fort, J.-C. and Pagès, G. (1999). Asymptotic behavior of a Markovian stochastic algorithm with constant step. SIAM J. Control Optim. 37 1456-1482. · Zbl 0954.60057 [23] Freidlin, M. I. and Wentzell, A. D. (1998). Random Perturbations of Dynamical Systems, 2nd ed. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences] 260. Springer, New York. · Zbl 0922.60006 [24] Glynn, P. W. and Meyn, S. P. (1996). A Liapounov bound for solutions of the Poisson equation. Ann. Probab. 24 916-931. · Zbl 0863.60063 [25] Hartman, P. (2002). Ordinary Differential Equations: Second Edition. Classics in Applied Mathematics 38. SIAM, Philadelphia, PA. Corrected reprint of the second (1982) edition [Birkhäuser, Boston, MA; MR0658490 (83e:34002)], With a foreword by Peter Bates. · Zbl 1009.34001 [26] He, K., Zhang, X., Ren, S. and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770-778. [27] Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P. and Sidford, A. (2018). Accelerating stochastic gradient descent. In Proceedings of the International Conference on Learning Theory (COLT). · Zbl 06982979 [28] Jain, P., Kakade, S. M., Kidambi, R., Netrapalli, P. and Sidford, A. (2017). Parallelizing stochastic gradient descent for least squares regression: Mini-batching, averaging, and model misspecification. J. Mach. Learn. Res. 18 223. · Zbl 06982979 [29] Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS) 1097-1105. [30] Kushner, H. J. and Clark, D. S. (1978). Stochastic Approximation Methods for Constrained and Unconstrained Systems. Applied Mathematical Sciences 26. Springer, New York-Berlin. · Zbl 0381.60004 [31] Lan, G. (2012). An optimal method for stochastic composite optimization. Math. Program. 133 365-397. · Zbl 1273.90136 [32] Levy, E. (2006). Why do partitions occur in Faa di Bruno’s chain rule for higher derivatives? Technical Report 0602183. [33] Ljung, L. (1977). Analysis of recursive stochastic algorithms. IEEE Trans. Automat. Control AC-22 551-575. · Zbl 0362.93031 [34] Ljung, L., Pflug, G. and Walk, H. (1992). Stochastic Approximation and Optimization of Random Systems. DMV Seminar 17. Birkhäuser, Basel. · Zbl 0747.62090 [35] Mandt, S., Hoffman, M. and Blei, D. M. (2016). A variational analysis of stochastic gradient algorithms. In International Conference on Machine Learning 354-363. [36] Mandt, S., Hoffman, M. D. and Blei, D. M. (2017). Stochastic gradient descent as approximate Bayesian inference. J. Mach. Learn. Res. 18 134. · Zbl 1442.62055 [37] Mattingly, J. C., Stuart, A. M. and Higham, D. J. (2002). Ergodicity for SDEs and approximations: Locally Lipschitz vector fields and degenerate noise. Stochastic Process. Appl. 101 185-232. · Zbl 1075.60072 [38] Métivier, M. and Priouret, P. (1984). Applications of a Kushner and Clark lemma to general classes of stochastic algorithms. IEEE Trans. Inform. Theory 30 140-151. · Zbl 0546.62056 [39] Métivier, M. and Priouret, P. (1987). Théorèmes de convergence presque sure pour une classe d’algorithmes stochastiques à pas décroissant. Probab. Theory Related Fields 74 403-428. · Zbl 0588.62153 [40] Meyn, S. and Tweedie, R. L. (2009). Markov Chains and Stochastic Stability, 2nd ed. Cambridge Univ. Press, Cambridge. · Zbl 1165.60001 [41] Moulines, E., Priouret, P. and Roueff, F. (2005). On recursive estimation for time varying autoregressive processes. Ann. Statist. 33 2610-2654. · Zbl 1084.62089 [42] Nedic, A. and Bertsekas, D. (2001). Convergence rate of incremental subgradient algorithms. In Stochastic Optimization: Algorithms and Applications (Gainesville, FL, 2000). Appl. Optim. 54 223-264. Kluwer Academic, Dordrecht. · Zbl 0984.90033 [43] Needell, D., Ward, R. and Srebro, N. (2014). Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. In Advances in Neural Information Processing Systems (NIPS), 1017-1025. · Zbl 1333.65070 [44] Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A. (2008). Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19 1574-1609. · Zbl 1189.90109 [45] Nemirovsky, A. S. and Yudin, D. B. (1983). Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience Publication. Wiley, New York. Translated from the Russian and with a preface by E. R. Dawson, Wiley-Interscience Series in Discrete Mathematics. [46] Nesterov, Y. and Vial, J.-P. (2008). Confidence level solutions for stochastic programming. Automatica J. IFAC 44 1559-1568. · Zbl 1283.93314 [47] Pflug, G. C. (1986). Stochastic minimization with constant step-size: Asymptotic laws. SIAM J. Control Optim. 24 655-666. · Zbl 0594.90089 [48] Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30 838-855. · Zbl 0762.62022 [49] Priouret, P. and Veretennikov, A. Y. (1998). A remark on the stability of the LMS tracking algorithm. Stoch. Anal. Appl. 16 119-129. · Zbl 0907.93065 [50] Rakhlin, A., Shamir, O. and Sridharan, K. (2011). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning. [51] Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400-407. · Zbl 0054.05901 [52] Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins-Monro process Technical report, Cornell University Operations Research and Industrial Engineering. [53] Shalev-Shwartz, S., Shamir, O., Srebro, N. and Sridharan, K. (2009). Stochastic convex optimization. In Proceedings of the International Conference on Learning Theory (COLT). [54] Shalev-Shwartz, S., Singer, Y. and Srebro, N. (2007). Pegasos: Primal estimated sub-GrAdient SOlver for SVM. In Proceedings of the International Conference on Machine Learning, ICML 807-814. · Zbl 1211.90239 [55] Shamir, O. and Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the International Conference on Machine Learning. [56] Stoer, J. and Bulirsch, R. (2002). Introduction to Numerical Analysis, 3rd ed. Texts in Applied Mathematics 12. Springer, New York. · Zbl 1004.65001 [57] Su, W. J. and Zhu, Y. (2018). Uncertainty quantification for online learning and stochastic approximation via hierarchical incremental gradient descent. ArXiv preprint. Available at arXiv:1802.04876. [58] Tadic, V. B. and Doucet, A. (2017). Asymptotic bias of stochastic gradient search. Ann. Appl. Probab. 27 3255-3304. · Zbl 1387.49044 [59] Talay, D. and Tubaro, L. (1990). Expansion of the global error for numerical schemes solving stochastic differential equations. Stoch. Anal. Appl. 8 483-509. · Zbl 0718.60058 [60] Villani, C. (2009). Optimal Transport: Old and New. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences] 338. Springer, Berlin. · Zbl 1156.53003 [61] Welling, M. and Teh, Y. W. (2011). Bayesian learning via Stochastic Gradient Langevin Dynamics. In Proceedings of the International Conference on Machine Learning (ICML) 681-688. [62] Zhu, D. · Zbl 0855.47043