×

Stochastic heavy ball. (English) Zbl 1392.62244

Summary: This paper deals with a natural stochastic optimization procedure derived from the so-called Heavy-ball method differential equation, which was introduced by B. T. Poljak in the 1960s with his seminal contribution [U.S.S.R. Comput. Math. Math. Phys. 4, No. 5, 1–17 (1967; Zbl 0147.35301); translation from Zh. Vychisl. Mat. Mat. Fiz. 4, 791–803 (1964)]. The Heavy-ball method is a second-order dynamics that was investigated to minimize convex functions \(f\). The family of second-order methods recently received a large amount of attention, until the famous contribution of Yu. E. Nesterov [Sov. Math., Dokl. 27, 372–376 (1983; Zbl 0535.90071); translation from Dokl. Akad. Nauk SSSR 269, 543–547 (1983)], leading to the explosion of large-scale optimization problems. This work provides an in-depth description of the stochastic heavy-ball method, which is an adaptation of the deterministic one when only unbiased evalutions of the gradient are available and used throughout the iterations of the algorithm. We first describe some almost sure convergence results in the case of general non-convex coercive functions \(f\). We then examine the situation of convex and strongly convex potentials and derive some non-asymptotic results about the stochastic heavy-ball method. We end our study with limit theorems on several rescaled algorithms.

MSC:

62L20 Stochastic approximation
35H10 Hypoelliptic equations
60F05 Central limit and other weak theorems
60J70 Applications of Brownian motions and diffusion theory (population genetics, absorption problems, etc.)
35P15 Estimates of eigenvalues in context of PDEs
PDF BibTeX XML Cite
Full Text: DOI arXiv Euclid

References:

[1] F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression., Journal of Machine Learning Research, 15:595-627, 2014. · Zbl 1318.62224
[2] O. Brandière and M. Duflo. Les algorithmes stochastiques contournent-ils les pièges ?, Annales de l’I.H.P. Probabilités et Statistiques, 32:395-427, 1996. · Zbl 0849.62043
[3] M. Benaïm., Dynamics of stochastic approximation algorithms. Lecture Notes in Mathematics, Séminaire de Probabilités XXXIII. Springer-Verlag, 2006. Characterization and convergence.
[4] M. Benaïm and M.W. Hirsh. Asymptotic pseudotrajectories and chain recurrent flows, with applications., J. Dynam. Differential Equations, 8:141-176, 1996. · Zbl 0878.58053
[5] P. Billingsley., Convergence of Probability Measures. Wiley series in Probability & Statistics, New York, 1995.
[6] S. Boucheron, G. Lugosi, and P. Massart., Concentration inequalities. Oxford University Press, Oxford, 2013. A nonasymptotic theory of independence, With a foreword by M. Ledoux. · Zbl 1337.60003
[7] M. Benaïm, M. Ledoux, and O. Raimond. Self-interacting diffusions., Probab. Theory Related Fields, 122:1-41, 2002. · Zbl 1042.60060
[8] S. Burer and R. Monteiro. Local minima and convergence in low-rank semidefinite programming., Math. Program., 103(3, Ser. A):427-444, 2005. · Zbl 1099.90040
[9] F. Bach and E. Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning., Advances in Neural Information Processing Systems (NIPS), 2011.
[10] Vivek S. Borkar. Stochastic approximation with two time scales., Systems Control Lett., 29(5):291-294, 1997. · Zbl 0895.62085
[11] Vivek S. Borkar., Stochastic approximation. Cambridge University Press, Cambridge; Hindustan Book Agency, New Delhi, 2008. A dynamical systems viewpoint. · Zbl 1159.60002
[12] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In, Proceedings of COMPSTAT ’2010, pages 177-186. Physica-Verlag/Springer, Heidelberg, 2010.
[13] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems., SIAM J. Imaging Sci., 2(1):183-202, 2009. · Zbl 1175.94009
[14] S. Boyd and L. Vandenberghe., Convex optimization. Cambridge University Press, Cambridge, 2004. · Zbl 1058.90049
[15] A. Cabot, H. Engler, and S. Gadat. On the long time behavior of second order differential equations with asymptotically small dissipation., Trans. Amer. Math. Soc., 361(11) :5983-6017, 2009. · Zbl 1191.34078
[16] A. Cabot, H. Engler, and S. Gadat. Second-order differential equations with asymptotically small dissipation and piecewise flat potentials. In, Proceedings of the Seventh Mississippi State-UAB Conference on Differential Equations and Computational Simulations, volume 17 of Electron. J. Differ. Equ. Conf., pages 33-38. Southwest Texas State Univ., San Marcos, TX, 2009. · Zbl 1171.34323
[17] M. Duflo. Random iterative models, adaptive algorithms and stochastic approximations,., Applications of Mathematics (New York). Springer-Verlag, Berlin., 22, 1997.
[18] S. Ethier and T. Kurtz., Markov Processes. John Willey and Sons, New York, 1986. · Zbl 0592.60049
[19] M. Frank and P. Wolfe. An algorithm for quadratic programming., Naval Res. Logist. Quart., 3:95-110, 1956.
[20] M. Freidlin and A. Wentzell., Random perturbations of dynamical systems, volume 260 of Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]. Springer, Heidelberg, third edition, 1984. Translated from the 1979 Russian original by Joseph Szücs. · Zbl 0522.60055
[21] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming., SIAM Journal on Optimization, 23 :2341-2368, 2013. · Zbl 1295.90026
[22] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming., Math. Program., 156:59-99, 2016. · Zbl 1335.62121
[23] S. Gadat, L. Miclo, and F. Panloup. A stochastic model for speculative bubbles., Alea: Latin American journal of probability and mathematical statistics, 12:491-532, 2015. · Zbl 1339.60119
[24] S. Gadat and F. Panloup. Long time behaviour and stationary regime of memory gradient diffusions., Annales Institut Henri Poincaré (B), 50:564-601, 2014. · Zbl 1299.60092
[25] S. Gadat and F. Panloup. Optimal non-asymptotic bound of the Ruppert-Polyak averaging without strong convexity, Preprint, 2018.
[26] S. Gadat and L. Younes. A stochastic algorithm for feature selection in pattern recognition., Journal of Machine Learning Research, 8:509-547, 2007. · Zbl 1222.68349
[27] P. Hartman., Ordinary Differential Equations. Classic in Applied Mathematics. Wiley, 1982. · Zbl 0476.34002
[28] A. Haraux., Systèmes dynamiques dissipatifs et applications. R.M.A. Masson, Paris, 1991.
[29] C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning., In Advances in Neural Information Processing Systems, 2009.
[30] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating Stochastic Gradient Descent., ArXiv e-prints, April 2017.
[31] C. Jin, S. M. Kakade, and P. Netrapalli. Provable efficient online matrix completion via non-convex stochastic gradient descent., NIPS, 2016.
[32] C. Jin, P. Netrapalli, and M. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent., Preprint, 2017.
[33] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function., Ann. Math. Statist., 23:462-466, 1952. · Zbl 0049.36601
[34] H. J. Kushner and G. Yin. Stochastic approximation and recursive algorithms and applications., Springer-Verlag, second edition,, 2003. · Zbl 1026.62084
[35] G. Lan. An optimal method for stochastic composite optimization., Math. Program, 133(1-2, Ser.A):365-397, 2012. · Zbl 1273.90136
[36] V. Lemaire. An adaptive scheme for the approximation of dissipative systems., Stochastic Processes and their Applications, 117(10) :1491-1518, 2007. · Zbl 1126.65005
[37] J. Lee, M. Simchowitz, M. Jordan, and B. Recht. Gradient descent converges to minimizers., Preprint, 2016.
[38] P. Mertikopoulos and M. Staudigl. On the convergence of gradient-like flows with noisy gradient input., SIAM Journal on Optimization, to appear, 2017. · Zbl 1387.90187
[39] J. C. Mattingly, A. M. Stuart, and D. J. Higham. Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise., Stochastic Process. Appl., 101(2):185-232, 2002. · Zbl 1075.60072
[40] S. Meyn and R. Tweedie. Stability of Markovian processes. III. Foster-Lyapunov criteria for continuous-time processes., Adv. in Appl. Probab., 25(3):518-548, 1993. · Zbl 0781.60053
[41] Y. Nesterov. A method of solving a convex programming problem with convergence rate \(o(1/k^2)\)., Soviet Mathematics Doklady, 27(2):372-376, 1983. · Zbl 0535.90071
[42] Y. Nesterov., Introductory lectures on convex optimization, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, MA, 2004. A basic course.
[43] A. Nitanda. Accelerated Stochastic Gradient Descent for Minimizing Finite Sums., ArXiv e-prints, June 2015.
[44] A. Nemirovski and D. Yudin. Problem complexity and method efficiency in optimization., Wiley-Interscience Series in Discrete Mathematics., John Wiley, XV, 1983.
[45] R. Pemantle. Non-convergence to unstable points in urn models and stochastic approximations., Annals of Probability, 18:698-712, 1990. · Zbl 0709.60054
[46] B. T. Polyak and A. Juditsky. Acceleration of stochastic approximation by averaging., SIAM Journal on Control and Optimization, 30:838-855, 1992. · Zbl 0762.62022
[47] H. Poincaré. Mémoire sur les courbes définies par une équation différentielle (iv)., Journal de Mathématiques Pures et Appliquées, 4:151-217, 1886.
[48] B. T. Polyak. Some methods of speeding up the convergence of iteration methods., USSR Computational Mathematics and Mathematical Physics, 4:1-17, 1964.
[49] H. Robbins and S. Monro. A stochastic approximation method., Ann. Math. Statist., 22:400-407, 1951. · Zbl 0054.05901
[50] D. Ruppert. Efficient estimations from a slowly convergent robbins-monro process., Technical Report, 781, Cornell University Operations Research and Industrial Engineering, 1988.
[51] K.R. Stromberg., Probability for Analysts. Chapman & Hall, CRC, New York, 1994. · Zbl 0803.60001
[52] D. W. Stroock and S. R. S. Varadhan., Multidimensional diffusion processes. Classics in Mathematics. Springer-Verlag, Berlin, 2006. Reprint of the 1997 edition. · Zbl 1103.60005
[53] C. Villani. Hypocoercivity., Mem. Amer. Math. Soc., 202(950), 2009. · Zbl 1197.35004
[54] S. Boyd W. Su and E. J. Candes. A differential equation for modeling nesterov’s accelerated gradient method: theory and insights., Journal of Machine Learning Research, to appear, 2016. · Zbl 1391.90667
[55] T. Yang, Q. Lin, and Z. Li. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization., Preprint, 2016.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.