zbMATH — the first resource for mathematics

Stochastic subgradient method converges on tame functions. (English) Zbl 1433.65141
Summary: This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science—including all popular deep learning architectures.

65L20 Stability and convergence of numerical methods for ordinary differential equations
34A60 Ordinary differential inclusions
60H10 Stochastic ordinary differential equations (aspects of stochastic analysis)
PyTorch; TensorFlow
Full Text: DOI
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
[2] Benaïm, M.; Hofbauer, J.; Sorin, S., Stochastic approximations and differential inclusions, SIAM J. Control Optim., 44, 1, 328-348 (2005) · Zbl 1087.62091
[3] Benaïm, M.; Hofbauer, J.; Sorin, S., Stochastic approximations and differential inclusions, II. Applications. Math. Oper. Res., 31, 4, 673-695 (2006) · Zbl 1284.62511
[4] Bolte, J.; Daniilidis, A.; Lewis, As; Shiota, M., Clarke subgradients of stratifiable functions, SIAM Journal on Optimization, 18, 2, 556-572 (2007) · Zbl 1142.49006
[5] V.S. Borkar. Stochastic approximation. Cambridge University Press, Cambridge; Hindustan Book Agency, New Delhi, 2008. A dynamical systems viewpoint. · Zbl 1181.62119
[6] Borwein, Jm; Wang, X., Lipschitz func tions with maximal Clarke subdifferentials are generic, Proc. Amer. Math. Soc., 128, 11, 3221-3229 (2000) · Zbl 0963.49015
[7] H. Brézis. Opérateurs maximaux monotones et semi-groupes de contraction dans des espaces de Hilbert. North-Holland Math. Stud. 5, North-Holland, Amsterdam, 1973. · Zbl 0252.47055
[8] Bruck, Re Jr, Asymptotic convergence of nonlinear contraction semigroups in Hilbert space, J. Funct. Anal., 18, 15-26 (1975) · Zbl 0319.47041
[9] J.V. Burke, X. Chen, and H. Sun. Subdifferentiation and smoothing of nonsmooth integral functionals. Preprint, Optimization-Online, May 2017.
[10] Clarke, Fh, Generalized gradients and applications, Trans. Amer. Math. Soc., 205, 247-262 (1975) · Zbl 0307.26012
[11] F.H. Clarke. Optimization and nonsmooth analysis, volume 5 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, second edition, 1990. · Zbl 0696.49002
[12] F.H. Clarke, Y.S. Ledyaev, R.J. Stern, and P.R. Wolenski. Nonsmooth analysis and control theory, volume 178. Springer Science & Business Media, 2008. · Zbl 1047.49500
[13] M. Coste. An introduction to o-minimal geometry. RAAG Notes, 81 pages, Institut de Recherche Mathématiques de Rennes, November 1999.
[14] M. Coste. An Introduction to Semialgebraic Geometry. RAAG Notes, 78 pages, Institut de Recherche Mathématiques de Rennes, October 2002.
[15] D. Davis and D. Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. To Appear in SIAM J. Optim.,arXiv:1803.06523, 2018. · Zbl 1415.65136
[16] D. Davis and D. Drusvyatskiy. Stochastic subgradient method converges at the rate \({O}(k^{-1/4})\) on weakly convex functions. arXiv:1802.02988, 2018.
[17] A. Dembo. Probability theory: Stat310/math230 september 3, 2016. 2016. Available at http://statweb.stanford.edu/ adembo/stat-310b/lnotes.pdf.
[18] Drusvyatskiy, D.; Ioffe, Ad; Lewis, As, Curves of descent, SIAM J. Control Optim., 53, 1, 114-138 (2015) · Zbl 1345.26027
[19] J.C. Duchi and F. Ruan. Stochastic methods for composite optimization problems. Preprint arXiv:1703.08570, 2017. · Zbl 06989166
[20] Ghadimi, S.; Lan, G., Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23, 4, 2341-2368 (2013) · Zbl 1295.90026
[21] Ioffe, A. D., Critical values of set-valued maps with stratifiable graphs. Extensions of Sard and Smale-Sard theorems, Proceedings of the American Mathematical Society, 136, 9, 3111-3119 (2008) · Zbl 1191.49015
[22] Ioffe, Ad, An invitation to tame optimization, SIAM J. Optim., 19, 4, 1894-1917 (2008) · Zbl 1182.90083
[23] A.D. Ioffe. Variational analysis of regular mappings. Springer Monographs in Mathematics. Springer, Cham, 2017. Theory and applications. · Zbl 1381.49001
[24] S. Kakade and J.D. Lee. Provably correct automatic subdifferentiation for qualified programs. arXiv preprint arXiv:1809.08530, 2018.
[25] Khan, Kamil A.; Barton, Paul I., Evaluating an element of the Clarke generalized Jacobian of a composite piecewise differentiable function, ACM Transactions on Mathematical Software, 39, 4, 1-28 (2013) · Zbl 1295.65076
[26] Khan, Ka; Barton, Pi, A vector forward mode of automatic differentiation for generalized derivative evaluation, Optimization Methods and Software, 30, 6, 1185-1212 (2015) · Zbl 1329.49023
[27] H.J. Kushner and G.G. Yin. Stochastic approximation and recursive algorithms and applications, volume 35 of Applications of Mathematics (New York). Springer-Verlag, New York, second edition, 2003. Stochastic Modelling and Applied Probability. · Zbl 1026.62084
[28] S. Łojasiewicz. Ensemble semi-analytiques. IHES Lecture Notes, 1965.
[29] S. Majewski, B. Miasojedow, and E. Moulines. Analysis of nonsmooth stochastic approximation: the differential inclusion approach. Preprint arXiv:1805.01916, 2018.
[30] Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A., Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19, 4, 1574-1609 (2008) · Zbl 1189.90109
[31] Nemirovsky, As; Yudin, Db, Problem complexity and method efficiency in optimization (1983), New York: Wiley, New York
[32] Nurminskii, Ea, Minimization of nondifferentiable functions in the presence of noise, Cybernetics, 10, 4, 619-621 (1974)
[33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
[34] Robbins, H.; Monro, S., A stochastic approximation method, Ann. Math. Statistics, 22, 400-407 (1951) · Zbl 0054.05901
[35] R.T. Rockafellar. The theory of subgradients and its applications to problems of optimization, volume 1 of R & E. Heldermann Verlag, Berlin, 1981. · Zbl 0462.90052
[36] R.T. Rockafellar and R.J-B. Wets. Variational Analysis. Grundlehren der mathematischen Wissenschaften, Vol 317, Springer, Berlin, 1998.
[37] G.V. Smirnov. Introduction to the theory of differential inclusions, volume 41 of Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2002. · Zbl 0992.34001
[38] T. Tao. An introduction to measure theory, volume 126. American Mathematical Soc., 2011. · Zbl 1231.28001
[39] Van Den Dries, L.; Miller, C., Geometric categories and o-minimal structures, Duke Math. J., 84, 497-540 (1996) · Zbl 0889.03025
[40] Whitney, Hassler, A function not constant on a connected set of critical points, Duke Mathematical Journal, 1, 4, 514-517 (1935) · Zbl 0013.05801
[41] Wilkie, Aj, Model completeness results for expansions of the ordered field of real numbers by restricted Pfaffian functions and the exponential function, J. Amer. Math. Soc., 9, 4, 1051-1094 (1996) · Zbl 0892.03013
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.