Stochastic (approximate) proximal point methods: convergence, optimality, and adaptivity.

*(English)*Zbl 07105236##### MSC:

65K10 | Numerical optimization and variational techniques |

90C06 | Large-scale problems in mathematical programming |

##### References:

[1] | H. Asi and J. C. Duchi, Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity, preprint, https://arxiv.org/abs/1810.05633, 2018. |

[2] | F. Bach and E. Moulines, Non-asymptotic analysis of stochastic approximation algorithms for machine learning, in Adv. Neural Inf. Process. Syst. 25, Curran Associates, Red Hook, NY, 2011, pp. 451–459, |

[3] | H. Bauschke and J. Borwein, On projection algorithms for solving convex feasibility problems, SIAM Rev., 38 (1996), pp. 367–426. · Zbl 0865.47039 |

[4] | M. Belkin, D. Hsu, and P. Mitra, Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate, in Adv. Neural. Inf. Process. Syst. 31, Curran Associates, Red Hook, NY, 2018, pp. 2300–2311. |

[5] | M. Belkin, A. Rakhlin, and A. B. Tsybakov, Does data interpolation contradict statistical optimality?, in Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res. 89, 2019, pp. 1611–1619; available at http://proceedings.mlr.press/v89/belkin19a.html. |

[6] | D. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont, MA, 1999. |

[7] | D. P. Bertsekas, Stochastic optimization problems with nondifferentiable cost functionals, J. Optim. Theory Appl., 12 (1973), pp. 218–231. |

[8] | D. P. Bertsekas, Incremental proximal methods for large scale convex optimization, Math. Program., 129 (2011), pp. 163–195. |

[9] | P. Bianchi, Ergodic convergence of a stochastic proximal point algorithm, SIAM J. Optim., 26 (2016), pp. 2235–2260. · Zbl 1355.90062 |

[10] | L. Bottou and O. Bousquet, The tradeoffs of large scale learning, in Adv. Neural Inf. Process. Syst. 20, Curran Associates, Red Hook, NY, 2007, pp. 161–168. |

[11] | S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, 2004. |

[12] | J. Burke, Descent methods for composite nondifferentiable optimization problems, Math. Program., 33 (1985), pp. 260–279. · Zbl 0581.90084 |

[13] | J. Burke and M. Ferris, A Gauss–Newton method for convex composite optimization, Math. Program., 71 (1995), pp. 179–194. · Zbl 0846.90083 |

[14] | K. Crammer and Y. Singer, On the algorithmic implementation of multiclass kernel-based vector machines, J. Mach. Learn. Res., 2 (2001), pp. 265–292. · Zbl 1037.68110 |

[15] | D. Davis and D. Drusvyatskiy, Stochastic model-based minimization of weakly convex functions, SIAM J. Optim., 29 (2019), pp. 207–239. · Zbl 1415.65136 |

[16] | D. Davis, D. Drusvyatskiy, and C. Paquette, The Nonsmooth Landscape of Phase Retrieval, preprint, https://arxiv.org/abs/1711.03247, 2017. |

[17] | J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, Large scale distributed deep networks, in Adv. Neural Inf. Process. Syst. 25, Curran Associates, Red Hook, NY, 2012, pp. 1223–1231. |

[18] | D. Drusvyatskiy and A. Lewis, Error bounds, quadratic growth, and linear convergence of proximal methods, Math. Oper. Res., 43 (2018), pp. 919–948. |

[19] | J. C. Duchi and F. Ruan, Stochastic methods for composite and weakly convex optimization problems, SIAM J. Optim., 28 (2018), pp. 3229–3259. · Zbl 06989166 |

[20] | J. C. Duchi and F. Ruan, Asymptotic optimality in stochastic optimization, Ann. Statist., to appear. |

[21] | S. Ghadimi and G. Lan, Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, I: A generic algorithmic framework, SIAM J. Optim., 22 (2012), pp. 1469–1492. · Zbl 1301.62077 |

[22] | S. Ghadimi and G. Lan, Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23 (2013), pp. 2341–2368. · Zbl 1295.90026 |

[23] | T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, 2nd ed., Springer, New York, 2009. · Zbl 1273.62005 |

[24] | E. Hazan and S. Kale, An Optimal Algorithm for Stochastic Strongly Convex Optimization, preprint, http://arxiv.org/abs/1006.2425, 2011. |

[25] | J. Hiriart-Urruty and C. Lemaréchal, Convex Analysis and Minimization Algorithms I: Fundamentals, Grundlehren Math. Wiss. 305, Springer, New York, 1993. · Zbl 0795.49001 |

[26] | J. Hiriart-Urruty and C. Lemaréchal, Convex Analysis and Minimization Algorithms II: Advanced Theory and Bundle Methods, Grundlehren Math. Wiss. 306, Springer, New York, 1993. · Zbl 0795.49002 |

[27] | A. J. Hoffman, On approximate solutions of systems of linear inequalities, J. Res. Natl. Inst. Stand., 49 (1952), pp. 263–265. |

[28] | H. Hu and Q. Wang, On approximate solutions of infinite systems of linear inequalities, Linear Algebra Appl., 114–115 (1989), pp. 429–438. |

[29] | N. Karampatziakis and J. Langford, Online importance weight aware updates, in Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence, AUAI Press, Arlington, VA, 2011, pp. 392–399. |

[30] | J. E. Kelley, The cutting-plane method for solving convex programs, J. Soc. Indust. Appl. Math., 8 (1960), pp. 703–712. · Zbl 0098.12104 |

[31] | B. Kulis and P. Bartlett, Implicit online learning, in Proceedings of the 27th International Conference on Machine Learning, Omnipress, Madison, WI, 2010, pp. 575–582. |

[32] | H. J. Kushner and G. Yin, Stochastic Approximation and Recursive Algorithms and Applications, 2nd ed., Springer, New York, 2003. |

[33] | G. Lan, An optimal method for stochastic composite optimization, Math. Program., 133 (2012), pp. 365–397. · Zbl 1273.90136 |

[34] | L. Le Cam and G. L. Yang, Asymptotics in Statistics: Some Basic Concepts, Springer, New York, 2000. · Zbl 0952.62002 |

[35] | Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, 521 (2015), pp. 436–444. |

[36] | D. Leventhal and A. S. Lewis, Randomized methods for linear constraints: Convergence rates and conditioning, Math. Oper. Res., 35 (2010), pp. 641–654. · Zbl 1216.15006 |

[37] | A. S. Lewis, D. R. Luke, and J. Malick, Local linear convergence for alternating and averaged nonconvex projections, Found. Comput. Math., 9 (2009), pp. 485–513. · Zbl 1169.49030 |

[38] | S. Ma, R. Bassily, and M. Belkin, The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning, in Proceedings of the 35th International Conference on Machine Learning, Proc. Mach. Learn. Res. 80, 2018, pp. 3325–3334; available at http://proceedings.mlr.press/v80/. |

[39] | P. McCullagh and J. Nelder, Generalized Linear Models, Chapman and Hall, London, 1989. |

[40] | S. Mendelson, Learning without concentration, J. ACM, 62 (2015), Article no. 21. |

[41] | D. Needell and J. Tropp, Paved with good intentions: Analysis of a randomized block Kaczmarz method, Linear Algebra Appl., 441 (2014), pp. 199–221. · Zbl 1282.65042 |

[42] | D. Needell, R. Ward, and N. Srebro, Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm, in Adv. Neural Inf. Process. Syst. 27, Curran Associates, Red Hook, NY, 2014, pp. 1017–1025. |

[43] | A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), pp. 1574–1609. · Zbl 1189.90109 |

[44] | Y. Nesterov, Introductory Lectures on Convex Optimization, Kluwer Academic, Norwell, MA, 2004. |

[45] | J. Nocedal and S. J. Wright, Numerical Optimization, Springer, New York, 2006. |

[46] | A. Patrascu and I. Necoara, Nonasymptotic Convergence of Stochastic Proximal Point Algorithms for Constrained Convex Optimization, preprint, https://arxiv.org/abs/1706.06297, 2017. |

[47] | B. T. Polyak, Introduction to Optimization, Optimization Software, New York, 1987. |

[48] | B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM J. Control Optim., 30 (1992), pp. 838–855. · Zbl 0762.62022 |

[49] | H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), pp. 400–407. · Zbl 0054.05901 |

[50] | H. Robbins and D. Siegmund, A convergence theorem for non-negative almost supermartingales and some applications, in Optimizing Methods in Statistics, Academic Press, New York, 1971, pp. 233–257. |

[51] | R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM J. Control Optim., 14 (1976), pp. 877–898. · Zbl 0358.90053 |

[52] | E. Ryu and S. Boyd, Stochastic Proximal Iteration: A Non-Asymptotic Improvement upon Stochastic Gradient Descent, preprint, http://www.math.ucla.edu/ eryu/papers/spi.pdf, 2014. |

[53] | M. Schmidt and N. Le Roux, Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition, preprint, https://arxiv.org/abs/1308.6370, 2013. |

[54] | S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, PEGASOS: Primal estimated sub-gradient solver for SVM, Math. Program., 127 (2011), pp. 3–30. · Zbl 1211.90239 |

[55] | A. Shapiro, D. Dentcheva, and A. Ruszczyński, Lectures on Stochastic Programming: Modeling and Theory, MPS-SIAM Ser. Optim., SIAM, Philadelphia, PA, 2009. · Zbl 1183.90005 |

[56] | T. Strohmer and R. Vershynin, A randomized Kaczmarz algorithm with exponential convergence, J. Fourier Anal. Appl., 15 (2009), pp. 262–278. · Zbl 1169.68052 |

[57] | C. H. Teo, S. Vishwanthan, A. J. Smola, and Q. V. Le, Bundle methods for regularized risk minimization, J. Mach. Learn. Res., 11 (2010), pp. 311–365. · Zbl 1242.68253 |

[58] | P. Toulis and E. Airoldi, Asymptotic and finite-sample properties of estimators based on stochastic gradients, Ann. Statist., 45 (2017), pp. 1694–1727. · Zbl 1378.62046 |

[59] | P. Toulis, D. Tran, and E. Airoldi, Towards stability and optimality in stochastic gradient descent, in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Proc. Mach. Learn. Res. 51, 2016, pp. 1290–1298; available at http://proceedings.mlr.press/v51/. |

[60] | L. N. Trefethen and D. Bau III, Numerical Linear Algebra, SIAM, Philadelphia, PA, 1997. |

[61] | A. W. van der Vaart, Asymptotic Statistics, Camb. Ser. Stat. Probab. Math., Cambridge University Press, Cambridge, 1998. |

[62] | R. Vershynin, Introduction to the non-asymptotic analysis of random matrices, in Compressed Sensing: Theory and Applications, Cambridge University Press, Cambridge, 2012, pp. 210–268. |

[63] | B. Widrow and M. E. Hoff, Adaptive Switching Circuits, in 1960 IRE WESCON Convention Record, Part 4, IRE, New York, 1960, pp. 96–104. |

[64] | C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning requires rethinking generalization, in Proceedings of the Fifth International Conference on Learning Representations, 2017; available at https://openreview.net/forum?id=Sy8gdB9xx. |

[65] | T. Zhang, Solving large scale linear prediction problems using stochastic gradient descent algorithms, in Proceedings of the Twenty-First International Conference on Machine Learning, ACM Press, New York, 2004, . |

[66] | M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, in Proceedings of the Twentieth International Conference on Machine Learning, AAAI Press, Palo Alto, CA, 2003, pp. 928–935. |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.