# zbMATH — the first resource for mathematics

Optimization methods for large-scale machine learning. (English) Zbl 1397.65085
This paper provides a review and commentary on the past, present, and future of numerical optimization algorithms in the context of machine learning applications. Through case studies on text classification and the training of deep neural networks, is discussed how optimization problems arise in machine learning and what makes them challenging. A main theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter. Based on this viewpoint, a comprehensive theory of a straightforward, yet versatile stochastic gradient algorithm, discussed its practical behavior, and highlight opportunities for designing algorithms with improved performance is presented. This leads to a discussion about the next generation of optimization methods for large-scale machine learning, including an investigation of two main streams of research on techniques that diminish noise in the stochastic directions and methods that make use of second-order derivative approximations.

##### MSC:
 65K05 Numerical mathematical programming methods 68Q25 Analysis of algorithms and problem complexity 68T05 Learning and adaptive systems in artificial intelligence 90C06 Large-scale problems in mathematical programming 90C30 Nonlinear programming 49Q10 Optimization of shapes other than minimal surfaces
##### Software:
 [1] A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright, Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization, IEEE Trans. Inform. Theory, 58 (2012), pp. 3235–3249. · Zbl 1365.94132 [2] N. Agarwal, B. Bullins, and E. Hazan, Second-order stochastic optimization for machine learning in linear time, J. Mach. Learn. Res., 18 (2017), pp. 4148–4187. · Zbl 1441.90115 [3] Z. Allen Zhu and E. Hazan, Optimal black-box reductions between optimization objectives, in Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, eds., 2016, pp. 1606–1614. [4] S.-I. Amari, A theory of adaptive pattern classifiers, IEEE Trans. Electronic Computers, EC-16 (1967), pp. 299–307. · Zbl 0189.50101 [5] S.-I. Amari, Natural gradient works efficiently in learning, Neural Comput., 10 (1998), pp. 251–276. [6] S.-I. Amari and H. Nagaoka, Methods of Information Geometry, AMS, Providence, RI, 1997. [7] G. Andrew and J. Gao, Scalable training of $$L^1$$-regularized log-linear models, in Proceedings of the 24th International Conference on Machine Learning, ACM, 2007, pp. 33–40. [8] Y. F. Atchade, G. Fort, and E. Moulines, On Stochastic Proximal Gradient Algorithms, preprint, , 2014. · Zbl 1433.90199 [9] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, Optimization with sparsity-inducing penalties, Found. Trends Mach. Learning, 4 (2012), pp. 1–106. · Zbl 06064248 [10] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci., 2 (2009), pp. 183–202, . · Zbl 1175.94009 [11] S. Becker, E. J. Candès, and M. Grant, Templates for convex cone problems with applications to sparse signal recovery, Math. Program. Comput., 3 (2011), pp. 165–218. · Zbl 1257.90042 [12] S. Becker and Y. LeCun, Improving the convergence of back-propagation learning with second-order methods, in Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, eds., Morgan Kaufman, San Mateo, CA, 1989, pp. 29–37. [13] D. P. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont, Massachusetts, 1995. [14] D. P. Bertsekas, Incremental least squares methods and the extended Kalman filter, SIAM J. Optim., 6 (1996), pp. 807–822, . · Zbl 0945.93026 [15] D. P. Bertsekas, Convex Optimization Algorithms, Athena Scientific, Nashua, NH, 2015. · Zbl 1347.90001 [16] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, Prentice-Hall, Englewood Cliffs, NJ, 1989. [17] D. Blatt, A. O. Hero, and H. Gauchman, A convergent incremental gradient method with a constant step size, SIAM J. Optim., 18 (2007), pp. 29–51, . · Zbl 1154.90015 [18] A. Bordes, L. Bottou, and P. Gallinari, SGD-QN: Careful Quasi-Newton stochastic gradient descent, J. Mach. Learn. Res., 10 (2009), pp. 1737–1754. · Zbl 1235.68130 [19] A. Bordes, L. Bottou, P. Gallinari, J. Chang, and S. A. Smith, Erratum: SGDQN is less careful than expected, J. Mach. Learn. Res., 11 (2010), pp. 2229–2240. · Zbl 1242.68206 [20] L. Bottou, Stochastic gradient learning in neural networks, in Proceedings of Neuro-Nîmes 91, Nimes, France, 1991, EC2. [21] L. Bottou, Online algorithms and stochastic approximations, in Online Learning and Neural Networks, D. Saad, ed., Cambridge University Press, Cambridge, UK, 1998. · Zbl 0968.68127 [22] L. Bottou, Large-scale machine learning with stochastic gradient descent, in Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), Y. Lechevallier and G. Saporta, eds., Paris, 2010, Springer, pp. 177–187. [23] L. Bottou and O. Bousquet, The tradeoffs of large scale learning, in Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, eds., Curran Associates, Inc., 2008, pp. 161–168. [24] L. Bottou, F. Fogelman Soulié, P. Blanchet, and J. S. Lienard, Experiments with time delay networks and dynamic time warping for speaker independent isolated digit recognition, in Proceedings of EuroSpeech 89, Vol. 2, Paris, 1989, pp. 537–540. [25] L. Bottou and Y. Le Cun, On-line learning for very large datasets, Appl. Stoch. Models Business Indust., 21 (2005), pp. 137–151. · Zbl 1091.68063 [26] O. Bousquet, Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms, Ph.D. thesis, Ecole Polytechnique, 2002. [27] C. G. Broyden, The convergence of a class of double-rank minimization algorithms, J. Institute Math. Appl., 6 (1970), pp. 76–90. · Zbl 0223.65023 [28] R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal, On the use of stochastic Hessian information in optimization methods for machine learning, SIAM J. Optim., 21 (2011), pp. 977–995, . · Zbl 1245.65062 [29] R. H. Byrd, G. M. Chin, J. Nocedal, and F. Oztoprak, A family of second-order methods for convex $$ℓ_1$$-regularized optimization, Math. Program. Ser. A, 159 (2016), pp. 435–467. · Zbl 1350.49046 [30] R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu, Sample size selection in optimization methods for machine learning, Math. Program. Ser. B, 134 (2012), pp. 127–155. · Zbl 1252.49044 [31] R. H. Byrd, J. Nocedal, and F. Oztoprak, An inexact successive quadratic approximation method for convex L-1 regularized optimization, Math. Program., 157 (2017), pp. 375–396. · Zbl 1342.49037 [32] E. J. Candès and B. Recht, Exact matrix completion via convex optimization, Found. Comput. Math., 9 (2009), pp. 717–772. · Zbl 1219.90124 [33] J.-F. Cardoso, Blind signal separation: Statistical principles, Proc. IEEE, 9 (1998), pp. 2009–2025. [34] N. Cesa-Bianchi, A. Conconi, and C. Gentile, On the generalization ability of on-line learning algorithms, IEEE Trans. Inform. Theory, 50 (2004), pp. 2050–2057. · Zbl 1295.68182 [35] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun, The loss surfaces of multilayer networks, in Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015, pp. 192–204. [36] K. L. Chung, On a stochastic approximation method, Ann. Math. Statist., 25 (1954), pp. 463–483. · Zbl 0059.13203 [37] A. R. Conn, N. I. M. Gould, and P. L. Toint, Trust Region Methods, SIAM, Philadelphia, 2000, . · Zbl 0958.65071 [38] C. Cortes and V. N. Vapnik, Support-Vector Networks, Machine Learning, 20 (1995), pp. 273–297. [39] R. Courant and H. Robbins., What Is Mathematics?, 1st ed., Oxford University Press, 1941. · JFM 67.0001.05 [40] R. Courant and H. Robbins, What Is Mathematics?, 2nd ed., Oxford University Press, 1996; revised by I. Stewart. [41] T. M. Cover, Universal portfolios, Math. Finance, 1 (1991), pp. 1–29. [42] I. Daubechies, M. Defrise, and C. De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Comm. Pure Appl. Math., 58 (2004), pp. 1413–1457. · Zbl 1077.65055 [43] Y. N. Dauphin, H. de Vries, and Y. Bengio, Equilibrated adaptive learning rates for non-convex optimization, in Advances in Neural Information Processing Systems 28, 2015, pp. 1504–1512. [44] Y. N. Dauphin, R. Pascanu, Ç. Gülçehre, K. Cho, S. Ganguli, and Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, in Advances in Neural Information Processing Systems 27, 2014, pp. 2933–2941. [45] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng, Large scale distributed deep networks, in Advances in Neural Information Processing Systems 25, 2012, pp. 1232–1240. [46] A. Defazio, F. Bach, and S. Lacoste-Julien, SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives, in Advances in Neural Information Processing Systems 27, 2014, pp. 1646–1654. [47] R. S. Dembo, S. C. Eisenstat, and T. Steihaug, Inexact Newton methods, SIAM J. Numer. Anal., 19 (1982), pp. 400–408, . · Zbl 0478.65030 [48] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. Roy. Statist. Soc. Ser. B, 39 (1977), pp. 1–38, . · Zbl 0364.62022 [49] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam, Large-scale object classification using label relation graphs, in Proceedings of the 13th European Conference on Computer Vision (ECCV), Part I, 2014, pp. 48–64. [50] L. Deng, G. E. Hinton, and B. Kingsbury, New types of deep neural network learning for speech recognition and related applications: An overview, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013), Vancouver, BC, Canada, 2013, pp. 8599–8603. [51] J. E. Dennis, Jr., and J. J. Moré, A characterization of superlinear convergence and its application to quasi-Newton methods, Math. Comp., 28 (1974), pp. 549–560. · Zbl 0282.65042 [52] J. E. Dennis, Jr., and R. B. Schnabel, Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Classics Appl. Math. 16, SIAM, Philadelphia, 1996, . [53] D. Donoho, De-noising by soft-thresholding, IEEE Trans. Inform. Theory, 41 (1995), pp. 613–627. · Zbl 0820.62002 [54] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12 (2011), pp. 2121–2159. · Zbl 1280.68164 [55] R. M. Dudley, Uniform Central Limit Theorems, Cambridge Stud. Adv. Math. 63, Cambridge University Press, Cambridge, UK, 1999. · Zbl 0951.60033 [56] S. T. Dumais, J. C. Platt, D. Hecherman, and M. Sahami, Inductive learning algorithms and representations for text categorization, in Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, Bethesda, MD, 1998, pp. 148–155. [57] J. Eckstein and D. P. Bertsekas, On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators, Math. Program., 55 (1992), pp. 293–318. · Zbl 0765.90073 [58] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, Liblinear: A library for large linear classification, J. Mach. Learn. Res., 9 (2008), pp. 1871–1874. · Zbl 1225.68175 [59] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems, IEEE J. Selected Topics Signal Process., 1 (2007), pp. 586–597. [60] R. Fletcher, A new approach to variable metric algorithms, Comput. J., 13 (1970), pp. 317–322. · Zbl 0207.17402 [61] J. E. Freund, Mathematical Statistics, Prentice-Hall, Englewood Cliffs, NJ, 1962. · Zbl 0142.15001 [62] M. P. Friedlander and M. Schmidt, Hybrid deterministic-stochastic methods for data fitting, SIAM J. Sci. Comput., 34 (2012), pp. A1380–A1405, . · Zbl 1262.90090 [63] M. C. Fu, Optimization for simulation: Theory vs. practice, INFORMS J. Comput., 14 (2002), pp. 192–215. · Zbl 1238.90001 [64] D. Gabay and B. Mercier, A dual algorithm for the solution of nonlinear variational problems via finite element approximations, Comput. Math. Appl., 2 (1976), pp. 17–40. · Zbl 0352.65034 [65] G. Gasso, A. Pappaioannou, M. Spivak, and L. Bottou, Batch and online learning algorithms for nonconvex Neyman-Pearson classification, ACM Trans. Intelligent System Technol., 3 (2011), article 28. [66] E. G. Gladyshev, On stochastic approximations, Theory Probab. Appl., 10 (1965), pp. 275–278. [67] R. Glowinski and A. Marrocco, Sur l’approximation, par elements finis déordre un, et la resolution, par penalisation-dualité, d’une classe de problems de Dirichlet non lineares, Rev. Française Automat. Informat. Recherche Opérationalle Sér. Rouge Anal. Numér., 9 (1975), pp. 41–76. [68] D. Goldfarb, A family of variable metric updates derived by variational means, Math. Comp., 24 (1970), pp. 23–26. · Zbl 0196.18002 [69] G. H. Golub and C. F. Van Loan, Matrix Computations, 4th ed., Johns Hopkins University Press, Baltimore, MD, 2012. · Zbl 1268.65037 [70] I. J. Goodfellow, O. Vinyals, and A. M. Sake, Qualitatively Characterizing Neural Network Optimization Problems, CoRR preprint, , 2014. [71] A. Griewank, Automatic differentiation, in Princeton Companion to Applied Mathematics, N. Higham, ed., Princeton University Press, Princeton, NJ, 2014, pp. 749–752. [72] M. Gürbüzbalaban, A. Ozdaglar, and P. A. Parrilo, On the convergence rate of incremental aggregated gradient algorithms, SIAM J. Optim., 27 (2017), pp. 1035–1048, . · Zbl 1366.90195 [73] F. S. Hashemi, S. Ghosh, and R. Pasupathy, On adaptive sampling rules for stochastic recursions, in Proceedings of the 2014 Winter Simulation Conference (WSC ’14), 2014, pp. 3959–3970. [74] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. [75] W. Hoeffding, Probability inequalities for sums of bounded random variables, J. Amer. Statist. Assoc., 58 (1963), pp. 13–30. · Zbl 0127.10602 [76] M. D. Hoffman, D. M. Blei, C. Wang, and J. W. Paisley, Stochastic variational inference, J. Mach. Learn.Res., 14 (2013), pp. 1303–1347. · Zbl 1317.68163 [77] T. Homem-de Mello and A. Shapiro, A simulation-based approach to two-stage stochastic programming with recourse, Math. Program., 81 (1998), pp. 301–325. · Zbl 0919.90120 [78] C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, P. K. Ravikumar, and R. Poldrack, BIG & QUIC: Sparse inverse covariance estimation for a million variables, in Advances in Neural Information Processing Systems, 2013, pp. 3165–3173. [79] C. J. Hsieh, M. A. Sustik, P. Ravikumar, and I. S. Dhillon, Sparse inverse covariance matrix estimation using quadratic approximation, in Advances in Neural Information Processing Systems 24, 2011, pp. 2330–2338. [80] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Lille, France, 2015, pp. 448–456. [81] T. Joachims, Text categorization with support vector machines: Learning with many relevant features, in Proceedings of the 10th European Conference on Machine Learning (ECML’98), Chemnitz, Germany, Springer, 1998, pp. 137–142. [82] R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems 26, 2013, pp. 315–323. [83] R. E. Kalman, A new approach to linear filtering and prediction problems, Trans. ASME J. Basic Engrg. Ser. D, 82 (1960), pp. 35–45. [84] D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, CoRR preprint, , 2014. [85] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, in Advances in Neural Information Processing Systems 25, 2012, pp. 84–90. [86] J. Lafond, N. Vasilache, and L. Bottou, Manuscript in preparation, 2016. [87] Y. Le Cun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, Handwritten digit recognition with a back-propagation network, in Advances in Neural Information Processing Systems 2, 1989, pp. 396–404. [88] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner, Gradient based learning applied to document recognition, Proc. IEEE, 86 (1998), pp. 2278–2324. [89] Y. Le Cun, L. Bottou, G. B. Orr, and K.-R. Müller, Efficient backprop, in Neural Networks: Tricks of the Trade, Lecture Notes in Comput. Sci. 1524, Springer-Verlag, 1998, pp. 9–50. [90] N. Le Roux, M. Schmidt, and F. R. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in Advances in Neural Information Processing Systems 25, 2012, pp. 2663–2671. [91] W. S. Lee, P. L. Bartlett, and R. C. Williamson, The importance of convexity in learning with squared loss, IEEE Trans. Inform. Theory, 44 (1998), pp. 1974–1980. · Zbl 0935.68091 [92] T. K. Leen and G. B. Orr, Optimal stochastic search and adaptive momentum, in Advances in Neural Information Processing Systems 6, 1993, pp. 477–484. [93] D. Leventhal and A. S. Lewis, Randomized methods for linear constraints: Convergence rates and conditioning, Math. Oper. Res., 35 (2010), pp. 641–654. · Zbl 1216.15006 [94] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li, RCV1: A new benchmark collection of text categorization research, J. Mach. Learn. Res., 5 (2004), pp. 361–397. [95] C.-J. Lin and J. J. Moré, Newton’s method for large bound-constrained optimization problems, SIAM J. Optim., 9 (1999), pp. 1100–1127, . · Zbl 0957.65064 [96] H. Lin, J. Mairal, and Z. Harchaoui, A universal catalyst for first-order optimization, Advances in Neural Information Processing Systems 28, 2015, pp. 3384–3392. [97] D. C. Liu and J. Nocedal, On the limited memory BFGS method for large scale optimization, Math. Program., 45 (1989), pp. 503–528. · Zbl 0696.90048 [98] J. Liu, S. J. Wright, C. Ré, V. Bittorf, and S. Sridhar, An asynchronous parallel stochastic coordinate descent algorithm, J. Mach. Learn.Res., 16 (2015), pp. 285–322. · Zbl 1337.68286 [99] G. Marceau-Caron and Y. Ollivier, Practical Riemannian Neural Networks, CoRR preprint, , 2016. [100] J. Martens, New Insights and Perspectives on the Natural Gradient Method, preprint, , 2014. [101] J. Martens and I. Sutskever, Learning recurrent neural networks with Hessian-free optimization, in 28th International Conference on Machine Learning (ICML), 2011. [102] P. Massart, Some applications of concentration inequalities to statistics, Ann. Fac. Sci. Toulouse Math. (6), 9 (2000), pp. 245–303. · Zbl 0986.62002 [103] A. Mokhtari and A. Ribeiro, RES: Regularized stochastic BFGS algorithm, IEEE Trans. Signal Process., 62 (2014), pp. 6089–6104. · Zbl 1394.94405 [104] N. Murata, A statistical study of on-line learning, in On-line Learning in Neural Networks, D. Saad, ed., Cambridge University Press, New York, 1998, pp. 63–92. · Zbl 0966.68170 [105] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19 (2009), pp. 1574–1609, . · Zbl 1189.90109 [106] A. S. Nemirovski and D. B. Yudin, On Cesaro’s convergence of the steepest descent method for approximating saddle point of convex-concave functions, Soviet Math. Dokl., 19 (1978). [107] Y. Nesterov, A method of solving a convex programming problem with convergence rate $${\mathcal O}(1/k^2)$$, Soviet Math. Dokl., 27 (1983), pp. 372–376. · Zbl 0535.90071 [108] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer Academic, 2004. · Zbl 1086.90045 [109] Y. Nesterov, Primal-dual subgradient methods for convex problems, Math. Program. Ser. B, 120 (2009), pp. 221–259. · Zbl 1191.90038 [110] Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM J. Optim., 22 (2012), pp. 341–362, . · Zbl 1257.90073 [111] NIPS Foundation, Advances in Neural Information Processing Systems ($$29$$ volumes), 1987–2016. Volumes 0–28, . [112] F. Niu, B. Recht, C. Re, and S. J. Wright, Hogwild!: A lock-free approach to parallelizing stochastic gradient descent, in Advances in Neural Information Processing Systems 24, 2011, pp. 693–701. [113] J. Nocedal, Updating quasi-Newton matrices with limited storage, Math. Comp., 35 (1980), pp. 773–782. · Zbl 0464.65037 [114] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed., Springer, New York, 2006. · Zbl 1104.65059 [115] A. B. J. Novikoff, On convergence proofs on perceptrons, in Proceedings of the Symposium on the Mathematical Theory of Automata, Vol. 12, 1962, pp. 615–622. [116] J. Nutini, M. Schmidt, I. H. Laradji, M. Friedlander, and H. Koepke, Coordinate descent converges faster with the Gauss-Southwell rule than random selection, in 32nd International Conference on Machine Learning (ICML), 2015, pp. 1632–1641. [117] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, SIAM, Philadelphia, 2000, . · Zbl 0949.65053 [118] M. R. Osborne, Fisher’s method of scoring, Internat. Statist. Rev., 60 (1992), pp. 99–117. · Zbl 0755.62005 [119] H. Park, S. Amari, and K. Fukumizu, Adaptive natural gradient learning algorithms for various stochastic models, Neural Networks, 13 (2000), pp. 755–764. [120] R. Pasupathy, P. W. Glynn, S. Ghosh, and F. S. Hashemi, On sampling rates in stochastic recursions, submitted for publication (2015). · Zbl 1380.93168 [121] B. A. Pearlmutter, Fast exact multiplication by the Hessian, Neural Comput., 6 (1994), pp. 147–160. [122] M. Pilanci and M. J. Wainwright, Newton sketch: A linear-time optimization algorithm with linear-quadratic convergence, SIAM J. Optim., 27 (2017), pp. 205–245, . · Zbl 06690582 [123] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, USSR Comput. Math. Math. Phys., 4 (1964), pp. 1–17. [124] B. T. Polyak, Comparison of the convergence rates for single-step and multi-step optimization algorithms in the presence of noise, Engrg. Cybernet., 15 (1977), pp. 6–10. [125] B. T. Polyak, New method of stochastic approximation type, Automat. Remote Control, 51 (1991), pp. 937–946. · Zbl 0737.93080 [126] B. T. Polyak and A. B. Juditsky, Acceleration of stochastic approximation by averaging, SIAM J. Control Optim., 30 (1992), pp. 838–855, . · Zbl 0762.62022 [127] M. J. D. Powell, On search directions for minimization algorithms, Math. Program., 4 (1973), pp. 193–201. · Zbl 0258.90043 [128] M. Raginsky and A. Rakhlin, Information-based complexity, feedback and dynamics in convex programming, IEEE Trans. Inform. Theory, 57 (2011), pp. 7036–7056. · Zbl 1365.93191 [129] A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, CNN features off-the-shelf: An astounding baseline for recognition, in 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2014, pp. 512–519. [130] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), pp. 400–407. · Zbl 0054.05901 [131] H. Robbins and D. Siegmund, A convergence theorem for non negative almost supermartingales and some applications, in Optimizing Methods in Statistics, J. S. Rustagi, ed., Academic Press, 1971, pp. 233–257. [132] F. Roosta-Khorasani and M. W. Mahoney, Sub-sampled Newton Methods II: Local Convergence Rates, preprint, , 2016. [133] S. Ross, P. Mineiro, and J. Langford, Normalized online learning, in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence (UAI), 2013. [134] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning internal representations by error propagation, in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, MIT Press, Cambridge, MA, 1986, pp. 318–362. [135] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning representations by back-propagating errors, Nature, 323 (1986), pp. 533–536. · Zbl 1369.68284 [136] D. Ruppert, Efficient Estimations from a Slowly Convergent Robbins-Monro Process, Tech. report 781, Cornell University Operations Research and Industrial Engineering, 1988. [137] K. Scheinberg and X. Tang, Practical inexact proximal quasi-Newton method with global complexity analysis, Math. Program., 160 (2016), pp. 495–529. · Zbl 1366.90166 [138] M. Schmidt, Graphical Model Structure Learning with $$ℓ_1$$-Regularization, Ph.D. thesis, University of British Columbia, 2010. [139] M. Schmidt, N. Le Roux, and F. Bach, Minimizing finite sums with the stochastic average gradient, Math. Program., 162 (2017), pp. 83–112. · Zbl 1358.90073 [140] M. Schmidt, N. L. Roux, and F. R. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, in Advances in Neural Information Processing Systems 24, 2011, pp. 1458–1466. [141] N. N. Schraudolph, Fast curvature matrix-vector products, in Artificial Neural Networks, ICANN 2001, G. Dorffner, H. Bischof, and K. Hornik, eds., Lecture Notes in Comput. Sci. 2130, Springer-Verlag, Berlin, Heidelberg, 2001, pp. 19–26. · Zbl 1001.68684 [142] N. N. Schraudolph, J. Yu, and S. Günter, A stochastic quasi-Newton method for online convex optimization, in International Conference on Artificial Intelligence and Statistics, Society for Artificial Intelligence and Statistics, 2007, pp. 436–443. [143] G. Shafer and V. Vovk, Probability and Finance: It’s Only a Game!, John Wiley & Sons, 2005. · Zbl 0985.91024 [144] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, Pegasos: Primal estimated sub-gradient solver for SVM, Math. Program., 127 (2011), pp. 3–30. · Zbl 1211.90239 [145] S. Shalev-Shwartz and T. Zhang, Stochastic dual coordinate ascent methods for regularized loss, J. Mach. Learn. Res., 14 (2013), pp. 567–599. · Zbl 1307.68073 [146] D. F. Shanno, Conditioning of quasi-Newton methods for function minimization, Math. Comp., 24 (1970), pp. 647–656. · Zbl 0225.65073 [147] S. Sra, S. Nowozin, and S. Wright, Optimization for Machine Learning, MIT Press, Cambridge, MA, 2011. [148] Stanford Vision Lab, ImageNet Large Scale Visual Recognition Challenge (ILSVRC), , 2015 (accessed October 25, 2015). [149] T. Steihaug, The conjugate gradient method and trust regions in large scale optimization, SIAM J. Numer. Anal., 20 (1983), pp. 626–637, . · Zbl 0518.65042 [150] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, On the importance of initialization and momentum in deep learning, in 30th International Conference on Machine Learning (ICML), 2013. [151] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Statist. Soc. Ser. B, 58 (1996), pp. 267–288. · Zbl 0850.62538 [152] T. Tieleman and G. Hinton, Lecture 6.5. RMSPROP: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012. [153] P. Tseng and S. Yun, A coordinate gradient descent method for nonsmooth separable minimization, Math. Program., 117 (2009), pp. 387–423. · Zbl 1166.90016 [154] A. B. Tsybakov, Optimal aggregation of classifiers in statistical learning, Ann. Statist., 32 (2004), pp. 135–166. · Zbl 1105.62353 [155] V. N. Vapnik, Estimation of Dependences Based on Empirical Data, Springer, 1983. · Zbl 0499.62005 [156] V. N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. · Zbl 0935.62007 [157] V. N. Vapnik and A. Y. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, Proc. USSR Acad. Sci., 181 (1968), pp. 781–783; English translation: Soviet Math. Dokl., 9 (1968), pp. 915–918. · Zbl 0247.60004 [158] V. N. Vapnik and A. Y. Chervonenkis, Theory of Pattern Recognition, Nauka, Moscow, 1974; German Translation: Theorie der Zeichenerkennung, Akademie–Verlag, Berlin, 1979. · Zbl 0284.68070 [159] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo, Sparse reconstruction by separable approximation, IEEE Trans. Signal Process., 57 (2009), pp. 2479–2493. · Zbl 1391.94442 [160] C. F. J. Wu, On the convergence properties of the EM algorithm, Ann. Statist., 11 (1983), pp. 95–103, . · Zbl 0517.62035 [161] W. Xu, Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent, preprint, , 2011. [162] M. D. Zeiler, ADADELTA: An Adaptive Learning Rate Method, CoRR preprint, , 2012. [163] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, Algorithm 78: L-BFGS-B: Fortran subroutines for large-scale bound constrained optimization, ACM Trans. Math. Softw., 23 (1997), pp. 550–560. · Zbl 0912.65057 [164] M. Zinkevich, Online Convex Programming and Generalized Infinitesimal Gradient Ascent, in Proceedings of the 20th International Conference on Machine Learning (ICML), 2003, pp. 928–936.