×

zbMATH — the first resource for mathematics

Minimizing finite sums with the stochastic average gradient. (English) Zbl 1358.90073
Math. Program. 162, No. 1-2 (A), 83-112 (2017); erratum ibid. 162, No. 1-2 (A), 113 (2017).
Summary: We analyze the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from \(O(1/\sqrt{k})\) to \(O(1/k)\) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear \(O(1/k)\) to a linear convergence rate of the form \(O(\rho ^k)\) for \(\rho < 1\). Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. This extends our earlier work [the second author et al., “A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets”, in: Advances in Neural Information Processing Systems. Proceedings of the first 12 conferences. Cambridge, MA: The MIT Press. 2663–2671 (2012)], which only lead to a faster rate for well-conditioned strongly-convex problems. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.

MSC:
90C06 Large-scale problems in mathematical programming
90C15 Stochastic programming
90C25 Convex programming
90C30 Nonlinear programming
65K05 Numerical mathematical programming methods
68Q25 Analysis of algorithms and problem complexity
62L20 Stochastic approximation
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 58(5), 3235-3249 (2012) · Zbl 1365.94132
[2] Bach, F., Moulines, E.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Adv. Neural Inf. Process. Syst. 773-781 (2013)
[3] Bach, F., Moulines, E.: Non-strongly-convex smooth stochastic approximation with convergence rate \(o(1/n)\). arXiv preprint (2013) · Zbl 1321.65016
[4] Bertsekas, DP, A new class of incremental gradient methods for least squares problems, SIAM J. Optim., 7, 913-926, (1997) · Zbl 0887.49025
[5] Blatt, D; Hero, AO; Gauchman, H, A convergent incremental gradient method with a constant step size, SIAM J. Optim., 18, 29-51, (2007) · Zbl 1154.90015
[6] Bordes, A; Bottou, L; Gallinari, P, Sgd-qn: careful quasi-Newton stochastic gradient descent, J. Mach. Learn. Res., 10, 1737-1754, (2009) · Zbl 1235.68130
[7] Bottou, L., LeCun, Y.: Large scale online learning. Adv. Neural Inf. Process. Syst. 217-224 (2003)
[8] Carbonetto, P.: New probabilistic inference algorithms that harness the strengths of variational and Monte Carlo methods. Ph.D. thesis, University of British Columbia (2009) · Zbl 0762.62022
[9] Caruana, R; Joachims, T; Backstrom, L, KDD-cup 2004: results and analysis, ACM SIGKDD Newsl., 6, 95-108, (2004)
[10] Cauchy, A, Méthode générale pour la résolution des systèmes d’équations simultanées, Comptes rendus des séances de l’Académie des sciences de Paris, 25, 536-538, (1847)
[11] Collins, M; Globerson, A; Koo, T; Carreras, X; Bartlett, P, Exponentiated gradient algorithms for conditional random fields and MAX-margin Markov networks, J. Mach. Learn. Res., 9, 1775-1822, (2008) · Zbl 1225.68167
[12] Cormack, G.V., Lynam, T.R.: Spam corpus creation for TREC. In: Proceedings of 2nd Conference on Email and Anti-Spam (2005). http://plg.uwaterloo.ca/ gvcormac/treccorpus/
[13] Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Adv. Neural Inf. Process. Syst. 1646-1654 (2014) · Zbl 1225.68167
[14] Delyon, B; Juditsky, A, Accelerated stochastic approximation, SIAM J. Optim., 3, 868-881, (1993) · Zbl 0801.62071
[15] Duchi, J; Hazan, E; Singer, Y, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12, 2121-2159, (2011) · Zbl 1280.68164
[16] Frank, A., Asuncion, A.: UCI machine learning repository (2010). http://archive.ics.uci.edu/ml · Zbl 0922.90131
[17] Friedlander, MP; Schmidt, M, Hybrid deterministic-stochastic methods for data Fitting, SIAM J. Sci. Comput., 34, a1351-a1379, (2012) · Zbl 1262.90090
[18] Ghadimi, S., Lan, G.: Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization. Optim. Online (2010) · Zbl 1293.62167
[19] Gong, P., Ye, J.: Linear convergence of variance-reduced projected stochastic gradient without strong convexity. arXiv preprint (2014) · Zbl 1154.90015
[20] Guyon, I.: Sido: A phamacology dataset (2008). http://www.causality.inf.ethz.ch/data/SIDO.html
[21] Hazan, E., Kale, S.: Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. J. Mach. Learn. Res. Workshop Conf. Proc. 15, 2489-2512 (2011) · Zbl 1319.90050
[22] Hu, C., Kwok, J., Pan, W.: Accelerated gradient methods for stochastic optimization and online learning. Adv. Neural Inf. Process. Syst. 781-789 (2009)
[23] Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 315-323 (2013)
[24] Keerthi, S; DeCoste, D, A modified finite Newton method for fast solution of large scale linear svms, J. Mach. Learn. Res., 6, 341-361, (2005) · Zbl 1222.68231
[25] Kesten, H, Accelerated stochastic approximation, Ann. Math. Stat., 29, 41-59, (1958) · Zbl 0087.13404
[26] Konečnỳ, J., Richtárik, P.: Semi-stochastic gradient descent methods. arXiv preprint (2013)
[27] Konečnỳ, J., Qu, Z., Richtárik, P.: Semi-stochastic coordinate descent. arXiv preprint (2014)
[28] Kushner, H.J., Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications, 2nd edn. Springer, New york (2003) · Zbl 1026.62084
[29] Lacoste-Julien, S., Jaggi, M., Schmidt, M., Pletscher, P.: Block-coordinate frank-wolfe optimization for structural SVMs. Int. Conf. Mach. Learn. J. Mach. Learn. Res. Workshop Conf. Proc. 28, 53-61 (2013) · Zbl 1242.62011
[30] Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. Adv. Neural Inf. Process. Syst. 2663-2671 (2012) · Zbl 0922.90131
[31] Lewis, D; Yang, Y; Rose, T; Li, F, RCV1: a new benchmark collection for text categorization research, J. Mach. Learn. Res., 5, 361-397, (2004)
[32] Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. Adv. Neural Inf. Process. Syst. 3384-3392 (2015) · Zbl 1169.68052
[33] Liu, J., Chen, J., Ye, J.: Large-scale sparse logistic regression. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2009)
[34] Mahdavi, M., Zhang, L., Jin, R.: Mixed Optimization of Smooth Functions. Adv. Neural Inf. Process. Syst. 674-682 (2013) · Zbl 0054.05901
[35] Mairal, J.: Optimization with first-order surrogate functions. J. Mach. Learn. Res. Workshop Conf. Proc. 28, 783-791 (2013) · Zbl 0915.65061
[36] Martens, J.: Deep learning via Hessian-free optimization. Int. Conf. Mach. Learn. 735-742 (2010)
[37] Nedic, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Uryasev, S., Pardalos, P.M. (eds.) Stochastic Optimization: Algorithms and Applications, pp. 263-304. Kluwer Academic Publishers, Dordrecht (2000) · Zbl 0984.90033
[38] Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Adv. Neural Inf. Process. Syst. 1017-1025 (2014) · Zbl 1333.65070
[39] Nemirovski, A., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley, New York (1983)
[40] Nemirovski, A; Juditsky, A; Lan, G; Shapiro, A, Robust stochastic approximation approach to stochastic programming, SIAM J. Optim., 19, 1574-1609, (2009) · Zbl 1189.90109
[41] Nesterov, Y, A method for unconstrained convex minimization problem with the rate of convergence \({O}(1/k^2)\), Doklady AN SSSR, 269, 543-547, (1983)
[42] Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Dordrecht (2004) · Zbl 1086.90045
[43] Nesterov, Y, Smooth minimization of non-smooth functions, Math. Program., 103, 127-152, (2005) · Zbl 1079.90102
[44] Nesterov, Y, Primal-dual subgradient methods for convex problems, Math. Program., 120, 221-259, (2009) · Zbl 1191.90038
[45] Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. CORE Discussion Paper (2010) · Zbl 1257.90073
[46] Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer, New York (2006) · Zbl 1104.65059
[47] Polyak, BT; Juditsky, AB, Acceleration of stochastic approximation by averaging, SIAM J. Control Optim., 30, 838-855, (1992) · Zbl 0762.62022
[48] Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. Int. Conf. Mach. Learn. 449-456 (2012) · Zbl 1191.90038
[49] Robbins, H; Monro, S, A stochastic approximation method, Ann. Math. Stat., 22, 400-407, (1951) · Zbl 0054.05901
[50] Robert, C.P., Casella, G.: Monte Carlo Statistical Methods, 2nd edn. Springer, New York (2004) · Zbl 1096.62003
[51] Schmidt, M.: minfunc: unconstrained differentiable multivariate optimization in matlab. https://www.cs.ubc.ca/ schmidtm/Software/minFunc.html (2005)
[52] Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint (2013) · Zbl 1225.68167
[53] Schmidt, M., Babanezhad, R., Ahemd, M., Clifton, A., Sarkar, A.: Non-uniform stochastic average gradient method for training conditional random fields. Int. Conf. Artif. Intell. Stat. J. Mach. Learn. Res. Workshop Conf. Proc. 38, 819-828 (2015) · Zbl 1280.68164
[54] Shalev-Schwartz, S; Zhang, T, Stochastic dual coordinate ascent methods for regularized loss minimization, J. Mach. Learn. Res., 14, 567-599, (2013) · Zbl 1307.68073
[55] Shalev-Shwartz, S; Singer, Y; Srebro, N; Cotter, A, Pegasos: primal estimated sub-gradient solver for SVM, Math. Program., 127, 3-30, (2011) · Zbl 1211.90239
[56] Sohl-Dickstein, J., Poole, B., Ganguli, S.: Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. J. Mach. Learn. Res. Workshop Conf. Proc. 32 (2014) · Zbl 1253.15035
[57] Solodov, M, Incremental gradient algorithms with stepsizes bounded away from zero, Comput. Optim. Appl., 11, 23-35, (1998) · Zbl 0915.65061
[58] Srebro, N., Sridharan, K.: Theoretical basis for “more data less work”? NIPS Workshop on Computataional Trade-offs in Statistical Learning (2011) · Zbl 1280.68164
[59] Strohmer, T; Vershynin, R, A randomized Kaczmarz algorithm with exponential convergence, J. Fourier Anal. Appl., 15, 262-278, (2009) · Zbl 1169.68052
[60] Sunehag, P., Trumpf, J., Vishwanathan, S., Schraudolph, N.: Variable metric stochastic approximation theory. J. Mach. Learn. Res. Workshop Conf. Proc. 5, 560-566 (2009) · Zbl 1222.68231
[61] Teo, C.H., Le, Q., Smola, A.J., Vishwanathan, S.V.N.: A scalable modular convex solver for regularized risk minimization. In: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2007)
[62] Tseng, P, An incremental gradient(-projection) method with momentum term and adaptive stepsize rule, SIAM J. Optim., 8, 506-531, (1998) · Zbl 0922.90131
[63] Xiao, L, Dual averaging methods for regularized stochastic learning and online optimization, J. Mach. Learn. Res., 11, 2543-2596, (2010) · Zbl 1242.62011
[64] Xiao, L; Zhang, T, A proximal stochastic gradient method with progressive variance reduction, SIAM J. Optim., 24, 2057-2075, (2014) · Zbl 1321.65016
[65] Zhong, L., Kwok, J.: Fast stochastic alternating direction method of multipliers. J. Mach. Learn. Res. Workshop Conf. Proc. 65 (2014)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.