×

Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. (English) Zbl 1466.90065

Summary: In this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ascent. This is the first time momentum variants of several of these methods are studied. We choose to perform our analysis in a setting in which all of the above methods are equivalent: convex quadratic problems. We prove global non-asymptotic linear convergence rates for all methods and various measures of success, including primal function values, primal iterates, and dual function values. We also show that the primal iterates converge at an accelerated linear rate in a somewhat weaker sense. This is the first time a linear rate is shown for the stochastic heavy ball method (i.e., stochastic gradient descent method with momentum). Under somewhat weaker conditions, we establish a sublinear convergence rate for Cesàro averages of primal iterates. Moreover, we propose a novel concept, which we call stochastic momentum, aimed at decreasing the cost of performing the momentum step. We prove linear convergence of several stochastic methods with stochastic momentum, and show that in some sparse data regimes and for sufficiently small momentum parameters, these methods enjoy better overall complexity than methods with deterministic momentum. Finally, we perform extensive numerical testing on artificial and real datasets, including data coming from average consensus problems.

MSC:

90C20 Quadratic programming
90C25 Convex programming
68Q25 Analysis of algorithms and problem complexity
68W20 Randomized algorithms
68W40 Analysis of algorithms
65Y20 Complexity and performance of numerical algorithms
90C15 Stochastic programming
15A06 Linear equations (linear algebraic aspects)
15B52 Random matrices (algebraic aspects)
65F10 Iterative numerical methods for linear systems
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1200-1205. ACM (2017) · Zbl 1369.68273
[2] Allen-Zhu, Z., Qu, Z., Richtárik, P., Yuan, Y.: Even faster accelerated coordinate descent using non-uniform sampling. In: International Conference on Machine Learning, pp. 1110-1119 (2016)
[3] Arnold, S., Manzagol, P., Babanezhad, R., Mitliagkas, I., Roux, N.: Reducing the variance in online optimization by transporting past gradients. arXiv preprint arXiv:1906.03532 (2019)
[4] Bertsekas, D., Incremental gradient, subgradient, and proximal methods for convex optimization: a survey, Optim. Mach. Learn., 2010, 1-38, 3 (2011)
[5] Blatt, D.; Hero, A.; Gauchman, H., A convergent incremental gradient method with a constant step size, SIAM J. Optim., 18, 1, 29-51 (2007) · Zbl 1154.90015
[6] Boyd, S.; Ghosh, A.; Prabhakar, B.; Shah, D., Randomized gossip algorithms, IEEE Trans. Inf. Theory, 14, SI, 2508-2530 (2006) · Zbl 1283.94005
[7] Byrne, C., Applied Iterative Methods (2008), Wellesley: AK Peters, Wellesley · Zbl 1140.65001
[8] Can, B., Gurbuzbalaban, M., Zhu, L.: Accelerated linear convergence of stochastic momentum methods in wasserstein distances. In: International Conference on Machine Learning, pp. 891-901 (2019)
[9] Chambolle, A.; Ehrhardt, M.; Richtárik, P.; Schönlieb, C., Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications, SIAM J. Optim., 28, 4, 2783-2808 (2018) · Zbl 06951767
[10] Chang, CC; Lin, CJ, Libsvm: a library for support vector machines, ACM Trans. Intell. Syst. Technol., 2, 3, 27 (2011)
[11] Csiba, D., Richtárik, P.: Global convergence of arbitrary-block gradient methods for generalized Polyak-Lojasiewicz functions. arXiv preprint arXiv:1709.03014 (2017)
[12] De Abreu, NMM, Old and new results on algebraic connectivity of graphs, Linear Algebra Appl., 423, 1, 53-73 (2007) · Zbl 1115.05056
[13] Defazio, A.: A simple practical accelerated method for finite sums. In: Advances in Neural Information Processing Systems, pp. 676-684 (2016)
[14] Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646-1654 (2014)
[15] Devraj, A., Bušic, A., Meyn, S.: Optimal matrix momentum stochastic approximation and applications to q-learning. arXiv preprint arXiv:1809.06277 (2018)
[16] Devraj, A., Bušić, A., Meyn, S.: Zap meets momentum: stochastic approximation algorithms with optimal convergence rate. arXiv preprint arXiv:1809.06277 (2018)
[17] Dimakis, A.; Kar, S.; Moura, J.; Rabbat, M.; Scaglione, A., Gossip algorithms for distributed signal processing, Proc. IEEE, 98, 11, 1847-1864 (2010)
[18] Elaydi, S., An Introduction to Difference Equations (2005), Berlin: Springer, Berlin · Zbl 1071.39001
[19] Eldar, Y.; Needell, D., Acceleration of randomized Kaczmarz method via the Johnson-Lindenstrauss lemma, Numer. Algorithms, 58, 2, 163-177 (2011) · Zbl 1230.65051
[20] Fercoq, O.; Richtárik, P., Accelerated, parallel, and proximal coordinate descent, SIAM J. Optim., 25, 4, 1997-2023 (2015) · Zbl 1327.65108
[21] Fillmore, J.; Marx, M., Linear recursive sequences, SIAM Rev., 10, 3, 342-353 (1968) · Zbl 0169.51004
[22] Gadat, S.; Panloup, F.; Saadane, S., Stochastic heavy ball, Electron. J. Stat., 12, 1, 461-529 (2018) · Zbl 1392.62244
[23] Geman, S., A limit theorem for the norm of random matrices, Ann. Probab., 8, 252-261 (1980) · Zbl 0428.60039
[24] Ghadimi, E., Feyzmahdavian, H., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: Control Conference (ECC), 2015 European, pp. 310-315. IEEE (2015)
[25] Ghadimi, E.; Shames, I.; Johansson, M., Multi-step gradient methods for networked optimization, IEEE Trans. Signal Process., 61, 21, 5417-5429 (2013)
[26] Ghadimi, S.; Lan, G., Accelerated gradient methods for nonconvex nonlinear and stochastic programming, Math. Program., 156, 1-2, 59-99 (2016) · Zbl 1335.62121
[27] Gower, R., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: squeezing more curvature out of data. In: International Conference on Machine Learning, pp. 1869-1878 (2016)
[28] Gower, R.; Richtárik, P., Randomized iterative methods for linear systems, SIAM J. Matrix Anal. Appl., 36, 4, 1660-1690 (2015) · Zbl 1342.65110
[29] Gower, R., Richtárik, P.: Stochastic dual ascent for solving linear systems. arXiv preprint arXiv:1512.06890 (2015) · Zbl 1342.65110
[30] Gower, R., Richtárik, P.: Linearly convergent randomized iterative methods for computing the pseudoinverse. arXiv preprint arXiv:1612.06255 (2016)
[31] Gower, RM; Richtárik, P., Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms, SIAM J. Matrix Anal. Appl., 38, 4, 1380-1409 (2017) · Zbl 1379.65016
[32] Gurbuzbalaban, M.; Ozdaglar, A.; Parrilo, P., On the convergence rate of incremental aggregated gradient algorithms, SIAM J. Optim., 27, 2, 1035-1048 (2017) · Zbl 1366.90195
[33] Hanzely, F., Konečný, J., Loizou, N., Richtárik, P., Grishchenko, D.: Privacy preserving randomized gossip algorithms. arXiv preprint arXiv:1706.07636 (2017)
[34] Hanzely, F., Konečnỳ, J., Loizou, N., Richtárik, P., Grishchenko, D.: A privacy preserving randomized gossip algorithm via controlled noise insertion. In: NeurIPS Privacy Preserving Machine Learning Workshop (2018)
[35] Jalilzadeh, A., Shanbhag, U., Blanchet, J., Glynn, P.: Optimal smoothed variable sample-size accelerated proximal methods for structured nonsmooth stochastic convex programs. arXiv preprint arXiv:1803.00718 (2018)
[36] Jofré, A.; Thompson, P., On variance reduction for stochastic smooth convex optimization with multiplicative noise, Math. Program., 174, 1-40 (2017)
[37] Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315-323 (2013)
[38] Kaczmarz, S., Angenäherte auflösung von systemen linearer gleichungen, Bulletin International de l’Academie Polonaise des Sciences et des Lettres, 35, 355-357 (1937) · JFM 63.0524.02
[39] Konečný, J.; Liu, J.; Richtárik, P.; Takáč, M., Mini-batch semi-stochastic gradient descent in the proximal setting, IEEE J. Sel. Top. Signal Process., 10, 2, 242-255 (2016)
[40] Konečný, J.; Richtárik, P., Semi-stochastic gradient descent methods, Front. Appl. Math. Stat., 3, 9, 1-14 (2017)
[41] Kovalev, D., Horváth, S., Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. arXiv preprint arXiv:1901.08689 (2019)
[42] Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097-1105 (2012)
[43] Lee, Y., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 147-156. IEEE (2013)
[44] Lessard, L.; Recht, B.; Packard, A., Analysis and design of optimization algorithms via integral quadratic constraints, SIAM J. Optim., 26, 1, 57-95 (2016) · Zbl 1329.90103
[45] Leventhal, D.; Lewis, A., Randomized methods for linear constraints: convergence rates and conditioning, Math. Oper. Res., 35, 3, 641-654 (2010) · Zbl 1216.15006
[46] Liu, J.; Wright, S., An accelerated randomized Kaczmarz algorithm, Math. Comput., 85, 297, 153-178 (2016) · Zbl 1327.65065
[47] Loizou, N., Rabbat, M., Richtárik, P.: Provably accelerated randomized gossip algorithms. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7505-7509. IEEE (2019)
[48] Loizou, N., Richtárik, P.: A new perspective on randomized gossip algorithms. In: 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 440-444. IEEE (2016)
[49] Loizou, N., Richtárik, P.: Linearly convergent stochastic heavy ball method for minimizing generalization error. In: NIPS-Workshop on Optimization for Machine Learning (2017)
[50] Loizou, N., Richtárik, P.: Accelerated gossip via stochastic heavy ball method. In: 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 927-934. IEEE (2018)
[51] Loizou, N., Richtárik, P.: Convergence analysis of inexact randomized iterative methods. arXiv preprint arXiv:1903.07971 (2019)
[52] Loizou, N., Richtárik, P.: Revisiting randomized gossip algorithms: general framework, convergence rates and novel block and accelerated protocols. arXiv preprint arXiv:1905.08645 (2019)
[53] Ma, A.; Needell, D.; Ramdas, A., Convergence properties of the randomized extended Gauss-Seidel and Kaczmarz methods, SIAM J. Matrix Anal. Appl., 36, 4, 1590-1604 (2015) · Zbl 1327.65112
[54] Ma, J., Yarats, D.: Quasi-hyperbolic momentum and Adam for deep learning. arXiv preprint arXiv:1810.06801 (2018)
[55] Needell, D., Randomized Kaczmarz solver for noisy linear systems, BIT Numer. Math., 50, 2, 395-403 (2010) · Zbl 1195.65038
[56] Needell, D.; Srebro, N.; Ward, R., Stochastic gradient descent and the randomized Kaczmarz algorithm, Math. Program. Ser. A, 155, 1, 549-573 (2016) · Zbl 1333.65070
[57] Needell, D.; Tropp, J., Paved with good intentions: analysis of a randomized block Kaczmarz method, Linear Algebra Appl., 441, 199-221 (2014) · Zbl 1282.65042
[58] Needell, D.; Zhao, R.; Zouzias, A., Randomized block Kaczmarz method with projection for solving least squares, Linear Algebra Appl., 484, 322-343 (2015) · Zbl 1330.65056
[59] Nemirovskii, A.; Yudin, D., Problem Complexity and Method Efficiency in Optimization (1983), Hoboken: Wiley Interscience, Hoboken · Zbl 0501.90062
[60] Nesterov, Y., A method of solving a convex programming problem with convergence rate \(o(1/k^2)\), Sov. Math. Dokl., 27, 372-376 (1983) · Zbl 0535.90071
[61] Nesterov, Y., Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM J. Optim., 22, 2, 341-362 (2012) · Zbl 1257.90073
[62] Nesterov, Y., Introductory Lectures on Convex Optimization: A Basic Course (2013), Berlin: Springer, Berlin
[63] Nutini, J., Schmidt, M., Laradji, I., Friedlander, M., Koepke, H.: Coordinate descent converges faster with the gauss-southwell rule than random selection. In: International Conference on Machine Learning, pp. 1632-1641 (2015)
[64] Nutini, J., Sepehry, B., Laradji, I., Schmidt, M., Koepke, H., Virani, A.: Convergence rates for greedy Kaczmarz algorithms, and faster randomized Kaczmarz rules using the orthogonality graph. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pp. 547-556. AUAI Press (2016)
[65] Ochs, P.; Brox, T.; Pock, T., iPiasco: inertial proximal algorithm for strongly convex optimization, J. Math. Imaging Vis., 53, 2, 171-181 (2015) · Zbl 1327.90219
[66] Ochs, P.; Chen, Y.; Brox, T.; Pock, T., iPiano: inertial proximal algorithm for nonconvex optimization, SIAM J. Imaging Sci., 7, 2, 1388-1419 (2014) · Zbl 1296.90094
[67] Penrose, M., Random Geometric Graphs (2003), Oxford: Oxford University Press, Oxford · Zbl 1029.60007
[68] Polyak, B., Some methods of speeding up the convergence of iteration methods, USSR Comput. Math. Math. Phys., 4, 5, 1-17 (1964) · Zbl 0147.35301
[69] Polyak, B., Introduction to Optimization. Translations Series in Mathematics and Engineering (1987), New York: Optimization Software, New York · Zbl 0625.62093
[70] Popa, C., Least-squares solution of overdetermined inconsistent linear systems using Kaczmarz’s relaxation, Int. J. Comput. Math., 55, 1-2, 79-89 (1995) · Zbl 0830.65027
[71] Popa, C., Convergence rates for Kaczmarz-type algorithms, Numer. Algorithms, 79, 1, 1-17 (2018) · Zbl 1398.65049
[72] Qu, Z.; Richtárik, P., Coordinate descent with arbitrary sampling I: algorithms and complexity, Optim. Methods Softw., 31, 5, 829-857 (2016) · Zbl 1365.90205
[73] Qu, Z.; Richtárik, P., Coordinate descent with arbitrary sampling II: expected separable overapproximation, Optim. Methods Softw., 31, 5, 858-884 (2016) · Zbl 1365.90206
[74] Qu, Z., Richtárik, P., Takáč, M., Fercoq, O.: SDNA: stochastic dual Newton ascent for empirical risk minimization. In: International Conference on Machine Learning (2016)
[75] Qu, Z., Richtárik, P., Zhang, T.: Quartz: randomized dual coordinate ascent with arbitrary sampling. In: Advances in Neural Information Processing Systems, pp. 865-873 (2015)
[76] Richtárik, P.; Takáč, M., Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, Math. Program., 144, 1-2, 1-38 (2014) · Zbl 1301.65051
[77] Richtárik, P.; Takáč, M., Parallel coordinate descent methods for big data optimization, Math. Program., 156, 1-2, 433-484 (2016) · Zbl 1342.90102
[78] Richtárik, P., Takáč, M.: Stochastic reformulations of linear systems: algorithms and convergence theory. arXiv:1706.01108 (2017) · Zbl 1440.65045
[79] Robbins, H.; Monro, S., A stochastic approximation method, Ann. Math. Stat., 22, 400-407 (1951) · Zbl 0054.05901
[80] Schmidt, M.; Le Roux, N.; Bach, F., Minimizing finite sums with the stochastic average gradient, Math. Program., 162, 1-2, 83-112 (2017) · Zbl 1358.90073
[81] Schöpfer, F.; Lorenz, D., Linear convergence of the randomized sparse Kaczmarz method, Math. Program., 173, 1-28 (2018)
[82] Shalev-Shwartz, S.; Zhang, T., Stochastic dual coordinate ascent methods for regularized loss, J. Mach. Learn. Res., 14, 1, 567-599 (2013) · Zbl 1307.68073
[83] Strohmer, T.; Vershynin, R., A randomized Kaczmarz algorithm with exponential convergence, J. Fourier Anal. Appl., 15, 2, 262-278 (2009) · Zbl 1169.68052
[84] Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning , vol. 28, pp. 1139-1147 (2013)
[85] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1-9 (2015)
[86] Tseng, P., An incremental gradient (-projection) method with momentum term and adaptive stepsize rule, SIAM J. Optim., 8, 2, 506-531 (1998) · Zbl 0922.90131
[87] Tu, S., Venkataraman, S., Wilson, A., Gittens, A., Jordan, M., Recht, B.: Breaking locality accelerates block Gauss-Seidel. In: International Conference on Machine Learning (2017)
[88] Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems, pp. 4148-4158 (2017)
[89] Wright, S., Coordinate descent algorithms, Math. Program., 151, 1, 3-34 (2015) · Zbl 1317.49038
[90] Xiang, H., Zhang, L.: Randomized iterative methods with alternating projections. arXiv preprint arXiv:1708.09845 (2017)
[91] Xu, P., He, B., De Sa, C., Mitliagkas, I., Re, C.: Accelerated stochastic power iteration. In: International Conference on Artificial Intelligence and Statistics, pp. 58-67 (2018)
[92] Yang, T., Lin, Q., Li, Z.: Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv preprint arXiv:1604.03257 (2016)
[93] Zavriev, S.; Kostyuk, F., Heavy-ball method in nonconvex optimization problems, Comput. Math. Model., 4, 4, 336-341 (1993) · Zbl 1331.90056
[94] Zhang, J., Mitliagkas, I., Ré, C.: Yellowfin and the art of momentum tuning. arXiv preprint arXiv:1706.03471 (2017)
[95] Zhou, K.: Direct acceleration of SAGA using sampled negative momentum. arXiv preprint arXiv:1806.11048 (2018)
[96] Zhou, K., Shang, F., Cheng, J.: A simple stochastic variance reduced algorithm with fast convergence rates. In: Proceedings of the 35th International Conference on Machine Learning, PMLR, vol. 80, pp. 5980-5989 (2018)
[97] Zouzias, A.; Freris, N., Randomized extended Kaczmarz for solving least squares, SIAM J. Matrix Anal. Appl., 34, 2, 773-793 (2013) · Zbl 1273.65053
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.