×

Decentralized and parallel primal and dual accelerated methods for stochastic convex programming problems. (English) Zbl 1472.90087

Summary: We introduce primal and dual stochastic gradient oracle methods for decentralized convex optimization problems. Both for primal and dual oracles, the proposed methods are optimal in terms of the number of communication steps. However, for all classes of the objective, the optimality in terms of the number of oracle calls per node takes place only up to a logarithmic factor and the notion of smoothness. By using mini-batching technique, we show that the proposed methods with stochastic oracle can be additionally parallelized at each node. The considered algorithms can be applied to many data science problems and inverse problems.

MSC:

90C25 Convex programming
90C06 Large-scale problems in mathematical programming
90C15 Stochastic programming

References:

[1] Z. Allen-Zhu, Katyusha: The first direct acceleration of stochastic gradient methods, STOC’17—Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, ACM, New York (2017), 1200-1205. · Zbl 1369.68273
[2] Z. Allen-Zhu, How to make the gradients small stochastically: Even faster convex and nonconvex SGD, Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Neural Information Processing Systems Foundation, San Diego (2018), 1157-1167.
[3] Z. Allen-Zhu and E. Hazan, Optimal black-box reductions between optimization objectives, Advances in Neural Information Processing Systems 29 (NeurIPS 2016), Neural Information Processing Systems Foundation, San Diego (2016), 1614-1622.
[4] A. S. Anikin, A. V. Gasnikov, P. E. Dvurechensky, A. I. Tyurin and A. V. Chernov, Dual approaches to the minimization of strongly convex functionals with a simple structure under affine constraints, Comput. Math. Math. Phys. 57 (2017), no. 8, 1262-1276. · Zbl 1380.49046
[5] Y. Arjevani and O. Shamir, Communication complexity of distributed convex learning and optimization, Advances in Neural Information Processing Systems 28 (NeurIPS 2015), Neural Information Processing Systems Foundation, San Diego (2015), 1756-1764.
[6] A. d’Aspremont, Smooth optimization with approximate gradient, SIAM J. Optim. 19 (2008), no. 3, 1171-1183. · Zbl 1180.90378
[7] A. Ben-Tal and A. Nemirovski, Lectures on Modern Convex Optimization. Analysis, Algorithms, and Engineering Applications, MPS/SIAM Ser. Optim., Society for Industrial and Applied Mathematics, Philadelphia, 2001. · Zbl 0986.90032
[8] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, Prentice-Hall, Upper Saddle River, 1989. · Zbl 0743.65107
[9] A. Beznosikov, E. Gorbunov and A. Gasnikov, Derivative-free method for decentralized distributed non-smooth optimization, preprint (2019), https://arxiv.org/abs/1911.10645.
[10] C. L. Byrne, Iterative Optimization in Inverse Problems, Monogr. Research Notes Math., CRC Press, Boca Raton, 2014. · Zbl 1285.65035
[11] Y. Carmon and J. Duchi, Gradient descent finds the cubic-regularized nonconvex Newton step, SIAM J. Optim. 29 (2019), no. 3, 2146-2178. · Zbl 1461.65135
[12] A. Chernov, P. Dvurechensky and A. Gasnikov, Fast primal-dual gradient method for strongly convex minimization problems with linear constraints, Discrete Optimization and Operations Research, Lecture Notes in Comput. Sci. 9869, Springer, Cham (2016), 391-403. · Zbl 1391.90471
[13] M. B. Cohen, J. Diakonikolas and L. Orecchia, On acceleration with noise-corrupted gradients, preprint (2018), https://arxiv.org/abs/1805.12591.
[14] O. Devolder, Exactness, inexactness and stochasticity in first-order methods for large-scale convex optimization, PhD thesis, ICTEAM and CORE, Université Catholique de Louvain, 2013.
[15] O. Devolder, F. Glineur and Y. Nesterov, First-order methods of smooth convex optimization with inexact oracle, Math. Program. 146 (2014), no. 1-2, 37-75. · Zbl 1317.90196
[16] D. Dvinskikh, E. Gorbunov, A. Gasnikov, P. Dvurechensky and C. A. Uribe, On dual approach for distributed stochastic convex optimization over networks, preprint (2019), https://arxiv.org/abs/1903.09844.
[17] D. Dvinskikh, A. Turin, A. Gasnikov and S. Omelchenko, Accelerated and Unaccelerated Stochastic Gradient Descent in Model Generality, Mat. Zametki 108 (2020), no. 4, 515-528. · Zbl 1452.90226
[18] P. Dvurechensky, D. Dvinskikh, A. Gasnikov, C. A. Uribe and A. Nedich, Decentralize and randomize: Faster algorithm for wasserstein barycenters, Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Neural Information Processing Systems Foundation, San Diego (2018), 10760-10770.
[19] P. Dvurechensky and A. Gasnikov, Stochastic intermediate gradient method for convex problems with stochastic inexact oracle, J. Optim. Theory Appl. 171 (2016), no. 1, 121-145. · Zbl 1351.90150
[20] P. Dvurechensky, A. Gasnikov and A. Lagunovskaya, Parallel algorithms and probability of large deviation for stochastic convex optimization problems, Numer. Anal. Appl. 11 (2018), no. 1, 33-37. · Zbl 1399.90205
[21] P. Dvurechensky, A. Gasnikov and A. Tiurin, Randomized similar triangles method: A unifying framework for accelerated randomized optimization methods (coordinate descent, directional search, derivative-free method), preprint (2017), https://arxiv.org/abs/1707.08486.
[22] P. Dvurechensky, E. Gorbunov and A. Gasnikov, An accelerated directional derivative method for smooth stochastic convex optimization, European J. Oper. Res. 290 (2021), no. 2, 601-621. · Zbl 1487.90524
[23] A. Fallah, M. Gurbuzbalaban, A. Ozdaglar, U. Simsekli and L. Zhu, Robust distributed accelerated stochastic gradient methods for multi-agent networks, preprint (2019), https://arxiv.org/abs/1910.08701.
[24] D. Foster, A. Sekhari, O. Shamir, N. Srebro, K. Sridharan and B. Woodworth, The complexity of making the gradient small in stochastic convex optimization, preprint (2019), https://arxiv.org/abs/1902.04686.
[25] Y. Gao and T. Blumensath, Distributed computation of linear inverse problems with application to computed tomography, preprint (2017), https://arxiv.org/abs/1709.00953.
[26] A. Gasnikov, Universal gradient descent, preprint (2017), https://arxiv.org/abs/1711.00394.
[27] A. Gasnikov, S. Kabanikhin, A. Mohammed and M. Shishlenin, Convex optimization in hilbert space with applications to inverse problems, preprint (2017), https://arxiv.org/abs/1703.00267.
[28] A. Gasnikov and Y. Nesterov, Universal method for stochastic composite optimization problems, Comput. Math. Math. Phys. 58 (2018), no. 1, 48-64. · Zbl 1457.90099
[29] S. Ghadimi and G. Lan, Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim. 23 (2013), no. 4, 2341-2368. · Zbl 1295.90026
[30] A. Godichon-Baggioni and S. Saadane, On the rates of convergence of parallelized averaged stochastic gradient algorithms, Statistics 54 (2020), no. 3, 618-635. · Zbl 1440.62313
[31] E. Gorbunov, D. Dvinskikh and A. Gasnikov, Optimal decentralized distributed algorithms for stochastic convex optimization, preprint (2019), https://arxiv.org/abs/1911.07363.
[32] V. Guigues, A. Juditsky and A. Nemirovski, Non-asymptotic confidence bounds for the optimal value of a stochastic program, Optim. Methods Softw. 32 (2017), no. 5, 1033-1058. · Zbl 1386.90091
[33] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning. Data Mining, Inference, and Prediction, 2nd ed., Springer Ser. Statist., Springer, New York, 2009. · Zbl 1273.62005
[34] H. Hendrikx, F. Bach and L. Massoulié, Accelerated decentralized optimization with local updates for smooth and strongly convex objectives, preprint (2018), https://arxiv.org/abs/1810.02660.
[35] H. Hendrikx, F. Bach and L. Massoulié, An accelerated decentralized stochastic proximal algorithm for finite sums, preprint (2019), https://arxiv.org/abs/1905.11394.
[36] H. Hendrikx, F. Bach and L. Massoulié, Asynchronous accelerated proximal stochastic gradient for strongly convex distributed finite sums, preprint (2019), https://arxiv.org/abs/1901.09865.
[37] H. Hendrikx, F. Bach and L. Massoulié, An optimal algorithm for decentralized finite sum optimization, preprint (2020), https://arxiv.org/abs/2005.10675.
[38] H. Hendrikx, F. Bach and L. Massoulié, Dual-free stochastic decentralized optimization with variance reduction, preprint (2020), https://arxiv.org/abs/2006.14384.
[39] H. Hendrikx, L. Xiao, S. Bubeck, F. Bach and L. Massoulié, Statistically preconditioned accelerated gradient method for distributed optimization, preprint (2020), https://arxiv.org/abs/2002.10726.
[40] A. Ivanova, D. Grishchenko, A. Gasnikov and E. Shulgin, Adaptive catalyst for smooth convex optimization, preprint (2019), https://arxiv.org/abs/1911.11271.
[41] C. Jin, P. Netrapalli, R. Ge, S. M. Kakade and M. I. Jordan, A short note on concentration inequalities for random vectors with subgaussian norm, preprint (2019), https://arxiv.org/abs/1902.03736.
[42] S. Kakade, S. Shalev-Shwartz and A. Tewari, On the duality of strong convexity and strong smoothness: Learning applications and matrix regularization, unpublished manuscript (2009), http://ttic.uchicago.edu/shai/papers/KakadeShalevTewari09.pdf.
[43] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich and A. T. Suresh, Scaffold: Stochastic controlled averaging for federated learning, preprint (2019), https://arxiv.org/abs/1910.06378.
[44] V. M. Kibardin, Decomposition into functions in the minimization problem, Avtom. Telem. 1979 (1979), no. 9, 66-79. · Zbl 0428.49026
[45] D. Kim and J. A. Fessler, Optimized first-order methods for smooth convex minimization, Math. Program. 159 (2016), no. 1-2, 81-107. · Zbl 1345.90113
[46] A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi and S. U. Stich, A unified theory of decentralized SGD with changing topology and local updates, Proceedings of the 37th International Conference on Machine Learning. ICML 2020, ICML, San Diego (2020), 5381-5393; https://arxiv.org/abs/2003.10422.
[47] D. Kovalev, A. Salim and P. Richtarik, Optimal and practical algorithms for smooth and strongly convex decentralized optimization, preprint (2020), https://arxiv.org/abs/2006.11773.
[48] A. Kulunchakov and J. Mairal, A generic acceleration framework for stochastic composite optimization, preprint (2019), https://arxiv.org/abs/1906.01164.
[49] A. Kulunchakov and J. Mairal, Estimate sequences for stochastic composite optimization: Variance reduction, acceleration, and robustness to noise, preprint (2019), https://arxiv.org/abs/1901.08788. · Zbl 1527.90160
[50] A. Kulunchakov and J. Mairal, Estimate sequences for variance-reduced stochastic composite optimization, preprint (2019), https://arxiv.org/abs/1905.02374.
[51] G. Lan, Gradient sliding for composite optimization, Math. Program. 159 (2016), no. 1-2, 201-235. · Zbl 1346.90667
[52] G. Lan, Lectures on optimization methods for machine learning, Lecture notes (2019), http://pwp.gatech.edu/guanghui-lan/wp-content/uploads/sites/330/2019/08/LectureOPTML.pdf.
[53] G. Lan, S. Lee and Y. Zhou, Communication-efficient algorithms for decentralized and stochastic optimization, Math. Program. 180 (2020), no. 1-2, 237-284. · Zbl 1437.90125
[54] G. Lan and Y. Zhou, An optimal randomized incremental gradient method, Math. Program. 171 (2018), no. 1-2, 167-215. · Zbl 1432.90115
[55] G. Lan and Y. Zhou, Random gradient extrapolation for distributed and stochastic optimization, SIAM J. Optim. 28 (2018), no. 4, 2753-2782. · Zbl 1401.90156
[56] H. Li, C. Fang, W. Yin and Z. Lin, A sharp convergence rate analysis for distributed accelerated gradient methods, preprint (2018), https://arxiv.org/abs/1810.01053.
[57] H. Li and Z. Lin, Revisiting EXTRA for smooth distributed optimization, SIAM J. Optim. 30 (2020), no. 3, 1795-1821. · Zbl 1447.90030
[58] H. Li, Z. Lin and Y. Fang, Optimal accelerated variance reduced extra and diging for strongly convex and smooth decentralized optimization, preprint (2020), https://arxiv.org/abs/2009.04373.
[59] H. Lin, J. Mairal and Z. Harchaoui, A universal catalyst for first-order optimization, Proceedings of the 28th International Conference on Neural Information Processing Systems - NIPS’15, MIT Press, Cambridge (2015), 3384-3392.
[60] B. Mathieu, T. Adrien and B. Francis, Principled analyses and design of first-order methods with proximal inexact proximal operator, preprint (2020), https://arxiv.org/abs/2006.06041.
[61] H. B. McMahan, E. Moore, D. Ramage, S. Hampson and B. Agüera y Arcas, Communication-efficient learning of deep networks from decentralized data, preprint (2016), https://arxiv.org/abs/1602.05629.
[62] A. Nedić, Distributed optimization over networks, Multi-Agent Optimization, Lecture Notes in Math. 2224, Springer, Cham (2018), 1-84. · Zbl 1461.90160
[63] A. Nedić, Distributed gradient methods for convex machine learning problems in networks: Distributed optimization, IEEE Signal Process. Mag. 37 (2020), no. 3, 92-101.
[64] A. Nedić, A. Olshevsky and C. A. Uribe, Graph-theoretic analysis of belief system dynamics under logic constraints, preprint (2018), https://arxiv.org/abs/1810.02456.
[65] A. Nemirovski, S. Onn and U. G. Rothblum, Accuracy certificates for computational problems with convex structure, Math. Oper. Res. 35 (2010), no. 1, 52-78. · Zbl 1216.90067
[66] Y. Nesterov, Smooth minimization of non-smooth functions, Math. Program. 103 (2005), no. 1, 127-152. · Zbl 1079.90102
[67] Y. Nesterov, Primal-dual subgradient methods for convex problems, Math. Program. 120 (2009), no. 1, 221-259. · Zbl 1191.90038
[68] Y. Nesterov, Introduction to Convex Optimization, MCCME, Moscow, 2010.
[69] Y. Nesterov, How to make the gradients small, Optima 88 (2012), 10-11.
[70] Y. Nesterov, Gradient methods for minimizing composite functions, Math. Program. 140 (2013), no. 1, 125-161. · Zbl 1287.90067
[71] Y. Nesterov, Universal gradient methods for convex optimization problems, Math. Program. 152 (2015), no. 1-2, 381-404. · Zbl 1327.90216
[72] Y. Nesterov, Implementable tensor methods in unconstrained convex optimization, CORE Discussion Paper 2018/05, CORE UCL, 2018. · Zbl 1459.90157
[73] Y. Nesterov, Lectures on Convex Optimization, Springer Optim. Appl. 137, Springer, Cham, 2018. · Zbl 1427.90003
[74] Y. Nesterov and S. U. Stich, Efficiency of the accelerated coordinate descent method on structured optimization problems, SIAM J. Optim. 27 (2017), no. 1, 110-123. · Zbl 1359.90073
[75] A. Olshevsky, I. C. Paschalidis and S. Pu, Asymptotic network independence in distributed optimization for machine learning, preprint (2019), https://arxiv.org/abs/1906.12345.
[76] A. Olshevsky, I. C. Paschalidis and S. Pu, A non-asymptotic analysis of network independence for distributed stochastic gradient descent, preprint (2019), https://arxiv.org/abs/1906.02702.
[77] B. T. Poljak, Iterative algorithms for singular minimization problems, Nonlinear Programming 4 (Madison 1980), Academic Press, New York (1981), 147-166. · Zbl 0546.90078
[78] B. T. Polyak, Introduction to Optimization, Transl. Ser. Math. Eng., Optimization Software, New York, 1987.
[79] R. T. Rockafellar, Convex Analysis, Princeton Math. Ser. 28, Princeton University Press, Princeton, 1970. · Zbl 0202.14303
[80] A. Rogozin and A. Gasnikov, Projected gradient method for decentralized optimization over time-varying networks, preprint (2019), https://arxiv.org/abs/1911.08527.
[81] A. Rogozin and A. Gasnikov, Penalty-based method for decentralized optimization over time-varying graphs, Optimization and Applications: Proceedings of the 11th International Conference. OPTIMA 2020 (Moscow 2020), Springer, Cham (2020), 239-256. · Zbl 1506.90075
[82] A. Rogozin, V. Lukoshkin, A. Gasnikov, D. Kovalev and E. Shulgin, Towards accelerated rates for distributed optimization over time-varying networks, preprint (2020), https://arxiv.org/abs/2009.11069.
[83] A. Rogozin, C. A. Uribe, A. V. Gasnikov, N. Malkovsky and A. Nedić, Optimal distributed convex optimization on slowly time-varying graphs, IEEE Trans. Control Netw. Syst. 7 (2020), no. 2, 829-841. · Zbl 1516.93098
[84] K. Scaman, F. Bach, S. Bubeck, Y. T. Lee and L. Massoulié, Optimal algorithms for smooth and strongly convex distributed optimization in networks, Proceedings of the 34th International Conference on Machine Learning. ICML 2017 (Sysney 2017), ICML, San Diego (2017), 3027-3036.
[85] K. Scaman, F. Bach, S. Bubeck, L. Massoulié and Y. T. Lee, Optimal algorithms for non-smooth distributed optimization in networks, Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Neural Information Processing Systems Foundation, San Diego (2018), 2745-2754.
[86] S. Shalev-Shwartz, O. Shamir, N. Srebro and K. Sridharan, Stochastic convex optimization, Proceedings of the Conference on Learning Theory (COLT), COLT (2009), https://www.cs.mcgill.ca/ colt2009/papers/018.pdf.
[87] S. Shalev-Shwartz and T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, Proceedings of the 34th International Conference on Machine Learning. ICML 2014 (Bejing 2014), ICML, San Diego (2014), (2014), 64-72.
[88] O. Shamir and S. Shalev-Shwartz, Matrix completion with the trace norm: learning, bounding, and transducing, J. Mach. Learn. Res. 15 (2014), 3401-3423. · Zbl 1318.68152
[89] A. Shapiro, D. Dentcheva and A. Ruszczyński, Lectures on Stochastic Programming. Modeling and Theory, MPS/SIAM Ser. Optimiz. 9, Society for Industrial and Applied Mathematics, Philadelphia, 2009. · Zbl 1183.90005
[90] V. Spokoiny, Parametric estimation. Finite sample theory, Ann. Statist. 40 (2012), no. 6, 2877-2909. · Zbl 1296.62051
[91] S. Sra, Tractable optimization in machine learning, Tractability, Cambridge University Press, Cambridge (2014), 202-230.
[92] F. Stonyakin, D. Dvinskikh, P. Dvurechensky, A. Kroshnin, O. Kuznetsova, A. Agafonov, A. Gasnikov, A. Tyurin, C. A. Uribe, D. Pasechnyuk and S. Artamonov, Gradient methods for problems with inexact model of the objective, International Conference on Mathematical Optimization Theory and Operations Research, Lecture Notes in Comput. Sci. 11548, Springer, Cham (2019), 97-114. · Zbl 1437.90126
[93] F. Stonyakin, A. Gasnikov, A. Tyurin, D. Pasechnyuk, A. Agafonov, P. Dvurechensky, D. Dvinskikh and V. Piskunova, Inexact model: A framework for optimization and variational inequalities, preprint (2019), https://arxiv.org/abs/1902.00990.
[94] F. Stonyakin, A. Stepanov, A. Gasnikov and A. Titov, Mirror descent for constrained optimization problems with large subgradient values of functional constraints, Comput. Res. Modell. 12 (2020), no. 2, 301-317.
[95] H. Sun and M. Hong, Distributed non-convex first-order optimization and information processing: Lower complexity bounds and rate optimal algorithms, IEEE Trans. Signal Process. 67 (2019), no. 22, 5912-5928.
[96] J. Tang, K. Egiazarian, M. Golbabaee and M. Davies, The practicality of stochastic optimization in imaging inverse problems, IEEE Trans. Comput. Imaging 6 (2020), 1471-1485.
[97] C. A. Uribe, D. Dvinskikh, P. Dvurechensky, A. Gasnikov and A. Nedić, Distributed computation of Wasserstein barycenters over networks, 2018 IEEE 57th Annual Conference on Decision and Control (CDC), IEEE Press, Piscataway (2018), 6544-6549.
[98] C. R. Vogel, Computational Methods for Inverse Problems, Front. Appl. Math. 23, Society for Industrial and Applied Mathematics, Philadelphia, 2002. · Zbl 1008.65103
[99] B. E. Woodworth, K. K. Patel, S. U. Stich, Z. Dai, B. Bullins, H. B. McMahan, O. Shamir and N. Srebro, Is local SGD better than minibatch SGD?, preprint (2020), https://arxiv.org/abs/2002.07839.
[100] B. E. Woodworth, K. K. Patel and N. Srebro, Minibatch vs local SGD for heterogeneous distributed learning, preprint (2020), https://arxiv.org/abs/2006.04735.
[101] B. E. Woodworth and N. Srebro, Tight complexity bounds for optimizing composite objectives, Advances in Neural Information Processing Systems 29 (NeurIPS 2016), Neural Information Processing Systems Foundation, San Diego (2016), 3639-3647.
[102] B. E. Woodworth, J. Wang, A. Smith, B. McMahan and N. Srebro, Graph oracle models, lower bounds, and gaps for parallel stochastic optimization, Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Neural Information Processing Systems Foundation, San Diego (2018), 8505-8515.
[103] J. Xu, Y. Tian, Y. Sun and G. Scutari, Accelerated primal-dual algorithms for distributed smooth convex optimization over networks, preprint (2019), https://arxiv.org/abs/1910.10666.
[104] H. Ye, L. Luo, Z. Zhou and T. Zhang, Multi-consensus decentralized accelerated gradient descent, preprint (2020), https://arxiv.org/abs/2005.00797.
[105] N. Ye, F. Roosta-Khorasani and T. Cui, Optimization methods for inverse problems, 2017 MATRIX Annals, MATRIX Book Ser. 2, Springer, Cham (2019), 121-140.
[106] H. Yuan and T. Ma, Federated accelerated stochastic gradient descent, preprint (2020), https://arxiv.org/abs/2006.08950.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.