zbMATH — the first resource for mathematics

The error-feedback framework: SGD with delayed gradients. (English) Zbl 07307490
Summary: We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus, in the presence of noise, the effects of the delay become negligible after a few iterations and the algorithm converges at the same optimal rate as standard SGD. This result extends a line of research that showed similar results in the asymptotic regime or for strongly-convex quadratic functions only. < < We further show similar results for SGD with more intricate form of delayed gradients – compressed gradients under error compensation and for local SGD where multiple workers perform local steps before communicating with each other. In all of these settings, we improve upon the best known rates. < < These results show that SGD is robust to compressed and/or delayed stochastic gradient updates. This is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.
68T05 Learning and adaptive systems in artificial intelligence
Full Text: Link
[1] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. InAdvances in Neural Information Processing Systems 24, pages 873-881. Curran Associates, Inc., 2011.
[2] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. InAdvances in Neural Information Processing Systems 30, pages 1709-1720. Curran Associates, Inc., 2017.
[3] Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cedric Renggli. The convergence of sparsified gradient methods. InAdvances in Neural Information Processing Systems 31, pages 5977-5987. Curran Associates, Inc., 2018.
[4] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. InAdvances in Neural Information Processing Systems 28, pages 1756- 1764. Curran Associates, Inc., 2015.
[5] Yossi Arjevani, Yair Carmon, John C. Duchi, Dylan J. Foster, Nathan Srebro, and Blake Woodworth.Lower bounds for non-convex stochastic optimization.arXiv preprint arXiv:1902.04686, 2019.
[6] Yossi Arjevani, Ohad Shamir, and Nathan Srebro. A tight convergence analysis for stochastic gradient descent with delayed updates. InProceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117 ofPMLR, pages 111-132, 2020.
[7] Francis R. Bach and Eric Moulines. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. InAdvances in Neural Information Processing Systems 24, pages 451-459. Curran Associates, Inc., 2011.
[8] D.P. Bertsekas and J.N. Tsitsiklis.Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, 1989.
[9] L. Bottou, F. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223-311, 2018. doi: 10.1137/16M1080173.
[10] L´eon Bottou.Large-scale machine learning with stochastic gradient descent.In Yves Lechevallier and Gilbert Saporta, editors,Proceedings of COMPSTAT’2010, pages 177- 186, Heidelberg, 2010. Physica-Verlag HD.
[11] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i.Mathematical Programming, pages 1-50, 2017.
[12] Sorathan Chaturapruek, John C Duchi, and Christopher R´e.Asynchronous stochastic convex optimization: the noise is in the noise and SGD don’t care. InAdvances in Neural Information Processing Systems 28, pages 1531-1539. Curran Associates, Inc., 2015.
[13] Jean-Baptiste Cordonnier. Convex optimization using sparsified stochastic gradient descent with memory. Master thesis (Adv: S. U. Stich, M. Jaggi), EPFL, 2018.
[14] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches.Journal of Machine Learning Research, 13(1):165-202, January 2012. ISSN 1532-4435.
[15] Aymeric Dieuleveut, Nicolas Flammarion, and Francis Bach. Harder, better, faster, stronger convergence rates for least-squares regression.Journal of Machine Learning Research, 18 (101):1-51, 2017.
[16] H. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous mini-batch algorithm for regularized stochastic optimization.IEEE Transactions on Automatic Control, 61(12):3740-3754, Dec 2016. ISSN 0018-9286.
[17] Saeed Ghadimi and Guanghui Lan.Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469-1492, 2012.
[18] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341-2368, 2013.
[19] Antoine Godichon-Baggioni and Sofiane Saadane. On the rates of convergence of parallelized averaged stochastic gradient algorithms.arXiv preprint arXiv:1710.07926, 2017.
[20] Robert M. Gower, Peter Richt´arik, and Francis Bach. Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching.arXiv preprint arXiv:1805.02632, 2018.
[21] Robert M. Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richt´arik. SGD: General analysis and improved rates. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5200-5209. PMLR, 2019.
[22] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems.Journal of Machine Learning Research, 19(29):1-44, 2018.
[23] Oliver Hinder, Aaron Sidford, and Nimit S. Sohoni. Near-optimal methods for minimizing star-convex functions and beyond.arXiv preprint arXiv:1906.11985, 2019.
[24] Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: Mini-batching, averaging, and model misspecification.Journal of Machine Learning Research, 18(223):1-42, 2018.
[25] Hamed Karimi, Julie Nutini, and Mark Schmidt.Linear convergence of gradient and proximal-gradient methods under the Polyak- Lojasiewicz condition. InEuropean Conference on Machine Learning and Knowledge Discovery in Databases - Volume 9851, ECML PKDD 2016, pages 795-811, Berlin, Heidelberg, 2016. Springer-Verlag.
[26] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes SignSGD and other gradient compression schemes. InProceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 3252-3261. PMLR, 2019.
[27] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, and Ananda Theertha Suresh. SCAFFOLD: stochastic controlled averaging for on-device federated learning. InProceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2020.
[28] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian U Stich. A unified theory of decentralized SGD with changing topology and local updates. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2020.
[29] Remi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. Improved asynchronous parallel optimization analysis for stochastic incremental methods.Journal of Machine Learning Research, 19(81):1-68, 2018.
[30] J. C. H. Lee and P. Valiant. Optimizing star-convex functions. In2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 603-614, 2016.
[31] Haizhou Li, Helen M. Meng, Bin Ma, Engsiong Chng, and Lei Xie, editors.INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6-10, 2015, 2015. ISCA.
[32] Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. InAdvances in Neural Information Processing Systems, pages 2737-2745, 2015.
[33] Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large minibatches, use local SGD.International Conference on Learning Representations (ICLR), 2020.
[34] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 3325-3334. PMLR, 2018.
[35] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I. Jordan.Perturbed iterate analysis for asynchronous stochastic optimization.SIAM Journal on Optimization, 27(4):2202-2229, 2017.
[36] Ryan McDonald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon S. Mann. Efficient large-scale distributed training of conditional maximum entropy models.In Advances in Neural Information Processing Systems 22, pages 1231-1239. Curran Associates, Inc., 2009.
[37] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 ofProceedings of Machine Learning Research, pages 1273-1282. PMLR, 2017.
[38] I. Necoara, Yu. Nesterov, and F. Glineur. Linear convergence of first order methods for non-strongly convex optimization.Mathematical Programming, 175(1):69-107, May 2019. ISSN 1436-4646.
[39] Deanna Needell, Nathan Srebro, and Rachel Ward. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm.Mathematical Programming, 155(1): 549-573, Jan 2016. ISSN 1436-4646.
[40] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming.SIAM Journal on Optimization, 19(4):1574-1609, 2009.
[41] A. S. Nemirovski and D.B. Yudin.Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
[42] Yurii Nesterov.Introductory Lectures on Convex Optimization, volume 87 ofSpringer Science & Business Media. Springer US, Boston, MA, 2004.
[43] Yurii Nesterov and B.T. Polyak. Cubic regularization of newton method and its global performance.Mathematical Programming, 108(1):177-205, 2006.
[44] Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. HOGWILD!: A lockfree approach to parallelizing stochastic gradient descent. InProceedings of the 24th International Conference on Neural Information Processing Systems, pages 693-701. Curran Associates Inc., 2011.
[45] Kumar Kshitij Patel and Aymeric Dieuleveut. Communication trade-offs for synchronized distributed SGD with large step size. InAdvances in Neural Information Processing Systems 32, pages 13601-13612. Curran Associates, Inc., 2019.
[46] B. T. Polyak. New method of stochastic approximation type.Autom. Remote Control, 51 (7):937-946, 1990.
[47] Boris T. Polyak.Introduction to Optimization. OptimizationSoftware, Inc., 1987.
[48] Herbert Robbins and Sutton Monro. A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400-407, September 1951.
[49] Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition.arXiv preprint arXiv:1308.6370, 2013.
[50] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In Li et al. (2015), pages 1058-1062.
[51] O. Shamir and N. Srebro. Distributed stochastic optimization and learning. In2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 850-857, 2014.
[52] Suvrit Sra, Adams Wei Yu, Mu Li, and Alex Smola. Adadelay: Delay adaptive distributed stochastic optimization. InProceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 ofProceedings of Machine Learning Research, pages 957-965. PMLR, 2016.
[53] Sebastian U. Stich. Local SGD converges fast and communicates little.International Conference on Learning Representations (ICLR), 2019a.
[54] Sebastian U. Stich. Unified optimal analysis of the (stochastic) gradient method.arXiv preprint arXiv:1907.04232v2, 2019b.
[55] Sebastian U. Stich. On communication compression for distributed optimization on heterogeneous data.arXiv preprint arXiv:2009.02388, 2020.
[56] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. InAdvances in Neural Information Processing Systems 31, pages 4447-4458. Curran Associates, Inc., 2018.
[57] Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In Li et al. (2015), pages 1488-1492.
[58] Sharan Vaswani, Francis Bach, and Mark Schmidt.Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron.arXiv preprint arXiv:1810.07288, 2018.
[59] Jianyu Wang and Gauri Joshi. Cooperative SGD: A unified framework for the design and analysis of communication-efficient SGD algorithms.arXiv preprint arXiv:1808.07576, 2018.
[60] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang.Gradient sparsification for communication-efficient distributed optimization. InAdvances in Neural Information Processing Systems 31, pages 1306-1316. Curran Associates, Inc., 2018.
[61] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems 30, pages 1509-1519. Curran Associates, Inc., 2017.
[62] Blake E Woodworth, Jialei Wang, Adam Smith, Brendan McMahan, and Nati Srebro. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Advances in Neural Information Processing Systems 31, pages 8496-8506. Curran Associates, Inc., 2018.
[63] Blake E. Woodworth, Kumar Kshitij Patel, Sebastian U. Stich, Zhen Dai, Brian Bullins, H. Brendan McMahan, Ohad Shamir, and Nathan Srebro. Is local SGD better than minibatch sgd? InProceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2020.
[64] Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 5325-5333. PMLR, 2018.
[65] Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted SGD for non-convex optimization with faster convergence and less communication.arXiv preprint arXiv:1807.06629, 2018.
[66] Jian Zhang, Christopher De Sa, Ioannis Mitliagkas, and Christopher R´e. Parallel SGD: When does averaging help?arXiv preprint arXiv:1904.11325, 2016.
[67] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-efficient algorithms for statistical optimization.Journal of Machine Learning Research, 14:3321-3363, 2013.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.