×

Communication-efficient distributed optimization in networks with gradient tracking and variance reduction. (English) Zbl 07307473

Summary: There is growing interest in large-scale machine learning and optimization over decentralized networks, e.g. in the context of multi-agent learning and federated learning. Due to the imminent need to alleviate the communication burden, the investigation of communication-efficient distributed optimization algorithms – particularly for empirical risk minimization – has flourished in recent years. A large fraction of these algorithms have been developed for the master/slave setting, relying on the presence of a central parameter server that can communicate with all agents. < < This paper focuses on distributed optimization over networks, or decentralized optimization, where each agent is only allowed to aggregate information from its neighbors over a network (namely, no centralized coordination is present). By properly adjusting the global gradient estimate via local averaging in conjunction with proper correction, we develop a communication-efficient approximate Newton-type method, called Network-DANE, which generalizes DANE to accommodate decentralized scenarios. Our key ideas can be applied, in a systematic manner, to obtain decentralized versions of other master/slave distributed algorithms. A notable development is Network-SVRG/SARAH, which employs variance reduction at each agent to further accelerate local computation. We establish linear convergence of Network-DANE and Network-SVRG for strongly convex losses, and Network-SARAH for quadratic losses, which shed light on the impacts of data homogeneity, network connectivity, and local averaging upon the rate of convergence. We further extend Network-DANE to composite optimization by allowing a nonsmooth penalty term. Numerical evidence is provided to demonstrate the appealing performance of our algorithms over competitive baselines, in terms of both communication and computation efficiency. Our work suggests that by performing a judiciously chosen amount of local communication and computation per iteration, the overall efficiency can be substantially improved.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

Software:

AIDE; HOGWILD; ADD-OPT
PDF BibTeX XML Cite
Full Text: arXiv Link

References:

[1] Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. InProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1200-1205. ACM, 2017. · Zbl 1369.68273
[2] Mario Arioli and Jennifer Scott. Chebyshev acceleration of iterative refinement.Numerical Algorithms, 66(3):591-608, 2014. · Zbl 1296.65052
[3] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM journal on imaging sciences, 2(1):183-202, 2009. · Zbl 1175.94009
[4] Dimitri P Bertsekas and John N Tsitsiklis.Parallel and distributed computation: numerical methods, volume 23. Prentice hall Englewood Cliffs, NJ, 1989. · Zbl 0743.65107
[5] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and TrendsRin Machine learning, 3(1):1-122, 2011. · Zbl 1229.90122
[6] Shicong Cen, Huishuai Zhang, Yuejie Chi, Wei Chen, and Tie-Yan Liu. Convergence of distributed stochastic variance reduced methods without sampling extra data.IEEE Transactions on Signal Processing, 68:3976-3989, 2020.
[7] Paolo Di Lorenzo and Gesualdo Scutari. Next: In-network nonconvex optimization.IEEE Transactions on Signal and Information Processing over Networks, 2(2):120-136, 2016.
[8] Jianqing Fan, Yongyi Guo, and Kaizheng Wang. Communication-efficient accurate statistical estimation.arXiv preprint arXiv:1906.04870, 2019.
[9] Robert Hannah, Yanli Liu, Daniel O’Connor, and Wotao Yin. Breaking the span assumption yields fast finite-sum minimization. InAdvances in Neural Information Processing Systems, pages 2318-2327, 2018.
[10] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. InAdvances in neural information processing systems, pages 315-323, 2013.
[11] Jakub Konečn‘y, Brendan McMahan, and Daniel Ramage. Federated optimization: Distributed optimization beyond the datacenter.arXiv preprint arXiv:1511.03575, 2015.
[12] Jakub Konečn‘y, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency.arXiv preprint arXiv:1610.05492, 2016.
[13] Guanghui Lan, Soomin Lee, and Yi Zhou. Communication-efficient algorithms for decentralized and stochastic optimization.Mathematical Programming, pages 1-48, 2017. · Zbl 1437.90125
[14] Jason D Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement.The Journal of Machine Learning Research, 18(1):4404-4446, 2017. · Zbl 1435.68380
[15] Zhi Li, Wei Shi, and Ming Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates.IEEE Transactions on Signal Processing, 67(17):4494-4506, 2019. · Zbl 07123375
[16] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. InAdvances in Neural Information Processing Systems, pages 5330-5340, 2017.
[17] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. InArtificial Intelligence and Statistics, pages 1273-1282, 2017.
[18] Aryan Mokhtari and Alejandro Ribeiro. DSA: Decentralized double stochastic averaging gradient algorithm.The Journal of Machine Learning Research, 17(1):2165-2199, 2016. · Zbl 1360.68699
[19] Angelia Nedic, Asuman Ozdaglar, and Pablo A Parrilo. Constrained consensus and optimization in multi-agent networks.IEEE Transactions on Automatic Control, 55(4): 922-938, 2010. · Zbl 1368.90143
[20] Angelia Nedić, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over time-varying graphs.SIAM Journal on Optimization, 27(4):2597-2633, 2017. · Zbl 1387.90189
[21] Angelia Nedić, Alex Olshevsky, and Michael G Rabbat.Network topology and communication-computation tradeoffs in decentralized optimization.Proceedings of the IEEE, 106(5):953-976, 2018.
[22] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. InInternational Conference on Machine Learning, pages 2613-2621, 2017.
[23] Guannan Qu and Na Li. Harnessing smoothness to accelerate distributed optimization. IEEE Transactions on Control of Network Systems, 5(3):1245-1260, 2018. · Zbl 07044988
[24] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. InAdvances in neural information processing systems, pages 693-701, 2011.
[25] Sashank J Reddi, Jakub Konečn‘y, Peter Richtárik, Barnabás Póczós, and Alex Smola. Aide:fast and communication efficient distributed optimization.arXiv preprint arXiv:1608.06879, 2016.
[26] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In International Conference on Machine Learning, pages 3027-3036, 2017. · Zbl 1446.90127
[27] Kevin Scaman, Francis Bach, Sébastien Bubeck, Laurent Massoulié, and Yin Tat Lee. Optimal algorithms for non-smooth distributed optimization in networks. InAdvances in Neural Information Processing Systems, pages 2740-2749, 2018. · Zbl 1446.90127
[28] Gesualdo Scutari and Ying Sun. Distributed nonconvex constrained optimization over timevarying digraphs.Mathematical Programming, 176(1-2):497-544, 2019. · Zbl 1415.90130
[29] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate Newton-type method. InInternational conference on machine learning, pages 1000-1008, 2014.
[30] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. EXTRA: An exact first-order algorithm for decentralized consensus optimization.SIAM Journal on Optimization, 25(2):944-966, 2015a. · Zbl 1328.90107
[31] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. A proximal gradient algorithm for decentralized composite optimization.IEEE Transactions on Signal Processing, 63(22):6013-6023, 2015b. · Zbl 1394.94531
[32] Virginia Smith, Simone Forte, Chenxin Ma, Martin Takáč, Michael I Jordan, and Martin Jaggi. Cocoa: A general framework for communication-efficient distributed optimization. Journal of Machine Learning Research, 18:230, 2018. · Zbl 1473.68167
[33] Haoran Sun, Songtao Lu, and Mingyi Hong. Improving the sample and communication complexity for decentralized non-convex optimization: A joint gradient estimation and tracking approach.arXiv preprint arXiv:1910.05857, 2019a.
[34] Ying Sun, Amir Daneshmand, and Gesualdo Scutari. Convergence rate of distributed optimization algorithms based on gradient tracking.arXiv preprint arXiv:1905.02637, 2019b. · Zbl 1415.90130
[35] César A Uribe, Soomin Lee, Alexander Gasnikov, and Angelia Nedić. Optimal algorithms for distributed optimization.arXiv preprint arXiv:1712.00232, 2017.
[36] Hoi-To Wai, Zhuoran Yang, Princeton Zhaoran Wang, and Mingyi Hong. Multi-agent reinforcement learning via double averaging primal-dual optimization. InAdvances in Neural Information Processing Systems, pages 9649-9660, 2018.
[37] Shusen Wang, Farbod Roosta-Khorasani, Peng Xu, and Michael W Mahoney. Giant: Globally improved approximate newton method for distributed optimization. InAdvances in Neural Information Processing Systems, pages 2338-2348, 2018.
[38] Chenguang Xi, Ran Xin, and Usman A Khan. ADD-OPT: Accelerated distributed directed optimization.IEEE Transactions on Automatic Control, 63(5):1329-1339, 2017. · Zbl 1395.90204
[39] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging.Systems and Control Letters, 53(1):65-78, 2004. ISSN 01676911. doi: 10.1016/j.sysconle.2004.02.022. · Zbl 1157.90347
[40] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction.SIAM Journal on Optimization, 24(4):2057-2075, 2014. · Zbl 1321.65016
[41] Ran Xin, Usman A Khan, and Soummya Kar. Variance-reduced decentralized stochastic optimization with gradient tracking.arXiv preprint arXiv:1909.11774, 2019a.
[42] Ran Xin, Anit Kumar Sahu, Usman A Khan, and Soummya Kar. Distributed stochastic optimization with gradient tracking over strongly-connected networks.arXiv preprint arXiv:1903.07266, 2019b.
[43] Kun Yuan, Bicheng Ying, Jiageng Liu, and Ali H Sayed. Variance-reduced stochastic learning by networked agents under random reshuffling.IEEE Transactions on Signal Processing, 67(2):351-366, 2018a. · Zbl 1414.93023
[44] Kun Yuan, Bicheng Ying, Xiaochuan Zhao, and Ali H Sayed. Exact diffusion for distributed optimization and learning – part I: Algorithm development.IEEE Transactions on Signal Processing, 67(3):708-723, 2018b. · Zbl 1414.90277
[45] Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-efficient algorithms for statistical optimization. InAdvances in Neural Information Processing Systems, pages 1502-1510, 2012. · Zbl 1318.62016
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.