×

Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. (English) Zbl 1435.68380

Summary: We study the round complexity of minimizing the average of convex functions under a new setting of distributed optimization where each machine can receive two subsets of functions. The first subset is from a random partition and the second subset is randomly sampled with replacement. Under this setting, we define a broad class of distributed algorithms whose local computation can utilize both subsets and design a distributed stochastic variance reduced gradient method belonging to in this class. When the condition number of the problem is small, our method achieves the optimal parallel runtime, amount of communication and rounds of communication among all distributed first-order methods up to constant factors. When the condition number is relatively large, a lower bound is provided for the number of rounds of communication needed by any algorithm in this class. Then, we present an accelerated version of our method whose the rounds of communication matches the lower bound up to logarithmic terms, which establishes that this accelerated algorithm has the lowest round complexity among all algorithms in our class under this new setting.

MSC:

68W15 Distributed algorithms
62J07 Ridge regression; shrinkage estimators (Lasso)
90C15 Stochastic programming
90C25 Convex programming
90C90 Applications of mathematical programming
PDF BibTeX XML Cite
Full Text: Link

References:

[1] Alekh Agarwal and Leon Bottou. A lower bound for the optimization of finite sums. In International Conference on Machine Learning (ICML), 2015.
[2] Alekh Agarwal, Sahand Negahban, and Martin Wainwright. Fast global convergence of gradient methods for high-dimensional statistical recovery. The Annals of Statistics, 40 (5):2452–2482, 2012. · Zbl 1373.62244
[3] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. In Advances in Neural Information Processing Systems (NIPS), 2015. · Zbl 1391.90467
[4] Necdet Aybat, Zi Wang, and Garud Iyengar. An asynchronous distributed proximal gradient method for composite convex optimization. In International Conference on Machine Learning (ICML), 2015.
[5] Amir Beck, Angelia Nedi´c, Asuman Ozdaglar, and Marc Teboulle. An O(1/k) gradient method for network resource allocation problems. Transactions on Control of Network Systems, 1:64–73, 2014. · Zbl 1370.90290
[6] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1–122, 2011. · Zbl 1229.90122
[7] Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In The ACM Symposium on Theory of Computing, pages 1011–1020, 2016. · Zbl 1373.68235
[8] Tsung-Hui Chang, Mingyi Hong, and Xiangfeng Wang. Multi-agent distributed optimization via inexact consensus ADMM. IEEE Transactions on Signal Processing, 63:482 – 497, 2015. · Zbl 1393.90124
[9] Annie I. Chen and Asuman E. Ozdaglar. A fast distributed proximal-gradient method. In Annual Allerton Conference on Communication, Control, and Computing, pages 601–608, 2012. 40
[10] Zihao Chen, Luo Luo, and Zhihua Zhang. Communication lower bounds for distributed convex optimization: Partition data on features. In AAAI, pages 1812–1818, 2017.
[11] Vaclav Chvatal. The tail of the hypergeometric distribution. Discrete Mathematics, 25: 285–287, 1979. · Zbl 0396.60016
[12] Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
[13] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), 2014a.
[14] Aaron Defazio, Justin Domke, and Tiberio S. Caetano. Finito: A faster, permutable incremental gradient method for big data problems. In International Conference on Machine Learning (ICML), 2014b.
[15] Wei Deng and Wotao Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Journal of Scientific Computing, 66:889–916, 2016. · Zbl 1379.65036
[16] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, 2012. · Zbl 1369.90156
[17] Roy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In International Conference on Machine Learning (ICML), pages 2540–2548, 2015.
[18] William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A high-performance, portable implementation of the mpi message passing interface standard. Parallel Computing, 22(6):789–828, 1996. · Zbl 0875.68206
[19] Martin Jaggi, Virginia Smith, Martin Takac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I. Jordan. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems (NIPS), 2014.
[20] Dusan Jakoveti´c, Jos´e M. F. Moura, and Joao Xavier. Distributed Nesterov-like gradient algorithms. In IEEE Annual Conference on Decision and Control (CDC), 2012.
[21] Dusan Jakoveti´c, Joao Xavier, and Jos´e M. F. Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59:1131–1146, 2014. · Zbl 1360.90292
[22] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (NIPS), pages 315–323, 2013.
[23] Jakub Koneˇcn‘y and Peter Richt´arik. Semi-stochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, 3:9, 2017. 41
[24] Jakub Koneˇcn´y, H. Brendan McMahan, Daniel Ramage, and Peter Richt´arik. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.
[25] Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. Mathematical Programming, 12:1–49, 2017. · Zbl 1432.90115
[26] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui.A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems (NIPS), 2015.
[27] Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I Jordan, Peter Richt´arik, and Martin Tak´aˇc. Adding vs. averaging in distributed primal-dual optimization. In International Conference on Machine Learning (ICML), 2015.
[28] Mehrdad Mahdavi, Lijun Zhang, and Rong Jin. Mixed optimization for smooth functions. In Advances in Neural Information Processing Systems (NIPS), 2013.
[29] Julien Mairal.Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25:829–855, 2015. · Zbl 1320.90047
[30] Ali Makhdoumi and Asuman Ozdaglar. Convergence rate of distributed ADMM over networks. IEEE Transactions on Automatic Control, 2017. · Zbl 1390.90551
[31] Aryan Mokhtari, Wei Shi, Qing Ling, and Alejandro Ribeiro. Dqm: Decentralized quadratically approximated alternating direction method of multipliers. IEEE Transactions on Signal Processing, 64:5158 –5173, 2016.
[32] Jo˜ao F. C. Mota, Jo˜ao M. F. Xavier, Pedro M. Q. Aguiar, and Markus P¨uschel. D-ADMM: A communication-efficient distributed algorithm for separable optimization. IEEE Transaction on Signal Processing, 61(10):2718–2723, 2013. · Zbl 1393.94059
[33] Angelia Nedi´c and Asuman Ozdaglar. On the rate of convergence of distributed subgradient methods for multi-agent optimization. In IEEE Annual Conference on Decision and Control (CDC), 2007. · Zbl 1367.90086
[34] Angelia Nedi´c and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54:49–61, 2009. · Zbl 1367.90086
[35] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2013. · Zbl 1086.90045
[36] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems (NIPS), pages 1177–1184, 2008.
[37] Sashank J Reddi, Jakub Koneˇcn‘y, Peter Richt´arik, Barnab´as P´ocz´os, and Alex Smola. AIDE: Fast and communication efficient distributed optimization.arXiv preprint arXiv:1608.06879, 2016.
[38] Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems (NIPS), pages 2663–2671, 2012. 42
[39] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017. · Zbl 1358.90073
[40] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14:567–599, 2013. · Zbl 1307.68073
[41] Shai Shalev-Shwartz, Ohad Shamir, Karthik Sridharan, and Nathan Srebro. Stochastic convex optimization. In International Conference of Machine Learning (ICML), 2009.
[42] Ohad Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems (NIPS), 2016.
[43] Ohad Shamir and Nathan Srebro. On distributed stochastic optimization and learning. In Annual Allerton Conference on Communication, Control, and Computing, 2014.
[44] Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate Newton-type method. In International Conference on Machine Learning (ICML), pages 1000–1008, 2014.
[45] Wei Shi, Qing Ling, Kun Yuan, Gang Wu, and Wotao Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62:1750–1761, 2014. · Zbl 1394.94532
[46] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63:6013–6023, 2015a. · Zbl 1394.94531
[47] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25:944–966, 2015b. · Zbl 1328.90107
[48] J. V. Uspensky. Introduction to Mathematical Probability. McGraw-Hill, 1937. · JFM 63.1069.01
[49] Ermin Wei and Asuman Ozdaglar. Distributed alternating direction method of multipliers. In IEEE Annual Conference on Decision and Control (CDC), pages 5445–5450, 2012.
[50] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014. · Zbl 1321.65016
[51] Tianbao Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems (NIPS), 2013.
[52] Yuchen Zhang and Xiao Lin. Disco: Distributed optimization for self-concordant empirical loss. In International Conference on Machine Learning (ICML), pages 362–370, 2015.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.