×

Distributed learning with sparse communications by identification. (English) Zbl 07394788

Summary: In distributed optimization for large-scale learning, a major performance limitation stems from the communications between the different entities. To the extent that computations are performed by workers on local data while a coordinator machine coordinates their updates to minimize a global loss, we present an asynchronous optimization algorithm that efficiently reduces the communications between the coordinator and workers. This reduction comes from a random sparsification of the local updates. We show that this algorithm converges linearly in the strongly convex case and also identifies optimal strongly sparse solutions. We further exploit this identification to propose an automatic dimension reduction, aptly sparsifying all exchanges between coordinator and workers.

MSC:

65K10 Numerical optimization and variational techniques
90C06 Large-scale problems in mathematical programming
68W15 Distributed algorithms

Software:

Celer; ARock
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli, The convergence of sparsified gradient methods, in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Curran Associates, 2018.
[2] A. Aytekin, H. R. Feyzmahdavian, and M. Johansson, Analysis and Implementation of an Asynchronous Optimization Algorithm for the Parameter Server, preprint, https://arxiv.org/abs/1610.05507, 2016. · Zbl 1359.90080
[3] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, Optimization with sparsity-inducing penalties, Found. Trends Mach. Learn., 4 (2012), pp. 1-106. · Zbl 06064248
[4] D. Basu, D. Data, C. Karakus, and S. Diggavi, Qsparse-local-sgd: Distributed SGD with Quantization, Sparsification and Local Computations, in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Curran Associates, 2019.
[5] H. H. Bauschke and P. L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, Springer, New York, 2011. · Zbl 1218.47001
[6] R. Bellman, R. Kalaba, and J. Lockett, Numerical Inversion of the Laplace Transform: Applications to Biology, Economics, Engineering and Physics, Elsevier, New York, 1966. · Zbl 0147.14003
[7] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods, rentice Hall, Englewood Cliffs, NJ, 1989. · Zbl 0743.65107
[8] S. Bubeck, Convex optimization: Algorithms and complexity, Found. Trends Mach. Learn., 8 (2015), pp. 231-357. · Zbl 1365.90196
[9] I. S. Dhillon, P. K. Ravikumar, and A. Tewari, Nearest Neighbor based Greedy Coordinate Descent, in Advances in Neural Information Processing Systems 24, NIPS, 2011.
[10] D. L. Donoho, De-noising by soft-thresholding, IEEE Trans. Inform. Theory 41 (1995), pp. 613-627. · Zbl 0820.62002
[11] J. M. Fadili, G. Garrigos, J. Malick, and G. Peyré, Model consistency for learning with mirror-stratifiable regularizers, Proceedings of Machine Learning Research, 89 (2019), pp. 1236-124.
[12] M. Fuentes, J. Malick, and C. Lemaréchal, Descentwise inexact proximal algorithms for smooth optimization, Comput. Optim. Appl., 53 (2012), pp. 755-769. · Zbl 1264.90160
[13] D. Grishchenko, F. Iutzeler, and J. Malick, Proximal gradient methods with adaptive subspace sampling, Math. Oper. Res., in press. · Zbl 1467.90024
[14] O. Güler, New proximal point algorithms for convex minimization, SIAM J. Optim., 2 (1992), pp. 649-664, https://doi.org/10.1137/0802032. · Zbl 0778.90052
[15] R. Hannah and W. Yin, On unbounded delays in asynchronous parallel fixed-point algorithms, J. Sci. Comput., (2016), pp. 1-28.
[16] W. L. Hare and A. S. Lewis, Identifying active constraints via partial smoothness and prox-regularity, J. Convex Anal., 11 (2004), pp. 251-266. · Zbl 1178.90314
[17] F. Iutzeler and J. Malick, Nonsmoothness in machine learning: specific structure, proximal identification, and applications, Set-Valued Var. Anal., 28 (2020), pp. 661-678. · Zbl 1506.90272
[18] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, H. Eichner, S. El Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konečný, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, M. Raykova, H. Qi, D. Ramage, R. Raskar, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao, Advances and Open Problems in Federated Learning, preprint, https://arxiv.org/abs/1912.04977, 2019.
[19] S. Khirirat, M. Johansson, and D. Alistarh, Gradient compression for communication-limited convex optimization, in Proceedings of the 2018 IEEE Conference on Decision and Control (CDC), IEEE, 2018, pp. 166-171.
[20] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, Federated Optimization: Distributed Machine Learning for On-Device Intelligence, preprint, https://arxiv.org/abs/1610.02527, 2016.
[21] J. Langford, L. Li, and T. Zhang, Sparse online learning via truncated gradient, J. Mach. Learn. Res., 10 (2009), pp. 777-801. · Zbl 1235.68167
[22] S. Lee and S. J. Wright, Manifold identification in dual averaging for regularized stochastic online learning, J. Mach. Learn. Res., 13 (2012), pp. 1705-1744. · Zbl 1432.68392
[23] M. Li, D. G. Andersen, and A. Smola, Distributed delayed proximal gradient methods, in NIPS Workshop on Optimization for Machine Learning, 2013.
[24] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, Federated Learning: Challenges, Methods, and Future Directions, preprint, https://arxiv.org/abs/1908.07873, 2019.
[25] H. Lin, J. Mairal, and Z. Harchaoui, Catalyst acceleration for first-order convex optimization: from theory to practice, J. Mach. Learn. Res., 18 (2017), pp. 7854-7907. · Zbl 1469.68101
[26] H. Lin, J. Mairal, and Z. Harchaoui, An inexact variable metric proximal point algorithm for generic quasi-newton acceleration, SIAM J. Optim., 29 (2019), pp. 1408-1443, https://doi.org/10.1137/17M1125157. · Zbl 1421.90117
[27] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training, https://arxiv.org/abs/1712.01887, 2017.
[28] J. Liu, S. J. Wright, C. Ré, V. Bittorf, and S. Sridhar, An asynchronous parallel stochastic coordinate descent algorithm, J. Mach. Learn. Res., 16 (2015), pp. 285-322. · Zbl 1337.68286
[29] I. Loshchilov, M. Schoenauer, and M. Sebag, Adaptive coordinate descent, in Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, ACM, 2011, pp. 885-892.
[30] C. Ma, M. Jaggi, F. E. Curtis, N. Srebro, and M. Takáč, An accelerated communication-efficient optimization framework for structured machine learning, Optim. Methods Softw. 36 (2021), pp. 20-44. · Zbl 1464.90059
[31] B. Martinet, Régularisation d’inéquations variationelles par approximation successives, Revue Française d’Informatique et de Recherche Opérationelle, R3 (1970), pp. 154-158. · Zbl 0215.21103
[32] M. Massias, A. Gramfort, and J. Salmon, Celer: A Fast Solver for the Lasso with Dual Extrapolation, preprint, https://arxiv.org/abs/1802.07481, 2018.
[33] K. Mishchenko, F. Iutzeler, and J. Malick, A distributed flexible delay-tolerant proximal gradient algorithm, SIAM J. Optim., 30 (2020), pp. 933-959, https://doi.org/10.1137/18M1194699. · Zbl 1441.90120
[34] K. Mishchenko, F. Iutzeler, J. Malick, and M.-R. Amini, A delay-tolerant proximal-gradient algorithm for distributed learning, in Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, 80 (2018), pp. 3587-3595.
[35] H. Namkoong, A. Sinha, S. Yadlowsky, and J. C. Duchi, Adaptive sampling probabilities for non-smooth optimization, in Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, 70 (2017), pp. 2574-2583.
[36] Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM J. Optim., 22 (2012), pp. 341-362, https://doi.org/10.1137/100802001. · Zbl 1257.90073
[37] J. Nutini, I. Laradji, and M. Schmidt, Let’s Make Block Coordinate Descent Go Fast: Faster Greedy Rules, Message-Passing, Active-Set Complexity, and Superlinear Convergence, preprint, https://arxiv.org/abs/1712.08859, 2017.
[38] J. Nutini, M. Schmidt, and W. Hare, “Active-set complexity” of proximal gradient: How long does it take to find the sparsity pattern?, Optim. Lett., 13 (2019), pp. 645-655. · Zbl 1426.90253
[39] F. Pedregosa, R. Leblond, and S. Lacoste-Julien, Breaking the nonsmooth barrier: A scalable parallel method for composite optimization, in Advances in Neural Information Processing Systems 30, NIPS, 2017. · Zbl 1478.68293
[40] Z. Peng, Y. Xu, M. Yan, and W. Yin, ARock: An algorithmic framework for asynchronous parallel coordinate updates, SIAM J. Sci. Comput., 38 (2016), pp. A2851-A2879, https://doi.org/10.1137/15M1024950. · Zbl 1350.49041
[41] C. Poon, J. Liang, and C. Schoenlieb, Local convergence properties of saga/prox-svrg and acceleration, in Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, 80 (2018), pp. 4124-4132.
[42] C. Renggli, S. Ashkboos, M. Aghagolzadeh, D. Alistarh, and T. Hoefler, Sparcml: High-performance sparse communication for machine learning, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2019, 11.
[43] P. Richtárik and M. Takáč, Distributed coordinate descent method for learning with big data, J. Mach. Learn. Res., 17 (2016), pp. 2657-2681. · Zbl 1360.68709
[44] P. Richtárik and M. Takáč, On optimal probabilities in stochastic coordinate descent methods, Optim. Lett., 10 (2016), pp. 1233-1243. · Zbl 1353.90148
[45] R. T. Rockafellar, Monotone operators and the proximal point algorithm, SIAM J. Control Optim., 14 (1976), pp. 877-898, https://doi.org/10.1137/0314056. · Zbl 0358.90053
[46] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs, in Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association, 2014, pp. 1058-1062.
[47] M. V. Solodov and B. F. Svaiter, Error bounds for proximal point subproblems and associated inexact proximal point algorithms, Math. Program., 88 (2000), pp. 371-389. · Zbl 0963.90064
[48] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, Sparsified SGD with memory, in Advances in Neural Information Processing Systems, 2018, pp. 4447-4458.
[49] S. U. Stich, A. Raj, and M. Jaggi, Safe adaptive importance sampling, in Advances in Neural Information Processing Systems, 2017, pp. 4381-4391.
[50] N. Strom, Scalable distributed DNN training using commodity GPU cloud computing, in Poceedings of the Sixteenth Annual Conference of the International Speech Communication Association, 2015, pp. 1488-1492.
[51] T. Sun, R. Hannah, and W. Yin, Asynchronous coordinate descent under more realistic assumptions, in Advances in Neural Information Processing Systems, 2017.
[52] Y. Sun, H. Jeong, J. Nutini, and M. Schmidt, Are we there yet? manifold identification of gradient-related proximal methods, in Artificial Intelligence and Statistics, 2019, pp. 1110-1119.
[53] S. Vaiter, G. Peyré, and J. Fadili, Low complexity regularization of linear inverse problems, in Sampling Theory, a Renaissance, Springer, 2015, pp. 103-153. · Zbl 1358.94016
[54] T. Vogels, S. P. Karimireddy, and M. Jaggi, Powersgd: Practical low-rank gradient compression for distributed optimization, in NeurIPS, 2019.
[55] J. Wang, M. Kolar, N. Srebro, and T. Zhang, Efficient distributed learning with sparsity, in International Conference on Machine Learning, 2017, pp. 3636-3645.
[56] J. Wangni, J. Wang, J. Liu, and T. Zhang, Gradient sparsification for communication-efficient distributed optimization, in Advances in Neural Information Processing Systems, 2018, pp. 1306-1316.
[57] S. J. Wright, Identifiable surfaces in constrained optimization, SIAM J. Control Optim., 31 (1993), pp. 1063-1079, https://doi.org/10.1137/0331048. · Zbl 0804.90105
[58] P. Zhao and T. Zhang, Stochastic optimization with importance sampling for regularized loss minimization, in Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, 37 (2015), pp. 1-9.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.