×

An accelerated communication-efficient primal-dual optimization framework for structured machine learning. (English) Zbl 1464.90059

Summary: Distributed optimization algorithms are essential for training machine learning models on very large-scale datasets. However, they often suffer from communication bottlenecks. Confronting this issue, a communication-efficient primal-dual coordinate ascent framework (CoCoA) and its improved variant CoCoA+ have been proposed, achieving a convergence rate of \(\mathcal{O}(1/t)\) for solving empirical risk minimization problems with Lipschitz continuous losses. In this paper, an accelerated variant of CoCoA+ is proposed and shown to possess a convergence rate of \(\mathcal{O}(1/t^2)\) in terms of reducing suboptimality. The analysis of this rate is also notable in that the convergence rate bounds involve constants that, except in extreme cases, are significantly reduced compared to those previously provided for CoCoA+. The results of numerical experiments are provided to show that acceleration can lead to significant performance gains.

MSC:

90C25 Convex programming
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Beck, A.; Teboulle, M., A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM. J. Imaging. Sci., 2, 1, 183-202 (2009) · Zbl 1175.94009 · doi:10.1137/080716542
[2] Bradley, J.K., Kyrola, A., Bickson, D., and Guestrin, C., Parallel coordinate descent for l1-regularized loss minimization (2011). Available at arXiv:1105.5379.
[3] Duchi, J.; Hazan, E.; Singer, Y., Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12, 2121-2159 (2011) · Zbl 1280.68164
[4] Dünner, C., Forte, S., Takáč, M., and Jaggi, M., Primal-dual rates and certificates, in In 33rd International Conference on Machine Learning, ICML 2016, 2016.
[5] Fercoq, O.; Richtárik, P., Accelerated, parallel and proximal coordinate descent, SIAM. J. Optim., 25, 4, 1997-2023 (2015) · Zbl 1327.65108 · doi:10.1137/130949993
[6] Jaggi, M., Smith, V., Takác, M., Terhorst, J., Krishnan, S., Hofmann, T., and Jordan, M.I., Communication-efficient distributed dual coordinate ascent, in Advances in Neural Information Processing Systems, 2014, pp. 3068-3076.
[7] Lin, H., Mairal, J., and Harchaoui, Z., A universal catalyst for first-order optimization, in Advances in Neural Information Processing Systems, 2015, pp. 3366-3374.
[8] Liu, J.; Wright, S. J., Asynchronous stochastic coordinate descent: Parallelism and convergence properties, SIAM. J. Optim., 25, 1, 351-376 (2015) · Zbl 1358.90098 · doi:10.1137/140961134
[9] Ma, C.; Konečnč, J.; Jaggi, M.; Smith, V.; Jordan, M. I.; Richtárik, P.; Takáč, M., Distributed optimization with arbitrary local solvers, Optim. Methods Softw., 32, 4, 813-848 (2017) · Zbl 1419.68214 · doi:10.1080/10556788.2016.1278445
[10] Ma, C., Smith, V., Jaggi, M., Jordan, M.I., Richtárik, P., and Takáč, M., Adding vs. averaging in distributed primal-dual optimization, in In 32th International Conference on Machine Learning, ICML 2015, 2015.
[11] Nesterov, Y., Gradient methods for minimizing composite functions, Math. Program., 140, 1, 125-161 (2013) · Zbl 1287.90067 · doi:10.1007/s10107-012-0629-5
[12] Nesterov, Y., Introductory Lectures on Convex Optimization: A Basic Course (2013), Springer: Springer, Boston, MA
[13] Recht, B., Re, C., Wright, S., and Niu, F., Hogwild: A lock-free approach to parallelizing stochastic gradient descent, in Advances in Neural Information Processing Systems, 2011, pp. 693-701.
[14] Rockafellar, R. T., Convex Analysis (1970), Princeton university press: Princeton university press, Princeton, NJ · Zbl 0229.90020
[15] Shalev-Shwartz, S. and Zhang, T., Accelerated mini-batch stochastic dual coordinate ascent, in Advances in Neural Information Processing Systems, 2013, pp. 378-385.
[16] Shalev-Shwartz, S.; Zhang, T., Stochastic dual coordinate ascent methods for regularized loss minimization, J. Mach. Learn. Res., 14, 1, 567-599 (2013) · Zbl 1307.68073
[17] Shamir, O., Srebro, N., and Zhang, T., Communication-efficient distributed optimization using an approximate newton-type method, 2014, pp. 1000-1008.
[18] Smith, V.; Forte, S.; Chenxin, M.; Takáč, M.; Jordan, M. I.; Jaggi, M., CoCoA: A general framework for communication-efficient distributed optimization, J. Mach. Learn. Res., 18, 230 (2018) · Zbl 1473.68167
[19] Takáč, M., Bijral, A., Richtárik, P., and Srebro, N., Mini-batch primal and dual methods for svms, in In 30th International Conference on Machine Learning, ICML 2013, 2013.
[20] Takáč, M., Richtárik, P., and Srebro, N., Distributed mini-batch SDCA (2015). Available at arXiv:1507.08322.
[21] Yang, T., Zhu, S., Jin, R., and Lin, Y., Analysis of distributed stochastic dual coordinate ascent (2013). Available at arXiv:1312.1031.
[22] Zhang, Y. and Xiao, L., Communication-efficient distributed optimization of self-concordant empirical loss, in Large-Scale and Distributed Optimization, Springer, 2018, pp. 289-341. · Zbl 1412.90118
[23] Zhu, C.; Byrd, R. H.; Lu, P.; Nocedal, J., Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization, ACM Trans. Math. Softw. (TOMS), 23, 4, 550-560 (1997) · Zbl 0912.65057 · doi:10.1145/279232.279236
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.