Importance sampling for minibatches.

*(English)*Zbl 06982318Summary: Minibatching is a very well studied and highly popular technique in supervised learning, used by practitioners due to its ability to accelerate training through better utilization of parallel processing power and reduction of stochastic variance. Another popular technique is importance sampling – a strategy for preferential sampling of more important examples also capable of accelerating the training process. However, despite considerable effort by the community in these areas, and due to the inherent technical difficulty of the problem, there is virtually no existing work combining the power of importance sampling with the strength of minibatching. In this paper we propose the first practical importance sampling for minibatches and give simple and rigorous complexity analysis of its performance. We illustrate on synthetic problems that for training data of certain properties, our sampling can lead to several orders of magnitude improvement in training time. We then test the new sampling on several
popular data sets, and show that the improvement can reach an order of magnitude.

##### MSC:

65C60 | Computational problems in statistics (MSC2010) |

##### Keywords:

empirical risk minimization; importance sampling; minibatching; variance-reduced methods; convex optimization##### Software:

Pegasos
PDF
BibTeX
XML
Cite

\textit{D. Csiba} and \textit{P. Richtárik}, J. Mach. Learn. Res. 19, Paper No. 27, 21 p. (2018; Zbl 06982318)

Full Text:
Link

##### References:

[1] | L´eon Bottou. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pages 177–186. Springer, 2010. |

[2] | Dominik Csiba and Peter Richt´arik. Primal method for ERM with flexible mini-batching schemes and non-convex losses. arXiv:1506.02227, 2015. |

[3] | Dominik Csiba, Zheng Qu, and Peter Richt´arik. Stochastic dual coordinate ascent with adaptive probabilities. ICML, 2015. |

[4] | Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS 27, pages 1646–1654, 2014. |

[5] | Olivier Fercoq and Peter Richt´arik. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 25(4):1997–2023, 2015. · Zbl 1327.65108 |

[6] | Olivier Fercoq, Zheng Qu, Peter Richt´arik, and Martin Tak´aˇc. Fast distributed coordinate descent for minimizing non-strongly convex losses. IEEE International Workshop on Machine Learning for Signal Processing, 2014. |

[7] | Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochastic methods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012. · Zbl 1262.90090 |

[8] | Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Koneˇcn´y, and Scott Sallinen. Stop wasting my gradients: Practical SVRG. In NIPS 28, pages 2251–2259, 2015. |

[9] | Geoffrey E Hinton. Learning multiple layers of representation. Trends in cognitive sciences, 11(10):428–434, 2007. |

[10] | Guang-Bin Huang, Hongming Zhou, Xiaojian Ding, and Rui Zhang. Extreme learning machine for regression and multiclass classification. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 42(2):513–529, 2012. |

[11] | Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS 26, 2013. |

[12] | Jakub Koneˇcn´y and Peter Richt´arik. S2GD: Semi-stochastic gradient descent methods. Frontiers in Applied Mathematics and Statistics, 3(9):1–14, 2017. |

[13] | Jakub Koneˇcn´y, Jie Lu, Peter Richt´arik, and Martin Tak´aˇc. mS2GD: Mini-batch semistochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10(2):242–255, 2016. |

[14] | Jakub Koneˇcn´y, Zheng Qu, and Peter Richt´arik. Semi-stochastic coordinate descent. Optimization Methods and Software, 32(5):993–1005, 2017. · Zbl 1386.90080 |

[15] | Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NIPS 25, pages 1097–1105, 2012. 19 |

[16] | Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method and its application to regularized empirical risk minimization. SIAM Journal on Optimization, 25(4):2244–2273, 2015. · Zbl 1329.65127 |

[17] | Ji Liu and Stephen J Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties. SIAM Journal on Optimization, 25(1):351–376, 2015. · Zbl 1358.90098 |

[18] | Julien Mairal.Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015. · Zbl 1320.90047 |

[19] | Ion Necoara and Andrei Patrascu. A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints. Computational Optimization and Applications, 57(2):307–337, 2014. · Zbl 1304.90160 |

[20] | Deanna Needell, Rachel Ward, and Nati Srebro. Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In NIPS 27, pages 1017–1025, 2014. · Zbl 1333.65070 |

[21] | Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012. · Zbl 1257.90073 |

[22] | Zheng Qu and Peter Richt´arik. Coordinate descent methods with arbitrary sampling I: algorithms and complexity. Optimization Methods and Software, 31(5):829–857, 2016. · Zbl 1365.90205 |

[23] | Zheng Qu and Peter Richt´arik. Coordinate descent with arbitrary sampling II: expected separable overapproximation. Optimization Methods and Software, 31(5):858–884, 2016. · Zbl 1365.90206 |

[24] | Zheng Qu, Peter Richt´arik, and Tong Zhang. Quartz: Randomized dual coordinate ascent with arbitrary sampling. In NIPS 28, pages 865–873, 2015. |

[25] | Peter Richt´arik and Martin Tak´aˇc. Distributed coordinate descent method for learning with big data. JMLR, 17(75):1–25, 2016a. · Zbl 1360.68709 |

[26] | Peter Richt´arik and Martin Tak´aˇc. On optimal probabilities in stochastic coordinate descent methods. Optimization Letters, 10(6):1233–1243, 2016b. · Zbl 1353.90148 |

[27] | Peter Richt´arik and Martin Tak´aˇc. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144 (2):1–38, 2014. · Zbl 1301.65051 |

[28] | Peter Richt´arik and Martin Tak´aˇc. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1):433–484, 2016. · Zbl 1342.90102 |

[29] | Herbert Robbins and Sutton Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 1951. · Zbl 0054.05901 |

[30] | Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv:1309.2388, 2013. · Zbl 1417.90099 |

[31] | Shai Shalev-Shwartz. SDCA without duality. arXiv:1502.06177, 2015. 20 |

[32] | Shai Shalev-Shwartz and Ambuj Tewari. Stochastic methods for ‘1-regularized loss minimization. JMLR, 12:1865–1892, 2011. · Zbl 1280.62081 |

[33] | Shai Shalev-Shwartz and Tong Zhang.Proximal stochastic dual coordinate ascent. arXiv:1211.2717, 2012. · Zbl 1342.90103 |

[34] | Shai Shalev-Shwartz and Tong Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In NIPS 26, pages 378–385. 2013a. · Zbl 1307.68073 |

[35] | Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. JMLR, 14(1):567–599, 2013b. · Zbl 1307.68073 |

[36] | Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical Programming, 127(1):3–30, 2011. · Zbl 1211.90239 |

[37] | Martin Tak´aˇc, Avleen Singh Bijral, Peter Richt´arik, and Nathan Srebro. Mini-batch primal and dual methods for SVMs. In ICML, 2013. |

[38] | Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014. · Zbl 1321.65016 |

[39] | Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML, 2004. |

[40] | Yuchen Zhang and Lin Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In ICML, 2015. · Zbl 1440.62314 |

[41] | Peilin Zhao and Tong Zhang. Accelerating minibatch stochastic gradient descent using stratified sampling. arXiv:1405.3080, 2014. |

[42] | Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling. In ICML, 2015. |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.