Large-scale machine learning with stochastic gradient descent. (English) Zbl 1436.68293

Lechevallier, Yves (ed.) et al., Proceedings of COMPSTAT’2010. 19th international conference on computational statistics, Paris, France, August 22–27, 2010. Keynote, invited and contributed papers. Heidelberg: Physica Verlag. 177-186 (2010).
Summary: During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by the computing time rather than the sample size. A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.
For the entire collection see [Zbl 1202.62001].


68T05 Learning and adaptive systems in artificial intelligence
62L20 Stochastic approximation


Full Text: DOI


[1] BORDES. A., BOTTOU, L., and GALLINARI, P. (2009): SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent. Journal of Machine Learning Research, 10:1737-1754. With Erratum (to appear). · Zbl 1235.68130
[2] BOTTOU, L. and BOUSQUET, O. (2008): The Tradeoffs of Large Scale Learning, In Advances in Neural Information Processing Systems, vol.20, 161-168.
[3] BOTTOU, L. and LECUN, Y. (2004): On-line Learning for Very Large Datasets. Applied Stochastic Models in Business and Industry, 21(2):137-151 · Zbl 1091.68063 · doi:10.1002/asmb.538
[4] BOUSQUET, O. (2002): Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms. Thèse de doctorat, Ecole Polytechnique, Palaiseau, France.
[5] CORTES, C. and VAPNIK, V. N. (1995): Support Vector Networks, Machine Learning, 20:273-297. · Zbl 0831.68098
[6] DENNIS, J. E., Jr., and SCHNABEL, R. B. (1983): Numerical Methods For Unconstrained Optimization and Nonlinear Equations. Prentice-Hall · Zbl 0579.65058
[7] JOACHIMS, T. (2006): Training Linear SVMs in Linear Time. In Proceedings of the 12th ACM SIGKDD, ACM Press.
[8] LAFFERTY, J. D., MCCALLUM, A., and PEREIRA, F. (2001): Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML 2001, 282-289, Morgan Kaufman.
[9] LEE, W. S., BARTLETT, P. L., and WILLIAMSON, R. C. (1998): The Importance of Convexity in Learning with Squared Loss. IEEE Transactions on Information Theory, 44(5):1974-1980. · Zbl 0935.68091 · doi:10.1109/18.705577
[10] LEWIS, D. D., YANG, Y., ROSE, T. G., and LI, F. (2004): RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397.
[11] LIN, C. J., WENG, R. C., and KEERTHI, S. S. (2007): Trust region Newton methods for large-scale logistic regression. In Proceedings of ICML 2007, 561-568, ACM Press.
[12] MACQUEEN, J. (1967): Some Methods for Classification and Analysis of Multivariate Observations. In Fifth Berkeley Symposium on Mathematics, Statistics, and Probabilities, vol.1, 281-297, University of California Press. · Zbl 0214.46201
[13] MASSART, P. (2000): Some applications of concentration inequalities to Statistics, Annales de la Faculté des Sciences de Toulouse, series 6,9,(2):245-303. · Zbl 0986.62002
[14] MURATA, N. (1998): A Statistical Study of On-line Learning. In Online Learning and Neural Networks, Cambridge University Press. · Zbl 0966.68170
[15] POLYAK, B. T. and JUDITSKY, A. B. (1992): Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization, 30(4):838-855. · Zbl 0762.62022 · doi:10.1137/0330046
[16] ROSENBLATT, F. (1957): The Perceptron: A perceiving and recognizing automaton. Technical Report 85-460-1, Project PARA, Cornell Aeronautical Lab.
[17] RUMELHART, D. E., HINTON, G. E., and WILLIAMS, R. J. (1986): Learning internal representations by error propagation. In Parallel distributed processing: Explorations in the microstructure of cognition, vol.I, 318-362, Bradford Books.
[18] SHALEV-SHWARTZ, S. and SREBRO, N. (2008): SVM optimization: inverse dependence on training set size. In Proceedings of the ICML 2008, 928-935, ACM.
[19] TIBSHIRANI, R. (1996): Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1):267-288. · Zbl 0850.62538
[20] TJONG KIM SANG E. F., and BUCHHOLZ, S. (2000): Introduction to the CoNLL-2000 Shared Task: Chunking. In Proceedings of CoNLL-2000, 127-132.
[21] TSYBAKOV, A. B. (2004): Optimal aggregation of classifiers in statistical learning, Annals of Statististics, 32(1). · Zbl 1105.62353
[22] VAPNIK, V. N. and CHERVONENKIS, A. YA. (1971): On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability and its Applications, 16(2):264-280. · Zbl 0247.60005 · doi:10.1137/1116025
[23] WIDROW, B. and HOFF, M. E. (1960): Adaptive switching circuits. IRE WESCON Conv. Record, Part 4., 96-104.
[24] XU, W.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.