×

Nonparametric stochastic approximation with large step-sizes. (English) Zbl 1346.60041

Summary: We consider the random-design least-squares regression problem within the reproducing kernel Hilbert space (RKHS) framework. Given a stream of independent and identically distributed input/output data, we aim to learn a regression function within an RKHS \(\mathcal{H}\), even if the optimal predictor (i.e., the conditional expectation) is not in \(\mathcal{H}\). In a stochastic approximation framework where the estimator is updated after each observation, we show that the averaged unregularized least-mean-square algorithm (a form of stochastic gradient descent), given a sufficient large step-size, attains optimal rates of convergence for a variety of regimes for the smoothness of the optimal prediction function and the functions in \(\mathcal{H}\). Our results apply as well in the usual finite-dimensional setting of parametric least-squares regression, showing adaptivity of our estimator to the spectral decay of the covariance matrix of the covariates.

MSC:

60F99 Limit theorems in probability theory
62G08 Nonparametric regression and quantile regression
46E22 Hilbert spaces with reproducing kernels (= (proper) functional Hilbert spaces, including de Branges-Rovnyak and other structured spaces)
65C60 Computational problems in statistics (MSC2010)

Software:

Forgetron
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Abramowitz, M. and Stegun, I. (1964). Handbook of Mathematical Functions . Dover, New York. · Zbl 0171.38503
[2] Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337-404. · Zbl 0037.20701 · doi:10.2307/1990404
[3] Bach, F. (2013). Sharp analysis of low-rank kernel matrix approximations. In Proceedings of the International Conference on Learning Theory ( COLT ).
[4] Bach, F. and Moulines, E. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Adv. NIPS . Curran Associates, Inc., Granada, Spain.
[5] Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate \(O(1/n)\). In Advances in Neural Information Processing Systems ( NIPS ). Curran Associates, Inc., Lake Tahoe, CA.
[6] Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics . Kluwer Academic, Boston, MA. · Zbl 1145.62002
[7] Blanchard, G. and Krämer, N. (2010). Optimal learning rates for kernel conjugate gradient regression. In Advances in Neural Inf. Proc. Systems ( NIPS ) 226-234. Curran Associates, Inc., Vancouver, Canada.
[8] Bordes, A., Ertekin, S., Weston, J. and Bottou, L. (2005). Fast kernel classifiers with online and active learning. J. Mach. Learn. Res. 6 1579-1619. · Zbl 1222.68152
[9] Brezis, H. (1983). Analyse Fonctionnelle , Théorie et Applications . Masson, Paris.
[10] Caponnetto, A. and De Vito, E. (2007). Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 7 331-368. · Zbl 1129.68058 · doi:10.1007/s10208-006-0196-8
[11] Cucker, F. and Smale, S. (2002). On the mathematical foundations of learning. Bull. Amer. Math. Soc. ( N.S. ) 39 1-49 (electronic). · Zbl 0983.68162 · doi:10.1090/S0273-0979-01-00923-5
[12] Cucker, F. and Smale, S. (2002). Best choices for regularization parameters in learning theory: On the bias-variance problem. Found. Comput. Math. 2 413-428. · Zbl 1057.68085 · doi:10.1007/s102080010030
[13] Dekel, O., Shalev-Shwartz, S. and Singer, Y. (2005). The Forgetron: A kernel-based perceptron on a fixed budget. In Adv. NIPS . Curran Associates, Inc., Vancouver, Canada. · Zbl 1151.68579 · doi:10.1137/060666998
[14] De Vito, E., Caponnetto, A. and Rosasco, L. (2005). Model selection for regularized least-squares algorithm in learning theory. Found. Comput. Math. 5 59-85. · Zbl 1083.68106 · doi:10.1007/s10208-004-0134-1
[15] Dieuleveut, A. and Bach, F. (2014). Nonparametric stochastic approximation with large step sizes. Preprint. Available at . arXiv:1408.0361 · Zbl 1346.60041 · doi:10.1214/15-AOS1391
[16] Engl, H. W., Hanke, M. and Neubauer, A. (1996). Regularization of Inverse Problems. Mathematics and Its Applications 375 . Kluwer Academic, Dordrecht. · Zbl 0859.65054
[17] Flammarion, N. and Bach, F. (2015). From averaging to acceleration, there is only a step-size. In Proceedings of the International Conference on Learning Theory ( COLT ). Microtome Publishing, Paris, France.
[18] Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations , 3rd ed. Johns Hopkins Univ. Press, Baltimore, MD. · Zbl 0865.65009
[19] Hazan, E. and Kale, S. (2011). Beyond the regret minimization barrier: An optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the International Conference on Learning Theory ( COLT ). Microtome Publishing, Budapest, Hungary. · Zbl 1319.90050
[20] Hsu, D., Kakade, S. M. and Zhang, T. (2014). Random design analysis of ridge regression. Found. Comput. Math. 14 569-600. · Zbl 1298.62120 · doi:10.1007/s10208-014-9192-1
[21] Johnstone, I. M. (1994). Minimax Bayes, asymptotic minimax and sparse wavelet priors. In Statistical Decision Theory and Related Topics , V ( West Lafayette , IN , 1992) 303-326. Springer, New York. · Zbl 0815.62017 · doi:10.1007/978-1-4612-2618-5_23
[22] Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 33 82-95. · Zbl 0201.39702 · doi:10.1016/0022-247X(71)90184-3
[23] Kivinen, J., Smola, A. J. and Williamson, R. C. (2004). Online learning with kernels. IEEE Trans. Signal Process. 52 2165-2176. · Zbl 1369.68281 · doi:10.1109/TSP.2004.830991
[24] Kolmogorov, A. N. and Fomin, S. V. (1999). Elements of the Theory of Functions and Functional Analysis , Vol. 1 Dover, New York. · Zbl 0235.46001
[25] Lacoste-Julien, S., Schmidt, M. and Bach, F. (2012). A simpler approach to obtaining an O(1/t) rate for the stochastic projected subgradient method. Preprint. Available at . arXiv:1212.2002
[26] Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Faund. Trends Mach. Learn. 3 123-224. · Zbl 1232.68173 · doi:10.1561/2200000035
[27] Micchelli, C. A., Xu, Y. and Zhang, H. (2006). Universal kernels. J. Mach. Learn. Res. 7 2651-2667. · Zbl 1222.68266
[28] Mikusinski, P. and Weiss, E. (2014). The Bochner integral. Preprint. Available at . arXiv:1403.5209
[29] Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A. (2008). Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19 1574-1609. · Zbl 1189.90109 · doi:10.1137/070704277
[30] Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Early stopping for nonparametric regression: An optimal data-dependent stopping rule. In 49 th Annual Allerton Conference on Communication , Control , and Computing . IEEE, Monticello, IL, USA. · Zbl 1318.62136
[31] Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat. 22 400-407. · Zbl 0054.05901 · doi:10.1214/aoms/1177729586
[32] Rosasco, L., Tacchetti, A. and Villa, S. (2014). Regularization by early stopping for online learning algorithms. Preprint. Available at . arXiv:1405.0042
[33] Schölkopf, B. and Smola, A. J. (2002). Learning with Kernels . MIT Press, Cambridge, MA. · Zbl 1019.68094
[34] Shalev-Shwartz, S. (2011). Online learning and online convex optimization. Foundations and Trends in Machine Learning 4 107-194. · Zbl 1253.68190 · doi:10.1561/2200000018
[35] Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis . Cambridge Univ. Press, Cambridge, MA. · Zbl 0994.68074
[36] Smale, S. and Zhou, D.-X. (2007). Learning theory estimates via integral operators and their approximations. Constr. Approx. 26 153-172. · Zbl 1127.68088 · doi:10.1007/s00365-006-0659-y
[37] Sriperumbudur, B. K., Fukumizu, K. and Lanckriet, G. R. G. (2011). Universality, characteristic kernels and RKHS embedding of measures. J. Mach. Learn. Res. 12 2389-2410. · Zbl 1280.68198
[38] Steinwart, I., Hush, D. and Scovel, C. (2009). Optimal rates for regularized least squares regression. In Proceedings of the 22 nd Annual Conference in Learning Theory . Microtome Publishing, Montreal, Canada. · Zbl 1127.68090
[39] Tarrès, P. and Yao, Y. (2011). Online learning as stochastic approximation of regularization paths. Preprint. Available at . arXiv:1103.5538 · Zbl 1360.62192
[40] Thomson, B., Bruckner, J. and Bruckner, A. M. (2000). Elementary Real Analysis . Pearson Education, Upper Saddle River, NJ. · Zbl 0872.26001
[41] Tsybakov, A. B. (2008). Introduction to Nonparametric Estimation , 1st ed. Springer, Berlin. · Zbl 1176.62032
[42] Vert, J. (2014). Kernel methods. Unpublished manuscript.
[43] Wahba, G. (1990). Spline Models for Observational Data . SIAM, Philadelphia, PA. · Zbl 0813.62001
[44] Williams, C. and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13 (T. K. Leen, T. G. Dietterich and V. Tresp, eds.). MIT Press, Cambridge, MA.
[45] Yao, Y. (2006). A dynamic theory of learning. Ph.D. thesis, Univ. of California, Berkeley.
[46] Yao, Y., Rosasco, L. and Caponnetto, A. (2007). On early stopping in gradient descent learning. Constr. Approx. 26 289-315. · Zbl 1125.62035 · doi:10.1007/s00365-006-0663-2
[47] Ying, Y. and Pontil, M. (2008). Online gradient descent learning algorithms. Found. Comput. Math. 8 561-596. · Zbl 1175.68211 · doi:10.1007/s10208-006-0237-y
[48] Zhang, T. (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML 2014 Proceedings of the Twenty-First International Conference on Machine Learning . Microtome Publishing, Beijing, China.
[49] Zhang, T. and Yu, B. (2005). Boosting with early stopping: Convergence and consistency. Ann. Statist. 33 1538-1579. · Zbl 1078.62038 · doi:10.1214/009053605000000255
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.