zbMATH — the first resource for mathematics

Just interpolate: Kernel “Ridgeless” Regression can generalize. (English) Zbl 07241593
Summary: In the absence of explicit regularization, Kernel “Ridgeless” Regression with nonlinear kernels has the potential to fit the training data perfectly. It has been observed empirically, however, that such interpolated solutions can still generalize well on test data. We isolate a phenomenon of implicit regularization for minimum-norm interpolated solutions which is due to a combination of high dimensionality of the input data, curvature of the kernel function and favorable geometric properties of the data such as an eigenvalue decay of the empirical covariance and kernel matrices. In addition to deriving a data-dependent upper bound on the out-of-sample error, we present experimental evidence suggesting that the phenomenon occurs in the MNIST dataset.
68Q32 Computational learning theory
62G08 Nonparametric regression and quantile regression
MNIST; Scikit
Full Text: DOI Euclid
[1] Alvarez, M. A., Rosasco, L. and Lawrence, N. D. (2012). Kernels for vector-valued functions: A review. Found. Trends Mach. Learn. 4 195-266. · Zbl 1301.68212
[2] Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3 463-482. · Zbl 1084.68549
[3] Belkin, M. (2018). Approximation beats concentration? an approximation view on inference with smooth radial kernels. ArXiv preprint. Available at arXiv:1801.03437.
[4] Belkin, M., Hsu, D. and Mitra, P. (2018). Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. ArXiv preprint. Available at arXiv:1806.05161.
[5] Belkin, M., Ma, S. and Mandal, S. (2018). To understand deep learning we need to understand kernel learning. ArXiv preprint. Available at arXiv:1802.01396.
[6] Belkin, M., Rakhlin, A. and Tsybakov, A. B. (2018). Does data interpolation contradict statistical optimality? ArXiv preprint. Available at arXiv:1806.09471.
[7] Bose, A., Chatterjee, S. and Gangopadhyay, S. (2003). Limiting spectral distributions of large dimensional random matrices. J. Indian Statist. Assoc. 41 221-259.
[8] Caponnetto, A. and De Vito, E. (2007). Optimal rates for the regularized least-squares algorithm. Found. Comput. Math. 7 331-368. · Zbl 1129.68058
[9] Cressie, N. (1990). The origins of kriging. Math. Geol. 22 239-252. · Zbl 0964.86511
[10] Cucker, F. and Smale, S. (2002). Best choices for regularization parameters in learning theory: On the bias-variance problem. Found. Comput. Math. 2 413-428. · Zbl 1057.68085
[11] De Vito, E., Caponnetto, A. and Rosasco, L. (2005). Model selection for regularized least-squares algorithm in learning theory. Found. Comput. Math. 5 59-85. · Zbl 1083.68106
[12] Dou, X. and Liang, T. (2019). Training neural networks as learning data-adaptive kernels: Provable representation and approximation benefits. ArXiv preprint. Available at arXiv:1901.07114.
[13] El Karoui, N. (2010). The spectrum of kernel random matrices. Ann. Statist. 38 1-50. · Zbl 1181.62078
[14] Evgeniou, T., Pontil, M. and Poggio, T. (2000). Regularization networks and support vector machines. Adv. Comput. Math. 13 1-50. · Zbl 0939.68098
[15] Golub, G. H., Heath, M. and Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21 215-223. · Zbl 0461.62059
[16] Gunasekar, S., Woodworth, B. E., Bhojanapalli, S., Neyshabur, B. and Srebro, N. (2017). Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems 6151-6159.
[17] Györfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression. Springer Series in Statistics. Springer, New York. · Zbl 1021.62024
[18] Koltchinskii, V. and Beznosova, O. (2005). Exponential convergence rates in classification. In Learning Theory. Lecture Notes in Computer Science 3559 295-307. Springer, Berlin. · Zbl 1137.68546
[19] LeCun, Y., Cortes, C. and Burges, C. J. (2010). Mnist handwritten digit database. AT&T Labs [Online]. Available at http://yann.Lecun.Com/exdb/mnist.
[20] Li, Y., Ma, T. and Zhang, H. (2017). Algorithmic regularization in over-parameterized matrix recovery. ArXiv preprint. Available at arXiv:1712.09203.
[21] Liang, T. and Rakhlin, A. (2019). Supplement to “Just interpolate: Kernel “Ridgeless” regression can generalize”. https://doi.org/10.1214/19-AOS1849SUPP.
[22] Neyshabur, B., Tomioka, R. and Srebro, N. (2014). In search of the real inductive bias: On the role of implicit regularization in deep learning. ArXiv preprint. Available at arXiv:1412.6614.
[23] Pedregosa, F., Varoquaux, G., Gramfort, A. et al. (2011). Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 2825-2830. · Zbl 1280.68189
[24] Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge. · Zbl 0994.68074
[25] Smola, A. J. and Schölkopf, B. (1998). Learning with Kernels 4. Citeseer. · Zbl 1019.68094
[26] Vapnik, V. N. (1998). Statistical Learning Theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. Wiley, New York.
[27] Vovk, V. (2013). Kernel ridge regression. In Empirical Inference 105-116. Springer, Heidelberg. · Zbl 1325.62094
[28] Wahba, G. (1990). Spline Models for Observational Data. CBMS-NSF Regional Conference Series in Applied Mathematics 59. SIAM, Philadelphia, PA. · Zbl 0813.62001
[29] Yao, Y., Rosasco, L. and Caponnetto, A. (2007). On early stopping in gradient descent learning. Constr. Approx. 26 289-315. · Zbl 1125.62035
[30] Yin, Y. Q., Bai, Z. D. and Krishnaiah, P. R. (1988). On the limit of the largest eigenvalue of the large-dimensional sample covariance matrix. Probab. Theory Related Fields 78 509-521. · Zbl 0627.62022
[31] Zhang, C.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.