×

Local Rademacher complexities and oracle inequalities in risk minimization. (2004 IMS Medallion Lecture). (With discussions and rejoinder). (English) Zbl 1118.62065

Summary: Let \({\mathcal F}\) be a class of measurable functions \(f:S\mapsto[0,1]\) defined on a probability space \((S,{\mathcal A},P)\). Given a sample \((X_1, \dots,X_n)\) of i.i.d. random variables taking values in \(S\) with common distribution \(P\), let \(P_n\) denote the empirical measure based on \((X_1, \dots,X_n)\). We study an empirical risk minimization problem \(P_nf\to \min\), \(f\in{\mathcal F}\). Given a solution \(\widehat f_n\) of this problem, the goal is to obtain very general upper bounds on its excess risk \[ {\mathcal E}_P(\widehat f_n):=P\widehat f_n-\inf_{f\in{\mathcal F}}Pf, \] expressed in terms of relevant geometric parameters of the class \({\mathcal F}\). Using concentration inequalities and other empirical process tools, we obtain both distribution-dependent and data-dependent upper bounds on the excess risk that are of asymptotically correct order in many examples. The bounds involve localized sup-norms of empirical and Rademacher processes indexed by functions from the class. We use these bounds to develop model selection techniques in abstract risk minimization problems that can be applied to more specialized frameworks of regression and classification.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
60B99 Probability theory on algebraic and topological structures
68Q32 Computational learning theory
62G08 Nonparametric regression and quantile regression
68T05 Learning and adaptive systems in artificial intelligence
68T10 Pattern recognition, speech recognition

References:

[1] Anthony, M. and Bartlett, P. (1999). Neural Network Learning: Theoretical Foundations . Cambridge Univ. Press. · Zbl 0968.68126 · doi:10.1017/CBO9780511624216
[2] Baraud, Y. (2002). Model selection for regression on a random design. ESAIM Probab. Statist. 6 127–146. · Zbl 1059.62038 · doi:10.1051/ps:2002007
[3] Barron, A., Birgé, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301–413. · Zbl 0946.62036 · doi:10.1007/s004400050210
[4] Bartlett, P., Boucheron, S. and Lugosi, G. (2002). Model selection and error estimation. Machine Learning 48 85–113. · Zbl 0998.68117 · doi:10.1023/A:1013999503812
[5] Bartlett, P., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537. · Zbl 1083.62034 · doi:10.1214/009053605000000282
[6] Bartlett, P., Jordan, M. and McAuliffe, J. (2006). Convexity, classification and risk bounds. J. Amer. Statist. Assoc. 101 138–156. · Zbl 1118.62330 · doi:10.1198/016214505000000907
[7] Bartlett, P. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311–334. · Zbl 1142.62348 · doi:10.1007/s00440-005-0462-3
[8] Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation. In Festschrift for Lucien Le Cam. Research Papers in Probability and Statistics (D. Pollard, E. Torgersen and G. Yang, eds.) 55–87. Springer, New York. · Zbl 0920.62042
[9] Blanchard, G., Bousquet, O. and Massart, P. (2003). Statistical performance of support vector machines. · Zbl 1133.62044 · doi:10.1214/009053607000000839
[10] Blanchard, G., Lugosi, G. and Vayatis, N. (2003). On the rate of convergence of regularized boosting classifiers. J. Mach. Learn. Res. 4 861–894. · Zbl 1083.68109 · doi:10.1162/1532443041424319
[11] Boucheron, S., Bousquet, O., Lugosi, G. and Massart, P. (2005). Moment inequalities for functions of independent random variables. Ann. Probab. 33 514–560. · Zbl 1074.60018 · doi:10.1214/009117904000000856
[12] Boucheron, S., Lugosi, G. and Massart, P. (2000). A sharp concentration inequality with applications. Random Structures Algorithms 16 277–292. · Zbl 0954.60008 · doi:10.1002/(SICI)1098-2418(200005)16:3<277::AID-RSA4>3.0.CO;2-1
[13] Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Acad. Sci. Paris Ser. I Math. 334 495–500. · Zbl 1001.60021 · doi:10.1016/S1631-073X(02)02292-6
[14] Bousquet, O., Koltchinskii, V. and Panchenko, D. (2002). Some local measures of complexity of convex hulls and generalization bounds. In Computational Learning Theory . Lecture Notes in Comput. Sci. 2375 59–73. Springer, Berlin. · Zbl 1050.68055 · doi:10.1007/3-540-45435-7_5
[15] Devroye, L., Györfi, G. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition . Springer, New York. · Zbl 0853.68150
[16] Dudley, R. M. (1999). Uniform Central Limit Theorems . Cambridge Univ. Press. · Zbl 0951.60033 · doi:10.1017/CBO9780511665622
[17] Einmahl, U. and Mason, D. (2000). An empirical processes approach to the uniform consistency of kernel type function estimators. J. Theoret. Probab. 13 1–37. · Zbl 0995.62042 · doi:10.1023/A:1007769924157
[18] Giné, E. and Guillou, A. (2001). On consistency of kernel density estimators for randomly censored data: Rates holding uniformly over adaptive intervals. Ann. Inst. H. Poincaré Probab. Statist. 37 503–522. · Zbl 0974.62030 · doi:10.1016/S0246-0203(01)01081-0
[19] Giné, E. and Koltchinskii, V. (2006). Concentration inequalities and asymptotic results for ratio type empirical processes. Ann. Probab. 34 1143–1216. · Zbl 1152.60021 · doi:10.1214/009117906000000070
[20] Giné, E., Koltchinskii, V. and Wellner, J. (2003). Ratio limit theorems for empirical processes. In Stochastic Inequalities and Applications (E. Giné, C. Houdré and D. Nualart, eds.) 249–278. Birkhäuser, Basel. · Zbl 1055.60019
[21] Giné, E. and Zinn, J. (1984). Some limit theorems for empirical processes (with discussion). Ann. Probab. 12 929–998. · Zbl 0553.60037 · doi:10.1214/aop/1176993138
[22] Györfi, L., Kohler, M., Krzyzak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression . Springer, New York. · Zbl 1021.62024
[23] Johnstone, I. M. (1998). Oracle inequalities and nonparametric function estimation. In Proc. International Congress of Mathematicians 3 267–278. · Zbl 0920.62045
[24] Klein, T. (2002). Une inégalité de concentration à gauche pour les processus empiriques. C. R. Acad. Sci. Paris Ser. I Math. 334 501–504. · Zbl 1003.60024 · doi:10.1016/S1631-073X(02)02303-8
[25] Kohler, M. (2000). Inequalities for uniform deviations of averages from expectations with applications to nonparametric regression. J. Statist. Plann. Inference 89 1–23. · Zbl 0982.62035 · doi:10.1016/S0378-3758(99)00215-3
[26] Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902–1914. · Zbl 1008.62614 · doi:10.1109/18.930926
[27] Koltchinskii, V. and Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In High Dimensional Probability II (E. Giné, D. Mason and J. Wellner, eds.) 443–457. Birkhäuser, Boston. · Zbl 1106.68385
[28] Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1–50. · Zbl 1012.62004
[29] Koltchinskii, V. and Panchenko, D. (2005). Complexities of convex combinations and bounding the generalization error in classification. Ann. Statist. 33 1455–1496. · Zbl 1080.62045 · doi:10.1214/009053605000000228
[30] Koltchinskii, V., Panchenko, D. and Lozano, F. (2003). Bounding the generalization error of convex combinations of classifiers: balancing the dimensionality and the margins. Ann. Appl. Probab. 13 213–252. · Zbl 1073.62535 · doi:10.1214/aoap/1042765667
[31] Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces . Isoperimetry and Processes . Springer, New York. · Zbl 0748.60004
[32] Lee, W. S., Bartlett, P. and Williamson, R. C. (1996). Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inform. Theory 42 2118–2132. · Zbl 0874.68253 · doi:10.1109/18.556601
[33] Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Statist. 32 30–55. · Zbl 1105.62319 · doi:10.1214/aos/1079120129
[34] Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. Ann. Statist. 32 1679–1697. · Zbl 1045.62060 · doi:10.1214/009053604000000463
[35] Mammen, E. and Tsybakov, A. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829. · Zbl 0961.62058 · doi:10.1214/aos/1017939240
[36] Massart, P. (2000). Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. (6) 9 245–303. · Zbl 0986.62002 · doi:10.5802/afst.961
[37] Mendelson, S. (2002). Improving the sample complexity using global data. IEEE Trans. Inform. Theory 48 1977–1991. · Zbl 1061.68128 · doi:10.1109/TIT.2002.1013137
[38] Mendelson, S. (2002). Geometric parameters of kernel machines. In Computational Learning Theory . Lecture Notes in Comput. Sci. 2375 29–43. Springer, Berlin. · Zbl 1050.68070 · doi:10.1007/3-540-45435-7_3
[39] Shen, X. and Wong, W.H. (1994). Convergence rate of sieve estimates. Ann. Statist. 22 580–615. · Zbl 0805.62008 · doi:10.1214/aos/1176325486
[40] Steinwart, I. (2005). Consistency of support vector machines and other regularized kernel machines. IEEE Trans. Inform. Theory 51 128–142. · Zbl 1304.62090 · doi:10.1109/TIT.2004.839514
[41] Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22 28–76. · Zbl 0798.60051 · doi:10.1214/aop/1176988847
[42] Talagrand, M. (1996). A new look at independence. Ann. Probab. 24 1–34. · Zbl 0858.60019 · doi:10.1214/aop/1042644705
[43] Talagrand, M. (1996). New concentration inequalities in product spaces. Invent. Math. 126 505–563. · Zbl 0893.60001 · doi:10.1007/s002220050108
[44] Tsybakov, A. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166. · Zbl 1105.62353 · doi:10.1214/aos/1079120131
[45] Tsybakov, A. and van de Geer, S. (2005). Square root penalty: Adaptation to the margin in classification and in edge estimation. Ann. Statist. 33 1203–1224. · Zbl 1080.62047 · doi:10.1214/009053604000001066
[46] van de Geer, S. (2000). Empirical Processes in M-Estimation . Cambridge Univ. Press. · Zbl 1179.62073
[47] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. With Applications to Statistics . Springer, New York. · Zbl 0862.60002
[48] Vapnik, V. (1998). Statistical Learning Theory . Wiley, New York. · Zbl 0935.62007
[49] Vapnik, V. and Chervonenkis, A. (1974). Theory of Pattern Recognition . Nauka, Moscow (in Russian). · Zbl 0284.68070
[50] Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization (with discussion). Ann. Statist. 32 56–117, 128–134. · Zbl 1105.62323 · doi:10.1214/aos/1079120130
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.