Local Rademacher complexities.

*(English)*Zbl 1083.62034From the paper: Estimating the performance of statistical procedures is useful for providing a better understanding of the factors that influence their behavior, as well as for suggesting ways to improve them. Although asymptotic analysis is a crucial first step toward understanding the behavior, finite sample error bounds are of more value as they allow the design of model selection (or parameter tuning) procedures. These error bounds typically have the following form: with high probability, the error of the estimator (typically a function in a certain class) is bounded by an empirical estimate of error plus a penalty term depending on the complexity of the class of functions that can be chosen by the algorithm. The differences between the true and empirical errors of functions in that class can be viewed as an empirical process. Many tools have been developed for understanding the behavior of such objects, and especially for evaluating their suprema – which can be thought of as a measure of how hard it is to estimate functions in the class at hand. The goal is thus to obtain the sharpest possible estimates on the complexity of function classes. A problem arises since the notion of complexity might depend on the (unknown) underlying probability measure according to which the data is produced. Distribution-free notions of the complexity, such as the Vapnik-Chervonenkis dimension or the metric entropy, typically give conservative estimates. Distribution-dependent estimates, based for example on entropy numbers in the \(L_2(P)\) distance, where \(P\) is the underlying distribution, are not useful when \(P\) is unknown. Thus, it is desirable to obtain data-dependent estimates which can readily be computed from the sample.

We propose new bounds on the error of learning algorithms in terms of a data-dependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to classification and prediction with convex function classes, and with kernel classes in particular.

We propose new bounds on the error of learning algorithms in terms of a data-dependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to classification and prediction with convex function classes, and with kernel classes in particular.

##### MSC:

62G08 | Nonparametric regression and quantile regression |

68Q32 | Computational learning theory |

68Q25 | Analysis of algorithms and problem complexity |

65Y20 | Complexity and performance of numerical algorithms |

PDF
BibTeX
XML
Cite

\textit{P. L. Bartlett} et al., Ann. Stat. 33, No. 4, 1497--1537 (2005; Zbl 1083.62034)

**OpenURL**

##### References:

[1] | Bartlett, P. L., Boucheron, S. and Lugosi, G. (2002). Model selection and error estimation. Machine Learning 48 85–113. · Zbl 0998.68117 |

[2] | Bartlett, P. L., Jordan, M. I. and McAuliffe, J. D. (2005). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. · Zbl 1118.62330 |

[3] | Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res. 3 463–482. · Zbl 1084.68549 |

[4] | Bartlett, P. L. and Mendelson, S. (2003). Empirical minimization. Probab. Theory Related Fields . To appear. Available at www.stat.berkeley.edu/ bartlett/papers/ bm-em-03.pdf. · Zbl 1142.62348 |

[5] | Boucheron, S., Lugosi, G. and Massart, P. (2000). A sharp concentration inequality with applications. Random Structures Algorithms 16 277–292. · Zbl 0954.60008 |

[6] | Boucheron, S., Lugosi, G. and Massart, P. (2003). Concentration inequalities using the entropy method. Ann. Probab. 31 1583–1614. · Zbl 1051.60020 |

[7] | Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Math. Acad. Sci. Paris 334 495–500. · Zbl 1001.60021 |

[8] | Bousquet, O. (2003). Concentration inequalities for sub-additive functions using the entropy method. In Stochastic Inequalities and Applications (E. Giné, C. Houdré and D. Nualart, eds.) 213–247. Birkhäuser, Boston. · Zbl 1037.60015 |

[9] | Bousquet, O., Koltchinskii, V. and Panchenko, D. (2002). Some local measures of complexity of convex hulls and generalization bounds. Computational Learning Theory. Lecture Notes in Artificial Intelligence 2375 59–73. Springer, Berlin. · Zbl 1050.68055 |

[10] | Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition . Springer, New York. · Zbl 0853.68150 |

[11] | Dudley, R. M. (1999). Uniform Central Limit Theorems . Cambridge Univ. Press. · Zbl 0951.60033 |

[12] | Györfi, L., Kohler, M., Krzyżak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression . Springer, New York. · Zbl 1021.62024 |

[13] | Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Inform. and Comput. 100 78–150. · Zbl 0762.68050 |

[14] | Haussler, D. (1995). Sphere packing numbers for subsets of the Boolean \(n\)-cube with bounded Vapnik–Chervonenkis dimension. J. Combin. Theory Ser. A 69 217–232. · Zbl 0818.60005 |

[15] | Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902–1914. · Zbl 1008.62614 |

[16] | Koltchinskii, V. and Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In High Dimensional Probability II (E. Giné, D. M. Mason and J. A. Wellner, eds.) 443–459. Birkhäuser, Boston. · Zbl 1106.68385 |

[17] | Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces : Isoperimetry and Processes . Springer, New York. · Zbl 0748.60004 |

[18] | Lee, W. S., Bartlett, P. L. and Williamson, R. C. (1998). The importance of convexity in learning with squared loss. IEEE Trans. Inform. Theory 44 1974–1980. · Zbl 0935.68091 |

[19] | Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. Ann. Statist. 32 1679–1697. · Zbl 1045.62060 |

[20] | Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829. · Zbl 0961.62058 |

[21] | Massart, P. (2000). About the constants in Talagrand’s concentration inequalities for empirical processes. Ann. Probab. 28 863–884. · Zbl 1140.60310 |

[22] | Massart, P. (2000). Some applications of concentration inequalities to statistics. Probability theory. Ann. Fac. Sci. Toulouse Math. ( 6 ) 9 245–303. · Zbl 0986.62002 |

[23] | McDiarmid, C. (1998). Concentration. In Probabilistic Methods for Algorithmic Discrete Mathematics (M. Habib, C. McDiarmid, J. Ramirez-Alfonsin and B. Reed, eds.) 195–248. Springer, New York. · Zbl 0927.60027 |

[24] | Mendelson, S. (2002). Geometric parameters of kernel machines. Computational Learning Theory. Lecture Notes in Artificial Intelligence 2375 29–43. Springer, Berlin. · Zbl 1050.68070 |

[25] | Mendelson, S. (2002). Rademacher averages and phase transitions in Glivenko–Cantelli classes. IEEE Trans. Inform. Theory 48 251–263. · Zbl 1059.60027 |

[26] | Mendelson, S. (2002). Improving the sample complexity using global data. IEEE Trans. Inform. Theory 48 1977–1991. · Zbl 1061.68128 |

[27] | Mendelson, S. (2003). A few notes on statistical learning theory. Advanced Lectures on Machine Learning . Lecture Notes in Comput. Sci. 2600 1–40. Springer, New York. · Zbl 1015.00030 |

[28] | Pollard, D. (1984). Convergence of Stochastic Processes . Springer, Berlin. · Zbl 0544.60045 |

[29] | Rio, E. (2001). Une inégalité de Bennett pour les maxima de processus empiriques. Ann. Inst. H. Poincaré Probab. Statist. 38 1053–1057. · Zbl 1014.60011 |

[30] | Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Ann. Probab. 22 28–76. JSTOR: · Zbl 0798.60051 |

[31] | van de Geer, S. (1987). A new approach to least-squares estimation, with applications. Ann. Statist. 15 587–602. JSTOR: · Zbl 0625.62046 |

[32] | van de Geer, S. (2000). Empirical Processes in M-Estimation . Cambridge Univ. Press. · Zbl 1179.62073 |

[33] | van der Vaart, A. (1998). Asymptotic Statistics . Cambridge Univ. Press. · Zbl 0910.62001 |

[34] | van der Vaart, A. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes. With Applications of Statistics . Springer, New York. · Zbl 0862.60002 |

[35] | Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264–280. · Zbl 0247.60005 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.