×

Simultaneous adaptation to the margin and to complexity in classification. (English) Zbl 1209.62146

Summary: We consider the problem of adaptation to the margin and to complexity in binary classification. We suggest an exponential weighting aggregation scheme. We use this aggregation procedure to construct classifiers which adapt automatically to margin and complexity. Two main examples are worked out in which adaptivity is achieved in frameworks proposed by I. Steinwart and C. Scovel [Learning Theory. Lect. Notes Comput. Sci. 3559, 279–294 (2005; Zbl 1137.68564); Ann. Stat. 35, No. 2, 575–607 (2007; Zbl 1127.68091)] and A. B. Tsybakov [Ann. Stat. 32, No. 1, 135–166 (2004; Zbl 1105.62353)]. Adaptive schemes, like ERM or penalized ERM, usually involve a minimization step. This is not the case for our procedure.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T10 Pattern recognition, speech recognition

Software:

AdaBoost.MH

References:

[1] Audibert, J.-Y. and Tsybakov, A. B. (2005). Fast learning rates for plug-in classifiers under the margin condition. Preprint PMA-998. Available at www.proba.jussieu.fr/mathdoc/preprints/index.html#2005. · Zbl 1118.62041 · doi:10.1214/009053606000001217
[2] Bartlett, P., Jordan, M. and McAuliffe, J. (2006). Convexity, classification and risk bounds. J. Amer. Statist. Assoc. 101 138–156. · Zbl 1118.62330 · doi:10.1198/016214505000000907
[3] Birgé, L. (2006). Model selection via testing: An alternative to (penalized) maximum likelihood estimators. Ann. Inst. H. Poincaré Probab. Statist. 42 273–325. · Zbl 1333.62094 · doi:10.1016/j.anihpb.2005.04.004
[4] Blanchard, G., Bousquet, O. and Massart, P. (2004). Statistical performance of support vector machines. Available at ida.first.fraunhofer.de/ blanchard/publi/index.html. · Zbl 1133.62044
[5] Blanchard, G., Lugosi, G. and Vayatis, N. (2004). On the rate of convergence of regularized boosting classifiers. J. Mach. Learn. Res. 4 861–894. · Zbl 1083.68109 · doi:10.1162/1532443041424319
[6] Boucheron, S., Bousquet, O. and Lugosi, G. (2005). Theory of classification: A survey of some recent advances. ESAIM Probab. Stat. 9 323–375. · Zbl 1136.62355 · doi:10.1051/ps:2005018
[7] Bühlmann, P. and Yu, B. (2002). Analyzing bagging. Ann. Statist. 30 927–961. · Zbl 1029.62037 · doi:10.1214/aos/1031689014
[8] Buckland, S. T., Burnham, K. P. and Augustin, N. H. (1997). Model selection: An integral part of inference. Biometrics 53 603–618. · Zbl 0885.62118 · doi:10.2307/2533961
[9] Bunea, F. and Nobel, A. (2005). Sequential procedures for aggregating arbitrary estimators of a conditional mean. Technical Report M984, Dept. Statistics, Florida State Univ. · Zbl 1329.62359
[10] Bunea, F., Tsybakov, A. B. and Wegkamp, M. (2007). Aggregation for Gaussian regression. Ann. Statist. 35 1674–1697. · Zbl 1209.62065 · doi:10.1214/009053606000001587
[11] Catoni, O. (2004). Statistical Learning Theory and Stochastic Optimization. École d’Eté de Probabilités de Saint-Flour 2001 . Lecture Notes in Math. 1851 . Springer, Berlin. · Zbl 1076.93002 · doi:10.1007/b99352
[12] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction , Learning and Games . Cambridge Univ. Press. · Zbl 1114.91001
[13] Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning 20 273–297. · Zbl 0831.68098
[14] Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge Univ. Press. · Zbl 0994.68074
[15] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York. · Zbl 0853.68150
[16] Dudley, R. M. (1974). Metric entropy of some classes of sets with differentible boundaries. J. Approximation Theory 10 227–236. · Zbl 0275.41011 · doi:10.1016/0021-9045(74)90120-8
[17] Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119–139. · Zbl 0880.68103 · doi:10.1006/jcss.1997.1504
[18] Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Ann. Statist. 28 337–407. · Zbl 1106.62323 · doi:10.1214/aos/1016218223
[19] Hartigan, J. A. (2002). Bayesian regression using Akaike priors. Preprint, Dept. Statistics, Yale Univ.
[20] Juditsky, A. B. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist. 28 681–712. · Zbl 1105.62338 · doi:10.1214/aos/1015951994
[21] Juditsky, A. B., Nazin, A. V., Tsybakov, A. B. and Vayatis, N. (2005). Recursive aggregation of estimators by the mirror descent method with averaging. Problems Inform. Transmission 41 368–384. · Zbl 1123.62044 · doi:10.1007/s11122-006-0005-2
[22] Koltchinskii, V. (2001). Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory 47 1902–1914. · Zbl 1008.62614 · doi:10.1109/18.930926
[23] Koltchinskii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization (with discussion). Ann. Statist. 36 2593–2706. · Zbl 1118.62065 · doi:10.1214/009053606000001019
[24] Koltchinskii, V. and Panchenko, D. (2000). Rademacher penalties and bounding the risk of function learning. In High Dimensional Probability II (E. Giné, D. M. Mason and J. A. Wellner, eds.) 443–457. Birkhäuser, Boston. · Zbl 1106.68385
[25] Korostelëv, A. P. and Tsybakov, A. B. (1993). Minimax Theory of Image Reconstruction. Lecture Notes in Statist. 82 . Springer, New York. · Zbl 0833.62039
[26] Leung, G. and Barron, A. (2006). Information theory and mixing least-squares regressions. IEEE Trans. Inform. Theory 52 3396–3410. · Zbl 1309.94051 · doi:10.1109/TIT.2006.878172
[27] Lin, Y. (1999). A note on margin-based loss functions in classification. Technical Report 1029r, Dept. Statistics, Univ. Wisconsin-Madison.
[28] Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Statist. 32 30–55. · Zbl 1105.62319 · doi:10.1214/aos/1079120129
[29] Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. Ann. Statist. 32 1679–1697. · Zbl 1045.62060 · doi:10.1214/009053604000000463
[30] Mammen, E. and Tsybakov, A. B. (1995). Asymptotical minimax recovery of sets with smooth boundaries. Ann. Statist. 23 502–524. · Zbl 0834.62038 · doi:10.1214/aos/1176324533
[31] Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829. · Zbl 0961.62058 · doi:10.1214/aos/1017939240
[32] Massart, P. (2000). Some applications of concentration inequalities to statistics. Probability theory. Ann. Fac. Sci. Toulouse Math. ( 6 ) 9 245–303. · Zbl 0986.62002 · doi:10.5802/afst.961
[33] Massart, P. (2007). Concentration Inequalities and Model Selection. Lectures Notes in Math. 1896 . Springer, Berlin. · Zbl 1170.60006
[34] Massart, P. and Nédélec, E. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366. · Zbl 1108.62007 · doi:10.1214/009053606000000786
[35] Nemirovski, A. (2000). Topics in non-parametric statistics. Lectures on Probability Theory and Statistics ( Saint-Flour , 1998 ). Lecture Notes in Math. 1738 85–277. Springer, Berlin. · Zbl 0998.62033
[36] Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686. · Zbl 0929.62069 · doi:10.1214/aos/1024691352
[37] Schölkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press, Cambridge. · Zbl 1019.68094
[38] Steinwart, I. and Scovel, C. (2005). Fast rates for support vector machines. Learning Theory . Lecture Notes in Comput. Sci. 3559 279–294. Springer, Berlin. · Zbl 1137.68564 · doi:10.1007/b137542
[39] Steinwart, I. and Scovel, C. (2007). Fast rates for support vector machines using Gaussian kernels. Ann. Statist. 35 575–607. · Zbl 1127.68091 · doi:10.1214/009053606000001226
[40] Tarigan, B. and van de Geer, S. A. (2006). Classifiers of support vector machine type with \(l_1\) complexity regularization. Bernoulli 12 1045–1076. · Zbl 1118.62067 · doi:10.3150/bj/1165269150
[41] Tsybakov, A. B. (2003). Optimal rates of aggregation. Learning Theory and Kernel Machines. Lecture Notes in Artificial Intelligence 2777 303–313. Springer, Heidelberg. · Zbl 1208.62073
[42] Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135–166. · Zbl 1105.62353 · doi:10.1214/aos/1079120131
[43] Tsybakov, A. B. and van de Geer, S. A. (2005). Square root penalty: Adaptation to the margin in classification and in edge estimation. Ann. Statist. 33 1203–1224. · Zbl 1080.62047 · doi:10.1214/009053604000001066
[44] van de Geer, S. (2000). Applications of Empirical Process Theory . Cambridge Univ. Press. · Zbl 0953.62049
[45] Vovk, V. (1990). Aggregating strategies. In Proc. Third Annual Workshop on Computational Learning Theory (Mark Fulk and John Case, eds.) 371–383. Morgan Kaufmann. San Mateo, CA.
[46] Yang, Y. (2000). Combining different procedures for adaptive regression. J. Multivariate Anal. 74 135–161. · Zbl 0964.62032 · doi:10.1006/jmva.1999.1884
[47] Yang, Y. (2000). Mixing strategies for density estimation. Ann. Statist. 28 75–87. · Zbl 1106.62322 · doi:10.1214/aos/1016120365
[48] Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32 56–85. · Zbl 1105.62323 · doi:10.1214/aos/1079120130
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.