##
**Complexities of convex combinations and bounding the generalization error in classification.**
*(English)*
Zbl 1080.62045

From the paper: Since the invention of enemble classification methods (such as boosting), the convex hull \(\text{conv}({\mathcal H})\) of a given base function class \({\mathcal H}\) has become an important object of study in the machine learning literature. The reason is that the ensemble algorithms typically output classifiers that are convex combinations of simple classifiers selected by the algorithm from the base class \({\mathcal H}\), and, because of this, measuring the complexity of the whole convex hull as well as of its subsets becomes very important in the analysis of the generalization error of ensemble classifiers. Another important feature of boosting and many other ensemble methods is that they belong to the class of so-called large margin methods, that is, they are based on optimization of the empirical risk with respect to various loss functions that penalize not only for a misclassification (a negative classification margin), but also for a correct classification with too small positive margin. Thus, the very nature of these methods is to produce classifiers that tend to have rather large positive classification margins on the training data. Finding such classifiers becomes possible since the algorithms search for them in rather huge function classes (such as convex hulls of typical VC-classes used in classification).

We introduce and study several measures of complexity of functions from the convex hull of a given base class. These complexity measures take into account the sparsity of the weights of a convex combination as well as certain clustering properties of the base functions involved in it. We prove new upper confidence bounds on the generalization error of ensemble (voting) classification algorithms that utilize the new complexity measures along with the empirical distributions of classification margins, providing a better explanation of generalization performance of large margin classification methods.

We introduce and study several measures of complexity of functions from the convex hull of a given base class. These complexity measures take into account the sparsity of the weights of a convex combination as well as certain clustering properties of the base functions involved in it. We prove new upper confidence bounds on the generalization error of ensemble (voting) classification algorithms that utilize the new complexity measures along with the empirical distributions of classification margins, providing a better explanation of generalization performance of large margin classification methods.

### MSC:

62H30 | Classification and discrimination; cluster analysis (statistical aspects) |

65Y20 | Complexity and performance of numerical algorithms |

68T99 | Artificial intelligence |

62G05 | Nonparametric estimation |

62G20 | Asymptotic properties of nonparametric inference |

60F15 | Strong limit theorems |

68Q32 | Computational learning theory |

68Q25 | Analysis of algorithms and problem complexity |

PDF
BibTeX
XML
Cite

\textit{V. Koltchinskii} and \textit{D. Panchenko}, Ann. Stat. 33, No. 4, 1455--1496 (2005; Zbl 1080.62045)

### References:

[1] | Andonova, S. (2004). Theoretical and experimental analysis of the generalization ability of some statistical learning algorithms. Ph.D. dissertation, Boston Univ. |

[2] | Antos, A., Kégl, B., Linder, T. and Lugosi, G. (2002). Data-dependent margin-based generalization bounds for classification. J. Mach. Learn. Res. 3 73–98. · Zbl 1088.68688 |

[3] | Audibert, J.-Y. (2004). Aggregated estimators and empirical complexity for least square regression. Ann. Inst. H. Poincaré Probab. Statist. 40 685–736. · Zbl 1052.62037 |

[4] | Bartlett, P. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory 44 525–536. · Zbl 0901.68177 |

[5] | Bartlett, P., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497–1537. · Zbl 1083.62034 |

[6] | Bartlett, P., Jordan, M. and McAuliffe, J. (2005). Convexity, classification and risk bounds. J. Amer. Statist. Assoc. · Zbl 1118.62330 |

[7] | Blanchard, G., Lugosi, G. and Vayatis, N. (2004). On the rate of convergence of regularized boosting classifiers. J. Mach. Learn. Res. 4 861–894. · Zbl 1083.68109 |

[8] | Bousquet, O., Koltchinskii, V. and Panchenko, D. (2002). Some local measures of complexity of convex hulls and generalization bounds. Computational Learning Theory . Lecture Notes in Artificial Intelligence 2375 59–73. Springer, Berlin. · Zbl 1050.68055 |

[9] | Breiman, L. (1996). Bagging predictors. Machine Learning 24 123–140. · Zbl 0858.68080 |

[10] | Breiman, L. (1998). Arcing classifiers (with discussion). Ann. Statist. 26 801–849. · Zbl 0934.62064 |

[11] | Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation 11 1493–1517. |

[12] | Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition . Springer, New York. · Zbl 0853.68150 |

[13] | Dudley, R. M. (1999). Uniform Central Limit Theorems . Cambridge Univ. Press. · Zbl 0951.60033 |

[14] | Haussler, D. (1995). Sphere packing numbers for subsets of the Boolean \(n\)-cube with bounded Vapnik–Chervonenkis dimension. J. Combin. Theory Ser. A 69 217–232. · Zbl 0818.60005 |

[15] | Jiang, W. (2004). Process consistency for AdaBoost. Ann. Statist. 32 13–29. · Zbl 1105.62316 |

[16] | Koltchinskii, V. (2003). Bounds on margin distributions in learning problems. Ann. Inst. H. Poincaré Probab. Statist. 39 943–978. · Zbl 1031.60017 |

[17] | Koltchinskii, V. (2003). Local Rademacher complexities and oracle inequalities in risk minimization. · Zbl 1118.62065 |

[18] | Koltchinskii, V. and Panchenko, D. (2000). Rademacher processes and bounding the risk of function learning. In High Dimensional Probability II (E. Giné, D. Mason and J. Wellner, eds.) 443–459. Birkhäuser, Boston. · Zbl 1106.68385 |

[19] | Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1–50. · Zbl 1012.62004 |

[20] | Koltchinskii, V., Panchenko, D. and Andonova, S. (2003). Generalization bounds for voting classifiers based on sparsity and clustering. Lecture Notes in Artificial Intelligence . · Zbl 1274.68328 |

[21] | Koltchinskii, V., Panchenko, D. and Lozano, F. (2003). Bounding the generalization error of convex combinations of classifiers: Balancing the dimensionality and the margins. Ann. Appl. Probab. 13 213–252. · Zbl 1073.62535 |

[22] | Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Statist. 32 30–55. · Zbl 1105.62319 |

[23] | Massart, P. (2000). Some applications of concentration inequalities to statistics. Probability theory. Ann. Fac. Sci. Toulouse Math. (6) 9 245–303. · Zbl 0986.62002 |

[24] | Panchenko, D. (2001). A note on Talagrand’s concentration inequality. Electron. Comm. Probab. 6 55–65. · Zbl 0977.60008 |

[25] | Panchenko, D. (2002). Some extensions of an inequality of Vapnik and Chervonenkis. Electron. Comm. Probab. 7 55–65. · Zbl 1008.60035 |

[26] | Panchenko, D. (2003). Symmetrization approach to concentration inequalities for empirical processes. Ann. Probab. 31 2068–2081. · Zbl 1042.60008 |

[27] | Pisier, G. (1981). Remarques sur un résultat non publié de B. Maurey. In Seminar on Functional Analysis , 1980–1981 , École Polytechnic , Palaiseau . Exp. no. V, 13 pp. · Zbl 0491.46017 |

[28] | Schapire, R., Freund, Y., Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686. · Zbl 0929.62069 |

[29] | Schapire, R. and Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning 37 297–336. · Zbl 0945.68194 |

[30] | Steinwart, I. (2004). Sparseness of support vector machines. J. Mach. Learn. Res. 4 1071–1105. · Zbl 1094.68082 |

[31] | Tsybakov, A. (2003). Optimal rates of aggregation. Lecture Notes in Artificial Intelligence . · Zbl 1208.62073 |

[32] | van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes . With Applications to Statistics . Springer, New York. · Zbl 0862.60002 |

[33] | Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32 56–85. · Zbl 1105.62323 |

[34] | Zhang, T. and Yu, B. (2005). Boosting with early stopping: Convergence and consistency. Ann. Statist. 33 1538–1579. · Zbl 1078.62038 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.