×

Statistical performance of support vector machines. (English) Zbl 1133.62044

Summary: The support vector machine (SVM) algorithm is well known to the computer learning community for its very good practical results. The goal of the present paper is to study this algorithm from a statistical perspective, using tools of concentration theory and empirical processes. Our main result builds on the observation made by other authors that the SVM can be viewed as a statistical regularization procedure. From this point of view, it can also be interpreted as a model selection principle using a penalized criterion. It is then possible to adapt general methods related to model selection in this framework to study two important points: (1) what is the minimum penalty and how does it compare to the penalty actually used in the SVM algorithm; (2) is it possible to obtain “oracle inequalities” in that setting, for the specific loss function used in the SVM algorithm? We show that the answer to the latter question is positive and provides relevant insight to the former. Our result shows that it is possible to obtain fast rates of convergence for SVMs.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence
62G05 Nonparametric estimation
62G20 Asymptotic properties of nonparametric inference
62M45 Neural nets and related approaches to inference from stochastic processes
47N30 Applications of operator theory in probability theory and statistics

References:

[1] Alon, N., Ben-David, S., Cesa-Bianchi, N. and Haussler, D. (1997). Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM 44 615-631. · Zbl 0891.68086 · doi:10.1145/263867.263927
[2] Bartlett, P. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory 44 525-536. · Zbl 0901.68177 · doi:10.1109/18.661502
[3] Bartlett, P., Bousquet, O. and Mendelson, S. (2005). Local Rademacher complexities. Ann. Statist. 33 1497-1537. · Zbl 1083.62034 · doi:10.1214/009053605000000282
[4] Bartlett, P., Jordan, M. and McAuliffe, J. (2006). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc. 101 138-156. · Zbl 1118.62330 · doi:10.1198/016214505000000907
[5] Bartlett, P. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. J. Machine Learning Research 3 463-482. · Zbl 1084.68549 · doi:10.1162/153244303321897690
[6] Bartlett, P. and Mendelson, S. (2006). Empirical minimization. Probab. Theory Related Fields 135 311-334. · Zbl 1142.62348 · doi:10.1007/s00440-005-0462-3
[7] Birman, M. S. and Solomyak, M. Z. (1967). Piecewise polynomial approximations of functions of the class W \alpha p . Mat. USSR Sb. 73 295-317. · Zbl 0173.16001 · doi:10.1070/SM1967v002n03ABEH002343
[8] Blanchard, G., Bousquet, O. and Zwald, L. (2007). Statistical properties of kernel principal component analysis. Machine Learning 66 259-294. · Zbl 1078.68133
[9] Blanchard, G., Lugosi, G. and Vayatis, N. (2003). On the rate of convergence of regularized boosting classifiers. J. Machine Learning Research 4 861-894. · Zbl 1083.68109 · doi:10.1162/1532443041424319
[10] Bousquet, O. (2002). A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Acad. Sci. Paris Ser. I 334 495-500. · Zbl 1001.60021 · doi:10.1016/S1631-073X(02)02292-6
[11] Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. J. Machine Learning Research 2 499-526. · Zbl 1007.68083 · doi:10.1162/153244302760200704
[12] Chen, D.-R., Wu, Q., Ying, Y. and Zhou, D.-X. (2004). Support vector machine soft margin classifiers: Error analysis. J. Machine Learning Research 5 1143-1175. · Zbl 1222.68167
[13] Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge Univ. Press. · Zbl 0994.68074
[14] Edmunds, D. E. and Triebel, H. (1996). Function Spaces , Entropy Numbers , Differential Operators . Cambridge Univ. Press. · Zbl 0865.46020
[15] Evgeniou, T., Pontil, M. and Poggio, T. (2000). Regularization networks and support vector machines. Adv. Comput. Math. 13 1-50. · Zbl 0939.68098 · doi:10.1023/A:1018946025316
[16] Koltchinksii, V. (2006). Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist. 34 2593-2656. · Zbl 1118.62065 · doi:10.1214/009053606000001019
[17] Koltchinskii, V. and Panchenko, D. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30 1-50. · Zbl 1012.62004
[18] Lecué, G. (2007). Simultaneous adaptation to the margin and to complexity in classification. Ann. Statist. 35 1698-1721. · Zbl 1209.62146 · doi:10.1214/009053607000000055
[19] Lin, Y. (2002). Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery 6 259-275. · doi:10.1023/A:1015469627679
[20] Lugosi, G. and Wegkamp, M. (2004). Complexity regularization via localized random penalties. Ann. Statist. 32 1679-1697. · Zbl 1045.62060 · doi:10.1214/009053604000000463
[21] Massart, P. (2000). About the constants in Talagrand’s inequality for empirical processes. Ann. Probab. 28 863-884. · Zbl 1140.60310 · doi:10.1214/aop/1019160263
[22] Massart, P. (2000). Some applications of concentration inequalities in statistics. Ann. Fac. Sci. Toulouse Math. 9 245-303. · Zbl 0986.62002 · doi:10.5802/afst.961
[23] Massart, P. and Nédelec, E. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326-2366. · Zbl 1108.62007 · doi:10.1214/009053606000000786
[24] Massart, P. (2007). Concentration Inequalities and Model Selection. Lectures on Probability Theory and Statistics. Ecole d’Été de Probabilités de Saint-Flour XXXIII-2003. Lecture Notes in Math. 1896 . Springer, Berlin.
[25] Mendelson, S. (2003). Estimating the performance of kernel classes. J. Machine Learning Research 4 759-771. · Zbl 1083.68097 · doi:10.1162/1532443041424337
[26] De Vito, E., Rosasco, L., Caponnetto, A., De Giovanni, U. and Odone, F. (2005). Learning from examples as an inverse problem. J. Machine Learning Research 6 883-904. · Zbl 1222.68180
[27] Schölkopf, B., Smola, A. J. and Williamson, R. C. (2001). Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators. IEEE Trans. Inform. Theory 47 2516-2532. · Zbl 1008.62507 · doi:10.1109/18.945262
[28] Shawe-Taylor, J., Williams, C., Cristianini, N. and Kandola, J. (2005). On the eigenspectrum of the Gram matrix and the generalisation error of kernel PCA. IEEE Trans. Inform. Theory 51 2510-2522. · Zbl 1310.15076 · doi:10.1109/TIT.2005.850052
[29] Schölkopf, B. and Smola, A. (2002). Learning with Kernels . MIT Press. · Zbl 1019.68094
[30] Smola, A. and Schölkopf, B. (1998). From regularization operators to support vector kernels. In Advances in Neural Information Processings Systems 10 (M. I. Jordan, M. J. Kearns and S. A. Solla, eds.) 343-349. MIT Press.
[31] Steinwart, I. (2002). Support vector machines are universally consistent. J. Complexity 18 768-791. · Zbl 1030.68074 · doi:10.1006/jcom.2002.0642
[32] Steinwart, I. and Scovel, C. (2007). Fast rates for support vector machines using Gaussian kernels. Ann. Statist. 35 575-607. · Zbl 1127.68091 · doi:10.1214/009053606000001226
[33] Steinwart, I., Hush, D. and Scovel, C. (2006). A new concentration result for regularized risk minimizers. In High-Dimensional Probability IV 260-275. IMS Lecture Notes-Monograph Series 51 . · Zbl 1127.68090 · doi:10.1214/074921706000000897
[34] Tarigan, B. and van de Geer, S. (2006). Adaptivity of support vector machines with \ell 1 penalty. Bernoulli 12 1045-1076. · Zbl 1118.62067 · doi:10.3150/bj/1165269150
[35] Tsybakov, A. (2004). Optimal aggregation of classifiers in statistical learning. Ann. Statist. 32 135-166. · Zbl 1105.62353 · doi:10.1214/aos/1079120131
[36] Tsybakov, A. and van de Geer, S. (2005). Square root penalty: Adaptation to the margin in classification and in edge estimation. Ann. Statist. 33 1203-1224. · Zbl 1080.62047 · doi:10.1214/009053604000001066
[37] Vapnik, V. (1998). Statistical Learning Theory . Wiley, New York. · Zbl 0935.62007
[38] Vapnik, V. and Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16 264-280. · Zbl 0247.60005 · doi:10.1137/1116025
[39] Yang, Y. (1999). Minimax nonparametric classification. I. Rates of convergence. IEEE Trans. Inform. Theory 45 2271-2284. · Zbl 0962.62026 · doi:10.1109/18.796368
[40] Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Statist. 32 56-85. · Zbl 1105.62323 · doi:10.1214/aos/1079120130
[41] Zhou, D.-X. (2003). Capacity of reproducing kernel spaces in learning theory. IEEE Trans. Inform. Theory 49 1743-1752. · Zbl 1290.62033 · doi:10.1109/TIT.2003.813564
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.