Classifiers of support vector machine type with \(\ell_1\) complexity regularization. (English) Zbl 1118.62067

Summary: We study the binary classification problem with hinge loss. We consider classifiers that are linear combinations of base functions. Instead of an \(\ell_2\) penalty, which is used by the support vector machine, we put an \(\ell_1\) penalty on the coefficients. Under certain conditions on the base functions, hinge loss with this complexity penalty is shown to lead to an oracle inequality involving both model complexity and margin.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
60E15 Inequalities; stochastic orderings
62C10 Bayesian problems; characterization of Bayes procedures


Full Text: DOI Euclid


[1] Audibert, J.-Y. (2004) Classification under polynomial entropy and margin assumptions and randomized estimators. Preprint PMA-908, Laboratoire de Probabilités et Modèles Aléatoires. http://www.proba.jussieu.fr/mathdoc/textes/PMA-908.pdf URL:
[2] Bartlett, P.L., Jordan, M.I. and McAuliffe, J.D. (2006) Convexity, classification and risk bounds. J. Amer. Statist. Assoc., 101, 138-156. · Zbl 1118.62330 · doi:10.1198/016214505000000907
[3] Blanchard, G., Lugosi, G. and Vayatis, N. (2003) On the rate of convergence of regularized boosting classifiers. J. Mach. Learn. Res., 4, 861-894. · Zbl 1083.68109 · doi:10.1162/1532443041424319
[4] Blanchard, G., Bousquet, O. and Massart, P. (2004) Statistical performance of support vector machines. Manuscript. http://ida.first.fraunhofer.de/ blanchard/publi/index.html URL: · Zbl 1133.62044
[5] Boser, B., Guyon, I. and Vapnik, V.N. (1992) A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pp. 142-152. New York: Association for Computing Machine.
[6] Candès, E.J. and Donoho, D.L. (2004) New tight frames of curvelets and optimal representations of objects with piecewise C2 singularities. Comm. Pure Appl. Math., 57, 219-266. · Zbl 1038.94502 · doi:10.1002/cpa.10116
[7] Donoho, D.L. (1995) Denoising via soft-thresholding. IEEE Trans. Inform. Theory, 41, 613-627. · Zbl 0820.62002 · doi:10.1109/18.382009
[8] Donoho, D.L. (1999) Wedgelets: nearly minimax estimation of edges. Ann. Statist., 27, 859-897. · Zbl 0957.62029 · doi:10.1214/aos/1018031261
[9] Donoho, D.L. (2004a) For most large underdetermined systems of equations, the minimal ’1-norm near-solution approximates the sparsest near-solution. Technical report, Stanford University. http://www-stat.stanford.edu/ donoho/Reports/2004/l1l0approx.pdf URL:
[10] Donoho, D.L. (2004b) For most large underdetermined systems of linear equations, the minimal ’1-norm solution is also the sparsest solution. Technical report, Stanford University. http://www-stat.stanford.edu/ donoho/Reports/2004/l1l0EquivCorrected.pdf URL: · Zbl 1113.15004
[11] Hardy, G.H., Littlewood, J.E. and Poĺya, G. (1988) Inequalities, 2nd edn. Cambridge: Cambridge University Press.
[12] Hastie, T., Tibshirani, R. and Friedman, J. (2001) The Elements of Statistical Learning. Data Mining, Inference and Prediction. New York: Springer-Verlag. · Zbl 0973.62007
[13] Koltchinskii, V. (2001) Rademacher penalties and structural risk minimization. IEEE Trans. Inform. Theory, 47, 1902-1914. · Zbl 1008.62614 · doi:10.1109/18.930926
[14] Koltchinskii, V. (2006) Local Rademacher complexities and oracle inequalities in risk minimization. To appear in Ann. Statist., 34(6). · Zbl 1118.62065 · doi:10.1214/009053606000001019
[15] Koltchinskii, V. and Panchenko, D. (2002) Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist., 30, 1-50. · Zbl 1012.62004
[16] Ledoux, M. (1997) On Talagrandś deviation inequalities for product measures. ESAIM Probab. Statist., 1, 63-87. · Zbl 0869.60013 · doi:10.1051/ps:1997103
[17] Ledoux, M. and Talagrand, M. (1991) Probability in Banach Spaces: Isoperimetry and Processes. New York: Springer-Verlag. · Zbl 0748.60004
[18] Lin, Y. (2002) Support vector machines and the Bayes rule in classification. Data Min. Knowledge Discovery, 6, 259-275. · Zbl 05660804 · doi:10.1023/A:1015469627679
[19] Loubes, J.-M. and van de Geer, S. (2002) Adaptive estimation in regression, using soft thresholding type penalties. Statist. Neerlandica, 56, 453-478.
[20] Lugosi, G. and Wegkamp, M. (2004) Complexity regularization via localized random penalties. Ann. Statist., 32, 1679-1697. · Zbl 1045.62060 · doi:10.1214/009053604000000463
[21] Mammen, E. and Tsybakov, A.B. (1999) Smooth discrimination analysis. Ann. Statist., 27, 1808-1829. · Zbl 0961.62058 · doi:10.1214/aos/1017939240
[22] Massart, P. (2000) About the constants in Talagrandś concentration inequalities for empirical processes. Ann. Probab., 28, 863-884. · Zbl 1140.60310 · doi:10.1214/aop/1019160263
[23] Schölkopf, B. and Smola, A. (2002) Learning with Kernels. Cambridge, MA: MIT Press. · Zbl 1019.68094
[24] Scott, C. and Nowak, R. (2006) Minimax-optimal classification with dyadic decision trees. IEEE Trans. Inform. Theory, 52, 1335-1353. · Zbl 1318.62216 · doi:10.1109/TIT.2006.871056
[25] Shorack, G.R. andWellner, J.A. (1986) Empirical Processes with Applications to Statistics.NewYork:Wiley.
[26] Steinwart, I. and Scovel, S. (2005) Fast rates for support vector machines using Gaussian kernels. Technical report LA-UR 04-8796, Los Alamos National Laboratory. http://www.c3.lanl.gov/ml/pubs/2004_fastratesa/paper.pdf URL: · Zbl 1137.68564
[27] Tarigan, B. and van de Geer, S.A. (2004) Adaptivity of support vector machines with ’1 penalty. Technical report MI 2004-14, University of Leiden. http://www.stat.math.ethz.ch/ geer/reports.html URL:
[28] Tibshirani, R. (1996) Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B, 58, 267-288. JSTOR: · Zbl 0850.62538
[29] Tsybakov, A.B. (2004) Optimal aggregation of classifiers in statistical learning. Ann. Statist., 32, 135-166. · Zbl 1105.62353 · doi:10.1214/aos/1079120131
[30] Tsybakov, A.B. and van de Geer, S.A. (2005) Square root penalty: adaptation to the margin in classification and in edge estimation. Ann. Statist., 33, 1203-1224. · Zbl 1080.62047 · doi:10.1214/009053604000001066
[31] van de Geer, S. (2000) Empirical Processes in M-Estimation. Cambridge: Cambridge University Press. · Zbl 0953.62049
[32] van de Geer, S. (2003) Adaptive quantile regression. In M.G. Akritas and D.N. Politis (eds), Recent Advances and Trends in Nonparametric Statistics, pp. 235-250. Amsterdam: Elsevier.
[33] Vapnik, V.N. (1995) The Nature of Statistical Learning Theory. New York: Springer-Verlag. · Zbl 0833.62008
[34] Vapnik, V.N. (1998) Statistical Learning Theory. New York: Wiley. · Zbl 0935.62007
[35] Zhang, T. (2004) Statistical behaviour and consistency of classification methods based on convex risk minimization. Ann. Statist., 32, 56-84. · Zbl 1105.62323 · doi:10.1214/aos/1079120130
[36] Zhu, J., Rosset, S., Hastie, T. and Tibshirani, R. (2003) 1-norm support vector machines. Neural Inform. Process. Syst., 16.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.