Variable selection for multicategory SVM via adaptive sup-norm regularization. (English) Zbl 1135.62056

Summary: Support Vector Machine (SVM) is a popular classification paradigm in machine learning and has achieved great success in real applications. However, the standard SVM can not select variables automatically and therefore its solution typically utilizes all the input variables without discrimination. This makes it difficult to identify important predictor variables, which is often one of the primary goals in data analysis. We propose two novel types of regularization in the context of multicategory SVM (MSVM) for simultaneous classification and variable selection. The MSVM generally requires estimation of multiple discriminating functions and applies the argmax rule for prediction. For each individual variable, we propose to characterize its importance by the supnorm of its coefficient vector associated with different functions, and then minimize the MSVM hinge loss function subject to a penalty on the sum of sup-norms. To further improve the sup-norm penalty, we propose the adaptive regularization, which allows different weights imposed on different variables according to their relative importance. Both types of regularization automate variable selection in the process of building classifiers, and lead to sparse multiclassifiers with enhanced interpretability and improved accuracy, especially for high dimensional low sample size data. One big advantage of the supnorm penalty is its easy implementation via standard linear programming. Numerious examples and one real gene data analysis demonstrate the outstanding performance of the adaptive supnorm penalty in various data settings.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence


Full Text: DOI arXiv


[1] Argyriou, A., Evgeniou, T. and M., P. (2006). Multi-task feature learning., Neural Information Processing Systems , 19 .
[2] Argyriou, A., Evgeniou, T. and M., P. (2007). Convex multi-task feature learning., Machine Learning .
[3] Boser, B. E., Guyon, I. M. and Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In, Fifth Annual ACM Workshop on Computational Learning Theory . ACM Press, Pittsburgh, PA, 144-152.
[4] Bradley, P. S. and Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. In, Proc. 15th International Conf. on Machine Learning . Morgan Kaufmann, San Francisco, CA, 82-90.
[5] Christianini, N. and Shawe-Taylor, J. (2000)., An introduction to support vector machines and other kernel-based learning methods . Cambridge University Press.
[6] Crammer, K. and Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines., Journal of Machine Learning Research , 2 265-292. · Zbl 1037.68110
[7] Dudoit, S., Fridlyand, J. and Speed, T. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data., Journal of American Statistical Association , 97 77-87. · Zbl 1073.62576
[8] Fourer, R., Gay, D. and Kernighan, B. (2003)., AMPL: A Modeling Language for Mathematical Programming . Duxbury Press. · Zbl 0701.90062
[9] Khan, J., Wei, J. S., Ringnér, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C. and Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks., Nature Medicine , 7 673-679.
[10] Lafferty, J. and Wasserman, L. (2006). Challenges in statistical machine learning., Statistica Sinica , 16 307-323.
[11] Lee, Y., Kim, Y., Lee, S. and Koo, J.-Y. (2006). Structured multicategory support vector machine with anova decomposition., Biometrika , 93 555-571. · Zbl 1108.62059
[12] Lee, Y., Lin, Y. and Wahba, G. (2004). Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data., Journal of American Statistical Association , 99 67-81. · Zbl 1089.62511
[13] Liu, Y. and Shen, X. (2006). Multicategory, \psi -learning. Journal of the American Statistical Association , 101 500-509. · Zbl 1119.62341
[14] Liu, Y. and Wu, Y. (2007). Variable selection via a combination of the, l 0 and l 1 penalties. Journal of Computation and Graphical Statistics , 16 782-798.
[15] Liu, Y., Zhang, H. H., Park, C. and Ahn, J. (2007). Support vector machines with adaptive, l q penalties. Computational Statistics and Data Analysis , 51 6380-6394. · Zbl 1446.62179
[16] Micchelli, C. and Pontil, M. (2007). Feature space perspectives for learning the kernel., Machine Learning , 66 297-319.
[17] Schölkopf, B. and Smola, A. J. (2002)., Learning with Kernels . MIT Press. · Zbl 1019.68094
[18] Vapnik, V. (1995)., The Nature of Statistical Learning Theory . Springer-Verlag, New York. · Zbl 0833.62008
[19] Vapnik, V. (1998)., Statistical learning theory . Wiley. · Zbl 0935.62007
[20] Wang, H., Li, G. and Jiang, G. (2007). Robust regression shrinkage and consistent variable selection via the lad-lasso., Journal of Business and Economics Statistics , 25 347-355.
[21] Wang, J. and Shen, X. (2007a). Large margin semi-supervised learning., Journal of Machine Learning Research , 8 1867-1891. · Zbl 1222.68329
[22] Wang, L. and Shen, X. (2007b). On, l 1 -norm multi-class support vector machines: methodology and theory. Journal of the American Statistical Association , 102 583-594. · Zbl 1172.62317
[23] Wang, L., Zhu, J. and Zou, H. (2006). The doubly regularized support vector machine., Statistica Sinica , 16 589-615. · Zbl 1126.68070
[24] Weston, J., Elisseeff, A., Schölkopf, B. and Tipping, M. (2003). Use of the zero-norm with linear models and kernel methods., Journal of Machine Learning Research , 3 1439-1461. · Zbl 1102.68605
[25] Weston, J. and Watkins, C. (1999). Multiclass support vector machines. In, Proceedings of ESANN99 (M. Verleysen, ed.). D. Facto Press.
[26] Wu, Y. and Liu, Y. (2007a). Robust truncated-hinge-loss support vector machines., Journal of the American Statistical Association , 102 974-983. · Zbl 1469.62293
[27] Wu, Y. and Liu, Y. (2007b). Variable selection in quantile regression., Statistica Sinica . · Zbl 1166.62012
[28] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables., Journal of the Royal Statistical Society, Series B , 68 49-67. · Zbl 1141.62030
[29] Zhang, H. H., Ahn, J., Lin, X. and Park, C. (2006). Gene selection using support vector machines with nonconvex penalty., Bioinformatics , 22 88-95.
[30] Zhang, H. H. and Lu, W. (2007). Adaptive-lasso for cox’s proportional hazard model., Biometrika , 94 691-703. · Zbl 1135.62083
[31] Zhao, P., Rocha, G. and Yu, B. (2006). Grouped and hierarchical model selection through composite absolute penalties. Technical Report 703, Department of Statistics University of California at, Berkeley.
[32] Zhu, J., Hastie, T., Rosset, S. and Tibshirani, R. (2003). 1-norm support vector machines., Neural Information Processing Systems , 16 .
[33] Zou, H. (2006). The adaptive lasso and its oracle properties., Journal of the American Statistical Association , 101 1418-1429. · Zbl 1171.62326
[34] Zou, H. and Yuan, M. (2006). The, f \infty -norm support vector machine. Statistica Sinica . · Zbl 1416.62370
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.