Learning sparse classifiers with difference of convex functions algorithms. (English) Zbl 1282.90181

Summary: Sparsity of a classifier is a desirable condition for high-dimensional data and large sample sizes. This paper investigates the two complementary notions of sparsity for binary classification: sparsity in the number of features and sparsity in the number of examples. Several different losses and regularizers are considered: the hinge loss and ramp loss, and \(\ell_{2},\ell_{1}\), approximate \(\ell_{0}\), and capped \(\ell_{1}\) regularization. We propose three new objective functions that further promote sparsity, the capped \(\ell_{1}\) regularization with hinge loss, and the ramp loss versions of approximate \(\ell_{0}\) and capped \(\ell_{1}\) regularization. We derive difference of convex functions algorithms (DCA) for solving these novel non-convex objective functions. The proposed algorithms are shown to converge in a finite number of iterations to a local minimum. Using simulated data and several data sets from the University of California Irvine (UCI) machine learning repository, we empirically investigate the fraction of features and examples required by the different classifiers.


90C30 Nonlinear programming
Full Text: DOI


[1] DOI: 10.1371/journal.pcbi.1000173
[2] Bradley, P. S. and Mangasarian, O. L.Feature selection via concave minimization and support vector machines. Proceedings of International Conference on Machina Learning. Madison, Wisconsin, USA · Zbl 0986.90085
[3] Collobert R., Large Scale Kernel Machines pp 275– (2007)
[4] Do, C. B., Le, Q., Teo, C. H., Chapelle, O. and Smola, A.Tighter bounds for structured estimation. in Proceedings of Neural Information Processing Systems. pp.281–288. British Columbia, Canada: Vancouver.
[5] Fan R.-E., J. Mach. Learn. Res 9 pp 1871– (2008)
[6] Freund Y., Tech. Rep (2009)
[7] DOI: 10.1109/TSP.2009.2026004 · Zbl 1391.90489
[8] DOI: 10.1162/153244303322753616 · Zbl 1102.68556
[9] DOI: 10.1007/978-3-642-56468-0
[10] Krause, N. and Singer, Y.Leveraging the margin more carefully. Proceedings of the Twenty-first International Conference on Machine Learning. Banff, Alberta, Canada
[11] DOI: 10.1007/s10479-004-5022-1 · Zbl 1116.90122
[12] Le Thi H. A., J. Global Optim 37 pp 593– (2006)
[13] DOI: 10.1016/j.ejor.2005.07.028 · Zbl 1149.90117
[14] DOI: 10.1007/s11634-008-0030-7 · Zbl 1284.90057
[15] DOI: 10.1198/016214505000000781 · Zbl 1119.62341
[16] DOI: 10.1198/106186005X37238
[17] Long, P. and Servedio, R.Random classification noise defeats all convex potential boosters. International Conference on Machine Learning. Helsinki, Finland
[18] Mangasarian O. L., J. Mach. Learn. Res 7 pp 1517– (2006)
[19] DOI: 10.1214/07-AOS582 · Zbl 1155.62050
[20] DOI: 10.1007/s10994-005-1505-9 · Zbl 1137.90643
[21] DOI: 10.1016/j.sigpro.2007.08.015 · Zbl 1186.94273
[22] Pham Dinh T., Fermat Days 85. Mathematics for Optimization (1986)
[23] DOI: 10.1137/S1052623494274313 · Zbl 0913.65054
[24] Schölkopf B., Learning with Kernels (2002)
[25] Steinwart I., J. Mach. Learn. Res 2 pp 67– (2001)
[26] Steinwart I., Information Science and Statistics (2008)
[27] Steinwart I., Advances in Neural Information Processing Systems 19 pp 1321–
[28] Tibshirani R., J. R. Statist. Soc. B 58 (1) pp 267– (1996)
[29] DOI: 10.1198/016214507000000617 · Zbl 1469.62293
[30] DOI: 10.1162/08997660360581958 · Zbl 1022.68112
[31] Zhang T., Ann. Statist 37 (5) pp 109– (2009)
[32] Zhu, J. and Xing, E. P.On primal and dual sparsity of Markov networks. International Conference on Machine Learning. Quebec, Canada: Montreal.
[33] Zhu J., Neural Information Processing Systems (2004)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.