×

A novel Frank-Wolfe algorithm. Analysis and applications to large-scale SVM training. (English) Zbl 1355.68234

Summary: Recently, there has been a renewed interest in the machine learning community for variants of a sparse greedy approximation procedure for concave optimization known as the Frank-Wolfe (FW) method. In particular, this procedure has been successfully applied to train large-scale instances of non-linear Support Vector Machines (SVMs). Specializing FW to SVM training has allowed to obtain efficient algorithms, but also important theoretical results, including convergence analysis of training algorithms and new characterizations of model sparsity.
In this paper, we present and analyze a novel variant of the FW method based on a new way to perform away steps, a classic strategy used to accelerate the convergence of the basic FW procedure. Our formulation and analysis is focused on a general concave maximization problem on the simplex. However, the specialization of our algorithm to quadratic forms is strongly related to some classic methods in computational geometry, namely the Gilbert and MDM algorithms.
On the theoretical side, we demonstrate that the method matches the guarantees in terms of convergence rate and number of iterations obtained by using classic away steps. In particular, the method enjoys a linear rate of convergence, a result that has been recently proved for MDM on quadratic forms.
On the practical side, we provide experiments on several classification datasets, and evaluate the results using statistical tests. Experiments show that our method is faster than the FW method with classic away steps, and works well even in the cases in which classic away steps slow down the algorithm. Furthermore, these improvements are obtained without sacrificing the predictive accuracy of the obtained SVM model.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Damla Ahipasaoglu, S.; Peng, Sun; Todd, Michael, Linear convergence of a modified Frank-Wolfe algorithm for computing minimum volume enclosing ellipsoids, Optim. Methods Softw., 23, 1, 5-19, (2008) · Zbl 1146.90047
[2] Héctor Allende, Emanuele Frandi, Ricardo Ñanculef, Claudio Sartori, Novel Frank-Wolfe methods for SVM learning, Technical report, 2013. <http://arxiv.org/abs/1304.1014>. · Zbl 1355.68234
[3] (Bakir, Gükhan; Hofmann, Thomas; Schölkopf, Bernhard; Smola, Alexander; Taskar, Ben; Vishwanathan, S. V.N., Predicting Structured Data (Neural Information Processing), (2007), The MIT Press)
[4] Beck, Amir; Teboulle, Marc, A conditional gradient method with linear rate of convergence for solving convex linear systems, Math. Methods Oper. Res., 59, 2, 235-247, (2004) · Zbl 1138.90440
[5] Kristin P. Bennett, Erin J. Bredensteiner, Geometry in learning, in: Geometry at Work, 1997. · Zbl 0893.90158
[6] Bennett, Kristin P.; Bredensteiner, Erin J., Duality and geometry in SVM classifiers, (Proc. 17th International Conf. on Machine Learning, (2000), Morgan Kaufmann), 57-64
[7] Bottou, Léon, Stochastic learning, (Bousquet, Olivier; Luxburg, Ulrike von, Advanced Lectures on Machine Learning, Lecture Notes in Artificial Intelligence, LNAI, vol. 3176, (2004), Springer Verlag), 146-168 · Zbl 1120.68426
[8] Léon Bottou, Olivier Bousquet, The tradeoffs of large scale learning, in: NIPS, 2007.
[9] Bădoiu, Mihai; Clarkson, Kenneth, Smaller core-sets for balls, (Proceedings of the SODA’03, (2003), SIAM), 801-802 · Zbl 1092.68660
[10] Bădoiu, Mihai; Clarkson, Kenneth, Optimal core-sets for balls, Comput. Geomet., 40, 1, 14-22, (2008) · Zbl 1138.68056
[11] Burges, Cristopher J. C.; Crisp, David J., A geometric interpretation of nu-SVM classifiers, (Advances in Neural Information Processing Systems, (2000), MIT Press Cambridge, MA), 244-250
[12] Chih-Chung Chang, Chih-Jen Lin, LIBSVM: A Library for Support Vector Machines, 2011.
[13] Clarkson, Kenneth, Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm, (Proceedings of SODA’08, (2008), SIAM), 922-931 · Zbl 1192.90142
[14] Clarkson, Kenneth; Hazan, Elad; Woodruff, David, Sublinear optimization for machine learning, J. ACM, 59, 5, (2012) · Zbl 1281.68177
[15] Demsar, Janez, Statistical comparison of classifiers over multiple data sets, J. Mach. Learn. Res., 7, 1-30, (2006) · Zbl 1222.68184
[16] Fan, Rong-En; Chen, Pai-Hsuen; Lin, Chih-Jen, Working set selection using second order information for training support vector machines, J. Mach. Learn. Res., 6, 1889-1918, (2005) · Zbl 1222.68198
[17] Anthony V. Fiacco, Garth P. McCormick, Nonlinear Programming: Sequential Unconstrained Minimization Techniques, SIAM, 1968 (reprinted 1990).
[18] Fine, Shai; Scheinberg, Katya, Efficient SVM training using low-rank kernel representations, J. Mach. Learn. Res., 2, 243-264, (2002) · Zbl 1037.68112
[19] Frandi, Emanuele; Gasparo, Maria Grazia; Lodi, Stefano; Ñanculef, Ricardo; Sartori, Claudio, A new algorithm for training SVMs using approximate minimal enclosing balls, (Proceedings of the 15th Iberoamerican Congress on Pattern Recognition, Lecture Notes in Computer Science, (2010), Springer), 87-95
[20] Frandi, Emanuele; Gasparo, Maria Grazia; Lodi, Stefano; Ñanculef, Ricardo; Sartori, Claudio, Training support vector machines using Frank-Wolfe methods, International Journal of Pattern Recognition and Artificial Intelligence, 27, 3, (2011)
[21] Frandi, Emanuele; Gasparo, Maria Grazia; Ñanculef, Ricardo; Papini, Alessandra, Solution of classification problems via computational geometry methods, Recent Advances in Nonlinear Optimization and Equilibrium Problems: A Tribute to Marco D’Apuzzo, Quaderni di Matematica, 27, 201-226, (2012)
[22] Andrew Frank and Arthur Asuncion, The UCI KDD Archive, 2010. <http://kdd.ics.uci.edu>.
[23] Frank, Marguerite; Wolfe, Philip, An algorithm for quadratic programming, Naval Res. Logist. Q., 1, 95-110, (1956)
[24] Friess, Thilo-Thomas; Cristianini, Nello; Campbell, Colin, The kernel-adatron algorithm: a fast and simple learning procedure for support vector machines, (Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, (1998), Morgan Kaufmann Publishers Inc. San Francisco, CA, USA), 188-196
[25] Dan Garber, Elad Hazan, A polynomial time conditional gradient algorithm with applications to online and stochastic optimization. CoRR, abs/1301.4666, 2013. · Zbl 1342.65142
[26] Gärtner, Bernd; Jaggi, Martin, Coresets for polytope distance, (Hershberger, John; Fogel, Efi, Symposium on Computational Geometry, (2009), ACM), 33-42 · Zbl 1380.68396
[27] Gilbert, Elmer, An iterative procedure for computing the minimum of a quadratic form on a convex set, SIAM J. Control, 4, 1, 61-80, (1966) · Zbl 0196.51204
[28] Guélat, Jacques; Marcotte, Patrice, Some comments on wolfe’s “away step”, Math. Programming, 35, 110-119, (1986) · Zbl 0592.90074
[29] Elad Hazan, Satyen Kale, Projection-free online learning, in: International Conference on Machine Learning, 2012. · Zbl 1433.68347
[30] Hofmann, Thomas; Schölkopf, Bernhard; Smola, Alexander, Kernel methods in machine learning, Ann. Statist., 36, 3, 1171-1220, (2008) · Zbl 1151.30007
[31] Hsieh, Cho-Jui; Chang, Kai-Wei; Lin, Chih-Jen; Keerthi, S Sathiya; Sundararajan, Sellamanickam, A dual coordinate descent method for large-scale linear SVM, (Proceedings of the 25th international conference on Machine learning, (2008), ACM), 408-415
[32] Hsu, Chih-Wei; Lin, Chih-Jen, A comparison of methods for multiclass support vector machines, IEEE Trans. Neural Networks, 13, 2, 415-425, (2002)
[33] Martin Jaggi, Revisiting Frank-Wolfe: projection-free sparse convex optimization, in: International Conference on Machine Learning (to appear), 2013.
[34] Joachims, Thorsten, Making large-scale support vector machine learning practical, (Advances in Kernel Methods: Support Vector Learning, (1999), MIT Press), 169-184
[35] Thorsten Joachims, SVM-light Support Vector Machine, 2011.
[36] Keerthi, S. S.; Shevade, S. K.; Bhattacharyya, C.; Murthy, K. R.K., A fast iterative nearest point algorithm for support vector machine classifier design, IEEE Trans. Neural Networks, 11, 1, 124-136, (2000)
[37] Keerthi, Sathiya; Gilbert, Elmer, Convergence of a generalized SMO algorithm for SVM classifier design, Mach. Learn., 46, 1-3, 351-360, (2002) · Zbl 0998.68109
[38] Kumar, Piyush; Yildirim, Alper, A linearly convergent linear-time first-order algorithm for support vector classification with a core set result, INFORMS J. Comput., 23, 3, 377-391, (2011) · Zbl 1243.68243
[39] Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, Patrick Pletscher, Block-coordinate Frank-Wolfe optimization for structural SVMs, in: International Conference on Machine Learning (to appear), 2013.
[40] López, Jorge; Barbero, Álvaro; Dorronsoro, José R., On the equivalence of the SMO and MDM algorithms for SVM training, (Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I, ECML PKDD ’08, (2008), Springer-Verlag Berlin, Heidelberg), 288-300
[41] Lopez, Jorge; Dorronsoro, José R, The convergence rate of the MDM algorithm, (The 2012 International Joint Conference on Neural Networks (IJCNN), (2012), IEEE), 1-7
[42] Mitchell, B. F.; Dem’yanov, V. F.; Malozemov, V. N., Finding the point of a polyhedron closest to the origin, SIAM J. Control, 12, 1, 19-26, (1974) · Zbl 0277.52007
[43] Nocedal, Jorge; Wright, Stephen, Numerical optimization, (2006), Springer · Zbl 1104.65059
[44] Hua Ouyang, Alexander Gray, Fast stochastic Frank-Wolfe algorithms for nonlinear SVMs, in: SDM, 2010, pp. 245-256.
[45] Platt, John, Fast training of support vector machines using sequential minimal optimization, (Advances in Kernel Methods: Support Vector Learning, (1999), MIT Press), 185-208
[46] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2012, ISBN 3-900051-07-0.
[47] Rai, Piyush; Daumé, Hal; Venkatasubramanian, Suresh, Streamed learning: one-pass svms, (IJCAI’09: Proceedings of the 21st International Jont Conference on Artifical Intelligence, (2009), Morgan Kaufmann Publishers), 1211-1216
[48] Robinson, Stephen, Generalized equations and their solutions, part II: applications to nonlinear programming, (Optimality and Stability in Mathematical Programming, Mathematical Programming Studies, vol. 19, (1982), Springer Berlin, Heidelberg), 200-221 · Zbl 0495.90077
[49] Scheinberg, Katya, An efficient implementation of an active set method for svms, J. Mach. Learn. Res., 7, 2237-2257, (2006) · Zbl 1222.90043
[50] (Schölkopf, Bernard; Burges, Christopher; Smola, Alexander, Advances in Kernel Methods: Support Vector Learning, (1999), MIT Press)
[51] Schölkopf, Bernard; Smola, Alexander, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, (2001), MIT Press Cambridge, MA, USA
[52] Bernhard Schölkopf, Alexander J. Smola, Sparse greedy matrix approximation for machine learning, in: Proceedings of the 17th International Conference on Machine Learning, 2001, pp. 911-918.
[53] Shalev-Shwartz, Shai, Online learning and online convex optimization, Found. Trends Mach. Learn., 4, 2, 107-194, (2012) · Zbl 1253.68190
[54] Shalev-Shwartz, Shai; Singer, Yoram; Srebro, Nathan; Cotter, Andrew, Pegasos: primal estimated sub-gradient solver for SVM, Math. Program., 127, 1, 3-30, (2011) · Zbl 1211.90239
[55] Shalev-Shwartz, Shai; Zhang, Tong, Stochastic dual coordinate ascent methods for regularized loss minimization, J. Mach. Learn. Res., 14, 567-599, (2013) · Zbl 1307.68073
[56] Steinwart, Ingo, Sparseness of support vector machines, J. Mach. Learn. Res., 4, 1071-1105, (2003) · Zbl 1094.68082
[57] Ivor Tsang, Andras Kocsor, James Kwok, LibCVM Toolkit, 2011. <www.c2i.ntu.edu.sg/ivor/cvm.html>.
[58] Tsang, Ivor; Kwok, James; Cheung, Pak-Ming, Core vector machines: fast SVM training on very large data sets, J. Mach. Learn. Res., 6, 363-392, (2005) · Zbl 1222.68320
[59] Tsang, Ivor; Kwok, James; Zurada, Jacek, Generalized core vector machines, IEEE Trans. Neural Networks, 17, 5, 1126-1140, (2006)
[60] Tsochantaridis, Ioannis; Joachims, Thorsten; Hofmann, Thomas; Altun, Yasemin, Large margin methods for structured and interdependent output variables, J. Mach. Learn. Res., 6, 1453-1484, (2005) · Zbl 1222.68321
[61] Wang, Di; Zhang, Bo; Zhang, Peng; Qiao, Hong, An online core vector machine with adaptive MEB adjustment, Pattern Recogni., 43, 3468-3482, (2010) · Zbl 1213.68553
[62] Wolfe, Philip, Convergence theory in nonlinear programming, (Abadie, J., Integer and Nonlinear Programming, (1970), North-Holland Amsterdam), 1-36
[63] Woodsend, Kristian; Gondzio, Jacek, Exploiting separability in large scale linear support vector machine training, Comput. Optim. Appl., 29, 241-269, (2011) · Zbl 1219.90210
[64] Yildirim, Emre Alper, Two algorithms for the minimum enclosing ball problem, SIAM J. Optim., 19, 3, 1368-1391, (2008) · Zbl 1180.90240
[65] Yuan, Guo-Xun; Ho, Chia-Hua; Lin, Chih-Jen, Recent advances of large-scale linear classification, Proc. IEEE, 100, 9, 2584-2603, (2012)
[66] Zhang, Tong, Sequential greedy approximation for certain convex optimization problems, IEEE Trans. Inf. Theory, 49, 3, 682-691, (2003) · Zbl 1063.90040
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.