×

Cutting-plane training of structural SVMs. (English) Zbl 1235.68161

Summary: Discriminative training approaches like structural SVMs have shown much promise for building highly complex and accurate models in areas like natural language processing, protein structure prediction, and information retrieval. However, current training algorithms are computationally expensive or intractable on large datasets. To overcome this bottleneck, this paper explores how cutting-plane methods can provide fast training not only for classification SVMs, but also for structural SVMs. We show that for an equivalent “1-slack” reformulation of the linear SVM training problem, our cutting-plane method has time complexity linear in the number of training examples. In particular, the number of iterations does not depend on the number of training examples, and it is linear in the desired precision and the regularization parameter. Furthermore, we present an extensive empirical evaluation of the method applied to binary classification, multi-class classification, HMM sequence tagging, and CFG parsing. The experiments show that the cutting-plane algorithm is broadly applicable and fast in practice. On large datasets, it is typically several orders of magnitude faster than conventional training methods derived from decomposition methods like SVM-light, or conventional cutting-plane methods. Implementations of our methods are available at http://www.joachims.org.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Altun, Y., Tsochantaridis, I., & Hofmann, T. (2003). Hidden Markov support vector machines. In International conference on machine learning (ICML) (pp. 3–10).
[2] Anguelov, D., Taskar, B., Chatalbashev, V., Koller, D., Gupta, D., Heitz, G., & Ng, A. Y. (2005). Discriminative learning of Markov random fields for segmentation of 3D scan data. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 169–176). Los Alamitos: IEEE Computer Society.
[3] Bartlett, P., Collins, M., Taskar, B., & McAllester, D. (2004). Exponentiated algorithms for large-margin structured classification. In Advances in neural information processing systems (NIPS) (pp. 305–312).
[4] Caruana, R., Joachims, T., & Backstrom, L. (2004). KDDCup 2004: Results and analysis. ACM SIGKDD Newsletter, 6(2), 95–108. · Zbl 05442758
[5] Chang, C. C., & Lin, C. J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/\(\sim\)cjlin/libsvm .
[6] Collins, M. (2002). Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. In Empirical methods in natural language processing (EMNLP) (pp. 1–8).
[7] Collins, M. (2004). Parameter estimation for statistical parsing models: Theory and practice of distribution-free methods. In New developments in parsing technology. Dordrecht: Kluwer Academic (paper accompanied invited talk at IWPT 2001).
[8] Collins, M., & Duffy, N. (2002). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Annual meeting of the association for computational linguistics (ACL) (pp. 263–270).
[9] Collobert, R., & Bengio, S. (2001). SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research (JMLR), 1, 143–160. · Zbl 1052.68111
[10] Cortes, C., & Vapnik, V. N. (1995). Support-vector networks. Machine Learning, 20, 273–297. · Zbl 0831.68098
[11] Crammer, K., & Singer, Y. (2001). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research (JMLR), 2, 265–292. · Zbl 1037.68110
[12] Crammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research (JMLR), 3, 951–991. · Zbl 1112.68497
[13] Ferris, M., & Munson, T. (2003). Interior-point methods for massive support vector machines. SIAM Journal of Optimization, 13(3), 783–804. · Zbl 1039.90092
[14] Fukumizu, K., Bach, F., & Jordan, M. (2004). Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research (JMLR), 5, 73–99. · Zbl 1222.62069
[15] Fung, G., & Mangasarian, O. (2001). Proximal support vector classifiers. In ACM conference on knowledge discovery and data mining (KDD) (pp. 77–86). · Zbl 1101.68758
[16] Globerson, A., Koo, T. Y., Carreras, X., & Collins, M. (2007). Exponentiated gradient algorithm for log-linear structured prediction. In International conference on machine learning (ICML) (pp. 305–312).
[17] Joachims, T. (1999). Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods–support vector learning (pp. 169–184). Cambridge: MIT Press. Chap. 11.
[18] Joachims, T. (2003). Learning to align sequences: A maximum-margin approach. Online manuscript.
[19] Joachims, T. (2005). A support vector method for multivariate performance measures. In International conference on machine learning (ICML) (pp. 377–384).
[20] Joachims, T. (2006). Training linear SVMs in linear time. In ACM SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 217–226).
[21] Johnson, M. (1998). PCFG models of linguistic tree representations. Computational Linguistics, 24(4), 613–632.
[22] Keerthi, S., & DeCoste, D. (2005). A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research (JMLR), 6, 341–361. · Zbl 1222.68231
[23] Keerthi, S., Chapelle, O., & DeCoste, D. (2006). Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research (JMLR), 7, 1493–1515. · Zbl 1222.68230
[24] Kivinen, J., & Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1), 1–63. · Zbl 0872.68158
[25] Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International conference on machine learning (ICML).
[26] Lewis, D., Yang, Y., Rose, T., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research (JMLR), 5, 361–397.
[27] Mangasarian, O., & Musicant, D. (2001). Lagrangian support vector machines. Journal of Machine Learning Research (JMLR), 1, 161–177. · Zbl 0997.68108
[28] Marcus, M., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330.
[29] McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of dependency parsers. In Annual meeting of the association for computational linguistics (ACL) (pp. 91–98).
[30] Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods–support vector learning. Cambridge: MIT Press. Chap 12.
[31] Ratliff, N. D., Bagnell, J. A., & Zinkevich, M. A. (2007). (Online) subgradient methods for structured prediction. In Conference on artificial intelligence and statistics (AISTATS).
[32] Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM. In International conference on machine learning (ICML) (pp. 807–814). New York: ACM. · Zbl 1211.90239
[33] Smola, A., & Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In International conference on machine learning (pp. 911–918).
[34] Taskar, B., Guestrin, C., & Koller, D. (2003). Maximum-margin Markov networks. In Advances in neural information processing systems (NIPS).
[35] Taskar, B., Klein, D., Collins, M., Koller, D., & Manning, C. (2004). Max-margin parsing. In Empirical methods in natural language processing (EMNLP).
[36] Taskar, B., Lacoste-Julien, S., & Jordan, M. I. (2005). Structured prediction via the extragradient method. In Advances in neural information processing systems (NIPS). · Zbl 1222.62143
[37] Teo, C. H., Smola, A., Vishwanathan, S. V., & Le, Q. V. (2007). A scalable modular convex solver for regularized risk minimization. In ACM conference on knowledge discovery and data mining (KDD) (pp. 727–736).
[38] Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In International conference on machine learning (ICML) (pp. 104–112).
[39] Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research (JMLR), 6, 1453–1484. · Zbl 1222.68321
[40] Vapnik, V. (1998). Statistical learning theory. New York: Wiley. · Zbl 0935.62007
[41] Vishwanathan, S. V. N., Schraudolph, N. N., Schmidt, M. W., & Murphy, K. P. (2006). Accelerated training of conditional random fields with stochastic gradient methods. In International conference on machine learning (ICML) (pp. 969–976).
[42] Yu, C. N., Joachims, T., Elber, R., & Pillardy, J. (2007). Support vector training of protein alignment models. In Proceeding of the international conference on research in computational molecular biology (RECOMB) (pp. 253–267).
[43] Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average precision. In ACM SIGIR conference on research and development in information retrieval (SIGIR) (pp. 271–278).
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.