×

SpicyMKL: a fast algorithm for multiple kernel learning with thousands of kernels. (English) Zbl 1237.68166

Summary: We propose a new optimization algorithm for multiple kernel learning (MKL) called SpicyMKL, which is applicable to general convex loss functions and general types of regularization. The proposed SpicyMKL iteratively solves smooth minimization problems. Thus, there is no need of solving SVM, LP, or QP internally. SpicyMKL can be viewed as a proximal minimization method and converges super-linearly. The cost of inner minimization is roughly proportional to the number of active kernels. Therefore, when we aim for a sparse kernel combination, our algorithm scales well against increasing number of kernels. Moreover, we give a general block-norm formulation of MKL that includes non-sparse regularizations, such as elastic-net and \(\ell _{p }\)-norm regularizations. Extending SpicyMKL, we propose an efficient optimization method for the general regularization framework. Experimental results show that our algorithm is faster than existing methods especially when the number of kernels is large (\(>\)1000).

MSC:

68T05 Learning and adaptive systems in artificial intelligence
90C56 Derivative-free methods and methods using generalized derivatives
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337-404. · Zbl 0037.20701 · doi:10.1090/S0002-9947-1950-0051437-7
[2] Asuncion, A., & Newman, D. (2007). UCI machine learning repository. http://www.ics.uci.edu/ mlearn/MLRepository.html.
[3] Bach, F. R. (2008). Consistency of the group Lasso and multiple kernel learning. Journal of Machine Learning Research, 9, 1179-1225. · Zbl 1225.68147
[4] Bach, F. R.; Lanckriet, G.; Jordan, M., Multiple kernel learning, conic duality, and the SMO algorithm, 41-48 (2004)
[5] Bach, F. R.; Thibaux, R.; Jordan, M. I., Computing regularization paths for learning multiple kernels, No. 17, 73-80 (2005), Cambridge
[6] Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. New York: Academic Press. · Zbl 0572.90067
[7] Bertsekas, D. P. (1999). Nonlinear programming. Nashua: Athena Scientific. · Zbl 1015.90077
[8] Candes, E. J., Romberg, J., & Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2), 489-509. · Zbl 1231.94017 · doi:10.1109/TIT.2005.862083
[9] Chapelle, O.; Rakotomamonjy, A., Second order optimization of kernel parameters, Whistler
[10] Cortes, C. (2009). Can learning kernels help performance? Invited talk at International Conference on Machine Learning (ICML 2009), Montréal, Canada.
[11] Cortes, C.; Mohri, M.; Rostamizadeh, A., L2 regularization for learning kernels, Montréal, Canada
[12] Daubechies, I., Defrise, M., & Mol, C. D. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, LVII, 1413-1457. · Zbl 1077.65055 · doi:10.1002/cpa.20042
[13] Figueiredo, M., & Nowak, R. (2003). An EM algorithm for wavelet-based image restoration. IEEE Transactions on Image Processing, 12, 906-916. · Zbl 1279.94015 · doi:10.1109/TIP.2003.814255
[14] Gehler, P. V.; Nowozin, S., Let the kernel figure it out; principled learning of pre-processing for kernel classifiers (2009)
[15] Hestenes, M. (1969). Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4, 303-320. · Zbl 0174.20705 · doi:10.1007/BF00927673
[16] Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33, 82-95. · Zbl 0201.39702 · doi:10.1016/0022-247X(71)90184-3
[17] Kloft, M.; Brefeld, U.; Sonnenburg, S.; Laskov, P.; Müller, K. R.; Zien, A.; Bengio, Y. (ed.); Schuurmans, D. (ed.); Lafferty, J. (ed.); Williams, C. K. I. (ed.); Culotta, A. (ed.), Efficient and accurate ℓp-norm multiple kernel learning, No. 22, 997-1005 (2009), Cambridge
[18] Kloft, M., Rückert, U., & Bartlett, P. L. (2010). A unifying view of multiple kernel learning. arXiv:1005.0437.
[19] Lanckriet, G., Cristianini, N., Ghaoui, L. E., Bartlett, P., & Jordan, M. (2004). Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5, 27-72. · Zbl 1222.68241
[20] Micchelli, C. A., & Pontil, M. (2005). Learning the kernel function via regularization. Journal of Machine Learning Research, 6, 1099-1125. · Zbl 1222.68265
[21] Mosci, S.; Santoro, M.; Verri, A.; Villa, S., A new algorithm to learn an optimal kernel based on Fenchel duality, Whistler
[22] Nath, J. S.; Dinesh, G.; Raman, S.; Bhattacharyya, C.; Ben-Tal, A.; Ramakrishnan, K. R., On the algorithmics and applications of a mixed-norm based kernel learning formulation, No. 22, 844-852 (2009), Cambridge
[23] Palmer, J.; Wipf, D.; Kreutz-Delgado, K.; Rao, B.; Weiss, Y. (ed.); Schölkopf, B. (ed.); Platt, J. (ed.), Variational EM algorithms for non-Gaussian latent variable models, No. 18, 1059-1066 (2006), Cambridge
[24] Platt, J. C., Using sparseness and analytic QP to speed training of support vector machines, No. 11, 557-563 (1999), Cambridge
[25] Powell, M.; Fletcher, R. (ed.), A method for nonlinear constraints in minimization problems, 283-298 (1969), London
[26] Rakotomamonjy, A., Bach, F., & Canu, S. Y. G. (2008). SimpleMKL. Journal of Machine Learning Research, 9, 2491-2521. · Zbl 1225.68208
[27] Rätsch, G., Onoda, T., & Müller, K. R. (2001). Soft margins for adaboost. Machine Learning, 42(3), 287-320. · Zbl 0969.68128 · doi:10.1023/A:1007618119488
[28] Rockafellar, R. T. (1970). Convex analysis. Princeton: Princeton University Press. · Zbl 0193.18401
[29] Rockafellar, R. T. (1976). Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1, 97-116. · Zbl 0402.90076 · doi:10.1287/moor.1.2.97
[30] Schölkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge: MIT Press.
[31] Sonnenburg, S., Rätsch, G., Schäfer, C., & Schölkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531-1565. · Zbl 1222.90072
[32] Tomioka, R., & Sugiyama, M. (2009). Dual augmented lagrangian method for efficient sparse reconstruction. IEEE Signal Processing Letters, 16(12), 1067-1070. · doi:10.1109/LSP.2009.2030111
[33] Tomioka, R., & Suzuki, T. (2009). Sparsity-accuracy trade-off in MKL. arXiv:1001.2615. · Zbl 1222.90072
[34] Tomioka, R., & Suzuki, T. (2011). Regularization strategies and empirical Bayesian learning for MKL. arXiv:1011.3090.
[35] Tomioka, R., Suzuki, T., & Sugiyama, M. (2011). Super-linear convergence of dual augmented lagrangian algorithm for sparse learning. Journal of Machine Learning Research, 12, 1501-1550.
[36] Wright, S. J., Nowak, R. D., & Figueiredo, M. A. T. (2009). Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7), 2479-2493. doi:10.1109/TSP.2009.2016892. · Zbl 1391.94442 · doi:10.1109/TSP.2009.2016892
[37] Xu, Z.; Jin, R.; King, I.; Lyu, M. R., An extended level method for efficient multiple kernel learning, No. 21, 1825-1832 (2009), Cambridge · Zbl 1170.35396
[38] Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1), 49-67. · Zbl 1141.62030 · doi:10.1111/j.1467-9868.2005.00532.x
[39] Zangwill, W. I. (1969). Nonlinear programming: a unified approach. New York: Prentice Hall. · Zbl 0195.20804
[40] Zien, A.; Ong, C., Multiclass multiple kernel learning, 11910-1198 (2007), New York
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.