zbMATH — the first resource for mathematics

Hyper-parameter optimization for support vector machines using stochastic gradient descent and dual coordinate descent. (English) Zbl 1432.68387
Summary: We developed a gradient-based method to optimize the regularization hyper-parameter, \(C\), for support vector machines in a bilevel optimization framework. On the upper level, we optimized the hyper-parameter \(C\) to minimize the prediction loss on validation data using stochastic gradient descent. On the lower level, we used dual coordinate descent to optimize the parameters of support vector machines to minimize the loss on training data. The gradient of the loss function on the upper level with respect to the hyper-parameter, \(C\), was computed using the implicit function theorem combined with the optimality condition of the lower-level problem, i.e., the dual problem of support vector machines. We compared our method with the existing gradient-based method in the literature on several datasets. Numerical results showed that our method converges faster to the optimal solution and achieves better prediction accuracy for large-scale support vector machine problems.
68T05 Learning and adaptive systems in artificial intelligence
Full Text: DOI
[1] Bengio Y (1999) Gradient-based optimization of hyper-parameters. https://pdfs.semanticscholar.org/0b2b/806201fd1146f16b4c20fc3dd696e22c3c88.pdf
[2] Bennett KP, Kunapuli G, Hu J, Pang JS (2008) Bilevel optimization and machine learning. In: IEEE world congress on computational intelligence, Springer, pp 25-47
[3] Bottou, L.; Curtis, FE; Nocedal, J., Optimization methods for large-scale machine learning, SIAM Rev, 60, 2, 223-311 (2018) · Zbl 1397.65085
[4] Boyd, SP; Vandenberghe, L., Convex optimization (2004), Cambridge: Cambridge University Press, Cambridge
[5] Chang CC, Lin CJ (2013) LIBSVM: a library for support vector machines. https://www.csie.ntu.edu.tw/ cjlin/papers/libsvm.pdf
[6] Cortes, C.; Vapnik, V., Support-vector networks, Mach Learn, 20, 3, 273-297 (1995) · Zbl 0831.68098
[7] Couellan, N.; Wang, W., Bi-level stochastic gradient for large scale support vector machine, Neurocomputing, 153, 300-308 (2015)
[8] Couellan N, Wang W (2016) On the convergence of stochastic bi-level gradient methods. http://www.optimization-online.org/DB_HTML/2016/02/5323.html
[9] Do CB, Foo CS, Ng AY (2007) Efficient multiple hyperparameter learning for log-linear models. http://ai.stanford.edu/ chuongdo/papers/learn_reg.pdf
[10] Franceschi L, Donini M, Frasconi P, Pontil M (2017) Forward and reverse gradient-based hyperparameter optimization. arXiv:1703.01785
[11] Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning, vol 1. Springer series in statistics, New York · Zbl 0973.62007
[12] Fû J, Luô H, Fen J, Hsiang Lo K, Chuâ TS (2016) DrMAD: distilling reverse-mode automatic differentiation for optimizing hyperparameters of deep neural networks. arXiv:1601.00917
[13] Han, P.; Lakshminarayanan, P.; Jiang, W.; Shpitser, I.; Hui, X.; Lee, SH; Cheng, Z.; Guo, Y.; Taylor, RH; Siddiqui, SA, Dose/volume histogram patterns in salivary gland subvolumes influence xerostomia injury and recovery, Sci Rep, 9, 1, 3616 (2019)
[14] Hastie, T.; Edu, H.; Rosset, S.; Tibshirani, R.; Zhu, J.; Edu, J., The entire regularization path for the support vector machine, J Mach Learn Res, 5, 1391-1415 (2004) · Zbl 1222.68213
[15] Holland, JH, Genetic algorithms, Sci Am, 267, 1, 66-73 (1992)
[16] Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear svm. In: Proceedings of the 25th international conference on machine learning. ACM, pp 408-415
[17] Jiang, W.; Lakshminarayanan, P.; Hui, X.; Han, P.; Cheng, Z.; Bowers, M.; Shpitser, I.; Siddiqui, S.; Taylor, RH; Quon, H., Machine learning methods uncover radiomorphologic dose patterns in salivary glands that predict xerostomia in patients with head and neck cancer, Adv Rad Oncol, 4, 401-412 (2018)
[18] Jules MS (2017) Experiments with scalable gradient-based hyperparameter optimization for deep neural networks. https://pdfs.semanticscholar.org/b694/d478bc9b7e4828c8fca4ff23433f9bf5e9d3.pdf
[19] Kuhn HW, Tucker AW (1951) Nonlinear programming. In: Proceedings of the 2nd Berkeley symposium on mathematical statistics and probability, pp 481-492. 10.1007/978-3-0348-0439-4, https://link.springer.com/content/pdf/10.1007/978-3-0348-0439-4_11.pdf · Zbl 0044.05903
[20] Kunapuli, G.; Bennett, KP; Hu, J.; Js, Pang, Bilevel model selection for support vector machines, Notes, 45, 129-158 (2008) · Zbl 1145.68488
[21] Maclaurin D, Duvenaud D, Adams RP (2015) Gradient-based hyperparameter optimization through reversible learning. arXiv:1502.03492
[22] McNutt, Wong J, Purdy J, et al (2010) OncoSpace: A new paradigm for clinical research and decision support in radiation oncology. In: Proceedings of the 16th international conference on computers in radiation therapy, Amsterdam, Netherlands
[23] Miikkulainen R, Liang J, Meyerson E, Rawal A, Fink D, Francon O, Raju B, Shahrzad H, Navruzyan A, Duffy N, et al (2019) Evolving deep neural networks. In: Artificial intelligence in the age of neural networks and brain computing, Elsevier, pp 293-312
[24] Mockus, J., Bayesian approach to global optimization: theory and applications (2012), Berlin: Springer, Berlin
[25] Pedregosa F (2016) Hyperparameter optimization with approximate gradient. arXiv:1602.02355,
[26] Platt JC (1998) Sequential minimal optimization: a fast algorithm for training support vector machines. Adv Kernel Methods. https://www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fast-algorithm-for-training-support-vector-machines/
[27] Ron Kohavi (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence—Vol 2, International joint conferences on artificial intelligence, Inc., pp 1137-1143, https://dl.acm.org/citation.cfm?id=1643047
[28] Shalev-Shwartz, S.; Singer, Y.; Srebro, N.; Cotter, A., Pegasos: primal estimated sub-gradient solver for SVM, Math Program, 127, 1, 3-30 (2011) · Zbl 1211.90239
[29] Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems, pp 2951-2959
[30] UCI Machine Learning Repository (1995a) Connect-4 data set. http://archive.ics.uci.edu/ml/datasets/connect-4
[31] UCI Machine Learning Repository (1995b) UCI machine learning repository: breast cancer wisconsin (diagnostic) data set. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
[32] UCI Machine Learning Repository (1998) UCI machine learning repository. https://archive.ics.uci.edu/ml/index.php
[33] UCI Machine Learning Repository (2007) MAGIC gamma telescope data set. https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope
[34] Zhang T, Tong (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: 21st International conference on machine learning—ICML ’04, ACM Press, New York, New York, USA, p 116. 10.1145/1015330.1015332, doi:1015330.1015332
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.