Trust-region algorithms for training responses: machine learning methods using indefinite Hessian approximations. (English) Zbl 1440.90092

Summary: Machine learning (ML) problems are often posed as highly nonlinear and nonconvex unconstrained optimization problems. Methods for solving ML problems based on stochastic gradient descent are easily scaled for very large problems but may involve fine-tuning many hyper-parameters. Quasi-Newton approaches based on the limited-memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) update typically do not require manually tuning hyper-parameters but suffer from approximating a potentially indefinite Hessian with a positive-definite matrix. Hessian-free methods leverage the ability to perform Hessian-vector multiplication without needing the entire Hessian matrix, but each iteration’s complexity is significantly greater than quasi-Newton methods. In this paper we propose an alternative approach for solving ML problems based on a quasi-Newton trust-region framework for solving large-scale optimization problems that allow for indefinite Hessian approximations. Numerical experiments on a standard testing data set show that with a fixed computational time budget, the proposed methods achieve better results than the traditional limited-memory BFGS and the Hessian-free methods.


90C53 Methods of quasi-Newton type
15A06 Linear equations (linear algebraic aspects)
90C06 Large-scale problems in mathematical programming
65K05 Numerical mathematical programming methods
65K10 Numerical optimization and variational techniques
49M15 Newton-type methods
Full Text: DOI arXiv


[1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I.J., Harp, A., Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D.G., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P.A., Vanhoucke, V., Vasudevan, V., Viégas, F.B., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X., Tensorflow: Large-scale machine learning on heterogeneous distributed systems, CoRR abs/1603.04467 (2016). Available at http://arxiv.org/abs/1603.04467.
[2] Bengio, Y., Practical recommendations for gradient-based training of deep architectures, in Neural Networks: Tricks of the Trade, Montavon, G., Orr, G.B. and Müller, K.R., eds., Springer, 2012, pp. 437-478.
[3] Berahas, A.S., Nocedal, J., and Takác, M., A multi-batch L-BFGS method for machine learning, CoRR abs/1605.06049 (2016). Available at http://arxiv.org/abs/1605.06049. · Zbl 1430.90523
[4] Bergstra, J.; Bengio, Y., Random search for hyper-parameter optimization, J. Mach. Learn. Res., 13, 281-305 (2012) · Zbl 1283.68282
[5] Bergstra, J.S., Bardenet, R., Bengio, Y., and Kégl, B., Algorithms for hyper-parameter optimization, in Advances in Neural Information Processing Systems, Vol. 24, Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., and Weinberger, K.Q., eds., Curran Associates, Inc., 2011, pp. 2546-2554. Available at http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf.
[6] Bergstra, J., Yamins, D., and Cox, D.D., Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures, Proceedings of the 30th International Conference on Machine Learning, ICML 2013, , 2013, pp. 115-123. Available at http://jmlr.org/proceedings/papers/v28/bergstra13.html.
[7] Bottou, L.; Curtis, F.; Nocedal, J., Optimization methods for large-scale machine learning, SIAM Rev., 60, 223-311 (2018) · Zbl 1397.65085
[8] Brust, J.; Erway, J. B.; Marcia, R. F., On solving L-SR1 trust-region subproblems, Comput. Optim. Appl., 66, 245-266 (2017) · Zbl 1364.90239
[9] Burdakov, O.; Gong, L.; Zikrin, S.; Yuan, Y. X., On efficiently combining limited-memory and trust-region techniques, Math. Program. Comput., 1-34 (2016)
[10] Byrd, R. H.; Nocedal, J.; Schnabel, R. B., Representations of quasi-Newton matrices and their use in limited-memory methods, Math. Program., 63, 129-156 (1994) · Zbl 0809.90116
[11] Byrd, R. H.; Chin, G. M.; Nocedal, J.; Wu, Y., Sample size selection in optimization methods for machine learning, Math. Program., 134, 127-155 (2012) · Zbl 1252.49044
[12] Byrd, R. H.; Hansen, S. L.; Nocedal, J.; Singer, Y., A stochastic quasi-Newton method for large-scale optimization, SIAM J. Optim., 26, 1008-1031 (2016) · Zbl 1382.65166
[13] Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., and LeCun, Y., The loss surface of multilayer networks, CoRR abs/1412.0233 (2014). Available at http://arxiv.org/abs/1412.0233.
[14] Cohen, G., Afshar, S., Tapson, J., and van Schaik, A., EMNIST: An extension of MNIST to handwritten letters, preprint (2017). Available at arXiv:1702.05373.
[15] Conn, A. R.; Gould, N. I.; Toint, P. L., Testing a class of methods for solving minimization problems with simple bounds on the variables, Math. Comput., 50, 399-430 (1988) · Zbl 0645.65033
[16] Conn, A. R.; Gould, N. I.M.; Toint, P. L., Convergence of quasi-newton matrices generated by the symmetric rank one update, Math. Program., 50, 177-195 (1991) · Zbl 0737.90062
[17] Conn, A. R.; Gould, N. I.M.; Toint, P. L., Trust-Region Methods (2000), Society for Industrial and Applied Mathematics (SIAM): Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA
[18] Curtis, F., A self-correcting variable-metric algorithm for stochastic optimization, Proceedings of The 33rd international conference on machine learning, New York, NY, 2016, pp. 632-641.
[19] Curtis, F. E.; Que, X., A quasi-Newton algorithm for nonconvex, nonsmooth optimization with global convergence guarantees, Math. Program. Comput., 7, 399-428 (2015) · Zbl 1333.49042
[20] Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y., Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, in Advances in Neural Information Processing Systems, Vol. 27, Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, eds., Curran Associates, Inc., Montreal, Canada, 2014, pp. 2933-2941.
[21] Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q.V., Mao, M.Z., Ranzato, M., Senior, A.W., Tucker, P.A., Yang, K., and Ng, A.Y., Large Scale Distributed Deep Networks, in Advances in Neural Information Processing Systems, Vol. 25, 2012, pp. 1232-1240. Available at http://papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.
[22] Dewancker, I., McCourt, M., Clark, S., Hayes, P., Johnson, A., and Ke, G., A stratified analysis of Bayesian optimization methods, CoRR abs/1603.09441 (2016). Available at ttp://arxiv.org/abs/1603.09441.
[23] Duchi, J. C.; Hazan, E.; Singer, Y., Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res., 12, 2121-2159 (2011) · Zbl 1280.68164
[24] Erway, J. B.; Gill, P. E., A subspace minimization method for the trust-region step, SIAM. J. Optim., 20, 1439-1461 (2009) · Zbl 1195.49042
[25] Erway, J. B.; Gill, P. E.; Griffin, J. D., Iterative methods for finding a trust-region step, SIAM J. Optim., 20, 1110-1131 (2009) · Zbl 1189.49049
[26] Erway, J. B.; Marcia, R. F., On efficiently computing the eigenvalues of limited-memory quasi-newton matrices, SIAM J. Matrix Anal. Appl., 36, 1338-1359 (2015) · Zbl 1337.49048
[27] Friedman, J.; Hastie, T.; Tibshirani, R., The elements of statistical learning, 1 (2001), New York
[28] Gay, D. M., Computing optimal locally constrained steps, SIAM J. Sci. Statist. Comput., 2, 186-197 (1981) · Zbl 0467.65027
[29] Gould, N., An introduction to algorithms for continuous optimization, Oxford University Computing Laboratory Notes, 2006.
[30] Gould, N. I.M.; Lucidi, S.; Roma, M.; Toint, P. L., Solving the trust-region subproblem using the Lanczos method, SIAM J. Optim., 9, 504-525 (1999) · Zbl 1047.90510
[31] Gower, R., Goldfarb, D., and Richtarik, P., Stochastic block BFGS: Squeezing more curvature out of data, in Proceedings of The 33rd International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 48, 20-22 Jun, PMLR, New York, New York, USA, 2016, pp. 1869-1878. Available at http://proceedings.mlr.press/v48/gower16.html.
[32] Hager, W. W., Minimizing a quadratic over a sphere, SIAM J. Optim., 12, 188-208 (2001) · Zbl 1058.90045
[33] Ioffe, S. and Szegedy, C., Batch normalization: Accelerating deep network training by reducing internal covariate shift, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, 2015, pp. 448-456. Available at http://jmlr.org/proceedings/papers/v37/ioffe15.html.
[34] Kawaguchi, K., Deep learning without poor local minima, in Advances in Neural Information Processing Systems, Vol. 29, D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, and R. Garnett, eds., Curran Associates, Inc., 2016, pp. 586-594. Available at http://papers.nips.cc/paper/6112-deep-learning-without-poor-local-minima.pdf.
[35] Kingma, D.P. and Ba, J., Adam: A Method for Stochastic Optimization, CoRR abs/1412.6980 (2014). Available at http://arxiv.org/abs/1412.6980.
[36] Kussul, E.; Baidyk, T., Improved method of handwritten digit recognition tested on mnist database, Image Vis. Comput., 22, 971-981 (2004)
[37] Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A., On optimization methods for deep learning, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML ’11, Bellevue, Washington, USA, June, ACM, New York, NY, USA, 2011, pp. 265-272.
[38] Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P., Gradient-based learning applied to document recognition, Proc. IEEE, 86, 2278-2324 (1998)
[39] Liu, D. C.; Nocedal, J., On the limited memory method for large scale optimization, Math. Program. B, 45, 503-528 (1989) · Zbl 0696.90048
[40] Maclaurin, D., Duvenaud, D.K., and Adams, R.P., Gradient-based Hyperparameter Optimization through Reversible Learning, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, JMLR Workshop and Conference Proceedings, Vol. 37, JMLR.org, 2015, pp. 2113-2122. Available at http://jmlr.org/proceedings/papers/v37/maclaurin15.html.
[41] Martens, J., Deep learning via Hessian-free optimization, Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 2010, pp. 735-742.
[42] Martens, J. and Sutskever, I., Learning recurrent neural networks with Hessian-Free optimization, Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, 2011, pp. 1033-1040.
[43] Martens, J. and Sutskever, I., Training deep and recurrent networks with hessian-free optimization, in Neural Networks: Tricks of the Trade, G. Montavon, G.B. Orr and K.R. Müller, eds., Springer, 2012, pp. 479-535.
[44] McMahan, H.B. and Streeter, M.J., Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning, in Advances in Neural Information Processing Systems, Vol. 27, 2014, pp. 2915-2923. Available at http://papers.nips.cc/paper/5242-delay-tolerant-algorithms-for-asynchronous-distributed-online-learning.
[45] Metel, M.R., Mini-batch stochastic gradient descent with dynamic sample sizes, ArXiv e-prints (2017).
[46] Moré, J. J.; Sorensen, D. C., Computing a trust region step, SIAM J. Sci. Statist. Comput., 4, 553-572 (1983) · Zbl 0551.65042
[47] Moré, J.J. and Sorensen, D.C., Newton’s method, in Studies in Mathematics, Volume 24. Studies in Numerical Analysis, G.H. Golub, ed., Math. Assoc. America, Washington, DC, 1984, pp. 29-82.
[48] Moritz, P., Nishihara, R., and Jordan, M., A linearly-convergent stochastic L-BFGS algorithm, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Vol. 51, 09-11 May, PMLR, Cadiz, Spain, 2016, pp. 249-258. Available at http://proceedings.mlr.press/v51/moritz16.html.
[49] Nocedal, J., Updating quasi-Newton matrices with limited storage, Math. Comput., 35, 773-782 (1980) · Zbl 0464.65037
[50] Nocedal, J.; Wright, S. J., Numerical Optimization (2006), Springer: Springer, New York · Zbl 1104.65059
[51] Nocedal, J. and Yuan, Y.X., Combining trust region and line search techniques, in Advances in Nonlinear Programming, Y. Yuan, ed., Springer, Dordrecht, The Netherlands, 1998, pp. 153-175. · Zbl 0909.90243
[52] Pearlmutter, B. A., Fast exact multiplication by the Hessian, Neural Comput., 6, 147-160 (1994)
[53] Recht, B., Re, C., Wright, S., and Niu, F., Hogwild: A lock-free approach to parallelizing stochastic gradient descent, in Advances in Neural Information Processing Systems, Vol. 24, J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, and K.Q. Weinberger, eds., Curran Associates, Inc., 2011, pp. 693-701. Available at http://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf.
[54] Robbins, H.; Monro, S., A stochastic approximation method, Ann. Math. Statist., 22, 400-407 (1951) · Zbl 0054.05901
[55] Rojas, M.; Santos, S. A.; Sorensen, D. C., A new matrix-free algorithm for the large-scale trust-region subproblem, SIAM J. Optim., 11, 611-646 (2001) · Zbl 0994.65067
[56] Rojas, M.; Santos, S. A.; Sorensen, D. C., Algorithm 873: Lstrs: Matlab software for large-scale trust-region subproblems and regularization, ACM Trans. Math. Softw., 34, 11:1-11:28 (2008) · Zbl 1291.65177
[57] Sagun, L., Güney, V.U., and LeCun, Y., Explorations on high dimensional landscapes, CoRR abs/1412.6615 (2014). Available at http://arxiv.org/abs/1412.6615.
[58] Schraudolph, N.N., Yu, J., and Günter, S., A stochastic Quasi-Newton method for online convex optimization, in Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Vol. 2, 21-24 Mar, PMLR, 2007, pp. 436-443. Available at http://proceedings.mlr.press/v2/schraudolph07a.html.
[59] Smith, S.L., Kindermans, P.J., and Le, Q.V., Don’t decay the learning rate, increase the batch size, ArXiv e-prints (2017).
[60] Snoek, J., Larochelle, H., and Adams, R.P., Practical Bayesian optimization of machine learning algorithms, in Advances in Neural Information Processing Systems, Vol. 25, 2012, pp. 2960-2968. Available at http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.
[61] Sparks, E.R., Talwalkar, A., Franklin, M.J., Jordan, M.I., and Kraska, T., Tupaq: An efficient planner for large-scale predictive analytic queries, CoRR abs/1502.00068 (2015). Available at http://arxiv.org/abs/1502.00068.
[62] Steihaug, T., The conjugate gradient method and trust regions in large scale optimization, SIAM J. Numer. Anal., 20, 626-637 (1983) · Zbl 0518.65042
[63] Sutskever, I., Martens, J., Dahl, G.E., and Hinton, G.E., On the importance of initialization and momentum in deep learning, Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, 2013, pp. 1139-1147. Available at http://jmlr.org/proceedings/papers/v28/sutskever13.html.
[64] Toint, P.L., Towards an efficient sparsity exploiting Newton method for minimization, in Sparse Matrices and Their Uses, I.S. Duff, Academic Press, London and New York, 1981, pp. 57-88.
[65] Vapnik, V., Principles of risk minimization for learning theory, in Advances in Neural Information Processing Systems, J.E. Moody, S.J. Hanson, and R.P. Lippmann, eds., Morgan Kaufmann, Denver, CO, 1992, pp. 831-838.
[66] Yektamaram, S., Optimization algorithms for machine learning designed for parallel and distributed environments, Ph.D. diss., ISE Department, Lehigh University, Bethlehem, PA, 2017.
[67] Zeiler, M.D., ADADELTA: an adaptive learning rate method, CoRR abs/1212.5701 (2012). Available at http://arxiv.org/abs/1212.5701.
[68] Zhang, S., Choromanska, A., and LeCun, Y., Deep learning with elastic averaging SGD, in Advances in Neural Information Processing Systems, Vol. 28, 2015, pp. 685-693. Available at http://papers.nips.cc/paper/5761-deep-learning-with-elastic-averaging-sgd.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.