×

Stochastic proximal quasi-Newton methods for non-convex composite optimization. (English) Zbl 07111868

Summary: In this paper, we propose a generic algorithmic framework for stochastic proximal quasi-Newton (SPQN) methods to solve non-convex composite optimization problems. Stochastic second-order information is explored to construct proximal subproblem. Under mild conditions we show the non-asympotic convergence of the proposed algorithm to stationary point of original problems and analyse its computational complexity. Besides, we extend the proximal form of Polyak-Łojasiewicz (PL) inequality to constrained settings and obtain the constrained proximal PL (CP-PL) inequality. Under CP-PL inequality linear convergence rate of the proposed algorithm is achieved. Moreover, we propose a modified self-scaling symmetric rank one incorporated in the framework for SPQN method, which is called stochastic symmetric rank one method. Finally, we report some numerical experiments to reveal the effectiveness of the proposed algorithm.

MSC:

47N10 Applications of operator theory in optimization, convex analysis, mathematical programming, economics
65K10 Numerical optimization and variational techniques

Software:

NNLS; IMRO; SGD-QN
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Agarwal, N.; Bullins, B.; Hazan., E., Second order stochastic optimization for machine learning in linear time, J. Mach. Learn. Res., 18, 116, 1-40 (2017) · Zbl 1441.90115
[2] Allen-Zhu, Z., Natasha 2: Faster non-convex optimization than SGD, preprint (2017). Available at arXiv:1708.08694v2.
[3] Allen-Zhu, Z., Natasha: Faster stochastic non-convex optimization via strongly non-convex parameter, preprint (2017). Available at arXiv:1702.00763.
[4] Allen-Zhu, Z. and Hazan, E., Variance reduction for faster non-convex optimization, International Conference on Machine Learning, New York, NY, 2016, pp. 699-707.
[5] Becker, S. and Fadili, J., A quasi-Newton proximal splitting method, Advances in Neural Information Processing Systems, Lake Tahoe, 2012, pp. 2618-2626.
[6] Berahas, A.S., Bollapragada, R., and Nocedal, J., An investigation of Newton-sketch and subsampled Newton methods, preprint (2017). Available at arXiv:1705.06211. · Zbl 1454.90112
[7] Bordes, A.; Bottou, L.; Gallinari, P., Sgd-qn careful quasi-Newton stochastic gradient descent, J. Mach. Learn. Res., 10, 1737-1754 (2009) · Zbl 1235.68130
[8] Byrd, R. H.; Chin, G. M.; Nocedal, J.; Oztoprak, F., A family of second-order methods for convex \(####\)-regularized optimization, Math. Program., 159, 1-2, 435-467 (2016) · Zbl 1350.49046 · doi:10.1007/s10107-015-0965-3
[9] Byrd, R. H.; Hansen, S. L.; Nocedal, J.; Singer, Y., A stochastic quasi-Newton method for large-scale optimization, SIAM J. Optim., 26, 2, 1008-1031 (2016) · Zbl 1382.65166 · doi:10.1137/140954362
[10] Byrd, R. H.; Khalfan, H. F.; Schnabel, R. B., Analysis of a symmetric rank-one trust region method, SIAM J. Optim., 6, 4, 1025-1039 (1996) · Zbl 0923.65035 · doi:10.1137/S1052623493252985
[11] Byrd, R.H., Nocedal, J., and Oztoprak, F., An inexact successive quadratic approximation method for convex \(####\) regularized optimization, preprint (2013). Available at arXiv:1309.3529. · Zbl 1342.49037
[12] Cauchy, A. L., Méthode générale pour la résolution des systèmes d’équations simultanées, Comptes rendus des séances de l’Académie des sciences de Paris, 25, 536-538 (1847)
[13] Cho, K., Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y., Learning phrase representations using rnn encoderdecoder for statistical machine translation, preprint (2014). Available at arXiv:1406.1078.
[14] Conn, A. R.; Gould, N. I.M.; Toint, Ph. L., Convergence of quasi-Newton matrices generated by the symmetric rank one update, Math. Program., 50, 2, 177-195 (1991) · Zbl 0737.90062 · doi:10.1007/BF01594934
[15] Curtis, F.E., A self-correcting variable-metric algorithm for stochastic optimization, International Conference on Machine Learning, New York, NY, 2016, pp. 632-641.
[16] Defazio, A., Bach, F., and Julien, S.L., Saga: A fast incremental gradient method with support for non-strongly convex composite objectives, Advances in Neural Information Processing Systems, Montreal, 2014, pp. 1646-1654.
[17] Friedman, J.; Hastie, T.; Höfling, H.; Tibshirani, R., Pathwise coordinate optimization, Ann. Appl. Stat., 1, 2, 302-332 (2007) · Zbl 1378.90064 · doi:10.1214/07-AOAS131
[18] Ghadimi, S.; Lan, G.; Zhang, H., Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization, Math. Program., 155, 1-2, 267-305 (2016) · Zbl 1332.90196 · doi:10.1007/s10107-014-0846-1
[19] Ghanbari, H. and Scheinberg, K., Proximal quasi-Newton methods for regularized convex optimization with linear and accelerated sublinear convergence rates, preprint (2016). Available at arXiv:1607.03081. · Zbl 1397.90301
[20] Goodfellow, I.; Bengio, Y.; Courville, A., Deep Learning (2016), MIT Press: MIT Press, Cambridge, MA · Zbl 1373.68009
[21] Hsieh, C.J., Sustik, M.A., Dhillon, I.S., and Ravikumar, P., Sparse inverse covariance matrix estimation using quadratic approximation, Advances in Neural Information Processing Systems, Granada, 2011, pp. 2330-2338.
[22] Johnson, R. and Zhang, T., Accelerating stochastic gradient descent using predictive variance reduction, Advances in Neural Information Processing Systems, Lake Tahoe, 2013, pp. 315-323.
[23] Karimi, H., Nutini, J., and Schmidt, M., Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, Riva del Garda, 2016, pp. 795-811.
[24] Karimi, S.; Vavasis, S., Imro: A proximal quasi-Newton method for solving \(####\)-regularized least squares problems, SIAM J. Optim., 27, 2, 583-615 (2017) · Zbl 1365.90202 · doi:10.1137/140966587
[25] Khalfan, H. F.; Byrd, R. H.; Schnabel, R. B., A theoretical and experimental study of the symmetric rank one update, SIAM J. Optim., 3, 1, 1-24 · Zbl 0771.65029 · doi:10.1137/0803001
[26] Khorasani, F.R. and Mahoney, M.W., Sub-sampled Newton methods I: Globally convergent algorithms, preprint (2016). Available at arXiv:1601.04737.
[27] Khorasani, F.R. and Mahoney, M.W., Sub-sampled Newton methods II: Local convergence rates, preprint (2016). Available at arXiv:1601.04738.
[28] Kim, D.; Sra, S.; Dhillon, I. S., Tackling box-constrained optimization via a new projected quasi-Newton approach, SIAM J. Sci. Comput., 32, 6, 3548-3563 (2010) · Zbl 1220.93085 · doi:10.1137/08073812X
[29] Lee, J.; Sun, Y.; Saunders, M., Proximal Newton-type methods for minimizing composite functions, SIAM. J. Optim., 24, 3, 1420-1443 (2014) · Zbl 1306.65213 · doi:10.1137/130921428
[30] Lin, H., Mairal, J., and Harchaoui, Z., A generic quasi-Newton algorithm for faster gradient-based optimization, preprint (2016). Available at arXiv:1610.00960.
[31] Liu, X., Hsieh, C., Lee, J.D., and Sun, Y., An inexact subsampled proximal Newton-type method for large-scale machine learning, preprint (2017). Available at arXiv:1708.08552.
[32] Luo, L., Chen, Z., Zhang, Z., and Li, W., A proximal stochastic quasi-Newton algorithm, preprint (2016). Available at arXiv:1602.00223.
[33] Mairal, J., Incremental majorization-minimization optimization with application to large-scale machine learning, SIAM J. Optim., 25, 2, 829-855 (2015) · Zbl 1320.90047 · doi:10.1137/140957639
[34] Mine, H.; Fukushima, M., A minimization method for the sum of a convex function and a continuously differentiable function, J. Optim. Theory Appl., 33, 1, 9-23 (1981) · Zbl 0422.90070 · doi:10.1007/BF00935173
[35] Mokhtari, A., Eisen, M., and Ribeiro, A., IQN: An incremental quasi-Newton method with local superlinear convergence rate, preprint (2017). Available at arXiv:1702.00709. · Zbl 1401.90121
[36] Mokhtari, A.; Ribeiro, A., RES: Regularized stochastic BFGS algorithm, IEEE Trans. Signal Process., 62, 23, 6089-6104 (2014) · Zbl 1394.94405 · doi:10.1109/TSP.2014.2357775
[37] Moritz, P., Nishihara, R., and Jordan, M.I., A linearly-convergent stochastic L-BFGS algorithm, Artificial Intelligence and Statistics, 2016, pp. 249-258.
[38] Nocedal, J.; Wright, S. J., Numerical Optimization (2006), Springer: Springer, New York · Zbl 1104.65059
[39] Osborne, M. R.; Sun, L. P., A new approach to symmetric rank one updating, IMA J. Numer. Anal., 19, 4, 497-507 (1999) · Zbl 0940.65071 · doi:10.1093/imanum/19.4.497
[40] Powell, M. J.D., Algorithms for nonlinear constraints that use lagrangian functions, Math. Program., 14, 1, 224-248 (1978) · Zbl 0383.90092 · doi:10.1007/BF01588967
[41] Reddi, S.J., Hefny, A., Sra, S., Póczos, B., and Smola, A.J., Stochastic variance reduction for nonconvex optimization, International Conference on Machine Learning, New York, NY, 2016, pp. 314-323.
[42] Reddi, S.J., Sra, S., Póczos, B., and Smola, A., Fast incremental method for nonconvex optimization, preprint (2016). Available at arXiv:1603.06159.
[43] Reddi, S.J., Sra, S., Póczos, B., and Smola, A., Fast stochastic methods for nonsmooth nonconvex optimization, preprint (2016). Available at arXiv:1605.06900.
[44] Robbins, H.; Monro, S., A stochastic approximation method, Ann. Math. Stat., 22, 3, 400-407 (1951) · Zbl 0054.05901 · doi:10.1214/aoms/1177729586
[45] Rodomanov, A. and Kropotov, D., A superlinearly-convergent proximal Newton-type method for the optimization of finite sums, International Conference on Machine Learning, New York, NY, 2016, pp. 2597-2605.
[46] Schmidt, M.; Berg, E.; Friedlander, M. P.; Murphy, K., Optimizing costly functions with simple constraints: A limited-memory projected quasi-Newton algorithm, J. Mach. Learn. Res., 5, 456-463 (2009)
[47] Schmidt, M., Kim, D., and Sra, S., Projected Newton-type methods in machine learning, in Optimization for Machine Learning, S. Sra, S. Nowozin, and S.Wright, eds., MIT Press, Cambridge, MA, 2011.
[48] Schmidt, M.; Roux, N. L.; Bach, F., Minimizing finite sums with the stochastic average gradient, Math. Program., 160, 1-2, 83-112 (2017) · Zbl 1358.90073 · doi:10.1007/s10107-016-1030-6
[49] Schraudolph, N.N., Yu, J., and Günte, S., A stochastic quasi-Newton method for online convex optimization, J. Mach. Learn. Res.2 (2007), pp. 436-443.
[50] Shalev-Shwartz, S. and Zhang, T., Proximal stochastic dual coordinate ascent, preprint (2012). Available at arXiv:1211.2717.
[51] Shalev-Shwartz, S.; Zhang, T., Stochastic dual coordinate ascent methods for regularized loss, J. Mach. Learn. Res., 14, 1, 567-599 (2017) · Zbl 1307.68073
[52] Shi, Z. and Liu, R., Large scale optimization with proximal stochastic Newton-type gradient descent, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2015, pp. 691-704.
[53] Shwartz, S.; Tewari, A., Stochastic methods for \(####\)-regularized loss minimization, J. Mach. Learn. Res., 12, 1865-1892 (2011) · Zbl 1280.62081
[54] Spellucci, P., A modified rank one update which converges q-superlinearly, Comput. Optim. Appl., 19, 4, 273-296 (2001) · Zbl 1009.90131 · doi:10.1023/A:1011259905470
[55] Sun, L. P., Updating the self-scaling symmetric rank one algorithm with limited memory for large-scale unconstrained optimization, Comput. Optim. Appl., 27, 1, 23-29 · Zbl 1045.90064
[56] Tibshirani, R., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B, 58, 1, 267-288 (1996) · Zbl 0850.62538
[57] Wang, X.; Ma, S.; Goldfarb, D.; Liu, W., Stochastic quasi-Newton methods for nonconvex stochastic optimization, SIAM J. Optim., 27, 2, 927-956 (2017) · Zbl 1365.90182 · doi:10.1137/15M1053141
[58] Xiao, L.; Zhang, T., A proximal stochastic gradient method with progressive variance reduction, SIAM J. Optim., 24, 4, 2057-2075 (2014) · Zbl 1321.65016 · doi:10.1137/140961791
[59] Yu, X. and Tao, D., Variance-reduced proximal stochastic gradient descent for non-convex composite optimization, preprint (2016). Available at arXiv:1606.00602.
[60] Yuan, G. X.; Chang, K. W.; Hsieh, C. J.; Lin, C. J., A comparison of optimization methods and software for large-scale \(####\)-regularized linear classification, J. Mach. Learn. Res., 11, 3183-3234 (2010) · Zbl 1242.62065
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.