zbMATH — the first resource for mathematics

Multicomposite nonconvex optimization for training deep neural networks. (English) Zbl 1445.90086
90C26 Nonconvex programming, global optimization
49J52 Nonsmooth analysis
Full Text: DOI
[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, tensorflow.org, 2015.
[2] D.P. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont, MA, 1999. · Zbl 1015.90077
[3] L. Bottou, F. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), pp. 223-311. · Zbl 1397.65085
[4] R. Collobert and S. Bengio, Links between perceptrons, MLPs and SVMs, in Proceedings of the 21st International Conference on Machine learning, 2004.
[5] Y. Cui, J.S. Pang, and B. Sen, Composite difference-max programs for modern statistical estimation problems, SIAM J. Optim., 28 (2018), pp. 3344-3374. · Zbl 1407.62250
[6] Y. Cui, D. Sun, and K.C. Toh, Computing the best approximation over the intersection of a polyhedral set and the doubly nonnegative cone, SIAM J. Optim., 29 (2019), pp. 2785-2813. · Zbl 1431.90109
[7] D. Davis and D. Drusvyatskiy, Stochastic model-based minimization of weakly convex functions, SIAM J. Optim., 29 (2019), pp. 207-239. · Zbl 1415.65136
[8] D. Davis, D. Drusvyatskiy, S. Kakade, and J. Lee, Stochastic subgradient method converges on tame functions, Found. Comput. Math., 20 (2020), pp. 119-154. · Zbl 1433.65141
[9] V.F. Demyanov, G. Di Pillo, and F. Facchinei, Exact penalization via Dini and Hadamard conditional derivatives, Optim. Methods Softw., 9 (1998), pp. 19-36. · Zbl 0903.90149
[10] G. Di Pillo and F. Facchinei, Exact penalty functions for nondifferentiable programming problems, in Nonsmooth Optimization and Related Topics, Springer, New York, 1989, pp. 89-107. · Zbl 0735.90061
[11] F. Facchinei and L. Lampariello, Partial penalization for the solution of generalized Nash equilibrium problems, J. Global Optim., 50 (2011), pp. 39-50. · Zbl 1236.91015
[12] F. Facchinei and J.S. Pang, Finite-Dimensional Variational Inequalities and Complementarity Problems, Springer, New York, 2003. · Zbl 1062.90002
[13] C.A. Floudas, Deterministic Global Optimization: Theory, Methods, and Applications, Nonconvex Optim. Appl. 37, Springer, New York, 2000.
[14] W. Gao, D. Goldfarb, and F. Curtis, ADMM for multiaffine constrained optimization, Optim. Methods Softw., 35 (2020), pp. 257-303. · Zbl 1428.90132
[15] S. Ghadimi and G. Lan, Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23 (2013), pp. 2341-2368. · Zbl 1295.90026
[16] X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectifier neural networks, in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011, pp. 315-323.
[17] G.H. Golub and C.F. van Loan, Matrix Computations, Johns Hopkins University Press, Baltimore, 2013. · Zbl 1268.65037
[18] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning, Vol. 1, MIT Press, Cambridge, MA, 2016. · Zbl 1373.68009
[19] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, Maxout networks, in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 1319-1327.
[20] G. E. Hinton, S. Osindero, and Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural Comput., 18 (2006), pp. 1527-1554. · Zbl 1106.68094
[21] R.A. Horn and C.R. Johnson, Topics in Matrix Analysis, Cambridge University Press, Cambridge, UK, 1991. · Zbl 0729.15001
[22] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, 2 (1989), pp. 359-366. · Zbl 1383.92015
[23] K. Jarrett, K. Kavukcuoglu, and Y. LeCun, What is the best multi-stage architecture for object recognition?, in Proceedings of the IEEE 12th International Conference on Computer Vision, (2009), pp. 2146-2153.
[24] D. Kingma and J. Ba, Adam: A method for stochastic optimization, in Proceedings of the International Conference on Learning Representations, Vol. 5, 2015.
[25] T. Lau, J. Zeng, B. Wu, and Y. Yao, A proximal block coordinate descent algorithm for deep neural network training, in Proceedings of the Workshop of the 6th International Conference on Learning Representations, 2018.
[26] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, 521 (2015), pp. 436-444.
[27] M. Leshno, V.Y. Lin, A. Pinkus, and S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks, 6 (1993), pp. 861-867.
[28] A.L. Maas, A.Y. Hannun, and A.Y. Ng, Rectifier nonlinearities improve neural network acoustic models, in Proceedings of the 30th International Conference on Machine Learning, 2013.
[29] A. Nemirovsky and D. Yudin, Problem Complexity and Method Efficiency in Optimization, John Wiley & Sons, New York, 1983.
[30] Y. Nesterov, A method of solving a convex programming problem with convergence rate \(O(1/k^2)\), in Soviet Math. Dokl., 27 (1983), pp. 372-376. · Zbl 0535.90071
[31] J.S. Pang and L. Qi, A globally convergent Newton method for convex \(SC^1\) minimization problems, J. Optim. Theory Appl., 85 (1995), pp. 633-648. · Zbl 0831.90095
[32] J.S. Pang, M. Razaviyayn, and A. Alvarado, Computing B-stationary points of nonsmooth DC programs, Math. Oper. Res., 42 (2016), pp. 95-118. · Zbl 1359.90106
[33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. Devito, Z. Lin, A. Desmaison, L. Antiga, and A. Lever, Automatic Differentiation in PyTorch, https://pytorch.org, 2017,
[34] D.T. Pham and H.W. Le Thi, Convex analysis approach to DC programming: Theory, algorithm and applications, Acta Math. Vietnam., 22 (1997), pp. 289-355. · Zbl 0895.90152
[35] J.R. Quinlan, Combining instance-based and model-based learning, in Proceedings of the 10th International Conference on Machine Learning, 1993, pp. 236-243.
[36] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), pp. 400-407. · Zbl 0054.05901
[37] K. Schäcke, On the Kronecker Product, https://www.math.uwaterloo.ca/ hwolkowi/henry/reports/kronthesisschaecke04.pdf, 2013.
[38] J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks, 61 (2015), pp. 85-117.
[39] S. Scholtes, Introduction to Piecewise Differentiable Equations, Springer Briefs Optim., Springer, New York, 2002. · Zbl 06046475
[40] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein, Training neural networks without gradients: A scalable ADMM approach, in Proceedings of the 33rd International Conference on Machine Learning, 2016, pp. 2722-2731.
[41] Z. Zhang and B. Matthew, Convergent block coordinate descent for training Tikhonov regularized deep neural networks, in Proceedings of the 31st Conference on Neural Information Processing System, 2017, pp. 1721-1730.
[42] Z. Zhang, Y. Chen, and V. Saligrama, Efficient training of very deep neural networks for supervised hashing, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1487-1495.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.