# zbMATH — the first resource for mathematics

Multicomposite nonconvex optimization for training deep neural networks. (English) Zbl 1445.90086
##### MSC:
 90C26 Nonconvex programming, global optimization 49J52 Nonsmooth analysis
##### Software:
 [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al., TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, tensorflow.org, 2015. [2] D.P. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont, MA, 1999. · Zbl 1015.90077 [3] L. Bottou, F. Curtis, and J. Nocedal, Optimization methods for large-scale machine learning, SIAM Rev., 60 (2018), pp. 223-311. · Zbl 1397.65085 [4] R. Collobert and S. Bengio, Links between perceptrons, MLPs and SVMs, in Proceedings of the 21st International Conference on Machine learning, 2004. [5] Y. Cui, J.S. Pang, and B. Sen, Composite difference-max programs for modern statistical estimation problems, SIAM J. Optim., 28 (2018), pp. 3344-3374. · Zbl 1407.62250 [6] Y. Cui, D. Sun, and K.C. Toh, Computing the best approximation over the intersection of a polyhedral set and the doubly nonnegative cone, SIAM J. Optim., 29 (2019), pp. 2785-2813. · Zbl 1431.90109 [7] D. Davis and D. Drusvyatskiy, Stochastic model-based minimization of weakly convex functions, SIAM J. Optim., 29 (2019), pp. 207-239. · Zbl 1415.65136 [8] D. Davis, D. Drusvyatskiy, S. Kakade, and J. Lee, Stochastic subgradient method converges on tame functions, Found. Comput. Math., 20 (2020), pp. 119-154. · Zbl 1433.65141 [9] V.F. Demyanov, G. Di Pillo, and F. Facchinei, Exact penalization via Dini and Hadamard conditional derivatives, Optim. Methods Softw., 9 (1998), pp. 19-36. · Zbl 0903.90149 [10] G. Di Pillo and F. Facchinei, Exact penalty functions for nondifferentiable programming problems, in Nonsmooth Optimization and Related Topics, Springer, New York, 1989, pp. 89-107. · Zbl 0735.90061 [11] F. Facchinei and L. Lampariello, Partial penalization for the solution of generalized Nash equilibrium problems, J. Global Optim., 50 (2011), pp. 39-50. · Zbl 1236.91015 [12] F. Facchinei and J.S. Pang, Finite-Dimensional Variational Inequalities and Complementarity Problems, Springer, New York, 2003. · Zbl 1062.90002 [13] C.A. Floudas, Deterministic Global Optimization: Theory, Methods, and Applications, Nonconvex Optim. Appl. 37, Springer, New York, 2000. [14] W. Gao, D. Goldfarb, and F. Curtis, ADMM for multiaffine constrained optimization, Optim. Methods Softw., 35 (2020), pp. 257-303. · Zbl 1428.90132 [15] S. Ghadimi and G. Lan, Stochastic first- and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23 (2013), pp. 2341-2368. · Zbl 1295.90026 [16] X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectifier neural networks, in Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 2011, pp. 315-323. [17] G.H. Golub and C.F. van Loan, Matrix Computations, Johns Hopkins University Press, Baltimore, 2013. · Zbl 1268.65037 [18] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning, Vol. 1, MIT Press, Cambridge, MA, 2016. · Zbl 1373.68009 [19] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, Maxout networks, in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 1319-1327. [20] G. E. Hinton, S. Osindero, and Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural Comput., 18 (2006), pp. 1527-1554. · Zbl 1106.68094 [21] R.A. Horn and C.R. Johnson, Topics in Matrix Analysis, Cambridge University Press, Cambridge, UK, 1991. · Zbl 0729.15001 [22] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, 2 (1989), pp. 359-366. · Zbl 1383.92015 [23] K. Jarrett, K. Kavukcuoglu, and Y. LeCun, What is the best multi-stage architecture for object recognition?, in Proceedings of the IEEE 12th International Conference on Computer Vision, (2009), pp. 2146-2153. [24] D. Kingma and J. Ba, Adam: A method for stochastic optimization, in Proceedings of the International Conference on Learning Representations, Vol. 5, 2015. [25] T. Lau, J. Zeng, B. Wu, and Y. Yao, A proximal block coordinate descent algorithm for deep neural network training, in Proceedings of the Workshop of the 6th International Conference on Learning Representations, 2018. [26] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, 521 (2015), pp. 436-444. [27] M. Leshno, V.Y. Lin, A. Pinkus, and S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Networks, 6 (1993), pp. 861-867. [28] A.L. Maas, A.Y. Hannun, and A.Y. Ng, Rectifier nonlinearities improve neural network acoustic models, in Proceedings of the 30th International Conference on Machine Learning, 2013. [29] A. Nemirovsky and D. Yudin, Problem Complexity and Method Efficiency in Optimization, John Wiley & Sons, New York, 1983. [30] Y. Nesterov, A method of solving a convex programming problem with convergence rate $$O(1/k^2)$$, in Soviet Math. Dokl., 27 (1983), pp. 372-376. · Zbl 0535.90071 [31] J.S. Pang and L. Qi, A globally convergent Newton method for convex $$SC^1$$ minimization problems, J. Optim. Theory Appl., 85 (1995), pp. 633-648. · Zbl 0831.90095 [32] J.S. Pang, M. Razaviyayn, and A. Alvarado, Computing B-stationary points of nonsmooth DC programs, Math. Oper. Res., 42 (2016), pp. 95-118. · Zbl 1359.90106 [33] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. Devito, Z. Lin, A. Desmaison, L. Antiga, and A. Lever, Automatic Differentiation in PyTorch, https://pytorch.org, 2017, [34] D.T. Pham and H.W. Le Thi, Convex analysis approach to DC programming: Theory, algorithm and applications, Acta Math. Vietnam., 22 (1997), pp. 289-355. · Zbl 0895.90152 [35] J.R. Quinlan, Combining instance-based and model-based learning, in Proceedings of the 10th International Conference on Machine Learning, 1993, pp. 236-243. [36] H. Robbins and S. Monro, A stochastic approximation method, Ann. Math. Statist., 22 (1951), pp. 400-407. · Zbl 0054.05901 [37] K. Schäcke, On the Kronecker Product, https://www.math.uwaterloo.ca/ hwolkowi/henry/reports/kronthesisschaecke04.pdf, 2013. [38] J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks, 61 (2015), pp. 85-117. [39] S. Scholtes, Introduction to Piecewise Differentiable Equations, Springer Briefs Optim., Springer, New York, 2002. · Zbl 06046475 [40] G. Taylor, R. Burmeister, Z. Xu, B. Singh, A. Patel, and T. Goldstein, Training neural networks without gradients: A scalable ADMM approach, in Proceedings of the 33rd International Conference on Machine Learning, 2016, pp. 2722-2731. [41] Z. Zhang and B. Matthew, Convergent block coordinate descent for training Tikhonov regularized deep neural networks, in Proceedings of the 31st Conference on Neural Information Processing System, 2017, pp. 1721-1730. [42] Z. Zhang, Y. Chen, and V. Saligrama, Efficient training of very deep neural networks for supervised hashing, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1487-1495.