×

Convex geometry and duality of over-parameterized neural networks. (English) Zbl 07626727

Summary: We develop a convex analytic approach to analyze finite width two-layer ReLU networks. We first prove that an optimal solution to the regularized training problem can be characterized as extreme points of a convex set, where simple solutions are encouraged via its convex geometrical properties. We then leverage this characterization to show that an optimal set of parameters yield linear spline interpolation for regression problems involving one dimensional or rank-one data. We also characterize the classification decision regions in terms of a kernel matrix and minimum \(\ell_1\)-norm solutions. This is in contrast to Neural Tangent Kernel which is unable to explain predictions of finite width networks. Our convex geometric characterization also provides intuitive explanations of hidden neurons as auto-encoders. In higher dimensions, we show that the training problem can be cast as a finite dimensional convex problem with infinitely many constraints. Then, we apply certain convex relaxations and introduce a cutting-plane algorithm to globally optimize the network. We further analyze the exactness of the relaxations to provide conditions for the convergence to a global optimum. Our analysis also shows that optimal network parameters can be also characterized as interpretable closed-form formulas in some practically relevant special cases.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

Software:

MNIST; SDPT3; SuperMann; CVX; SPGL1
PDFBibTeX XMLCite
Full Text: arXiv Link

References:

[1] 20 newsgroups.http://qwone.com/ jason/20Newsgroups/.
[2] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In35th International Conference on Machine Learning, ICML 2018, pages 372-389. International Machine Learning Society (IMLS), 2018.
[3] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net.CoRR, abs/1904.11955, 2019. URLhttp://arxiv.org/abs/1904.11955.
[4] Francis Bach. Breaking the curse of dimensionality with convex neural networks.The Journal of Machine Learning Research, 18(1):629-681, 2017. · Zbl 1433.68390
[5] Jerzy K Baksalary and Oskar Maria Baksalary. Particular formulae for the moore-penrose inverse of a columnwise partitioned matrix.Linear algebra and its applications, 421(1): 16-23, 2007. · Zbl 1116.15003
[6] Burak Bartan and Mert Pilanci. Convex relaxations of convolutional neural nets. InICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4928-4932. IEEE, 2019.
[7] Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex neural networks. InAdvances in neural information processing systems, pages 123-130, 2006.
[8] Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels.arXiv preprint arXiv:1905.12173, 2019. · Zbl 1483.68335
[9] Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process.CoRR, abs/1904.09080, 2019. URLhttp://arxiv.org/abs/1904.09080.
[10] Stephen Boyd and Lieven Vandenberghe.Convex optimization. Cambridge university press, 2004. · Zbl 1058.90049
[11] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz.SGD learns over-parameterized networks that provably generalize on linearly separable data.CoRR, abs/1710.10174, 2017. URLhttp://arxiv.org/abs/1710.10174.
[12] E. J. Candes and T. Tao. Decoding by linear programming.IEEE Trans. Info Theory, 51 (12):4203-4215, December 2005. · Zbl 1264.94121
[13] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming.arXiv preprint arXiv:1812.07956, 2018.
[14] Adam Coates and Andrew Y Ng. Learning feature representations with k-means. InNeural networks: Tricks of the trade, pages 561-580. Springer, 2012.
[15] Chris HQ Ding, Tao Li, and Michael I Jordan. Convex and semi-nonnegative matrix factorizations.IEEE transactions on pattern analysis and machine intelligence, 32(1):45-55, 2008.
[16] D. L. Donoho. For most large underdetermined systems of linear equations, the minimal ‘1-norm solution is also the sparsest solution.Communications on Pure and Applied Mathematics, 59(6):797-829, June 2006. · Zbl 1113.15004
[17] Simon S Du and Jason D Lee. On the power of over-parametrization in neural networks with quadratic activation.arXiv preprint arXiv:1803.01206, 2018.
[18] Simon S. Du, Xiyu Zhai, Barnab´as P´oczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks.CoRR, abs/1810.02054, 2018. URLhttp: //arxiv.org/abs/1810.02054.
[19] Tolga Ergen and Mert Pilanci. Convex optimization for shallow neural networks. In2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 79-83. IEEE, 2019a. · Zbl 07626727
[20] Tolga Ergen and Mert Pilanci.Convex duality and cutting plane methods for overparameterized neural networks. InOPT-ML workshop, 2019b. · Zbl 07626727
[21] Tolga Ergen and Mert Pilanci. Convex geometry of two-layer relu networks: Implicit autoencoding and interpretable models. InInternational Conference on Artificial Intelligence and Statistics, pages 4024-4033. PMLR, 2020a.
[22] Tolga Ergen and Mert Pilanci. Revealing the structure of deep neural networks via convex duality. 2020b. · Zbl 07626727
[23] Tolga Ergen and Mert Pilanci. Convex programs for global optimization of convolutional neural networks in polynomial-time. InOPT-ML workshop, 2020c. · Zbl 07626727
[24] Tolga Ergen and Mert Pilanci. Implicit convex regularizers of{cnn}architectures: Convex optimization of two- and three-layer networks in polynomial time. InInternational Conference on Learning Representations, 2021. URLhttps://openreview.net/forum?id= 0N8jUH4JMv6. · Zbl 07626727
[25] Tolga Ergen, Arda Sahiner, Batu Ozturkler, John M. Pauly, Morteza Mardani, and Mert Pilanci.Demystifying batch normalization in relu networks: Equivalent convex optimization models and implicit regularization.CoRR, abs/2103.01499, 2021.URL https://arxiv.org/abs/2103.01499.
[26] Marguerite Frank and Philip Wolfe.An algorithm for quadratic programming.Naval Research Logistics Quarterly, 3(1-2):95-110, 1956. doi: 10.1002/nav.3800030109. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800030109.
[27] GM Fung and OL Mangasarian. Equivalence of minimal‘0-and‘p-norm solutions of linear equalities, inequalities and linear programs for sufficiently small p.Journal of optimization theory and applications, 151(1):1-10, 2011. · Zbl 1226.90104
[28] Miguel Angel Goberna and Marco L´opez-Cerd´a.Linear semi-infinite optimization. 01 1998. doi: 10.1007/978-1-4899-8044-1 3. · Zbl 0909.90257
[29] Yehoram Gordon. On milman’s inequality and random subspaces which escape through a mesh in rn. InGeometric Aspects of Functional Analysis, pages 84-106. Springer, 1988. · Zbl 0651.46021
[30] Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming, version 2.1.http://cvxr.com/cvx, March 2014.
[31] Vikul Gupta, Burak Bartan, Tolga Ergen, and Mert Pilanci. Convex neural autoregressive models: Towards tractable, expressive, and theoretically-backed models for sequential forecasting and generation. InICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3890-3894, 2021. doi: 10.1109/ ICASSP39728.2021.9413662.
[32] Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise.SIAM Journal on Computing, 39(2):742-765, 2009. · Zbl 1198.68157
[33] Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. 2018.
[34] Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in neural information processing systems, pages 8571-8580, 2018.
[35] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10 dataset.http://www. cs.toronto.edu/kriz/cifar.html, 2014.
[36] Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950-957, 1992.
[37] Jonathan Lacotte and Mert Pilanci. All local minima are global for two-layer relu neural networks: The hidden convex optimization landscape.arXiv preprint arXiv:2006.05900, 2020. · Zbl 1497.62152
[38] Yann LeCun. The MNIST database of handwritten digits.http://yann.lecun.com/exdb/ mnist/.
[39] Johannes Lederer. No spurious local minima: on the optimization landscapes of wide and deep neural networks, 2020.
[40] Michel Ledoux and Michel Talagrand.Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013. · Zbl 1226.60003
[41] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes.arXiv preprint arXiv:1711.00165, 2017. · Zbl 07330523
[42] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data.CoRR, abs/1808.01204, 2018. URLhttp://arxiv. org/abs/1808.01204.
[43] Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient descent quantizes relu network features.arXiv preprint arXiv:1803.08367, 2018.
[44] Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks.arXiv preprint arXiv:1804.11271, 2018.
[45] Tom M Mitchell and Machine Learning. Mcgraw-hill science.Engineering/Math, 1:27, 1997. · Zbl 0913.68167
[46] Radford M Neal. Priors for infinite networks. InBayesian Learning for Neural Networks, pages 29-53. Springer, 1996. · Zbl 0888.62021
[47] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning.arXiv preprint arXiv:1412.6614, 2014.
[48] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards understanding the role of over-parametrization in generalization of neural networks.arXiv preprint arXiv:1805.12076, 2018.
[49] Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm infinite width relu nets: The multivariate case. InInternational Conference on Learning Representations, 2020. URLhttps://openreview.net/forum?id= H1lNPxHKDH.
[50] Rahul Parhi and Robert D Nowak. Minimum “norm” neural networks are splines.arXiv preprint arXiv:1910.02333, 2019. · Zbl 1507.68250
[51] Mert Pilanci and Tolga Ergen. Neural networks are convex regularizers: Exact polynomialtime convex optimization formulations for two-layer networks. In Hal Daum´e III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 7695-7705. PMLR, 13-18 Jul 2020. URLhttp://proceedings.mlr.press/v119/pilanci20a.html. · Zbl 07626727
[52] R. T. Rockafellar.Convex Analysis. Princeton University Press, Princeton, 1970. · Zbl 0193.18401
[53] Saharon Rosset, Grzegorz Swirszcz, Nathan Srebro, and Ji Zhu. L1 regularization in infinite dimensional feature spaces. InInternational Conference on Computational Learning Theory, pages 544-558. Springer, 2007. · Zbl 1203.68167
[54] Walter Rudin.Principles of Mathematical Analysis. McGraw-Hill, New York, 1964. · Zbl 0052.05301
[55] Arda Sahiner, Tolga Ergen, Batu Ozturkler, Burak Bartan, John Pauly, Morteza Mardani, and Mert Pilanci. Hidden convexity of wasserstein gans: Interpretable generative models with closed-form solutions, 2021a.
[56] Arda Sahiner, Tolga Ergen, John M. Pauly, and Mert Pilanci. Vector-output relu neural network problems are copositive programs: Convex analysis of two layer networks and polynomial-time algorithms. InInternational Conference on Learning Representations, 2021b. URLhttps://openreview.net/forum?id=fGF8qAqpXXG.
[57] Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space?CoRR, abs/1902.05040, 2019. URL http://arxiv.org/abs/1902.05040.
[58] Vaishaal Shankar, Alex Fang, Wenshuo Guo, Sara Fridovich-Keil, Jonathan Ragan-Kelley, Ludwig Schmidt, and Benjamin Recht. Neural kernels without tangents. In Hal Daum´e III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 8614-8623. PMLR, 13-18 Jul 2020.URLhttp://proceedings.mlr.press/v119/shankar20a. html.
[59] Andreas Themelis and Panagiotis Patrinos. Supermann: a superlinearly convergent algorithm for finding fixed points of nonexpansive operators.IEEE Transactions on Automatic Control, 2019. · Zbl 1482.49019
[60] Louis Thiry, Michael Arbel, Eugene Belilovsky, and Edouard Oyallon. The unreasonable effectiveness of patches in deep convolutional kernels methods. 2021.
[61] L. Torgo.Regression data sets.http://www.dcc.fc.up.pt/ ltorgo/Regression/ DataSets.html. · Zbl 1032.62065
[62] RH T¨ut¨unc¨u, KC Toh, and MJ Todd. Sdpt3—a matlab software package for semidefinitequadratic-linear programming, version 3.0.Web page http://www. math. nus. edu. sg/mattohkc/sdpt3. html, 2001.
[63] E. van den Berg and M. P. Friedlander. SPGL1: A solver for large-scale sparse reconstruction, June 2007. http://www.cs.ubc.ca/labs/scl/spgl1.
[64] Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks.arXiv preprint arXiv:1810.05369, 2018.
[65] Francis Williams, Matthew Trager, Claudio Silva, Daniele Panozzo, Denis Zorin, and Joan Bruna. Gradient dynamics of shallow univariate relu networks.arXiv preprint arXiv:1906.07842, 2019.
[66] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530, 2016.
[67] Peng Zhao and Bin Yu. On model selection consistency of lasso.Journal of Machine learning research, 7(Nov):2541-2563, 2006. · Zbl 1222.62008
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.