Wide neural networks of any depth evolve as linear models under gradient descent.

*(English)*Zbl 07330523##### MSC:

82 | Statistical mechanics, structure of matter |

##### Keywords:

machine learning
PDF
BibTeX
XML
Cite

\textit{J. Lee} et al., J. Stat. Mech. Theory Exp. 2020, No. 12, Article ID 124002, 15 p. (2020; Zbl 07330523)

Full Text:
DOI

##### References:

[1] | Abadi M et al 2016 Tensorflow: a system for large-scale machine learning 12th USENIX Symp. on Operating Systems Design and Implementation (OSDI 16) |

[2] | Allen-Zhu Z, Li Y and Zhao S 2018 On the convergence rate of training recurrent neural networks (arXiv:1810.12065) |

[3] | Allen-Zhu Z, Li Y and Zhao S 2019 A convergence theory for deep learning via over-parameterization Int. Conf. on Machine Learning |

[4] | Chen M, Pennington J and Schoenholz S 2018 Dynamical isometry and a mean field theory of RNNs: gating enables signal propagation in recurrent neural networks Int. Conf. on Machine Learning |

[5] | Chizat L and Bach F 2018 On the global convergence of gradient descent for over-parameterized models using optimal transport Advances in Neural Information Processing Systems |

[6] | Chizat L, Oyallon E and Bach F 2018 On lazy training in differentiable programming (arXiv:1812.07956) |

[7] | Cho Y and Saul L K 2009 Kernel methods for deep learning Advances in Neural Information Processing Systems |

[8] | Daniely A 2017 SGD learns the conjugate kernel class of the network Advances in Neural Information Processing Systems pp 2422-30 |

[9] | Daniely A, Frostig R and Singer Y 2016 Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity Advances In Neural Information Processing Systems pp 2253-61 |

[10] | Devlin J, Chang M-W, Lee K and Toutanova K 2018 Bert: pre-training of deep bidirectional transformers for language understanding (arXiv:1810.04805) |

[11] | Dragomir S S 2003 Some Gronwall Type Inequalities and Applications (New York: Nova Science Publishers) · Zbl 1094.34001 |

[12] | Du S S, Lee J D, Li H, Wang L and Zhai X 2019 Gradient descent finds global minima of deep neural networks Int. Conf. on Machine Learning |

[13] | Frostig R, Hawkins P, Johnson M, Leary C and Maclaurin D 2018 JAX: Autograd and XLA (www.github.com/google/jax) |

[14] | Garriga-Alonso A, Aitchison L and Rasmussen C E 2019 Deep convolutional networks as shallow Gaussian processes Int. Conf. on Learning Representations |

[15] | Glorot X and Bengio Y 2010 Understanding the difficulty of training deep feedforward neural networks Int. Conf. on Artificial Intelligence and Statistics pp 249-56 |

[16] | He K, Zhang X, Ren S and Sun J 2016 Deep residual learning for image recognition Conf. on Computer Vision and Pattern Recognition pp 770-8 |

[17] | Jacot A, Gabriel F and Hongler C 2018 Neural tangent kernel: convergence and generalization in neural networks Advances in Neural Information Processing Systems pp 8571-80 |

[18] | Karras T, Aila T, Laine S and Lehtinen J 2018 Progressive growing of GANs for improved quality, stability, and variation Int. Conf. on Learning Representations |

[19] | Krizhevsky A, Sutskever I and Hinton G E 2012 Imagenet classification with deep convolutional neural networks Advances in Neural Information Processing Systems |

[20] | Lee J, Bahri Y, Novak R, Schoenholz S, Pennington J and Sohl-dickstein J 2018 Deep neural networks as Gaussian processes Int. Conf. on Learning Representations |

[21] | Matthews A G. de G, Hron J, Turner R E and Ghahramani Z 2017 Sample-then-optimize posterior sampling for bayesian linear models NeurIPS Workshop on Advances in Approximate Bayesian Inference (http://approximateinference.org/2017/accepted/MatthewsEtAl2017.pdf) |

[22] | Matthews A G. de G, Hron J, Rowland M, Turner R E and Ghahramani Z 2018a Gaussian process behaviour in wide deep neural networks Int. Conf. on Learning Representationsvol 4 (https://openreview.net/forum?id=H1-nGgWC-) |

[23] | Matthews A G. de G, Hron J, Rowland M, Turner R E and Ghahramani Z 2018b Gaussian process behaviour in wide deep neural networks (arXiv:1804.11271) |

[24] | Mei S, Montanari A and Nguyen P-M 2018 A mean field view of the landscape of two-layer neural networks Proc. Natl Acad. Sci.115 E7665-71 · Zbl 1416.92014 |

[25] | Neal R M 1994 Priors for Infinite networks Technical Report No. Crg-Tr-94-1 |

[26] | Neyshabur B, Tomioka R and Nathan S 2015 In search of the real inductive bias: on the role of implicit regularization in deep learning Int. Conf. on Learning Representations Workshop Track |

[27] | Neyshabur B, Li Z, Bhojanapalli S, LeCun Y and Nathan S 2019 The role of over-parametrization in generalization of neural networks Int. Conf. on Learning Representations |

[28] | Novak R, Bahri Y, Abolafia D A, Pennington J and Sohl-Dickstein J 2018 Sensitivity and generalization in neural networks: an empirical study Int. Conf. on Learning Representations |

[29] | Novak R et al 2019a Bayesian deep convolutional networks with many channels are Gaussian processes Int. Conf. on Learning Representations |

[30] | Novak R, Xiao L, Hron J, Lee J, Alemi A A, Sohl-Dickstein J and Schoenholz S S 2019b Neural tangents: fast and easy infinite neural networks in Python (arXiv:1912.02803) |

[31] | Park D S, Sohl-Dickstein J, Le Q V and Smith S L 2019 The effect of network width on stochastic gradient descent and generalization: an empirical study Int. Conf. on Machine Learning |

[32] | Poole B, Lahiri S, Raghu M, Sohl-Dickstein J and Ganguli S 2016 Exponential expressivity in deep neural networks through transient chaos Advances In Neural Information Processing Systems pp 3360-8 |

[33] | Qian N 1999 On the momentum term in gradient descent learning algorithms Neural Networks12 145-51 |

[34] | Roman V 2010 Introduction to the non-asymptotic analysis of random matrices (arXiv:1011.3027) |

[35] | Rotskoff G M and Vanden-Eijnden E 2018 Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks Advances in Neural Information Processing Systems |

[36] | Saxe A M, McClelland J L and Ganguli S 2014 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks Int. Conf. on Learning Representations |

[37] | Schoenholz S S, Gilmer J, Ganguli S and Sohl-Dickstein J 2017 Deep information propagation Int. Conf. on Learning Representations |

[38] | Sirignano J and Spiliopoulos K 2018 Mean field analysis of neural networks (arXiv:1805.01053) |

[39] | Su W, Boyd S and Candes E 2014 A differential equation for modeling Nesterovâ€™s accelerated gradient method: theory and insights Advances in Neural Information Processing Systems pp 2510-8 |

[40] | van Laarhoven T 2017 L2 regularization versus batch and weight normalization (arXiv:1706.05350) |

[41] | Williams C K I 1997 Computing with infinite networks Advances in Neural Information Processing Systems pp 295-301 |

[42] | Xiao L, Bahri Y, Sohl-Dickstein J, Schoenholz S and Pennington J 2018 Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks Int. Conf. on Machine Learning |

[43] | Yang G 2019 Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation (arXiv:1902.04760) |

[44] | Yang G and Schoenholz S 2017 Mean field residual networks: on the edge of chaos Advances in Neural Information Processing Systems pp 7103-14 |

[45] | Yang G, Pennington J, Rao V, Sohl-Dickstein J and Schoenholz S 2019 A mean field theory of batch normalization Int. Conf. on Learning Representations |

[46] | Zagoruyko S and Komodakis N 2016 Wide residual networks British Machine Vision Conf. |

[47] | Zou D, Yuan C, Zhou D and Gu Q 2020 Gradient descent optimizes over-parameterized deep ReLU networks Mach. Learn.109 467-92 · Zbl 07205217 |

[48] | Zhang C, Bengio S and Singer Y 2019 Are all layers created equal? (arXiv:1902.01996) |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.