×

Doubly infinite residual neural networks: a diffusion process approach. (English) Zbl 07415118

Summary: Modern neural networks featuring a large number of layers (depth) and units per layer (width) have achieved a remarkable performance across many domains. While there exists a vast literature on the interplay between infinitely wide neural networks and Gaussian processes, a little is known about analogous interplays with respect to infinitely deep neural networks. Neural networks with independent and identically distributed (i.i.d.) initializations exhibit undesirable forward and backward propagation properties as the number of layers increases, e.g., vanishing dependency on the input, and perfectly correlated outputs for any two inputs. To overcome these drawbacks, Peluchetti and Favaro (2020) considered fully-connected residual networks (ResNets) with network’s parameters initialized by means of distributions that shrink as the number of layers increases, thus establishing an interplay between infinitely deep ResNets and solutions to stochastic differential equations, i.e. diffusion processes, and showing that infinitely deep ResNets does not suffer from undesirable forward-propagation properties. In this paper, we review the results of Peluchetti and Favaro (2020), extending them to convolutional ResNets, and we establish analogous backward-propagation results, which directly relate to the problem of training fully-connected deep ResNets. Then, we investigate the more general setting of doubly infinite neural networks, where both network’s width and network’s depth grow unboundedly. We focus on doubly infinite fully-connected ResNets, for which we consider i.i.d. initializations. Under this setting, we show that the dynamics of quantities of interest converge, at initialization, to deterministic limits. This allow us to provide analytical expressions for inference, both in the case of weakly trained and fully trained ResNets. Our results highlight a limited expressive power of doubly infinite ResNets when the unscaled network’s parameters are i.i.d. and the residual blocks are shallow.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

Software:

MNIST; torchdiffeq; Adam
PDFBibTeX XMLCite
Full Text: arXiv Link

References:

[1] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. InAdvances in Neural Information Processing Systems 32, 2019.
[2] Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Joern-Henrik Jacobsen. Invertible residual networks. InInternational Conference on Machine Learning, pages 573-582, 2019.
[3] Francesca Biagini, Yaozhong Hu, Bernt Øksendal, and Tusheng Zhang.Stochastic calculus for fractional Brownian motion and applications. Springer Science & Business Media, 2008. · Zbl 1157.60002
[4] Patrick Billingsley.Convergence of Probability Measures. Wiley-Interscience, 2nd edition, 1999. · Zbl 0944.60003
[5] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning.Siam Review, 60(2):223-311, 2018. · Zbl 1397.65085
[6] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. InAdvances in Neural Information Processing Systems 31, pages 6571-6583, 2018.
[7] Dami Choi, Christopher J Shallue, Zachary Nado, Jaehoon Lee, Chris J Maddison, and George E Dahl. On empirical comparisons of optimizers for deep learning.arXiv preprint
[8] Vincent Dumoulin and Francesco Visin. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285, 2016.
[9] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Brandon Tran, and Aleksander Madry. Adversarial robustness as a prior for learned representations, 2019.
[10] Stewart N Ethier and Thomas G Kurtz.Markov processes: characterization and convergence. Wiley-Interscience, 2009.
[11] Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional networks as shallow gaussian processes. InInternational Conference on Learning
[12] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the thirteenth international conference on artificial
[13] Arjun K Gupta and Daya K Nagar.Matrix variate distributions. Chapman and Hall/CRC, 1st edition, 1999.
[14] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. On the impact of the activation function on deep neural networks training. InProceedings of the 36th International
[15] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. Training dynamics of deep networks using stochastic gradient descent via neural tangent kernel.arXiv preprint arXiv:1905.13654, 2019b.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. InProceedings of the IEEE
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. InEuropean conference on computer vision, pages 630-645. Springer, 2016b.
[19] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing · Zbl 07765141
[20] Ioannis Karatzas and Steven Shreve.Brownian Motion and Stochastic Calculus. Springer, 2nd edition, 1999.
[21] Diederick P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
[22] P. E. Kloeden and E. Platen.Numerical Solution of Stochastic Differential Equations. Springer, corrected edition, 1992. · Zbl 0752.60043
[23] Yann LeCun. The mnist database of handwritten digits.http://yann.lecun.com/exdb/mnist/, 1998.
[24] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436, 2015.
[25] Jaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as gaussian processes. InInternational Conference
[26] Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent, 2019a.
[27] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. InAdvances in Neural Information Processing Systems 32, 2019b.
[28] Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks.arXiv preprint
[29] Radford M Neal.Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, 1995.
[30] Daniel B Nelson. Arch models as diffusion approximations.Journal of econometrics, 45(1-2): 7-38, 1990. · Zbl 0719.60089
[31] B. Øksendal.Stochastic Differential Equations: An Introduction with Applications. Springer, 6th edition, 2003. · Zbl 1025.60026
[32] Stefano Peluchetti and Stefano Favaro. Infinitely deep neural networks as diffusion processes. InProceedings of the twenty-third international conference on artificial intelligence and
[33] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. InAdvances in
[34] Giuseppe Da Prato and Jerzy Zabczyk.Stochastic Equations in Infinite Dimensions. Second edition edition, 2014. · Zbl 1317.60077
[35] Kenneth V Price. Differential evolution. InHandbook of optimization, pages 187-214. Springer, 2013.
[36] Philip E Protter.Stochastic integration and differential equations. Springer, 2005.
[37] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
[38] Daniel Revuz and Marc Yor.Continuous Martingales and Brownian Motion. Springer, 3rd edition, 1999. · Zbl 0917.60006
[39] Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400-407, 1951. · Zbl 0054.05901
[40] Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. InInternational Conference on Learning Representations, 2017.
[41] Daniel W Stroock and SR Srinivasa Varadhan.Multidimensional diffusion processes. Springer, 2006 edition, 2006. · Zbl 1103.60005
[42] Christopher KI Williams and Carl Edward Rasmussen.Gaussian processes for machine learning. MIT press Cambridge, MA, 2006. · Zbl 1177.68165
[43] Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. InAdvances in Neural
[44] Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30, pages 7103- 7114. Curran Associates, Inc., 2017.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.