Hidden unit specialization in layered neural networks: ReLU vs. sigmoidal activation. (English) Zbl 07459771

Summary: By applying concepts from the statistical physics of learning, we study layered neural networks of rectified linear units (ReLU). The comparison with conventional, sigmoidal activation functions is in the center of interest. We compute typical learning curves for large shallow networks with \(K\) hidden units in matching student teacher scenarios. The systems undergo phase transitions, i.e. sudden changes of the generalization performance via the process of hidden unit specialization at critical sizes of the training set. Surprisingly, our results show that the training behavior of ReLU networks is qualitatively different from that of networks with sigmoidal activations. In networks with \(K \geq 3\) sigmoidal hidden units, the transition is discontinuous: Specialized network configurations co-exist and compete with states of poor performance even for very large training sets. On the contrary, the use of ReLU activations results in continuous transitions for all \(K\). For large enough training sets, two competing, differently specialized states display similar generalization abilities, which coincide exactly for large hidden layers in the limit \(K \to \infty\). Our findings are also confirmed in Monte Carlo simulations of the training processes.


82-XX Statistical mechanics, structure of matter
Full Text: DOI arXiv


[1] Hertz, J.; Krogh, A.; Palmer, R., Introduction to the Theory of Neural Computation (1991), Addison-Wesley: Addison-Wesley Reading, MA, USA
[2] Bishop, C., Neural Networks for Pattern Recognition (1995), Oxford University Press, Inc.: Oxford University Press, Inc. New York, NY, USA
[3] Engel, A.; van den Broeck, C., The Statistical Mechanics of Learning (2001), Cambridge University Press: Cambridge University Press Cambridge, UK · Zbl 0984.82034
[4] Hastie, T.; Tibshirani, R.; Friedman, J., (The Elements of Statistical Learning. The Elements of Statistical Learning, Springer Series in Statistics (2001), Springer: Springer New York, NY, USA) · Zbl 0973.62007
[5] Bishop, C., Pattern Recognition and Machine Learning (Information Science and Statistics) (2006), Springer: Springer Heidelberg, Germany
[6] Goodfellow, I.; Bengio, Y.; Courville, A., Deep Learning (2016), MIT Press: MIT Press Cambridge, MA, USA · Zbl 1373.68009
[7] LeCun, Y.; Bengio, Y.; Hinton, G., Deep learning, Nature, 521, 436-444 (2015)
[8] Angelov, P.; Sperduti, A., Challenges in deep learning, (Verleysen, M., Proc. of the European Symposium on Artificial Neural Networks (ESANN) (2016), i6doc.com), 489-495
[9] Ramachandran, P.; Zoph, B.; Le, Q. V., Searching for activation functions (2017), ArXiv abs/1710.05941, Presented at: Sixth Intl. Conf. on Learning Representations, ICLR 2018
[10] Eger, S.; Youssef, P.; Gurevych, I., Is it time to swish? Comparing deep learning activation functions across NLP tasks, (Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018), Association for Computational Linguistics: Association for Computational Linguistics Brussels, Belgium), 4415-4424
[11] A.L. Maas, A.Y. Hannun, A.Y. Ng, Rectifier nonlinearities improve neural network acoustic models, in: Proc. 30th ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
[12] Hahnloser, R.; Sarpeshkar, R.; Mahowald, M.; Douglas, R.; Seung, S., Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit, Nature, 405, 947-951 (2000)
[13] Krizhevsky, A.; Sutskever, I.; Hinton, G. E., Imagenet classification with deep convolutional neural networks, (Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS) - Volume 1 (2012), Curran Assoc. Inc.: Curran Assoc. Inc. USA), 1097-1105
[14] Nair, V.; Hinton, G., Rectified linear units improve restricted Boltzmann machines, (Proc. 27th International Conference on Machine Learning (ICML) (2010), Omnipress: Omnipress USA), 807-814
[15] Villmann, T.; Ravichandran, J.; Villmann, A.; Nebel, D.; Kaden, M., Investigation of activation functions for generalized learning vector quantization, (Vellido, A.; Gibert, K.; Angulo, C.; Martín Guerrero, J., Advances in Self-Organizing Maps, Learning Vector Quantization, Clustering and Data Visualization, WSOM 2019. Advances in Self-Organizing Maps, Learning Vector Quantization, Clustering and Data Visualization, WSOM 2019, Advances in Intelligent Systems and Computing, vol. 976 (2019), Springer: Springer Cham), 179-188
[16] Glorot, X.; Bordes, A.; Bengio, Y., Deep sparse rectifier neural networks, (Gordon, G.; Dunson, D.; Dudík, M., Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 15 (2011), PMLR: PMLR Fort Lauderdale, FL, USA), 315-323
[17] Seung, H. S.; Sompolinsky, H.; Tishby, N., Statistical mechanics of learning from examples, Phys. Rev. A, 45, 6056-6091 (1992)
[18] Watkin, T. L.H.; Rau, A.; Biehl, M., The statistical mechanics of learning a rule, Rev. Modern Phys., 65, 2, 499-556 (1993)
[19] Kinzel, W., Phase transitions of neural networks, Phil. Mag. B, 77, 5, 1455-1477 (1998)
[20] Opper, M., Learning and generalization in a two-layer neural network: The role of the Vapnik-Chervonenkis dimension, Phys. Rev. Lett., 72, 2113-2116 (1994)
[21] Biehl, M.; Caticha, N., The statistical mechanics of on-line learning and generalization, (Arbib, M., The Handbook of Brain Theory and Neural Networks (2003), MIT Press: MIT Press Cambridge, MA), 1095-1098
[22] Biehl, M.; Schwarze, H., Learning by on-line gradient descent, J. Phys. A, 28, 643-656 (1995) · Zbl 0960.68635
[23] Saad, D.; Solla, S. A., Exact solution for on-line learning in multilayer neural networks, Phys. Rev. Lett., 74, 4337-4340 (1995)
[24] Saad, D.; Solla, S., On-line learning in soft committee machines, Phys. Rev. E, 52, 4, 4225-4242 (1995)
[25] Biehl, M.; Riegler, P.; Wöhler, C., Transient dynamics of on-line learning in two-layered neural networks, J. Phys. A, 29, 4769-4780 (1996) · Zbl 0902.68156
[26] Vicente, R.; Caticha, N., Functional optimization of online algorithms in multilayer neural networks, J. Phys. A, 30, 17, L599-L605 (1997) · Zbl 0961.82506
[27] Herschkowitz, D.; Opper, M., Retarded learning: Rigorous results from statistical mechanics, Phys. Rev. Lett., 86, 2174-2177 (2001)
[28] Kang, K.; Oh, J.-H.; Kwon, C.; Park, Y., Generalization in a two-layer neural network, Phys. Rev. E, 48, 4805-4809 (1993)
[29] Biehl, M.; Ahr, M.; Schlösser, E., Statistical physics of learning: phase transitions in multilayered neural networks, (Kramer, B., Advances in Solid State Physics, Vol. 40 (2000), Vieweg), 819-826
[30] Biehl, M.; Schlösser, E.; Ahr, M., Phase transitions in soft-committee machines, Europhys. Lett., 44, 261-267 (1998)
[31] Ahr, M.; Biehl, M.; Urbanczik, R., Statistical physics and practical training of soft-committee machines, Eur. Phys. J. B, 10, 3, 583-588 (1999)
[32] Saitta, L.; Giordana, A.; Cornuéjols, A., Phase Transitions in Machine Learning (2011), Cambridge University Press: Cambridge University Press Cambridge, UK · Zbl 1246.68012
[33] Cocco, S.; Monasson, R.; Posani, L.; Rosay, S.; Tubiana, J., Statistical physics and representations in real and artificial neural networks, Phys. A, 504, 45-76 (2018)
[34] Kadmon, J.; Sompolinsky, H., Optimal architectures in a solvable model of deep networks, (Advances in Neural Information Processing Systems (NIPS 29) (2016), Curran Assoc. Inc.), 4781-4789
[35] Pankaj, M.; Schwab, D., An exact mapping from the variational renormalization group to deep learning (2014), arXiv repository [stat.ML] (1410.3831)
[36] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, S. Ganguli, Deep unsupervised learning using non-equilibrium thermodynamics, in: Proc. of Machine Learning Research, Vol. 37, 2016, pp. 2256-2265.
[37] Caticha, N.; Calsaverini, R.; Vicente, R., Phase transition from egalitarian to hierarchical societies driven between cognitive and social constraints (2016), arXiv repository (1608.03637)
[38] Biehl, M.; Caticha, N.; Opper, M.; Villmann, T., Statistical physics of learning and inference, (Verleysen, M., 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) (2019), i6doc.com), 501-509
[39] Goldt, S.; Mézard, M.; Krzakala, F.; Zdeborová, L., Modelling the influence of data structure on learning in neural networks (2019), arXiv e-print arXiv:1909.11500
[40] Carleo, G.; Cirac, I.; Cranmer, K.; Daudet, L.; Schuld, M.; Tishby, N.; Vogt-Maranto, L.; Zdeborová, L., Machine learning and the physical sciences, Rev. Modern Phys., 91, Article 045002 pp. (2019)
[41] Gabrié, M., Mean-field inference methods for neural networks, J. Phys. A, 53, 22, Article 223002 pp. (2020)
[42] Bahri, Y.; Kadmon, J.; Pennington, J.; Schoenholz, S. S.; Sohl-Dickstein, J.; Ganguli, S., Statistical mechanics of deep learning, Annu. Rev. Condens. Matter Phys., 11, 1, 501-528 (2020)
[43] Aubin, B.; Maillard, A.; Barbier, J.; Krzakala, F.; Macris, N.; Zdeborová, L., The committee machine: computational to statistical gaps in learning a two-layers neural network, J. Stat. Mech. Theory Exp., 2019, 12, Article 124023 pp. (2019) · Zbl 1459.82248
[44] Straat, M.; Biehl, M., On-line learning dynamics of relu neural networks using statistical physics techniques, (Verleysen, M., 27th Europ. Symp. on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) (2019), i6doc.com), 517-522
[45] Goldt, S.; Advani, M.; Saxe, A. M.; Krzakala, F.; Zdeborová, L., Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup, (Wallach, H.; Larochelle, H.; Beygelzimer, A.; Alché-Buc, F.; Fox, E.; Garnett, R., Advances in Neural Information Processing Systems 32 (2019), Curran Associates, Inc.), 6981-6991
[46] Dauphin, Y.; Pascanu, R.; Gulcehre, C.; Cho, K.; Ganguli, S.; Bengio, Y., Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, (Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N.; Weinberger, K., Advances in Neural Information Processing Systems (NIPS 27) (2014), Curran Assoc. Inc.), 2933-2941
[47] Saxe, A. M.; McClelland, J. L.; Ganguli, S., Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, (Bengio, Y.; LeCun, Y., 2nd International Conference on Learning Representations (ICLR), Conference Track Proceedings (2014))
[48] Baldassi, C.; Malatesta, E. M.; Zecchina, R., Properties of the geometry of solutions and capacity of multilayer neural networks with rectified linear unit activations, Phys. Rev. Lett., 123, Article 170602 pp. (2019)
[49] Urbanczik, R., Storage capacity of the fully-connected committee machine, J. Phys. A, 30, 11, L387-L392 (1997) · Zbl 0938.82513
[50] Schwarze, H.; Hertz, J., Generalization in fully connected committee machines, Europhys. Lett., 21, 7, 785-790 (1993) · Zbl 0942.68664
[51] Cybenko, G., Approximations by superpositions of sigmoidal functions, Math. Control Signals Systems, 2, 4, 303-314 (1989) · Zbl 0679.94019
[52] Hornik, K., Approximation capabilities of multilayer feedforward networks, Neural Netw., 4, 2, 251-257 (1991)
[53] Hanin, B., Universal function approximation by deep neural nets with bounded width and ReLU activations, Mathematics, 7, 10, 992 (2019)
[54] Endres, D.; Riegler, P., Learning dynamics on different timescales, J. Phys. A, 32, 49, 8655-8663 (1999) · Zbl 0955.82024
[55] Yoshida, Y.; Karakida, R.; Okada, M.; Amari, S., Statistical mechanical analysis of online learning with weight normalization in single layer perceptron, J. Phys. Soc. Japan, 86, 4, Article 044002 pp. (2017)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.