×

Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness. (English) Zbl 1475.68315

Summary: The accuracy of deep learning, i.e., deep neural networks, can be characterized by dividing the total error into three main types: approximation error, optimization error, and generalization error. Whereas there are some satisfactory answers to the problems of approximation and optimization, much less is known about the theory of generalization. Most existing theoretical works for generalization fail to explain the performance of neural networks in practice. To derive a meaningful bound, we study the generalization error of neural networks for classification problems in terms of data distribution and neural network smoothness. We introduce the cover complexity (CC) to measure the difficulty of learning a data set and the inverse of the modulus of continuity to quantify neural network smoothness. A quantitative bound for expected accuracy/error is derived by considering both the CC and neural network smoothness. Although most of the analysis is general and not specific to neural networks, we validate our theoretical assumptions and results numerically for neural networks by several data sets of images. The numerical results confirm that the expected error of trained networks scaled with the square root of the number of classes has a linear relationship with respect to the CC. We also observe a clear consistency between test loss and neural network smoothness during the training process. In addition, we demonstrate empirically that the neural network smoothness decreases when the network size increases whereas the smoothness is insensitive to training dataset size.

MSC:

68T07 Artificial neural networks and deep learning
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Allen-Zhu, Z.; Li, Y.; Liang, Y., Learning and generalization in overparameterized neural networks, going beyond two layers (2018), arXiv preprint arXiv:1811.04918
[2] Allen-Zhu, Z.; Li, Y.; Song, Z., A convergence theory for deep learning via over-parameterization (2018), arXiv preprint arXiv:1811.03962
[3] Arora, S.; Du, S.; Hu, W.; Li, Z.; Wang, R., Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks (2019), arXiv preprint arXiv:1901.08584
[4] Arora, S.; Ge, R.; Neyshabur, B.; Zhang, Y., Stronger generalization bounds for deep nets via a compression approach (2018), arXiv preprint arXiv:1802.05296
[5] Bartlett, P.; Foster, D.; Telgarsky, M., Spectrally-normalized margin bounds for neural networks, (Advances in neural information processing systems (2017)), 6240-6249
[6] Bartlett, P.; Harvey, N.; Liaw, C.; Mehrabian, A., Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks (2017), arXiv preprint arXiv:1703.02930
[7] Bartlett, P.; Mendelson, S., Rademacher and Gaussian complexities: Risk bounds and structural results, Journal of Machine Learning Research (JMLR), 3, Nov, 463-482 (2002) · Zbl 1084.68549
[8] Baykal, C.; Liebenwein, L.; Gilitschenski, I.; Feldman, D.; Rus, D., Data-dependent coresets for compressing neural networks with applications to generalization bounds (2018), arXiv preprint arXiv:1804.05345
[9] Belkin, M.; Hsu, D.; Ma, S.; Mandal, S., Reconciling modern machine learning and the bias-variance trade-off (2018), arXiv preprint arXiv:1812.11118
[10] Bottou, L., Large-scale machine learning with stochastic gradient descent, (Proceedings of COMPSTAT’2010 (2010), Springer), 177-186 · Zbl 1436.68293
[11] Bottou, L.; Bousquet, O., The tradeoffs of large scale learning, (Advances in neural information processing systems (2008)), 161-168
[12] Cao, Y.; Gu, Q., A generalization theory of gradient descent for learning over-parameterized deep ReLU networks (2019), arXiv preprint arXiv:1902.01384
[13] Chen, Y.; Jin, C.; Yu, B., Stability and convergence trade-off of iterative optimization algorithms (2018), arXiv preprint arXiv:1804.01619
[14] Cheng, Y.; Wang, D.; Zhou, P.; Zhang, T., Model compression and acceleration for deep neural networks: The principles, progress, and challenges, IEEE Signal Processing Magazine, 35, 1, 126-136 (2018)
[15] Cybenko, G., Approximation by superpositions of a sigmoidal function, Mathematics of Control, Signals and Systems, 2, 4, 303-314 (1989) · Zbl 0679.94019
[16] Dinh, L.; Pascanu, R.; Bengio, S.; Bengio, Y., Sharp minima can generalize for deep nets, (Proceedings of the 34th international conference on machine learning-Volume 70 (2017), JMLR. org), 1019-1028
[17] Du, S.; Lee, J.; Li, H.; Wang, L.; Zhai, X., Gradient descent finds global minima of deep neural networks (2018), arXiv preprint arXiv:1811.03804
[18] Dudley, R., Balls in \(\mathbb{R}^k\) do not cut all subsets of k + 2 points, Advances in Mathematics, 31, 3, 306-308 (1979) · Zbl 0408.05001
[19] Dziugaite, G.; Roy, D., Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data (2017), arXiv preprint arXiv:1703.11008
[20] Friedman, J.; Hastie, T.; Tibshirani, R., The elements of statistical learning, Vol. 1 (2001), Springer series in statistics: Springer series in statistics New York · Zbl 0973.62007
[21] Gonen, A.; Shalev-Shwartz, S., Fast rates for empirical risk minimization of strict saddle problems (2017), arXiv preprint arXiv:1701.04271
[22] Gunasekar, S.; Lee, J.; Soudry, D.; Srebro, N., Implicit bias of gradient descent on linear convolutional networks, (Advances in neural information processing systems (2018)), 9461-9471
[23] Hardt, M.; Recht, B.; Singer, Y., Train faster, generalize better: Stability of stochastic gradient descent (2015), arXiv preprint arXiv:1509.01240
[24] Hornik, K.; Stinchcombe, M.; White, H., Multilayer feedforward networks are universal approximators, Neural Networks, 2, 5, 359-366 (1989) · Zbl 1383.92015
[25] Kawaguchi, K.; Kaelbling, L.; Bengio, Y., Generalization in deep learning (2017), arXiv preprint arXiv:1710.05468
[26] Keskar, N.; Mudigere, D.; Nocedal, J.; Smelyanskiy, M.; Tang, P., On large-batch training for deep learning: Generalization gap and sharp minima (2016), arXiv preprint arXiv:1609.04836
[27] Kingma, D.; Ba, J., Adam: A method for stochastic optimization, (International conference on learning representations (2015))
[28] Krizhevsky, A.; Hinton, G., Learning multiple layers of features from tiny imagesTechnical Report (2009), Citeseer
[29] Krizhevsky, A.; Sutskever, I.; Hinton, G., Imagenet classification with deep convolutional neural networks, Neural Information Processing Systems, 25 (2012)
[30] Kuzborskij, I.; Lampert, C., Data-dependent stability of stochastic gradient descent (2017), arXiv preprint arXiv:1703.01678
[31] LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P., Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86, 11, 2278-2324 (1998)
[32] Lee, J.; Simchowitz, M.; Jordan, M.; Recht, B., Gradient descent converges to minimizers (2016), arXiv preprint arXiv:1602.04915
[33] Li, Y.; Liang, Y., Learning overparameterized neural networks via stochastic gradient descent on structured data, (Advances in neural information processing systems (2018)), 8157-8166
[34] Liang, T.; Poggio, T.; Rakhlin, A.; Stokes, J., Fisher-Rao metric, geometry, and complexity of neural networks (2017), arXiv preprint arXiv:1711.01530
[35] Liao, Q.; Poggio, T., Theory II: Landscape of the empirical risk in deep learning (2017), arXiv preprint arXiv:1703.09833
[36] Lu, L.; Shin, Y.; Su, Y.; Karniadakis, G., Dying reLU and initialization: Theory and numerical examples (2019), arXiv preprint arXiv:1903.06733
[37] Lu, L.; Su, Y.; Karniadakis, G., Collapse of deep and narrow neural nets (2018), arXiv preprint arXiv:1808.04947
[38] Maas, A., Hannun, A., & Ng, A. (2013). Rectifier nonlinearities improve neural network acoustic models. In Proc. Icml (p. 3).
[39] Mitzenmacher, M.; Upfal, E., Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis (2017), Cambridge University Press: Cambridge University Press New York, NY, USA · Zbl 1368.60002
[40] Nagarajan, V.; Kolter, J., Deterministic PAC-Bayesian generalization bounds for deep networks via generalizing noise-resilience, (International conference on learning representations (2019))
[41] Nagarajan, V.; Kolter, J., Generalization in deep networks: The role of distance from initialization (2019), arXiv preprint arXiv:1901.01672
[42] Nene, S.; Nayar, S.; Murase, H., Columbia object image library (coil-100) (1996), Citeseer
[43] Nene, S.; Nayar, S.; Murase, H., Columbia object image library (coil-20)Technical report CUCS-005-96 (1996)
[44] Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A., Reading digits in natural images with unsupervised feature learning, (Advances in neural information processing systems (2011))
[45] Neyshabur, B.; Bhojanapalli, S.; McAllester, D.; Srebro, N., Exploring generalization in deep learning, (Advances in neural information processing systems (2017)), 5947-5956
[46] Neyshabur, B.; Bhojanapalli, S.; Srebro, N., A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks (2017), arXiv preprint arXiv:1707.09564
[47] Neyshabur, B.; Li, Z.; Bhojanapalli, S.; LeCun, Y.; Srebro, N., The role of over-parametrization in generalization of neural networks, (International conference on learning representations (2019))
[48] Neyshabur, B.; Salakhutdinov, R.; Srebro, N., Path-SGD: Path-normalized optimization in deep neural networks, (Advances in neural information processing systems (2015)), 2422-2430
[49] Neyshabur, B.; Tomioka, R.; Srebro, N., In search of the real inductive bias: On the role of implicit regularization in deep learning (2014), arXiv preprint arXiv:1412.6614
[50] Poggio, T.; Kawaguchi, K.; Liao, Q.; Miranda, B.; Rosasco, L.; Boix, X.; Hidary, J.; Mhaskar, H., Theory of deep learning III: explaining the non-overfitting puzzle (2017), arXiv preprint arXiv:1801.00173
[51] Rahaman, N.; Arpit, D.; Baratin, A.; Draxler, F.; Lin, M.; Hamprecht, F.; Bengio, Y.; Courville, A., On the spectral bias of deep neural networks (2018), arXiv preprint arXiv:1806.08734
[52] Saxe, A. M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B. D.; Cox, D. D., On the information bottleneck theory of deep learning, Journal of Statistical Mechanics: Theory and Experiment, 2019, 12, 124020 (2019) · Zbl 1459.68185
[53] Shwartz-Ziv, R.; Tishby, N., Opening the black box of deep neural networks via information (2017), arXiv preprint arXiv:1703.00810
[54] Silver, D.; Huang, A.; Maddison, C.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M., Mastering the game of go with deep neural networks and tree search, Nature, 529, 7587, 484 (2016)
[55] Sokolic, J.; Giryes, R.; Sapiro, G.; Rodrigues, M., Generalization error of invariant classifiers (2016), arXiv preprint arXiv:1610.04574
[56] Sokolić, J.; Giryes, R.; Sapiro, G.; Rodrigues, M., Robust large margin deep neural networks, IEEE Transactions on Signal Processing, 65, 16, 4265-4280 (2017) · Zbl 1414.68076
[57] Soudry, D.; Hoffer, E.; Nacson, M.; Gunasekar, S.; Srebro, N., The implicit bias of gradient descent on separable data, Journal of Machine Learning Research (JMLR), 19, 1, 2822-2878 (2018) · Zbl 1477.62192
[58] Wei, C.; Ma, T., Data-dependent sample complexity of deep neural networks via Lipschitz augmentation (2019), arXiv preprint arXiv:1905.03684
[59] Xu, Z.; Zhang, Y.; Luo, T.; Xiao, Y.; Ma, Z., Frequency principle: Fourier analysis sheds light on deep neural networks (2019), arXiv preprint arXiv:1901.06523
[60] Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O., Understanding deep learning requires rethinking generalization (2016), arXiv preprint arXiv:1611.03530
[61] Zhang, C.; Liao, Q.; Rakhlin, A.; Miranda, B.; Golowich, N.; Poggio, T., Theory of deep learning IIb: Optimization properties of SGD (2018), arXiv preprint arXiv:1801.02254
[62] Zhou, W.; Veitch, V.; Austern, M.; Adams, R.; Orbanz, P., Compressibility and generalization in large-scale deep learning (2018), arXiv preprint arXiv:1804.05862
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.