×

Structure-preserving deep learning. (English) Zbl 07440570

Summary: Over the past few years, deep learning has risen to the foreground as a topic of massive interest, mainly as a result of successes obtained in solving large-scale image processing tasks. There are multiple challenging mathematical problems involved in applying deep learning: most deep learning methods require the solution of hard optimisation problems, and a good understanding of the trade-off between computational effort, amount of data and model complexity is required to successfully design a deep learning approach for a given problem. A large amount of progress made in deep learning has been based on heuristic explorations, but there is a growing effort to mathematically understand the structure in existing deep learning methods and to systematically design new deep learning methods to preserve certain types of structure in deep learning. In this article, we review a number of these directions: some deep neural networks can be understood as discretisations of dynamical systems, neural networks can be designed to have desirable properties such as invertibility or group equivariance and new algorithmic frameworks based on conformal Hamiltonian systems and Riemannian manifolds to solve the optimisation problems have been proposed. We conclude our review of each of these topics by discussing some open problems that we consider to be interesting directions for future research.

MSC:

68T07 Artificial neural networks and deep learning
65L05 Numerical methods for initial value problems involving ordinary differential equations
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Absil, P.-A., Mahony, R. & Sepulchre, R. (2008) Optimization Algorithms on Matrix Manifolds, Princeton University Press, Princeton, NJ. With a foreword by Paul Van Dooren. · Zbl 1147.65043
[2] Amari, S.-I. (1998) Natural gradient works efficiently in learning. Neural Comput.10(2), 251-276.
[3] Amari, S.-I., Cichocki, A. & Yang, H. H. (1996) A new learning algorithm for blind signal separation. In: Advances in Neural Information Processing Systems, pp. 757-763.
[4] Amari, S.-I. & Douglas, S. C. (1998) Why natural gradient? In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), Vol. 2, IEEE, pp. 1213-1216.
[5] Ambrosio, L., Gigli, N. & Savaré, G. (2008) Gradient flows: in Metric Spaces and in the Space of Probability Measures, Springer Science & Business Media, Berlin. · Zbl 1145.35001
[6] Arridge, S., Maass, P., Öktem, O. & Schönlieb, C.-B. (2019) Solving inverse problems using data-driven models. Acta Numerica 28, 1-174. · Zbl 1429.65116
[7] Asorey, M., Cariñena, J. F. & Ibort, L. A. (1983) Generalized canonical transformations for time-dependent systems. J. Math. Phys.24(12), 2745-2750. · Zbl 0548.70010
[8] Bécigneul, G. & Ganea, O.-E. (2019) Riemannian adaptive optimization methods. In: International Conference on Learning Representations.
[9] Behrmann, J., Grathwohl, W., Chen, R. T. Q., Duvenaud, D. & Jacobsen, J.-H. (2019) Invertible residual networks. In: Chaudhuri, K. and Salakhutdinov, R. (editors), Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, 09-15 June 2019, PMLR, pp. 573-582.
[10] Behrmann, J., Vicol, P., Wang, K. C., Grosse, R. & Jacobsen, J. H. (2021) Understanding and mitigating exploding inverses in invertible neural networks. In: International Conference on Artificial Intelligence and Statistics, PMLR, pp. 1792-1800.
[11] Bekkers, E. J., Lafarge, M. W., Veta, M., Eppenhof, K. A. J., Pluim, J. P. W. & Duits, R. (2018) Roto-translation covariant convolutional networks for medical image analysis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Cham, pp. 440-448.
[12] Benning, M., Celledoni, E., Ehrhardt, M. J., Owren, B. & Schönlieb, C.-B. (2019) Deep learning as optimal control problems: models and numerical methods. J. Comput. Dyn.6(2), 171-198. · Zbl 1429.68249
[13] Bhatt, A., Floyd, D. & Moore, B. E. (2016) Second order conformal symplectic schemes for damped Hamiltonian systems. J. Sci. Comput.66(3), 1234-1259. · Zbl 1377.65165
[14] Bogachev, V. I. (2007) Measure Theory, Vol. 1, Springer Science & Business Media, Berlin. · Zbl 1120.28001
[15] Bölcskei, H., Grohs, P., Kutyniok, G. & Petersen, P. (2019) Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci.1(1), 8-45. · Zbl 1499.41029
[16] Bonnans, J. F. (2019) Course on Optimal Control. http://www.cmap.polytechnique.fr/ bonnans/notes/oc/ocbook.pdf.
[17] Cardoso, J.-F. & Laheld, B. H. (1992) Equivariant adaptive source separation. IEEE Trans. Signal Process.44, 3017-3030.
[18] . (2003) Motion capture database. http://mocap.cs.cmu.edu/.
[19] Celledoni, E., Eslitzbichler, M. & Schmeding, A. (2016) Shape analysis on Lie groups with applications in computer animation. J. Geom. Mech.8(3), 273-304. · Zbl 1366.65018
[20] Celledoni, E. & Fiori, S. (2004) Neural learning by geometric integration of reduced ‘rigid-body’ equations. J. Comput. Appl. Math.172(2), 247-269. · Zbl 1082.68095
[21] Celledoni, E. & Høiseth, E. H. (2017) Energy-Preserving and Passivity-Consistent Numerical Discretization of Port-Hamiltonian Systems. arXiv preprint arXiv:1706.08621.
[22] Celledoni, E., Marthinsen, H. & Owren, B. (2014) An introduction to Lie group integrators—basics, new developments and applications. J. Comput. Phys.257(part B), 1040-1061. · Zbl 1351.37266
[23] Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D. & Holtham, E. (2018) Reversible architectures for arbitrarily deep residual neural networks. In: Thirty-Second AAAI Conference on Artificial Intelligence, Vol. 32, AAAI Press, Palo Alto, pp. 2811-2818.
[24] Chen, T. Q., Behrmann, J., Duvenaud, D. & Jacobsen, J.-H. (2019) Residual flows for invertible generative modeling. In: Advances in Neural Information Processing Systems, pp. 9913-9923.
[25] Chen, T. Q., Rubanova, Y., Bettencourt, J. & Duvenaud, D. (2018) Neural ordinary differential equations. In: Advances in Neural Information Processing Systems, pp. 6572-6583.
[26] Chizat, L. & Bach, F. (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems, pp. 3036-3046.
[27] Chizat, L. & Bach, F. (2020) Implicit Bias of Gradient Descent for Wide Two-Layer Neural Networks Trained with the Logistic Loss. arXiv preprint arXiv:2002.04486.
[28] Cho, M. & Lee, J. (2017) Riemannian approach to batch normalization. In: Advances in Neural Information Processing Systems, pp. 5225-5235.
[29] Ciccone, M., Gallieri, M., Masci, J., Osendorfer, C. & Gomez, F. (2018) NAIS-Net: stable deep networks from non-autonomous differential equations. In: Advances in Neural Information Processing Systems, pp. 3025-3035.
[30] Clason, C. (2020) Regularization of Inverse Problems. arXiv:2001.00617.
[31] Cohen, T., Geiger, M. & Weiler, M. (2019) A general theory of equivariant CNNs on homogeneous spaces. In: Advances in Neural Information Processing Systems 32, pp. 9145-9156.
[32] Cohen, T. S., Geiger, M., Koehler, J. & Welling, M. (2018) Spherical CNNs. arXiv:1801.10130.
[33] Cohen, T. S. & Welling, M. (2016) Group equivariant convolutional networks. In: International Conference on Machine Learning, pp. 2990-2999.
[34] Cohen, T. S. & Welling, M. (2017) Steerable CNNs, 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
[35] Conn, A. R., Gould, N. I. M. & Toint, P. L. (2000) Trust-Region Methods, MPS-SIAM Series on Optimization, Vol. 1. MPS/SIAM, Philadelphia. · Zbl 0958.65071
[36] Cook, P., Bai, Y., Nedjati-Gilani, S., Seunarine, K., Hall, M., Parker, G. & Alexander, D. (2006) Camino: open-source diffusion-MRI reconstruction and processing. In: Proceedings of the 14th Scientific Meeting of ISMRM, Seattle WA, USA, Vol. 2759.
[37] Cybenko, G. (1989) Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst.2(4), 303-314. · Zbl 0679.94019
[38] Dahlquist, G. (1979) Generalized disks of contractivity for explicit and implicit Runge-Kutta methods. Technical report, CM-P00069451.
[39] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. & Fei-Fei, L. (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp. 248-255.
[40] Dinh, L., Krueger, D. & Bengio, Y. (2014) NICE: Non-Linear Independent Components Estimation. arXiv preprint arXiv:1410.8516.
[41] Dinh, L., Sohl-Dickstein, J. & Bengio, S. (2016) Density Estimation Using Real NVP. arXiv preprint arXiv:1605.08803.
[42] Du, S. S., Wang, Y., Zhai, X., Balakrishnan, S., Salakhutdinov, R. & Singh, A. (2018) How Many Samples are Needed to Estimate a Convolutional or Recurrent Neural Network? arXiv:1805.07883.
[43] Duchi, J., Shalev-Shwartz, S., Singer, Y. & Chandra, T. (2008) Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning - ICML, pp. 272-279.
[44] Dupont, E., Doucet, A. & Teh, Y. W. (2019) Augmented neural ODEs. In: Advances in Neural Information Processing Systems.
[45] Durkan, C., Bekasov, A., Murray, I. & Papamakarios, G. (2019) Neural spline flows. In: Advances in Neural Information Processing Systems, pp. 7509-7520.
[46] E, W. (2017) A proposal on machine learning via dynamical systems. Commun. Math. Stat.5(1), 1-11. · Zbl 1380.37154
[47] E, W., Han, J. & Li, Q. (2018) A Mean-Field Optimal Control Formulation of Deep Learning. arXiv:1807.01083v1.
[48] E, W., Han, J. & Li, Q. (2019) A mean-field optimal control formulation of deep learning. Res. Math. Sci.6(1), 1-41. · Zbl 1421.49021
[49] E, W., Ma, C. & Wang, Q. (2019) A Priori Estimates of the Population Risk for Residual Networks. arXiv, pp. 1-19.
[50] Engl, H. W., Hanke, M. & Neubauer, A. (1996) Regularization of Inverse Problems, Mathematics and Its Applications, Springer, Berlin. · Zbl 0859.65054
[51] Esteves, C., Allen-Blanchette, C., Makadia, A. & Daniilidis, K. (2018) Learning SO(3) equivariant representations with spherical CNNs. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 52-68.
[52] Etmann, C., Ke, R. & Schönlieb, C.-B. (2020) iUNets: Fully Invertible U-Nets with Learnable Up-and Downsampling. arXiv preprint arXiv:2005.05220.
[53] França, G., Sulam, J., Robinson, D. P. & Vidal, R. (2019) Conformal Symplectic and Relativistic Optimization. arXiv preprint arXiv:1903.04100. · Zbl 07330529
[54] Gallot, S., Hulin, D. & Lafontaine, J. (2004) Riemannian Geometry, 3rd ed., Universitext, Springer-Verlag, Berlin. · Zbl 1068.53001
[55] García Trillos, N. & Slepčev, D. (2016) Continuum limit of total variation on point clouds. Arch. Ration. Mech. Anal.220(1), 193-241. · Zbl 1336.68215
[56] Gholami, A., Keutzer, K. & Biros, G. (2019) ANODE: unconditionally accurate memory-efficient gradients for neural ODEs. In: IJCAI International Joint Conference on Artificial Intelligence, Vol. 2019, pp. 730-736.
[57] Gomez, A. N., Ren, M., Urtasun, R. & Grosse, R. B. (2017) The reversible residual network: backpropagation without storing activations. In: Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S. and Garnett, R. (editors), Advances in Neural Information Processing Systems 30, Curran Associates, Inc., pp. 2214-2224.
[58] Grönwall, T. H. (1919) Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Ann. Math.20(4), 292-296. · JFM 47.0399.02
[59] Günther, S., Ruthotto, L., Schroder, J. B., Cyr, E. C. & Gauger, N. R. (2020) Layer-parallel training of deep residual neural networks. SIAM J. Math. Data Sci.2(1), 1-23. · Zbl 1508.68306
[60] Haber, E. & Ruthotto, L. (2017) Stable architectures for deep neural networks. Inverse Probl.34(1), 014004. · Zbl 1426.68236
[61] Hager, W. W. (2000) Runge-Kutta methods in optimal control and the transformed adjoint system. Numerische Mathematik87(2), 247-282. · Zbl 0991.49020
[62] Hairer, E., Lubich, C. & Wanner, G. (2006) Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equations, Vol. 31, Springer Science & Business Media, Berlin. · Zbl 1094.65125
[63] Hairer, E., Nørsett, S. P. & Wanner, G. (1993) Solving Ordinary Differential Equations I, 2nd ed., Springer Series in Computational Mathematics, Springer-Verlag, Berlin, Heidelberg. · Zbl 0789.65048
[64] Hairer, E. & Wanner, G. (2010) Solving Ordinary Differential Equations. II, Springer Series in Computational Mathematics, Vol. 14, Springer-Verlag, Berlin. Stiff and differential-algebraic problems, Second revised edition, paperback. · Zbl 1192.65097
[65] He, K., Zhang, X., Ren, S. & Sun, J. (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778.
[66] Hochreiter, S. & Schmidhuber, J. (1997) Flat minima. Neural Comput.9(1), 1-42. · Zbl 0872.68150
[67] Hoogeboom, E., Van Den Berg, R. & Welling, M.Emerging convolutions for generative normalizing flows. In: Chaudhuri, K. and Salakhutdinov, R. (editors), Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, 09-15 June 2019, PMLR, pp. 2771-2780.
[68] Hopfield, J. J. (1982) Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci.79(8), 2554-2558. · Zbl 1369.92007
[69] Hornik, K. (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks4(2), 251-257.
[70] Hutchinson, M. F. (1990) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Commun. Stat. Simul. Comput.19(2), 433-450. · Zbl 0718.62058
[71] Hyvärinen, A. & Oja, E. (2000) Independent component analysis: algorithms and applications. Neural Networks13, 411-430.
[72] Iserles, A., Munthe-Kaas, H. Z., Nørsett, S. P. & Zanna, A. (2000) Lie-group methods. In: Acta Numerica, 2000, Acta Numerica, Vol. 9, Cambridge University Press, Cambridge, pp. 215-365. · Zbl 1064.65147
[73] Ito, K. & Jin, B. (2014) Inverse Problems - Tikhonov Theory and Algorithms, World Scientific, Singapore. · Zbl 1306.65210
[74] Jacobsen, J.-H., Smeulders, A. W. M. & Oyallon, E. (2018) i-RevNet: deep invertible networks. In: International Conference on Learning Representations.
[75] Karras, T., Laine, S. & Aila, T. (2019) A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401-4410.
[76] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. (2017) On large-batch training for deep learning: generalization gap and sharp minima. In: ICLR.
[77] Kingma, D. P. & Ba, J. (2015) Adam: a method for stochastic optimization. In: ICLR.
[78] Kingma, D. P. & Dhariwal, P. (2018) Glow: generative flow with invertible 1x1 convolutions. In: Advances in Neural Information Processing Systems, pp. 10215-10224.
[79] Kobayashi, S. & Nomizu, K. (1996) Foundations of Differential Geometry, Vol. I, Wiley Classics Library, John Wiley & Sons, Inc., New York. Reprint of the 1963 original, A Wiley-Interscience Publication. · Zbl 0119.37502
[80] Kondor, R., Lin, Z. & Trivedi, S. (2018) Clebsch-Gordan Nets: a Fully Fourier Space Spherical Convolutional Neural Network. Advances in Neural Information Processing Systems, 31, 10117-10126.
[81] Kondor, R. & Trivedi, S. (2018) On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups. arXiv:1802.03690.
[82] Krizhevsky, A., Sutskever, I. & Hinton, G. E. (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097-1105.
[83] Lecun, Y. (1988) A theoretical framework for back-propagation. In: Proceedings of the 1988 Connectionist Models Summer School, Vol. 1, CMU, Morgan Kaufmann, Pittsburgh, PA, pp. 21-28.
[84] Lecun, Y. & Bengio, Y. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), 1995.
[85] Lecun, Y., Bengio, Y. & Hinton, G. (2015) Deep learning. Nature521(7553), 436-444.
[86] Lecun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. & Jackel, L. D. (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput.1(4), 541-551.
[87] Li, J., Li, F. & Todorovic, S. (2019) Efficient Riemannian optimization on the Stiefel manifold via the Cayley transform. In: International Conference on Learning Representations.
[88] Li, Q., Chen, L., Tai, C. & E, W. (2018) Maximum principle based algorithms for deep learning. J. Mach. Learn. Res.18, 1-29. · Zbl 1467.68156
[89] Li, Q. & Hao, S. (2018) An optimal control approach to deep learning and applications to discrete-weight neural networks. In: Proceedings of the 35th International Conference on Machine Learning.
[90] Li, Q., Tai, C. & E, W. (2019) Stochastic modified equations and dynamics of stochastic gradient algorithms I: mathematical foundations. J. Mach. Learn. Res.20, 1-47. · Zbl 1484.62106
[91] Li, S. T. J. & Fuxin, L. (2020) Efficient Riemannian optimization on the Stiefel manifold via the Cayley transform. In: ICLR 2020.
[92] Linnainmaa, S. (1970) The Representation of the Cumulative Rounding Error of an Algorithm as a Taylor Expansion of the Local Rounding Errors. Master’s Thesis (in Finnish), University Helsinki, pp. 6-7.
[93] Lu, Y., Zhong, A., Li, Q. & Dong, B. (2018) Beyond finite layer neural networks: bridging deep architectures and numerical differential equations. In: 6th International Conference on Learning Representations, ICLR 2018 - Workshop Track Proceedings.
[94] Lyu, K. & Li, J. (2020) Gradient descent maximizes the margin of homogeneous neural networks. In: International Conference on Learning Representations.
[95] Maddison, C. J., Paulin, D., Teh, Y. W., O’Donoghue, B. & Doucet, A. (2018) Hamiltonian Descent Methods. arXiv preprint arXiv:1809.05042.
[96] Martens, J. (2014) New Insights and Perspectives on the Natural Gradient Method. arXiv preprint arXiv:1412.1193. · Zbl 07306852
[97] Marthinsen, H. & Owren, B. (2016) Geometric integration of non-autonomous linear Hamiltonian problems. Adv. Comput. Math.42(2), 313-332. · Zbl 1338.65272
[98] Massaroli, S., Poli, M., Califano, F., Faragasso, A., Park, J., Yamashita, A. & Asama, H. (2019) Port-Hamiltonian Approach to Neural Network Training. arXiv preprint arXiv:1909.02702.
[99] Mclachlan, R. & Perlmutter, M. (2001) Conformal Hamiltonian systems. J. Geom. Phys.39(4), 276-300. · Zbl 1005.53058
[100] Mclachlan, R. I. & Quispel, G. R. W. (2002) Splitting methods. Acta Numer.11, 341-434. · Zbl 1105.65341
[101] Mclachlan, R. I., Quispel, G. R. W. & Robidoux, N. (1999) Geometric integration using discrete gradients. R. Soc. Lond. Philos. Trans. Ser. A Math. Phys. Eng. Sci.357(1754), 1021-1045. · Zbl 0933.65143
[102] Modin, K. (2016) Geometry of Matrix Decompositions seen through Optimal Transport and Information Geometry. arXiv preprint arXiv:1601.01875. · Zbl 1368.15010
[103] Ng, A. Y. (2004) Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of the 21 st International Conference on Machine Learning.
[104] Nocedal, J. & Wright, S. (2006) Numerical Optimization, Springer Science & Business Media, Berlin. · Zbl 1104.65059
[105] O’Donoghue, B. & Maddison, C. J. (2019) Hamiltonian descent for composite objectives. In: Advances in Neural Information Processing Systems, pp. 14443-14453.
[106] Parpas, P. & Muir, C. (2019) Predict Globally, Correct Locally: Parallel-in-Time Optimal Control of Neural Networks. arXiv, 1974.
[107] Pascanu, R. & Bengio, Y. (2013) Revisiting Natural Gradient for Deep Networks. arXiv preprint arXiv:1301.3584.
[108] Petersen, P. & Voigtlaender, F. (2019) Equivalence of approximation by convolutional neural networks and fully-connected networks. Proc. Am. Math. Soc.148(4), 1567-1581. · Zbl 07176144
[109] Pontryagin, L. S. (1987) Mathematical Theory of Optimal Processes, Classics of Soviet Mathematics, Taylor & Francis, Montreux.
[110] Putzky, P. & Welling, M. (2019) Invert to learn to invert. In: Advances in Neural Information Processing Systems 32, Curran Associates, Inc., pp. 446-456.
[111] Ranzato, M. A., Boureau, Y.-L. & Le Cun, Y. (2009) Sparse feature learning for deep belief networks. In: Advances in Neural Information Processing Systems 20 - Proceedings of the 2007 Conference.
[112] Reddi, S. J., Kale, S. & Kumar, S. (2018) On the convergence of Adam and beyond. In: ICLR.
[113] Rezende, D. J. & Mohamed, S. (2015) Variational inference with normalizing flows. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, JMLR.org., pp. 1530-1538.
[114] Robbins, H. & Monro, S. (1951) A stochastic approximation method. Ann. Math. Stat., 22(3), 400-407. · Zbl 0054.05901
[115] Rocca, F., Prato, C. M. & Ferretti, A. (1997) An overview of ERS-SAR interferometry. In: Proceedings of the 3rdERS Symposium on Space at the Service of our Environment, Florence, Italy.
[116] Ruthotto, L. & Haber, E. (2019) Deep neural networks motivated by partial differential equations. Journal of Mathematical Imaging and Vision, 1-13. Springer, Berlin. · Zbl 1434.68522
[117] Shalev-Shwartz, S. & Ben-David, S. (2014) Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, Cambridge. · Zbl 1305.68005
[118] Su, W., Boyd, S. P. & Candes, E. J. (2014) A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. In: NIPS, Vol. 14, pp. 2510-2518.
[119] Taylor, G., Burmeister, R., Xu, Z., Singh, B., Patel, A. & Goldstein, T. (2016) Training neural networks without gradients: a scalable ADMM approach. In: ICML.
[120] Teshima, T., Ishikawa, I., Tojo, K., Oono, K., Ikeda, M. & Sugiyama, M. (2020) Coupling-based Invertible Neural Networks are Universal Diffeomorphism Approximators. arXiv preprint arXiv:2006.11469.
[121] Thomas, N., Smidt, T., Kearnes, S., Yang, L., Li, L., Kohlhoff, K. & Riley, P. (2018) Tensor Field Networks: Rotation- and Translation-Equivariant Neural Networks for 3D Point Clouds. arXiv:1802.08219.
[122] Thorpe, M. & Van Gennip, Y. (2018) Deep Limits of Residual Neural networks. arXiv preprint arXiv:1810.11741.
[123] Udrişte, C. (1994) Convex Functions and Optimization Methods on Riemannian Manifolds, Mathematics and its Applications, Vol. 297, Kluwer Academic Publishers Group, Dordrecht. · Zbl 0932.53003
[124] Ulyanov, D., Vedaldi, A. & Lempitsky, V. (2018) Deep image prior. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9446-9454.
[125] Van Der Schaft, A. & Jeltsema, D. (2014) Port-Hamiltonian systems theory: an introductory overview. Found. Trends Syst. Control1(2-3), 173-378. · Zbl 1496.93055
[126] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P.-A. (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res.11, 3371-3408. · Zbl 1242.68256
[127] Wang, X., Ma, S., Goldfarb, D. & Lu, W. (2017) Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim.27(2), 927-956. · Zbl 1365.90182
[128] Weiler, M., Geiger, M., Welling, M., Boomsma, W. & Cohen, T. (2018) 3D steerable CNNs: learning rotationally equivariant features in volumetric data. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 10402-10413.
[129] Weiler, M., Hamprecht, F. A. & Storath, M. (2018) Learning steerable filters for rotation equivariant CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 849-858
[130] Weinmann, A., Demaret, L. & Storath, M. (2014) Total variation regularization for manifold-valued data. SIAM J. Imaging Sci.7(4), 2226-2257. · Zbl 1309.65071
[131] Withers, C. S. & Nadarajah, S. (2010) log det A = tr log A. Int. J. Math. Edu. Sci. Technol.41(8), 1121-1124. · Zbl 1292.97035
[132] Worrall, D. E., Garbin, S. J., Turmukhambetov, D. & Brostow, G. J. (2017) Harmonic networks: Deep translation and rotation equivariance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5028-5037.
[133] Xie, Y., Byrd, R. H. & Nocedal, J. (2020) Analysis of the BFGS method with errors. SIAM J. Optim.30(1), 182-209. · Zbl 1435.90149
[134] Yang, H. H. & Amari, S.-I. (1997) Natural gradient descent for training multi-layer perceptrons. Submitted to IEEE Trans. Neural Networks.
[135] Yang, Z., Liu, Y., Bao, C. & Shi, Z. (2020) Interpolation between residual and non-residual networks. In: International Conference on Machine Learning, PMLR, pp. 10736-10745.
[136] Yarotsky, D. (2018) Universal Approximations of Invariant Maps by Neural Networks. arXiv:1804.10306.
[137] Zaheer, M., Reddi, S., Sachan, D., Kale, S. & Kumar, S. (2018) Adaptive methods for nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 9793-9803.
[138] Zhang, G., Martens, J. & Grosse, R. B. (2019) Fast convergence of natural gradient descent for over-parameterized neural networks. In: Advances in Neural Information Processing Systems, pp. 8080-8091.
[139] Zhang, L. & Schaeffer, H. (2020) Forward stability of resNet and its variants. J. Math. Imaging Vis.62(3), 328-351. · Zbl 1434.68528
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.