Ollivier, Yann Online natural gradient as a Kalman filter. (English) Zbl 1447.93352 Electron. J. Stat. 12, No. 2, 2930-2961 (2018). Summary: We cast Amari’s natural gradient in statistical learning as a specific case of Kalman filtering. Namely, applying an extended Kalman filter to estimate a fixed unknown parameter of a probabilistic model from a series of observations, is rigorously equivalent to estimating this parameter via an online stochastic natural gradient descent on the log-likelihood of the observations.In the i.i.d. case, this relation is a consequence of the “information filter” phrasing of the extended Kalman filter. In the recurrent (state space, non-i.i.d.) case, we prove that the joint Kalman filter over states and parameters is a natural gradient on top of real-time recurrent learning (RTRL), a classical algorithm to train recurrent models.This exact algebraic correspondence provides relevant interpretations for natural gradient hyperparameters such as learning rates or initialization and regularization of the Fisher information matrix. Cited in 12 Documents MSC: 93E11 Filtering in stochastic control theory Keywords:statistical learning; natural gradient; Kalman filter; stochastic gradient descent Software:PRMLT × Cite Format Result Cite Review PDF Full Text: DOI arXiv Euclid References: [1] [Ama98] Shun-ichi Amari. Natural gradient works efficiently in learning., Neural Comput., 10:251–276, February 1998. [2] [AN00] Shun-ichi Amari and Hiroshi Nagaoka., Methods of information geometry, volume 191 of Translations of Mathematical Monographs. American Mathematical Society, Providence, RI, 2000. Translated from the 1993 Japanese original by Daishi Harada. · Zbl 0960.62005 [3] [APF00] Shun-ichi Amari, Hyeyoung Park, and Kenji Fukumizu. Adaptive method of realizing natural gradient learning for multilayer perceptrons., Neural Computation, 12(6) :1399–1409, 2000. [4] [Ber96] Dimitri P. Bertsekas. Incremental least squares methods and the extended Kalman filter., SIAM Journal on Optimization, 6(3):807–822, 1996. · Zbl 0945.93026 · doi:10.1137/S1052623494268522 [5] [Bis06] Christopher M. Bishop., Pattern recognition and machine learning. Springer, 2006. · Zbl 1107.68072 [6] [BL03] Léon Bottou and Yann LeCun. Large scale online learning. In, NIPS, volume 30, page 77, 2003. [7] [BRD97] M. Boutayeb, H. Rafaralahy, and M. Darouach. Convergence analysis of the extended Kalman filter used as an observer for nonlinear deterministic discrete-time systems., IEEE transactions on automatic control, 42(4):581–586, 1997. · Zbl 0876.93089 · doi:10.1109/9.566674 [8] [BV04] Stephen Boyd and Lieven Vandenberghe., Convex optimization. Cambridge University Press, 2004. · Zbl 1058.90049 [9] [dFNG00] João FG de Freitas, Mahesan Niranjan, and Andrew H. Gee. Hierarchical Bayesian models for regularization in sequential learning., Neural computation, 12(4):933–953, 2000. [10] [GA15] Mohinder S. Grewal and Angus P. Andrews., Kalman filtering: Theory and practice using MATLAB. Wiley, 2015. 4th edition. · Zbl 1322.93001 [11] [GHL87] S. Gallot, D. Hulin, and J. Lafontaine., Riemannian geometry. Universitext. Springer-Verlag, Berlin, 1987. · Zbl 0636.53001 [12] [GS15] Roger B. Grosse and Ruslan Salakhutdinov. Scaling up natural gradient by sparsely factorizing the inverse Fisher matrix. In, ICML, pages 2304–2313, 2015. [13] [Hay01] Simon Haykin., Kalman filtering and neural networks. John Wiley & Sons, 2001. [14] [Jaz70] Andrew H. Jazwinski., Stochastic processes and filtering theory. Academic Press, 1970. · Zbl 0313.93059 [15] [Kul97] Solomon Kullback., Information theory and statistics. Dover Publications Inc., Mineola, NY, 1997. Reprint of the second (1968) edition. · Zbl 0897.62003 [16] [LCL\(^+\)17] Yubo Li, Yongqiang Cheng, Xiang Li, Xiaoqiang Hua, and Yuliang Qin. Information geometric approach to recursive update in nonlinear filtering., Entropy, 19(2):54, 2017. [17] [LMB07] Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In, Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pages 849–856, 2007. [18] [LS83] Lennart Ljung and Torsten Söderström., Theory and Practice of Recursive Identification. MIT Press, 1983. · Zbl 0548.93075 [19] [Mar14] James Martens. New insights and perspectives on the natural gradient method., arXiv preprint arXiv :1412.1193, 2014. [20] [MG15] James Martens and Roger B. Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In, ICML, pages 2408–2417, 2015. [21] [OAAH17] Yann Ollivier, Ludovic Arnold, Anne Auger, and Nikolaus Hansen. Information-geometric optimization algorithms: A unifying picture via invariance principles., Journal of Machine Learning Research, 18(18):1–65, 2017. · Zbl 1433.90196 [22] [Oll15] Yann Ollivier. Riemannian metrics for neural networks I: feedforward networks., Information and Inference, 4(2):108–153, 2015. · Zbl 1380.68337 · doi:10.1093/imaiai/iav006 [23] [Pat16] Vivak Patel. Kalman-based stochastic gradient method with stop condition and insensitivity to conditioning., SIAM Journal on Optimization, 26(4) :2620–2648, 2016. · Zbl 1388.93091 · doi:10.1137/15M1048239 [24] [PB13] Razvan Pascanu and Yoshua Bengio. Natural gradient revisited., CoRR, abs /1301.3584, 2013. [25] [PJ92] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging., SIAM Journal on Control and Optimization, 30(4):838–855, 1992. · Zbl 0762.62022 · doi:10.1137/0330046 [26] [RRK\(^+\)92] Dennis W. Ruck, Steven K. Rogers, Matthew Kabrisky, Peter S. Maybeck, and Mark E. Oxley. Comparative analysis of backpropagation and the extended Kalman filter for training multilayer perceptrons., IEEE Transactions on Pattern Analysis & Machine Intelligence, (6):686–691, 1992. [27] [Sim06] Dan Simon., Optimal state estimation: Kalman, \(H_∞ \), and nonlinear approaches. John Wiley & Sons, 2006. [28] [ŠKT01] Miroslav Šimandl, Jakub Královec, and Petr Tichavskỳ. Filtering, predictive, and smoothing Cramér–Rao bounds for discrete-time nonlinear dynamic systems., Automatica, 37(11) :1703–1716, 2001. · Zbl 1031.93146 [29] [SW88] Sharad Singhal and Lance Wu. Training multilayer perceptrons with the extended Kalman algorithm. In, NIPS, pages 133–140, 1988. [30] [Sä13] Simo Särkkä., Bayesian filtering and smoothing. Cambridge University Press, 2013. · Zbl 1274.62021 [31] [vdV00] A.W. van der Vaart., Asymptotic statistics. Cambridge university press, 2000. · Zbl 0910.62001 [32] [Wil92] Ronald J Williams. Training recurrent networks using the extended Kalman filter. In, Neural Networks, 1992. IJCNN., International Joint Conference on, volume 4, pages 241–246. IEEE, 1992. [33] [WN96] Eric A. Wan and Alex T. Nelson. Dual Kalman filtering methods for nonlinear prediction, smoothing and estimation. In, NIPS, pages 793–799, 1996. This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.