×

Learning scalable deep kernels with recurrent structure. (English) Zbl 1434.68390

Summary: Many applications in speech, robotics, finance, and biology deal with sequential data, where ordering matters and recurrent structures are common. However, this structure cannot be easily captured by standard kernel functions. To model such structure, we propose expressive closed-form kernel functions for Gaussian processes. The resulting model, GP-LSTM, fully encapsulates the inductive biases of long short-term memory (LSTM) recurrent networks, while retaining the non-parametric probabilistic advantages of Gaussian processes. We learn the properties of the proposed kernels by optimizing the Gaussian process marginal likelihood using a new provably convergent semi-stochastic gradient procedure, and exploit the structure of these kernels for scalable training and prediction. This approach provides a practical representation for Bayesian LSTMs. We demonstrate state-of-the-art performance on several benchmarks, and thoroughly investigate a consequential autonomous driving application, where the predictive uncertainties provided by GP-LSTM are uniquely valuable.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62G05 Nonparametric estimation
PDFBibTeX XMLCite
Full Text: arXiv Link

References:

[1] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873-881, 2011.
[2] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157-166, 1994.
[3] Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999. · Zbl 1015.90077
[4] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 1994. · Zbl 1317.62001
[5] Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Manifold gaussian processes for regression. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 3338-3345. IEEE, 2016.
[6] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient hamiltonian monte carlo. In International Conference on Machine Learning, pages 1683-1691, 2014.
[7] Andreas C Damianou and Neil D Lawrence. Deep gaussian processes. In AISTATS, pages 207-215, 2013.
[8] Marc Peter Deisenroth and Jun Wei Ng. Distributed gaussian processes. In International Conference on Machine Learning (ICML), volume 2, page 5, 2015.
[9] Roger Frigola, Yutian Chen, and Carl Rasmussen. Variational gaussian process state-space models. In Advances in Neural Information Processing Systems, pages 3680-3688, 2014.
[10] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning, pages 1050-1059, 2016a.
[11] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1019-1027, 2016b.
[12] Alan Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6645-6649. IEEE, 2013.
[13] Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of The 33rd International Conference on Machine Learning, pages 1225-1234, 2016. 34 James Hensman, Nicolò Fusi, and Neil D Lawrence. Gaussian processes for big data. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pages 282-290. AUAI Press, 2013. Geoffrey E Hinton and Ruslan R Salakhutdinov. Using deep belief nets to learn covariance kernels for gaussian processes. In Advances in neural information processing systems, pages 1249-1256, 2008. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9 (8):1735-1780, 1997. Tommi S Jaakkola and David Haussler. Exploiting generative models in discriminative classifiers. Advances in neural information processing systems, pages 487-493, 1999. Robert Keys. Cubic convolution interpolation for digital image processing. IEEE transactions on acoustics, speech, and signal processing, 29(6):1153-1160, 1981. Juš Kocijan, Agathe Girard, Blaž Banko, and Roderick Murray-Smith. Dynamic systems identification with gaussian processes. Mathematical and Computer Modelling of Dynamical Systems, 11(4):411-424, 2005. John Langford, Alex J Smola, and Martin Zinkevich. Slow learners are fast. Advances in Neural Information Processing Systems, 22:2331-2339, 2009. Miguel Lázaro-Gredilla. Bayesian warped gaussian processes. In Advances in Neural Information Processing Systems, pages 1619-1627, 2012. Tsungnam Lin, Bil G Horne, Peter Tiňo, and C Lee Giles. Learning long-term dependencies in narx recurrent neural networks. Neural Networks, IEEE Transactions on, 7(6):1329-1338, 1996. David JC MacKay. Introduction to gaussian processes. NATO ASI Series F Computer and Systems Sciences, 168:133-166, 1998. César Lincoln C Mattos, Zhenwen Dai, Andreas Damianou, Jeremy Forth, Guilherme A Barreto, and Neil D Lawrence. Recurrent gaussian processes. arXiv preprint arXiv:1511.06644, 2015. Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. The Journal of Machine Learning Research, 7:2651-2667, 2006. Joaquin Quinonero-Candela and Carl Edward Rasmussen.A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research, 6: 1939-1959, 2005. 35
[14] Carl Edward Rasmussen and Zoubin Ghahramani. Occam’s razor. Advances in neural information processing systems, pages 294-300, 2001.
[15] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning. The MIT Press, 2006. · Zbl 1177.68165
[16] Stéphane Ross, Geoffrey J Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, volume 1, page 6, 2011.
[17] Yunus Saatchi and Andrew Gordon Wilson. Bayesian gan. arXiv preprint arXiv:1705.09558, 2017.
[18] Bernhard Schölkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
[19] Jonas Sjöberg, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard Delyon, PierreYves Glorennec, Håkan Hjalmarsson, and Anatoli Juditsky. Nonlinear black-box modeling in system identification: a unified overview. Automatica, 31(12):1691-1724, 1995. · Zbl 0846.93018
[20] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped gaussian processes. In NIPS, pages 337-344, 2003.
[21] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929-1958, 2014. · Zbl 1318.68153
[22] ML Stein. Interpolation of spatial data, 1999.
[23] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104-3112, 2014.
[24] Ryan D Turner, Marc P Deisenroth, and Carl E Rasmussen. State-space inference and learning with gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 868-875, 2010.
[25] Peter Van Overschee and Bart De Moor. Subspace identification for linear systems: Theory — Implementation — Applications. Springer Science & Business Media, 2012. · Zbl 0888.93001
[26] Jack Wang, Aaron Hertzmann, and David M Blei. Gaussian process dynamical models. In Advances in neural information processing systems, pages 1441-1448, 2005.
[27] Torbjörn Wigren. Input-output data sets for development and benchmarking in nonlinear identification. Technical Reports from the department of Information Technology, 20: 2010-020, 2010.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.