Automatic differentiation in machine learning: a survey.

*(English)*Zbl 06982909Summary: Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD), also called algorithmic differentiation or simply “autodiff”, is a family of techniques similar to but more general than backpropagation for efficiently and accurately evaluating derivatives of numeric functions expressed as computer programs. AD is a small but established field with applications in areas including computational fluid dynamics, atmospheric sciences, and engineering design optimization. Until very recently, the fields of machine learning and AD have largely been unaware of each other and, in some cases, have independently discovered each other’s results. Despite its relevance, general-purpose AD has been missing from the machine learning toolbox, a situation slowly changing with its ongoing adoption under the names “dynamic computational graphs” and “differentiable programming”. We survey the intersection of AD and machine learning, cover
applications where
AD has direct relevance, and address the main implementation techniques. By precisely defining the main differentiation techniques and their interrelationships, we aim to bring clarity to the usage of the terms “autodiff”, “automatic differentiation”, and “symbolic differentiation” as these are encountered more and more in machine learning settings.

##### MSC:

68T05 | Learning and adaptive systems in artificial intelligence |

##### Software:

AdaGrad; Adam; ADIC; ADIFOR; ADiJaC; ADiMat; ADOL-C; ADVI; AMPL; BinaryConnect; Caffe; Chainer; CNTK; ColPack; CompAD; Cosy; CppAD; cuDNN; DiffSharp; Edward; FADBAD++; FLUENT; ForwardDiff; GHC; GRESS; INTLAB; L-BFGS; L-BFGS-B; LBFGS-B; Lush; MAD; MXYZPTLK; NAGWare; OpenDR; PADRE2; Picture; ProbTorch; PROSE; PyMC; PyTorch; RMSprop; SLANG; SN; Stan; TAF; Tangent; TAPENADE; TensorFlow; Theano; Torch
PDF
BibTeX
XML
Cite

\textit{A. G. Baydin} et al., J. Mach. Learn. Res. 18, Paper No. 153, 43 p. (2018; Zbl 06982909)

Full Text:
Link

##### References:

[1] | Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al.TensorFlow: Large-scale machine learning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016. |

[2] | D. S. Adamson and C. W. Winant. A SLANG simulation of an initially strong shock wave downstream of an infinite area change. In Proceedings of the Conference on Applications of Continuous-System Simulation Languages, pages 231–40, 1969. 27 |

[3] | Naman Agarwal, Brian Bullins, and Elad Hazan. Second order stochastic optimization in linear time. Technical Report arXiv:1602.03943, arXiv preprint, 2016. · Zbl 1441.90115 |

[4] | R. K. Al Seyab and Y. Cao. Nonlinear system identification for predictive control using continuous time recurrent neural networks and automatic differentiation.Journal of Process Control, 18(6):568–581, 2008. doi: 10.1016/j.jprocont.2007.10.012. |

[5] | Brandon Amos and J Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks. arXiv preprint arXiv:1703.00443, 2017. |

[6] | Marianna S. Apostolopoulou, Dimitris G. Sotiropoulos, Ioannis E. Livieris, and Panagiotis Pintelas. A memoryless BFGS neural network training algorithm. In 7th IEEE International Conference on Industrial Informatics, INDIN 2009, pages 216–221, June 2009. doi: 10.1109/INDIN.2009.5195806. |

[7] | Andrew W Appel. Runtime tags aren’t necessary. Lisp and Symbolic Computation, 2(2): 153–162, 1989. |

[8] | Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. |

[9] | Daniel Paul Barrett and Jeffrey Mark Siskind. Felzenszwalb-Baum-Welch: Event detection by changing appearance. arXiv preprint arXiv:1306.4746, 2013. |

[10] | Fr´ed´eric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012. |

[11] | Friedrich L. Bauer. Computational graphs and rounding error. SIAM Journal on Numerical Analysis, 11(1):87–96, 1974. · Zbl 0337.65028 |

[12] | Atılım G¨une¸s Baydin, Barak A. Pearlmutter, and Jeffrey Mark Siskind. Diffsharp: An AD library for .NET languages. In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016, 2016a. Also arXiv:1611.03423. |

[13] | Atılım G¨une¸s Baydin, Barak A. Pearlmutter, and Jeffrey Mark Siskind. Tricks from deep learning. In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016, 2016b. Also arXiv:1611.03777. |

[14] | Atılım G¨une¸s Baydin, Robert Cornish, David Mart´ınez Rubio, Mark Schmidt, and Frank Wood. Online learning rate adaptation with hypergradient descent. In Sixth International Conference on Learning Representations (ICLR), Vancouver, Canada, April 30–May 3, 2018, 2018. |

[15] | L. M. Beda, L. N. Korolev, N. V. Sukkikh, and T. S. Frolova. Programs for automatic differentiation for the machine BESM (in Russian). Technical report, Institute for Precise Mechanics and Computation Techniques, Academy of Science, Moscow, USSR, 1959. 28 |

[16] | Bradley M. Bell and James V. Burke. Algorithmic differentiation of implicit functions and optimal values. In C. H. Bischof, H. M. B¨ucker, P. Hovland, U. Naumann, and J. Utke, editors, Advances in Automatic Differentiation, volume 64 of Lecture Notes in Computational Science and Engineering, pages 67–77. Springer Berlin Heidelberg, 2008. doi: 10.1007/978-3-540-68942-3 7. · Zbl 1152.65434 |

[17] | Claus Bendtsen and Ole Stauning. FADBAD, a flexible C++ package for automatic differentiation. Technical Report IMM-REP-1996-17, Department of Mathematical Modelling, Technical University of Denmark, Lyngby, Denmark, 1996. |

[18] | Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013. |

[19] | Charles W. Bert and Moinuddin Malik. Differential quadrature method in computational mechanics: A review. Applied Mechanics Reviews, 49, 1996. doi: 10.1115/1.3101882. · Zbl 0857.73077 |

[20] | Martin Berz, Kyoko Makino, Khodr Shamseddine, Georg H. Hoffst¨atter, and Weishi Wan. COSY INFINITY and its applications in nonlinear dynamics. In M. Berz, C. Bischof, G. Corliss, and A. Griewank, editors, Computational Differentiation: Techniques, Applications, and Tools, pages 363–5. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1996. |

[21] | Christian Bischof, Alan Carle, George Corliss, Andreas Griewank, and Paul Hovland. ADIFOR 2.0: Automatic differentiation of Fortran 77 programs. Computational Science Engineering, IEEE, 3(3):18–32, 1996. doi: 10.1109/99.537089. |

[22] | Christian Bischof, Lucas Roh, and Andrew Mauer-Oats. ADIC: An extensible automatic differentiation tool for ANSI-C. Software Practice and Experience, 27(12):1427–56, 1997. |

[23] | Christian H. Bischof, H. Martin B¨ucker, and Bruno Lang. Automatic differentiation for computational finance. In E. J. Kontoghiorghes, B. Rustem, and S. Siokos, editors, Computational Methods in Decision-Making, Economics and Finance, volume 74 of Applied Optimization, pages 297–310. Springer US, 2002. doi: 10.1007/978-1-4757-3613-7 15. · Zbl 1045.65020 |

[24] | Christian H. Bischof, H. Martin B¨ucker, Arno Rasch, Emil Slusanschi, and Bruno Lang. Automatic differentiation of the general-purpose computational fluid dynamics package FLUENT. Journal of Fluids Engineering, 129(5):652–8, 2006. doi: 10.1115/1.2720475. |

[25] | Christian H. Bischof, Paul D. Hovland, and Boyana Norris. On the implementation of automatic differentiation tools. Higher-Order and Symbolic Computation, 21(3):311–31, 2008. doi: 10.1007/s10990-008-9034-4. · Zbl 1168.65324 |

[26] | V. G. Boltyanskii, R. V. Gamkrelidze, and L. S. Pontryagin. The theory of optimal processes I: The maximum principle. Izvest. Akad. Nauk S.S.S.R. Ser. Mat., 24:3–42, 1960. |

[27] | L´eon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010. 29 |

[28] | L´eon Bottou and Yann LeCun. SN: A simulator for connectionist models. In Proceedings of NeuroNimes 88, pages 371–382, Nimes, France, 1988. URL http://leon.bottou.org/ papers/bottou-lecun-88. |

[29] | L´eon Bottou and Yann LeCun.Lush reference manual, 2002.URL http://lush. sourceforge.net/doc.html. |

[30] | L´eon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838, 2016. · Zbl 1397.65085 |

[31] | L´eon Bottou. Online learning and stochastic approximations. On-Line Learning in Neural Networks, 17:9, 1998. |

[32] | Claude Brezinski and M. Redivo Zaglia.Extrapolation Methods: Theory and Practice. North-Holland, 1991. · Zbl 0744.65004 |

[33] | A. E. Bryson and W. F. Denham. A steepest ascent method for solving optimum programming problems. Journal of Applied Mechanics, 29(2):247, 1962. doi: 10.1115/1.3640537. · Zbl 0112.20003 |

[34] | Arthur E. Bryson and Yu-Chi Ho. Applied Optimal Control: Optimization, Estimation, and Control. Blaisdell, Waltham, MA, 1969. |

[35] | Rirchard L. Burden and J. Douglas Faires. Numerical Analysis. Brooks/Cole, 2001. |

[36] | Luca Capriotti.Fast Greeks by algorithmic differentiation.Journal of Computational Finance, 14(3):3, 2011. · Zbl 1395.91491 |

[37] | Gregory R. Carmichael and Adrian Sandu. Sensitivity analysis for atmospheric chemistry models via automatic differentiation. Atmospheric Environment, 31(3):475–89, 1997. |

[38] | Bob Carpenter, Matthew D Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betancourt. The Stan math library: Reverse-mode automatic differentiation in C++. arXiv preprint arXiv:1509.07164, 2015. |

[39] | Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 20:1–37, 2016. |

[40] | Daniele Casanova, Robin S. Sharp, Mark Final, Bruce Christianson, and Pat Symonds. Application of automatic diffentiation to race car performance optimisation. In George Corliss, Christ‘ele Faure, Andreas Griewank, Lauren Hasco¨et, and Uwe Naumann, editors, Automatic Differentiation of Algorithms, pages 117–124. Springer-Verlag New York, Inc., New York, NY, USA, 2002. ISBN 0-387-95305-1. |

[41] | Isabelle Charpentier and Mohammed Ghemires. Efficient adjoint derivatives: Application to the meteorological model Meso-NH. Optimization Methods and Software, 13(1):35–63, 2000. · Zbl 0983.76067 |

[42] | Danqi Chen and Christopher Manning. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 740–750, 2014. 30 |

[43] | Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014. |

[44] | Siddhartha Chib and Edward Greenberg. Understanding the Metropolis-Hastings algorithm. The American Statistician, 49(4):327–335, 1995. doi: 10.1080/00031305.1995. 10476177. |

[45] | Bruce Christianson. Reverse accumulation and attractive fixed points. Optimization Methods and Software, 3(4):311–326, 1994. |

[46] | Bruce Christianson. A Leibniz notation for automatic differentiation. In Shaun Forth, Paul Hovland, Eric Phipps, Jean Utke, and Andrea Walther, editors, Recent Advances in Algorithmic Differentiation, volume 87 of Lecture Notes in Computational Science and Engineering, pages 1–9. Springer, Berlin, 2012. ISBN 978-3-540-68935-5. doi: 10.1007/ 978-3-642-30023-3 1. · Zbl 1251.65024 |

[47] | William K. Clifford. Preliminary sketch of bi-quaternions. Proceedings of the London Mathematical Society, 4:381–95, 1873. |

[48] | J Cohen and M Jeroen Molemaker. A fast double precision cfd code using cuda. Parallel Computational Fluid Dynamics: Recent Advances and Future Directions, pages 414–429, 2009. |

[49] | Ronan Collobert, Koray Kavukcuoglu, and Cl´ement Farabet. Torch7: A Matlab-like environment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF192376, 2011. |

[50] | George F. Corliss. Application of differentiation arithmetic, volume 19 of Perspectives in Computing, pages 127–48. Academic Press, Boston, 1988. · Zbl 0659.65016 |

[51] | Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pages 3123–3131, 2015. |

[52] | Navneet Dalal and Bill Triggs.Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), pages 886–93, Washington, DC, USA, 2005. IEEE Computer Society. doi: 10.1109/CVPR.2005.177. |

[53] | Benjamin Dauvergne and Laurent Hasco¨et. The data-flow equations of checkpointing in reverse automatic differentiation. In V. N. Alexandrov, G. D. van Albada, P. M. A. Sloot, and J. Dongarra, editors, Computational Science – ICCS 2006, volume 3994 of Lecture Notes in Computer Science, pages 566–73, Dauvergne, 2006. Springer Berlin. · Zbl 1157.65334 |

[54] | John E. Dennis and Robert B. Schnabel. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics, Philadelphia, 1996. 31 · Zbl 0847.65038 |

[55] | L. C. Dixon. Use of automatic differentiation for calculating Hessians and Newton steps. In A. Griewank and G. F. Corliss, editors, Automatic Differentiation of Algorithms: Theory, Implementation, and Application, pages 114–125. SIAM, Philadelphia, PA, 1991. · Zbl 0782.65021 |

[56] | Simon Duane, Anthony D. Kennedy, Brian J. Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987. |

[57] | John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul): 2121–2159, 2011. · Zbl 1280.68164 |

[58] | Ulf Ekstr¨om, Lucas Visscher, Radovan Bast, Andreas J. Thorvaldsen, and Kenneth Ruud. Arbitrary-order density functional response theory from automatic differentiation. Journal of Chemical Theory and Computation, 6:1971–80, 2010. doi: 10.1021/ct100117s. |

[59] | Jerry Eriksson, M˚arten Gulliksson, Per Lindstr¨om, and Per ˚Ake Wedin. Regularization tools for training large feed-forward neural networks using automatic differentiation. Optimization Methods and Software, 10(1):49–69, 1998. doi: 10.1080/10556789808805701. · Zbl 0913.68177 |

[60] | S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Koray Kavukcuoglu, and Geoffrey E. Hinton. Attend, infer, repeat: Fast scene understanding with generative models. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 3225– 3233. Curran Associates, Inc., 2016. |

[61] | Jenny Rose Finkel, Alex Kleeman, and Christopher D. Manning. Efficient, feature-based, conditional random field parsing. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL 2008), pages 959–67, 2008. |

[62] | Bengt Fornberg. Numerical differentiation of analytic functions. ACM Transactions on Mathematical Software, 7(4):512–26, 1981. doi: 10.1145/355972.355979. · Zbl 0465.65012 |

[63] | Shaun A. Forth. An efficient overloaded implementation of forward mode automatic differentiation in MATLAB. ACM Transactions on Mathematical Software, 32(2):195–222, 2006. · Zbl 1365.65053 |

[64] | Shaun A. Forth and Trevor P. Evans. Aerofoil optimisation via AD of a multigrid cellvertex Euler flow solver. In George Corliss, Christ‘ele Faure, Andreas Griewank, Laurent Hasco¨et, and Uwe Naumann, editors, Automatic Differentiation of Algorithms: From Simulation to Optimization, pages 153–160. Springer New York, New York, NY, 2002. ISBN 978-1-4613-0075-5. doi: 10.1007/978-1-4613-0075-5 17. |

[65] | Robert Fourer, David M. Gay, and Brian W. Kernighan. AMPL: A Modeling Language for Mathematical Programming. Duxbury Press, 2002. · Zbl 0701.90062 |

[66] | David M. Gay. Automatically finding and exploiting partially separable structure in nonlinear programming problems. Technical report, Bell Laboratories, Murray Hill, NJ, 1996. 32 |

[67] | Assefaw H. Gebremedhin, Arijit Tarafdar, Alex Pothen, and Andrea Walther. Efficient computation of sparse Hessians using coloring and automatic differentiation. INFORMS Journal on Computing, 21(2):209–23, 2009. doi: 10.1287/ijoc.1080.0286. · Zbl 1243.65071 |

[68] | Assefaw H Gebremedhin, Duc Nguyen, Md Mostofa Ali Patwary, and Alex Pothen. ColPack: Software for graph coloring and related problems in scientific computing. ACM Transactions on Mathematical Software (TOMS), 40(1):1, 2013. · Zbl 1295.65144 |

[69] | Samuel Gershman and Noah Goodman. Amortized inference in probabilistic reasoning. In Proceedings of the Annual Meeting of the Cognitive Science Society, number 36, 2014. |

[70] | Ralf Giering and Thomas Kaminski. Recipes for adjoint code construction. ACM Transactions on Mathematical Software, 24:437–74, 1998. doi: 10.1145/293686.293695. · Zbl 0934.65027 |

[71] | Kevin Gimpel, Dipanjan Das, and Noah A. Smith. Distributed asynchronous online learning for natural language processing. In Proceedings of the Fourteenth Conference on Computational Natural Language Learning, CoNLL ’10, pages 213–222, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. |

[72] | Mark Girolami and Be Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011. |

[73] | Yoav Goldberg. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57:345–420, 2016. · Zbl 1401.68264 |

[74] | Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. · Zbl 1373.68009 |

[75] | Andrew D Gordon, Thomas A Henzinger, Aditya V Nori, and Sriram K Rajamani. Probabilistic programming. In Proceedings of the on Future of Software Engineering, pages 167–181. ACM, 2014. |

[76] | Johannes Grabmeier and Erich Kaltofen. Computer Algebra Handbook: Foundations, Applications, Systems. Springer, 2003. · Zbl 1017.68162 |

[77] | Markus Grabner, Thomas Pock, Tobias Gross, and Bernhard Kainz. Automatic differentiation for GPU-accelerated 2D/3D registration. In C. H. Bischof, H. M. B¨ucker, P. Hovland, U. Naumann, and J. Utke, editors, Advances in Automatic Differentiation, volume 64 of Lecture Notes in Computational Science and Engineering, pages 259–269. Springer Berlin Heidelberg, 2008. doi: 10.1007/978-3-540-68942-3 23. · Zbl 1147.92310 |

[78] | Will Grathwohl, Dami Choi, Yuhuai Wu, Geoff Roeder, and David Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. arXiv preprint arXiv:1711.00123, 2017. |

[79] | Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing machines. arXiv preprint arXiv:1410.5401, 2014. 33 |

[80] | Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwi´nska, Sergio G´omez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016. |

[81] | Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to transduce with unbounded memory. In Advances in Neural Information Processing Systems, pages 1828–1836, 2015. |

[82] | Andreas Griewank. On automatic differentiation. In M. Iri and K. Tanabe, editors, Mathematical Programming: Recent Developments and Applications, pages 83–108. Kluwer Academic Publishers, 1989. · Zbl 0696.65015 |

[83] | Andreas Griewank. A mathematical view of automatic differentiation. Acta Numerica, 12: 321–98, 2003. doi: 10.1017/S0962492902000132. · Zbl 1047.65012 |

[84] | Andreas Griewank. Who invented the reverse mode of differentiation? Documenta Mathematica, Extra Volume ISMP:389–400, 2012. · Zbl 1293.65035 |

[85] | Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. Society for Industrial and Applied Mathematics, Philadelphia, 2008. doi: 10.1137/1.9780898717761. · Zbl 1159.65026 |

[86] | Andreas Griewank, Kshitij Kulshreshtha, and Andrea Walther. On the numerical stability of algorithmic differentiation. Computing, 94(2-4):125–149, 2012. · Zbl 1238.65013 |

[87] | Audrunas Gruslys, R´emi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memoryefficient backpropagation through time. In Advances in Neural Information Processing Systems, pages 4125–4133, 2016. |

[88] | Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1737–1746, 2015. |

[89] | Gundolf Haase, Ulrich Langer, Ewald Lindner, and Wolfram M¨uhlhuber. Optimal sizing of industrial structural mechanics problems using AD. In Automatic Differentiation of Algorithms, pages 181–188. Springer, 2002. |

[90] | Stefan Hadjis, Firas Abuzaid, Ce Zhang, and Christopher R´e. Caffe con troll: Shallow ideas to speed up deep learning. In Proceedings of the Fourth Workshop on Data analytics in the Cloud, page 2. ACM, 2015. |

[91] | William Rowan Hamilton. Theory of conjugate functions, or algebraic couples; with a preliminary and elementary essay on algebra as the science of pure time. Transactions of the Royal Irish Academy, 17:293–422, 1837. |

[92] | Laurent Hasco¨et and Val´erie Pascual. The Tapenade automatic differentiation tool: principles, model, and specification. ACM Transactions on Mathematical Software, 39(3), 2013. doi: 10.1145/2450153.2450158. 34 · Zbl 1295.65026 |

[93] | Robert Hecht-Nielsen. Theory of the backpropagation neural network. In International Joint Conference on Neural Networks, IJCNN 1989, pages 593–605. IEEE, 1989. |

[94] | Ruth L. Hinkins. Parallel computation of automatic differentiation applied to magnetic field calculations. Technical report, Lawrence Berkeley Lab., CA, 1994. |

[95] | Geoffrey E. Hinton and Zoubin Ghahramani. Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 352(1358):1177–1190, 1997. |

[96] | Matthew D. Hoffman and Andrew Gelman. The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15: 1351–1381, 2014. · Zbl 1319.60150 |

[97] | Berthold K. P. Horn. Understanding image intensities. Artificial Intelligence, 8:201–231, 1977. · Zbl 0359.68118 |

[98] | Jim E. Horwedel, Brian A. Worley, E. M. Oblow, and F. G. Pin. GRESS version 1.0 user’s manual. Technical Memorandum ORNL/TM 10835, Martin Marietta Energy Systems, Inc., Oak Ridge National Laboratory, Oak Ridge, 1988. |

[99] | Max E. Jerrell. Automatic differentiation and interval arithmetic for estimation of disequilibrium models. Computational Economics, 10(3):295–316, 1997. · Zbl 0892.90042 |

[100] | Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, pages 675–678. ACM, 2014. |

[101] | Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. Composing graphical models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems, pages 2946– 2954, 2016. |

[102] | Neil D Jones, Carsten K Gomard, and Peter Sestoft. Partial evaluation and automatic program generation. Peter Sestoft, 1993a. · Zbl 0875.68290 |

[103] | Simon L Peyton Jones and John Launchbury. Unboxed values as first class citizens in a non-strict functional language. In Conference on Functional Programming Languages and Computer Architecture, pages 636–666. Springer, 1991. |

[104] | SL Peyton Jones, Cordy Hall, Kevin Hammond, Will Partain, and Philip Wadler. The Glasgow Haskell compiler: a technical overview. In Proc. UK Joint Framework for Information Technology (JFIT) Technical Conference, volume 93, 1993b. |

[105] | Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets. In Advances in Neural Information Processing Systems, pages 190–198, 2015. 35 |

[106] | David W. Juedes. A taxonomy of automatic differentiation tools. In A. Griewank and G. F. Corliss, editors, Automatic Differentiation of Algorithms: Theory, Implementation, and Application, pages 315–29. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1991. · Zbl 0782.65029 |

[107] | D. Kingma and J. Ba. Adam: A method for stochastic optimization. In The International Conference on Learning Representations (ICLR), San Diego, 2015. |

[108] | Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. |

[109] | Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012. |

[110] | K. Kubo and M. Iri. PADRE2, version 1—user’s manual. Research Memorandum RMI 90-01, Department of Mathematical Engineering and Information Physics, University of Tokyo, Tokyo, 1990. |

[111] | Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14):1–45, 2017. · Zbl 1437.62109 |

[112] | Tejas D. Kulkarni, Pushmeet Kohli, Joshua B. Tenenbaum, and Vikash Mansinghka. Picture: A probabilistic programming language for scene perception. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. |

[113] | Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. Ask me anything: Dynamic memory networks for natural language processing. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1378–1387, New York, New York, USA, 20–22 Jun 2016. PMLR. |

[114] | C. L. Lawson. Computing derivatives using W-arithmetic and U-arithmetic. Internal Computing Memorandum CM-286, Jet Propulsion Laboratory, Pasadena, CA, 1971. |

[115] | Tuan Anh Le, Atılım G¨une¸s Baydin, and Frank Wood. Inference compilation and universal probabilistic programming. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 of Proceedings of Machine Learning Research, pages 1338–1348, Fort Lauderdale, FL, USA, 2017. PMLR. |

[116] | Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. |

[117] | Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553): 436–444, 2015. |

[118] | G. W. Leibniz. Machina arithmetica in qua non additio tantum et subtractio sed et multiplicatio nullo, diviso vero paene nullo animi labore peragantur. Hannover, 1685. 36 |

[119] | Xavier Leroy. The effectiveness of type-based unboxing. In TIC 1997: Workshop Types in Compilation, 1997. |

[120] | Seppo Linnainmaa. The representation of the cumulative rounding error of an algorithm as a taylor expansion of the local rounding errors. Master’s thesis, University of Helsinki, 1970. |

[121] | Seppo Linnainmaa. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics, 16(2):146–160, 1976. · Zbl 0332.65024 |

[122] | Matthew M. Loper and Michael J. Black. OpenDR: An approximate differentiable renderer. In European Conference on Computer Vision, pages 154–169. Springer, 2014. |

[123] | Dougal Maclaurin. Modeling, Inference and Optimization with Composable Differentiable Procedures. PhD thesis, School of Engineering and Applied Sciences, Harvard University, 2016. |

[124] | Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113–2122, 2015. |

[125] | Oleksandr Manzyuk, Barak A. Pearlmutter, Alexey Andreyevich Radul, David R Rush, and Jeffrey Mark Siskind. Confusion of tagged perturbations in forward automatic differentiation of higher-order functions. arXiv preprint arXiv:1211.4892, 2012. |

[126] | David Q. Mayne and David H. Jacobson. Differential Dynamic Programming. American Elsevier Pub. Co., New York, 1970. · Zbl 0223.49022 |

[127] | Vladimir Mazourik. Integration of automatic differentiation into a numerical library for PC’s. In A. Griewank and G. F. Corliss, editors, Automatic Differentiation of Algorithms: Theory, Implementation, and Application, pages 315–29. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1991. · Zbl 0782.65031 |

[128] | Renate Meyer, David A. Fournier, and Andreas Berg. Stochastic volatility: Bayesian computation using automatic differentiation and the extended Kalman filter. Econometrics Journal, 6(2):408–420, 2003. doi: 10.1111/1368-423X.t01-1-00116. · Zbl 1065.91533 |

[129] | L. Michelotti. MXYZPTLK: A practical, user-friendly C++ implementation of differential algebra: User’s guide. Technical Memorandum FN-535, Fermi National Accelerator Laboratory, Batavia, IL, 1990. |

[130] | Tom´aˇs Mikolov, Martin Karafi´at, Luk´aˇs Burget, Jan ˇCernock‘y, and Sanjeev Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010. |

[131] | J. D. M¨uller and P. Cusdin. On the performance of discrete adjoint CFD codes using automatic differentiation. International Journal for Numerical Methods in Fluids, 47 (8-9):939–945, 2005. ISSN 1097-0363. doi: 10.1002/fld.885. 37 · Zbl 1134.76431 |

[132] | Uwe Naumann. Optimal accumulation of Jacobian matrices by elimination methods on the dual computational graph. Mathematical Programming, 99(3):399–421, 2004. · Zbl 1084.68144 |

[133] | Uwe Naumann and Jan Riehme. Computing adjoints with the NAGWare Fortran 95 compiler. In H. M. B¨ucker, G. Corliss, P. Hovland, U. Naumann, and B. Norris, editors, Automatic Differentiation: Applications, Theory, and Implementations, Lecture Notes in Computational Science and Engineering, pages 159–69. Springer, 2005. · Zbl 1270.65090 |

[134] | Radford M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto, 1993. |

[135] | Richard D. Neidinger. Automatic differentiation and APL. College Mathematics Journal, 20(3):238–51, 1989. doi: 10.2307/2686776. |

[136] | John F. Nolan. Analytical differentiation on a digital computer. Master’s thesis, Massachusetts Institute of Technology, 1953. |

[137] | J. F. Ostiguy and L. Michelotti. Mxyzptlk: An efficient, native C++ differentiation engine. In Particle Accelerator Conference (PAC 2007), pages 3489–91. IEEE, 2007. doi: 10. 1109/PAC.2007.4440468. |

[138] | David B. Parker. Learning-logic: Casting the cortex of the human brain in silicon. Technical Report TR-47, Center for Computational Research in Economics and Management Science, MIT, 1985. |

[139] | Val´erie Pascual and Laurent Hasco¨et.TAPENADE for C.In Advances in Automatic Differentiation, Lecture Notes in Computational Science and Engineering, pages 199– 210. Springer, 2008. doi: 10.1007/978-3-540-68942-3 18. |

[140] | Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, Long Beach, CA, US, December 9, 2017, 2017. |

[141] | Barak A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6: 147–60, 1994. doi: 10.1162/neco.1994.6.1.147. |

[142] | Barak A. Pearlmutter and Jeffrey Mark Siskind. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator. ACM Transactions on Programming Languages and Systems (TOPLAS), 30(2):1–36, March 2008. doi: 10.1145/1330017.1330018. · Zbl 1175.68104 |

[143] | Ding-Yu Peng and Donald B. Robinson. A new two-constant equation of state. Industrial and Engineering Chemistry Fundamentals, 15(1):59–64, 1976. doi: 10.1021/i160057a011. |

[144] | John Peterson. Untagged data in tagged environments: Choosing optimal representations at compile time. In Proceedings of the Fourth International Conference on Functional Programming Languages and Computer Architecture, pages 89–99. ACM, 1989. 38 |

[145] | F. W. Pfeiffer. Automatic differentiation in PROSE. SIGNUM Newsletter, 22(1):2–8, 1987. doi: 10.1145/24680.24681. |

[146] | Thomas Pock, Michael Pock, and Horst Bischof. Algorithmic differentiation: Application to variational problems in computer vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(7):1180–1193, 2007. doi: 10.1109/TPAMI.2007.1044. |

[147] | William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, 2007. · Zbl 1132.65001 |

[148] | Louise B. Rall. Perspectives on automatic differentiation: Past, present, and future? In M. B¨ucker, G. Corliss, U. Naumann, P. Hovland, and B. Norris, editors, Automatic Differentiation: Applications, Theory, and Implementations, volume 50 of Lecture Notes in Computational Science and Engineering, pages 1–14. Springer Berlin Heidelberg, 2006. |

[149] | Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. MIT Press, 2006. · Zbl 1177.68165 |

[150] | J. Revels, M. Lubin, and T. Papamarkou. Forward-mode automatic differentiation in Julia. arXiv:1607.07892 [cs.MS], 2016a. URL https://arxiv.org/abs/1607.07892. |

[151] | Jarrett Revels, Miles Lubin, and Theodore Papamarkou. Forward-mode automatic differentiation in Julia. arXiv preprint arXiv:1607.07892, 2016b. |

[152] | Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286, 2014. |

[153] | Lawrence C. Rich and David R. Hill. Automatic differentiation in MATLAB. Applied Numerical Mathematics, 9:33–43, 1992. · Zbl 0753.65017 |

[154] | Daniel Ritchie, Paul Horsfall, and Noah D Goodman. Deep amortized inference for probabilistic programs. arXiv preprint arXiv:1610.05735, 2016. |

[155] | Elizabeth Rollins.Optimization of neural network feedback control systems using automatic differentiation. Master’s thesis, Department of Aeronautics and Astronautics, Massachusetts Institute of Technology, 2009. |

[156] | L. I. Rozonoer. L. S. Pontryagin’s maximum principle in the theory of optimum systems— Part II. Automat. i Telemekh., 20:1441–1458, 1959. |

[157] | David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533, 1986. · Zbl 1369.68284 |

[158] | Siegfried M. Rump. INTLAB—INTerval LABoratory. In Developments in Reliable Computing, pages 77–104. Kluwer Academic Publishers, Dordrecht, 1999. doi: 10.1007/97894-017-1247-7 7. · Zbl 0949.65046 |

[159] | Tim Salimans, Diederik Kingma, and Max Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1218–1226, 2015. 39 |

[160] | John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck. Probabilistic programming in Python using PyMC3. PeerJ Computer Science, 2:e55, 2016. |

[161] | Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. In International Conference on Machine Learning, pages 343–351, 2013. |

[162] | J¨urgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015. |

[163] | Nicol N. Schraudolph. Local gain adaptation in stochastic gradient descent. In Proceedings of the International Conference on Artificial Neural Networks, pages 569–74, Edinburgh, Scotland, 1999. IEE London. doi: 10.1049/cp:19991170. |

[164] | Nicol N. Schraudolph and Thore Graepel. Combining conjugate direction methods with stochastic approximation of gradients. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, 2003. · Zbl 1013.68699 |

[165] | Frank Seide and Amit Agarwal. CNTK: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 2135–2135, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2945397. |

[166] | Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-ofexperts layer. In International Conference on Learning Representations 2017, 2017. |

[167] | Olin Shivers. Control-flow analysis of higher-order languages. PhD thesis, Carnegie Mellon University, 1991. · Zbl 1302.68072 |

[168] | Alex Shtof, Alexander Agathos, Yotam Gingold, Ariel Shamir, and Daniel Cohen-Or. Geosemantic snapping for sketch-based modeling. Computer Graphics Forum, 32(2):245–53, 2013. doi: 10.1111/cgf.12044. |

[169] | N. Siddharth, Brooks Paige, Jan-Willem van de Meent, Alban Desmaison, Noah D. Goodman, Pushmeet Kohli, Frank Wood, and Philip Torr. Learning disentangled representations with semi-supervised deep generative models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5927–5937. Curran Associates, Inc., 2017. |

[170] | Patrice Simard, Yann LeCun, John Denker, and Bernard Victorri. Transformation invariance in pattern recognition, tangent distance and tangent propagation. In G. Orr and K. Muller, editors, Neural Networks: Tricks of the Trade. Springer, 1998. |

[171] | Z. Sirkes and E. Tziperman. Finite difference of adjoint or adjoint of finite difference? Monthly Weather Review, 125(12):3373–8, 1997. doi: 10.1175/1520-0493(1997)125h3373: FDOAOAi2.0.CO;2. 40 |

[172] | Jeffrey Mark Siskind and Barak A. Pearlmutter. Perturbation confusion and referential transparency: Correct functional implementation of forward-mode AD. In Andrew Butterfield, editor, Implementation and Application of Functional Languages—17th International Workshop, IFL’05, pages 1–9, Dublin, Ireland, 2005. Trinity College Dublin Computer Science Department Technical Report TCD-CS-2005-60. |

[173] | Jeffrey Mark Siskind and Barak A. Pearlmutter. Using polyvariant union-free flow analysis to compile a higher-order functional-programming language with a first-class derivative operator to efficient Fortran-like code. Technical Report TR-ECE-08-01, School of Electrical and Computer Engineering, Purdue University, 2008a. · Zbl 1156.68335 |

[174] | Jeffrey Mark Siskind and Barak A. Pearlmutter. Nesting forward-mode AD in a functional framework. Higher-Order and Symbolic Computation, 21(4):361–376, 2008b. · Zbl 1175.68104 |

[175] | Jeffrey Mark Siskind and Barak A. Pearlmutter. Efficient implementation of a higher-order language with built-in AD. In 7th International Conference on Algorithmic Differentiation, Christ Church Oxford, UK, September 12–15, 2016, 2016. Also arXiv:1611.03416. · Zbl 1295.65028 |

[176] | Jeffrey Mark Siskind and Barak A. Pearlmutter. Divide-and-conquer checkpointing for arbitrary programs with no user annotation. In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, Long Beach, CA, US, December 9, 2017, 2017. Also arXiv:1708.06799. · Zbl 06949134 |

[177] | Emil I. Slusanschi and Vlad Dumitrel. ADiJaC—Automatic differentiation of Java classfiles. ACM Transaction on Mathematical Software, 43(2):9:1–9:33, September 2016. ISSN 00983500. doi: 10.1145/2904901. · Zbl 1391.65045 |

[178] | Bert Speelpenning.Compiling Fast Partial Derivatives of Functions Given by Algorithms. PhD thesis, Department of Computer Science, University of Illinois at UrbanaChampaign, 1980. |

[179] | Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright. Optimization for Machine Learning. MIT Press, 2011. |

[180] | Filip Srajer, Zuzana Kukelova, and Andrew Fitzgibbon. A benchmark of selected algorithmic differentiation tools on some problems in machine learning and computer vision. In AD2016: The 7th International Conference on Algorithmic Differentiation, Monday 12th–Thursday 15th September 2016, Christ Church Oxford, UK: Programme and Abstracts, pages 181–184. Society for Industrial and Applied Mathematics (SIAM), 2016. · Zbl 1453.65050 |

[181] | Akshay Srinivasan and Emanuel Todorov.Graphical Newton.Technical Report arXiv:1508.00952, arXiv preprint, 2015. |

[182] | Andreas Stuhlm¨uller, Jacob Taylor, and Noah Goodman. Learning stochastic inverses. In Advances in Neural Information Processing Systems, pages 3048–3056, 2013. |

[183] | Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O. Stanley, and Jeff Clune.Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567, 2017. 41 |

[184] | Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in Neural Information Processing Systems, pages 2440–2448, 2015. |

[185] | Gerald J. Sussman and Jack Wisdom. Structure and Interpretation of Classical Mechanics. MIT Press, 2001. doi: 10.1063/1.1457268. · Zbl 0983.70001 |

[186] | Jonathan Taylor, Richard Stebbing, Varun Ramakrishna, Cem Keskin, Jamie Shotton, Shahram Izadi, Aaron Hertzmann, and Andrew Fitzgibbon. User-specific hand modeling from monocular depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 644–651, 2014. |

[187] | Jeffrey P. Thomas, Earl H. Dowell, and Kenneth C. Hall. Using automatic differentiation to create a nonlinear reduced order model of a computational fluid dynamic solver. AIAA Paper, 7115:2006, 2006. |

[188] | T. Tieleman and G. Hinton. Lecture 6.5—RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4 (2), 2012. |

[189] | Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015. |

[190] | Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei.Edward: A library for probabilistic modeling, inference, and criticism.arXiv preprint arXiv:1610.09787, 2016. |

[191] | Dustin Tran, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. Deep probabilistic programming. In International Conference on Learning Representations, 2017. |

[192] | Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew W. Fitzgibbon. Bundle adjustment—a modern synthesis. In International Workshop on Vision Algorithms, pages 298–372. Springer, 1999. |

[193] | George Tucker, Andriy Mnih, Chris J. Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pages 2624–2633, 2017. |

[194] | Bart van Merri¨enboer, Alexander B. Wiltschko, and Dan Moldovan.Tangent:Automatic differentiation using source code transformation in Python.arXiv preprint arXiv:1711.02712, 2017. |

[195] | Arun Verma. An introduction to automatic differentiation. Current Science, 78(7):804–7, 2000. |

[196] | S. V. N. Vishwanathan, Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. Accelerated training of conditional random fields with stochastic gradient methods. In 42 Proceedings of the 23rd International Conference on Machine Learning (ICML ’06), pages 969–76, 2006. doi: 10.1145/1143844.1143966. |

[197] | Andrea Walther. Automatic differentiation of explicit Runge-Kutta methods for optimal control. Computational Optimization and Applications, 36(1):83–108, 2007. doi: 10.1007/ s10589-006-0397-3. · Zbl 1278.49037 |

[198] | Andrea Walther and Andreas Griewank. Getting started with ADOL-C. In U. Naumann and O. Schenk, editors, Combinatorial Scientific Computing, chapter 7, pages 181–202. Chapman-Hall CRC Computational Science, 2012. doi: 10.1201/b11644-8. |

[199] | Robert E. Wengert. A simple automatic derivative evaluation program. Communications of the ACM, 7:463–4, 1964. · Zbl 0131.34602 |

[200] | Paul J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974. |

[201] | Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256, 1992. · Zbl 0772.68076 |

[202] | J. Willkomm and A. Vehreschild. The ADiMat handbook, 2013. URL http://adimat.sc. informatik.tu-darmstadt.de/doc/. |

[203] | David Wingate, Noah Goodman, Andreas Stuhlm¨uller, and Jeffrey Mark Siskind. Nonstandard interpretations of probabilistic programs for efficient inference. Advances in Neural Information Processing Systems, 23, 2011. |

[204] | Weiwei Yang, Yong Zhao, Li Yan, and Xiaoqian Chen. Application of PID controller based on BP neural network using automatic differentiation method. In F. Sun, J. Zhang, Y. Tan, J. Cao, and W. Yu, editors, Advances in Neural Networks—ISNN 2008, volume 5264 of Lecture Notes in Computer Science, pages 702–711. Springer Berlin Heidelberg, 2008. doi: 10.1007/978-3-540-87734-9 80. |

[205] | Ilker Yildirim, Tejas D. Kulkarni, Winrich A. Freiwald, and Joshua B. Tenenbaum. Efficient and robust analysis-by-synthesis in vision: A computational framework, behavioral tests, and modeling neuronal representations. In Annual Conference of the Cognitive Science Society, 2015. |

[206] | Haonan Yu and Jeffrey Mark Siskind. Grounded language learning from video described with sentences. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 53–63, Sofia, Bulgaria, 2013. Association for Computational Linguistics. |

[207] | Wojciech Zaremba, Tomas Mikolov, Armand Joulin, and Rob Fergus. Learning simple algorithms from examples. In International Conference on Machine Learning, pages 421– 429, 2016. |

[208] | Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550–60, 1997. · Zbl 0912.65057 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.