×

zbMATH — the first resource for mathematics

Expected policy gradients for reinforcement learning. (English) Zbl 07255083
Summary: We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussian policies and quadratic critics and then extend it to a universal analytical method, covering a broad class of actors and critics, including Gaussian, exponential families, and policies with bounded support. For Gaussian policies, we introduce an exploration method that uses covariance proportional to the matrix exponential of the scaled Hessian of the critic with respect to the actions. For discrete action spaces, we derive a variant of EPG based on softmax policies. We also establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. Furthermore, we prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and with little computational overhead. Finally, we provide an extensive experimental evaluation of EPG and show that it outperforms existing approaches on multiple challenging control domains.
MSC:
68T05 Learning and adaptive systems in artificial intelligence
PDF BibTeX XML Cite
Full Text: Link
References:
[1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, and Matthieu Devin. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015, 2015.
[2] Abbas Abdolmaleki, Rudolf Lioutikov, Jan R. Peters, Nuno Lau, Luis Pualo Reis, and Gerhard Neumann. Model-based relative entropy stochastic search. InAdvances in Neural
[3] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, R´emi Munos, Nicolas Heess, and Martin A. Riedmiller. Maximum a posteriori policy optimisation. In6th International
[4] Riad Akrour, Gerhard Neumann, Hany Abdulsamad, and Abbas Abdolmaleki. Model-Free Trajectory Optimization for Reinforcement Learning. InInternational Conference on · Zbl 1437.68147
[5] Shun-Ichi Amari. Natural Gradient Works Efficiently in Learning.Neural computation, 10 (2):251-276, 1998.
[6] K. Asadi, C. Allen, M. Roderick, A.-r. Mohamed, G. Konidaris, and M. Littman. Mean Actor Critic.ArXiv e-prints, September 2017.
[7] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An Actor-Critic Algorithm for Sequence Prediction.
[8] Leemon Baird et al. Residual Algorithms: Reinforcement Learning with Function Approximation. InProceedings of the Twelfth International Conference on Machine Learning, pages 30-37, 1995.
[9] Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, and Richard S Sutton. Incremental Natural Actor-Critic Algorithms. InAdvances in Neural Information Processing Systems, pages 105-112, 2008. · Zbl 1183.93130
[10] Peter Bickel and Kjell Doksum.Mathematical Statistics, Basic Ideas and Selected Topics, Vol. 1, (2nd Edition). Prentice Hall, 2 edition, 2006. ISBN 0-13-230637-9. · Zbl 0403.62001
[11] Steven J. Bradtke. Reinforcement Learning Applied to Linear Quadratic Regulation. InIn Advances in Neural Information Processing Systems 5, pages 295-302. Morgan Kaufmann,
[12] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym.arXiv preprint arXiv:1606.01540, 2016.
[13] Kamil Ciosek and Shimon Whiteson. Expected Policy Gradients. InAAAI 2018: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, February 2018.
[14] John L Crassidis and John L Junkins.Optimal Estimation of Dynamic Systems. CRC press, 2011. · Zbl 1072.93001
[15] Thomas Degris, Martha White, and Richard S Sutton. Off-Policy Actor-Critic.arXiv preprint arXiv:1205.4839, 2012.
[16] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. OpenAI Baselines.GitHub repository, 2017.
[17] Michael Fairbank.Value-Gradient Learning. PhD thesis, City University London, 2014.
[18] Michael Fairbank and Eduardo Alonso. Value-Gradient Learning. InNeural Networks (IJCNN), The 2012 International Joint Conference On, pages 1-8. IEEE, 2012.
[19] Scott Fujimoto, Herke van Hoof, and Dave Meger. Addressing Function Approximation Error in Actor-Critic Methods.arXiv preprint arXiv:1802.09477, 2018.
[20] Thomas Furmston and David Barber. A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes. InAdvances in Neural Information Processing Systems, pages 2717-2725, 2012.
[21] Thomas Furmston, Guy Lever, and David Barber. Approximate Newton Methods for Policy Search in Markov Decision Processes.Journal of Machine Learning Research, 17:1-51, 2016. · Zbl 1404.68102
[22] Mohammad Ghavamzadeh, Yaakov Engel, and Michal Valko. Bayesian Policy Gradient and Actor-Critic Algorithms.Journal of Machine Learning Research, 17(66):1-53, 2016. · Zbl 1360.68674
[23] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Q-Prop: Sample-Efficient Policy Gradient with an off-Policy Critic.arXiv preprint
[24] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous Deep Q-Learning with Model-Based Acceleration. InInternational Conference on Machine
[25] Shixiang Shane Gu, Timothy Lillicrap, Richard E. Turner, Zoubin Ghahramani, Bernhard Sch¨olkopf, and Sergey Levine. Interpolated Policy Gradient: Merging on-Policy and off-Policy Gradient Estimation for Deep Reinforcement Learning. InAdvances in Neural
[26] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In
[27] Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning Continuous Control Policies by Stochastic Value Gradients. InAdvances in
[28] Riashat Islam, Peter Henderson, Maziar Gomrokchi, and Doina Precup. Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control.arXiv preprint
[29] Sham M Kakade. A Natural Policy Gradient. InAdvances in Neural Information Processing Systems, pages 1531-1538, 2002.
[30] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[31] Michail G Lagoudakis and Ronald Parr. Least-Squares Policy Iteration.Journal of machine learning research, 4(Dec):1107-1149, 2003. · Zbl 1094.68080
[32] Weiwei Li and Emanuel Todorov. Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems. InICINCO (1), pages 222-229, 2004.
[33] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous Control with Deep Reinforcement Learning.arXiv preprint arXiv:1509.02971, 2015.
[34] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. InInternational Conference on Machine Learning, pages 1928- 1937, 2016.
[35] Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the Gap Between Value and Policy Based Reinforcement Learning.arXiv preprint arXiv:1702.08892, 2017.
[36] Ofir Nachum, Mohammad Norouzi, George Tucker, and Dale Schuurmans. Smoothed Action Value Functions for Learning Gaussian Policies.International Conference on Machine
[37] Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trust-pcl: An offpolicy trust region method for continuous control. In6th International Conference on
[38] Gergely Neu, Anders Jonsson, and Vicen¸c G´omez. A Unified View of Entropy-Regularized Markov Decision Processes.arXiv preprint arXiv:1705.07798, 2017.
[39] Brendan O’Donoghue, R´emi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Combining policy gradient and q-learning. In5th International Conference on Learning Represen
[40] Simone Parisi, Matteo Pirotta, and Marcello Restelli. Multi-Objective Reinforcement Learning through Continuous Pareto Manifold Approximation.Journal of Artificial · Zbl 1386.68137
[41] Paavo Parmas. Total stochastic gradient algorithms and applications in reinforcement learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems 31, pages 10225- 10235. Curran Associates, Inc., 2018.
[42] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. InNIPS 2017 Workshop on Autodiff, 2017. URLhttps: //openreview.net/forum?id=BJJsrmfCZ.
[43] Jan Peters and Stefan Schaal. Policy Gradient Methods for Robotics. InIntelligent Robots and Systems, 2006 IEEE/RSJ International Conference On, pages 2219-2225. IEEE, 2006.
[44] Jan Peters and Stefan Schaal. Natural Actor-Critic.Neurocomputing, 71(7):1180-1190, 2008a.
[45] Jan Peters and Stefan Schaal. Reinforcement Learning of Motor Skills with Policy Gradients. Neural networks, 21(4):682-697, 2008b.
[46] Jan Peters, Sethu Vijaykumar, and Stefan Schaal. Policy Gradient Methods for Robot Control.Report CS-03-787, University of Southern California, 2003.
[47] Jan Peters, Katharina M¨ulling, and Yasemin Altun. Relative Entropy Policy Search. In AAAI, pages 1607-1612. Atlanta, 2010.
[48] Matteo Pirotta, Marcello Restelli, and Luca Bascetta. Adaptive Step-Size for Policy Gradient Methods. InAdvances in Neural Information Processing Systems, pages 1394-1402, 2013. · Zbl 1354.90166
[49] Martin L Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014. · Zbl 0829.90134
[50] Michael Roth, Gustaf Hendeby, and Fredrik Gustafsson. Nonlinear Kalman Filters Explained: A Tutorial on Moment Computations and Sigma Point Methods.Journal of Advances in · Zbl 1391.93237
[51] Gavin A Rummery and Mahesan Niranjan.On-Line Q-Learning Using Connectionist Systems. University of Cambridge, Department of Engineering, 1994.
[52] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust Region Policy Optimization. InProceedings of the 32nd International Conference on
[53] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347, 2017.
[54] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic Policy Gradient Algorithms. InICML, 2014.
[55] Marshall H Stone. The Generalized Weierstrass Approximation Theorem.Mathematics Magazine, 21(5):237-254, 1948.
[56] Richard S Sutton. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding.Advances in neural information processing systems, pages 1038- 1044, 1996.
[57] Richard S Sutton and Andrew G Barto.Reinforcement Learning: An Introduction, volume 1. MIT press Cambridge, 1998. · Zbl 1407.68009
[58] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. InAdvances
[59] Richard S Sutton, SP Singh, and DA McAllester. Comparing Policy-Gradient Algorithms, 2000b. URLhttp://incompleteideas.net/papers/SSM-unpublished.pdf.
[60] ten Hagen, S.H.G., Kr¨ose, B.J.A., van der Broek, W., Verdenius, F., and Amsterdam Machine Learning lab (IVI, FNWI). Linear Quadratic Regulation using Reinforcement Learning. InProc. of the 8th Belgian-dutch Conf. on machine learning BENELEARN-98, pages 39-46, 1998.
[61] Philip S. Thomas and Emma Brunskill.Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-Dependent Baselines.arXiv preprint
[62] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. Technical report, COURSERA, 2012.
[63] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A Physics Engine for ModelBased Control. InIntelligent Robots and Systems (IROS), 2012 IEEE/RSJ International
[64] George E Uhlenbeck and Leonard S Ornstein. On the Theory of the Brownian Motion. Physical review, 36(5):823, 1930.
[65] Harm van Seijen, Hado van Hasselt, Shimon Whiteson, and Marco Wiering. A Theoretical and Empirical Analysis of Expected Sarsa. InADPRL 2009: Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 177- 184, March 2009.
[66] Karl Weierstrass. ¨uber Die Analytische Darstellbarkeit Sogenannter Willk¨urlicher Functionen Einer Reellen Ver¨anderlichen.Sitzungsberichte der K¨oniglich Preußischen Akademie der
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.