Autonomous reinforcement learning with experience replay. (English) Zbl 1296.68151

Summary: This paper considers the issues of efficiency and autonomy that are required to make reinforcement learning suitable for real-life control tasks. A real-time reinforcement learning algorithm is presented that repeatedly adjusts the control policy with the use of previously collected samples, and autonomously estimates the appropriate step-sizes for the learning updates. The algorithm is based on the actor-critic with experience replay whose step-sizes are determined on-line by an enhanced fixed point algorithm for on-line neural network training. An experimental study with simulated octopus arm and half-cheetah demonstrates the feasibility of the proposed algorithm to solve difficult learning control problems in an autonomous way within reasonably short time.


68T05 Learning and adaptive systems in artificial intelligence
68T40 Artificial intelligence for robotics
Full Text: DOI


[1] Abbeel, P.; Ng, A. Y., Exploration and apprenticeship learning in reinforcement learning, (Proc. of the 22nd ICML (2005), ACM), 1-8
[2] Adam, S.; Busoniu, L.; Babuska, R., Experience replay for real-time reinforcement learning control, IEEE Transactions on Systems, Man, and Cybernetics, Part C, 42, 2, 201-212 (2012)
[3] Barto, A. G.; Sutton, R. S.; Anderson, C. W., Neuronlike adaptive elements that can learn difficult learning control problems, IEEE Transactions on Systems, Man, and Cybernetics, 13, 834-846 (1983)
[4] Behera, L.; Kumar, S.; Patnaik, A., On adaptive learning rate that guarantees convergence in feedforward networks, IEEE Transactions on Neural Networks, 17, 5, 1116-1125 (2006)
[5] Bhatnagar, S.; Sutton, R.; Ghavamzadeh, M.; Lee, M., Natural actor-critic algorithms, Automatica, 45, 2471-2482 (2009) · Zbl 1183.93130
[6] Chiaverini, S.; Oriolo, G.; Walker, I. D., Kinematically redundant manipulators, (Springer handbook of robotics (2008)), 245-268
[7] Cichosz, P., An analysis of experience replay in temporal difference learning, Cybernetics and Systems, 30, 341-363 (1999) · Zbl 1005.68131
[9] George, A. P.; Powell, W. B., Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming, Machine Learning, 65, 1, 167-198 (2006) · Zbl 1475.90122
[10] Hachiya, H.; Peters, J.; Sugiyama, M., Reward-weighted regression with sample reuse for direct policy search in reinforcement learning, Neural Computation, 23, 11, 2798-2832 (2011) · Zbl 1237.68147
[11] Jacobs, R. A., Increased rates of convergence through learning rate adaptation, Neural Networks, 1, 4, 295-308 (1988)
[12] Kathirvalavakumar, T.; Subavathi, S. J., Neighborhood based modified backpropagation algorithm using adaptive learning parameters for training feedforward neural networks, Neurocomputing, 72, 3915-3921 (2009)
[14] Kober, J.; Peters, J., Policy search for motor primitives in robotics, Machine Learning, 84, 1-2, 171-203 (2011) · Zbl 1237.68229
[15] Konda, V.; Tsitsiklis, J., Actor-critic algorithms, SIAM Journal on Control and Optimization, 42, 4, 1143-1166 (2003) · Zbl 1049.93095
[18] Kushner, H. J.; Yin, G., Stochastic approximation algorithms and applications (1997), Springer-Verlag · Zbl 0914.60006
[20] Noda, I., Recursive adaptation of stepsize parameter for non-stationary environments, (Principles of practice in multi-agent systems (2009)), 525-533
[23] Peters, J.; Vijayakumar, S.; Schaal, S., Natural actor-critic, (Proc. of ECML (2005), Springer-Verlag: Springer-Verlag Berlin Heidelberg), 280-291
[24] Rubinstein, R. Y., Simulation and the Monte Carlo method (1981), John Wiley & Sons, Inc. · Zbl 0529.68076
[25] Rumelhart, D. E.; Hinton, G. E.; Williams, R. J., Neurocomputing: foundations of research (1988), MIT Press, (pp. 696-699) (Chap. Learning representations by back-propagating errors)
[29] Sutton, R. S., Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, (Proc. of the 7th ICML (1990), Morgan Kaufmann), 216-224
[32] Sutton, R. S.; Barto, A. G., Reinforcement learning: an introduction (1998), MIT Press
[33] Sutton, R. S.; McAllester, D.; Singh, S.; Mansour, Y., Policy gradient methods for reinforcement learning with function approximation, (Advances in NIPS, vol. 12 (2000), MIT Press), 1057-1063
[34] Wawrzyński, P., Real-time reinforcement learning by sequential actor-critics and experience replay, Neural Networks, 22, 1484-1497 (2009) · Zbl 1396.68107
[36] Wawrzyński, P.; Papis, B., Fixed point method for autonomous on-line neural network training, Neurocomputing, 74, 2893-2905 (2011)
[37] Williams, R. J., Simple statistical gradient-following algorithms for connectionist reinforcement learning, (Machine Learning (1992)), 229-256 · Zbl 0772.68076
[38] Woolley, B. G.; Stanley, K. O., Evolving a single scalable controller for an octopus arm with a variable number of segments, (Proceedings of the 11th international conference on parallel problem solving from nature. Proceedings of the 11th international conference on parallel problem solving from nature, PPSN-2010 (2010), Springer)
[39] Yekutieli, Y.; Sagiv-Zohar, R.; Aharonov, R.; Engel, Y.; Hochner, B.; Flash, T., Dynamic model of the octopus arm. i. biomechanics of the octopus reaching movement, Journal of Neurophysiology, 94, 2, 1443-1458 (2005)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.