Dulac-Arnold, Gabriel; Levine, Nir; Mankowitz, Daniel J.; Li, Jerry; Paduraru, Cosmin; Gowal, Sven; Hester, Todd Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. (English) Zbl 07465677 Mach. Learn. 110, No. 9, 2419-2468 (2021). Summary: Reinforcement learning (RL) has proven its worth in a series of artificial domains, and is beginning to show some successes in real-world scenarios. However, much of the research advances in RL are hard to leverage in real-world systems due to a series of assumptions that are rarely satisfied in practice. In this work, we identify and formalize a series of independent challenges that embody the difficulties that must be addressed for RL to be commonly deployed in real-world systems. For each challenge, we define it formally in the context of a Markov Decision Process, analyze the effects of the challenge on state-of-the-art learning algorithms, and present some existing attempts at tackling it. We believe that an approach that addresses our set of proposed challenges would be readily deployable in a large number of real world problems. Our proposed challenges are implemented in a suite of continuous control environments called realworldrl-suite which we propose an as an open-source benchmark. Cited in 4 Documents MSC: 68T05 Learning and adaptive systems in artificial intelligence Keywords:reinforcement learning; real-world; applied reinforcement learning Software:Safety Gym; IMPALA; Horizon; QT-Opt; TEXPLORE; POMDP PDF BibTeX XML Cite \textit{G. Dulac-Arnold} et al., Mach. Learn. 110, No. 9, 2419--2468 (2021; Zbl 07465677) Full Text: DOI References: [1] Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st international conference on machine learning (p. 1). ACM. [2] Abbeel, P.; Coates, A.; Ng, AY, Autonomous helicopter aerobatics through apprenticeship learning, The International Journal of Robotics Research, 29, 13, 1608-1639 (2010) [3] Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., & Riedmiller, M. A. (2018a). Maximum a posteriori policy optimisation. CoRR. arXiv:1806.06920 [4] Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., & Riedmiller, M. A. (2018b) Maximum a posteriori policy optimisation. In International conference on learning representations (ICLR). [5] Abdolmaleki, A., Huang, S. H., Hasenclever, L., Neunert, M., Song, H. F., Zambelli, M., Martins, M. F., Heess, N., Hadsell, R., & Riedmiller, M. (2020). A distributional view on multi-objective policy optimization. Preprint arXiv:200507513 [6] Achiam, J., Held, D., Tamar, A., & Abbeel, P. (2017). Constrained policy optimization. CoRR. arXiv:1705.10528 [7] Adam, S., Busoniu, L., & Babuska, R. (2011). Experience replay for real-time reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews),42(2), 201-212. [8] Adamski, I., Adamski, R., Grel, T., Jedrych, A., Kaczmarek, K., & Michalewski, H. (2018). Distributed deep reinforcement learning: Learn how to play atari games in 21 minutes. In International conference on high performance computing (pp. 370-388). Springer. [9] Agarwal, A., Bird, S., Cozowicz, M., Hoang, L., Langford, J., Lee, S., Li, J., Melamed, D., Oshri, G., Ribas, O., et al. (2016). Making contextual decisions with low technical debt. Preprint arXiv:1606.03966 [10] Agarwal, R., Schuurmans, D., & Norouzi, M. (2019). Striving for simplicity in off-policy deep reinforcement learning. Preprint arXiv:1907.04543 [11] Altman, E., Constrained Markov decision processes (1999), London: CRC Press, London · Zbl 0963.90068 [12] Ahn, M., Zhu, H., Hartikainen, K., Ponte, H., Gupta, A., Levine, S., & Kumar, V. (2019). ROBEL: RObotics BEnchmarks for Learning with low-cost robots. In Conference on robot learning (CoRL). [13] Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. (2018). Learning dexterous in-hand manipulation. Preprint arXiv:1808.00177 [14] Argenson, A., & Dulac-Arnold, G. (2020). Model-based offline planning. Preprint arXiv:2008.05556 [15] Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Unterthiner, T., & Hochreiter, S. (2018). Rudder: Return decomposition for delayed rewards. Preprint arXiv:1806.07857 [16] Bacon, P. L., Harb, J., & Precup, D. (2017). The option-critic architecture. In 31st AAAI conference on artificial intelligence. [17] Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney, W., Horgan, D. T. B. D., Muldal, A., Heess, N., & Lillicrap, T. P. (2018). Distributed distributional deterministic policy gradients. In International conference on learning representations (ICLR). [18] Bellemare, M. G., Dabney, W., & Munos, R. (2017). A distributional perspective on reinforcement learning. CoRR. arXiv:1707.06887 [19] Bohez, S., Abdolmaleki, A., Neunert, M., Buchli, J., Heess, N., & Hadsell, R. (2019). Value constrained model-free continuous control. Preprint arXiv:1902.04623 [20] Boutilier, C., & Lu, T. (2016). Budget allocation using weakly coupled, constrained Markov decision processes. In Proceedings of the 32nd conference on uncertainty in artificial intelligence (UAI-16) (pp. 52-61). New York, NY. [21] Bradtke, S.; Barto, A., Linear least-squares algorithms for temporal difference learning, Machine Learning, 22, 33-57 (1996) · Zbl 1099.93534 [22] Buckman, J., Hafner, D., Tucker, G., Brevdo, E., & Lee, H. (2018). Sample-efficient reinforcement learning with stochastic ensemble value expansion. CoRR. arXiv:1807.01675 [23] Cabi, S., Colmenarejo, S. G., Novikov, A., Konyushkova, K., Reed, S., Jeong, R., Zolna, K., Aytar, Y., Budden, D., Vecerik, M., Sushkov, O., Barker, D., Scholz, J., Denil, M., de Freitas, N., & Wang, Z. (2019). Scaling data-driven robotics with reward sketching and batch reinforcement learning. Preprint arXiv:1909.12200 [24] Cai, H., Ren, K., Zhang, W., Malialis, K., Wang, J., Yu, Y., & Guo, D. (2017). Real-time bidding by reinforcement learning in display advertising. In Proceedings of the 10th ACM international conference on web search and data mining (pp. 661-670). [25] Calian, D. A., Mankowitz, D. J., Zahavy, T., Xu, Z., Oh, J., Levine, N., & Mann, T. (2020). Balancing constraints and rewards with meta-gradient d4pg. Eprint. arXiv:2010.06324 [26] Carrara, N., Laroche, R., Bouraoui, J., Urvoy, T., Olivier, T. D. S., & Pietquin, O. (2018). A fitted-q algorithm for budgeted mdps. In EWRL. [27] Cassandra, A. R. (1998). A survey of POMDP applications. In Working notes of AAAI 1998 fall symposium on planning with partially observable Markov decision processes (Vol. 1724). [28] Chen, M., Beutel, A., Covington, P., Jain, S., Belletti, F., & Chi, E. H. (2019a). Top-k off-policy correction for a reinforce recommender system. In Proceedings of the 12th ACM international conference on web search and data mining (pp. 456-464). [29] Chen, X., Zhou, Z., Wang, Z., Wang, C., Wu, Y., Deng, Q., & Ross, K. (2019b). BAIL: Best-action imitation learning for batch deep reinforcement learning. Preprint arXiv:1910.12179 [30] Chow, Y., Nachum, O., Duenez-Guzman, E., & Ghavamzadeh, M. (2018). A Lyapunov-based approach to safe reinforcement learning. In: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31, pp. 8092-8101). [31] Chua, K., Calandra, R., McAllister, R., Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in neural information processing systems (pp. 4754-4765). [32] Covington, P., Adams, J., & Sargin, E. (2016). Deep neural networks for Youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems (pp. 191-198). ACM. [33] Dabney, W., Ostrovski, G., Silver, D., & Munos, R. (2018). Implicit quantile networks for distributional reinforcement learning. In J Dy, A Krause (Eds.), Proceedings of the 35th international conference on machine learning, PMLR, Stockholmsmssan, Stockholm Sweden, proceedings of machine learning research (Vol. 80, pp. 1096-1105). [34] Dalal, G., Dvijotham, K., Vecerik, M., Hester, T., Paduraru, C., & Tassa, Y. (2018). Safe exploration in continuous action spaces. CoRR. arXiv:1801.08757 [35] Derman, E., Mankowitz, D. J., Mann, T. A., & Mannor, S. (2018). Soft-robust actor-critic policy-gradient. Preprint arXiv:1803.04848 [36] Derman, E., Mankowitz, D. J., Mann, T. A., & Mannor, S. (2018). A Bayesian approach to robust reinforcement learning. arXiv:1905.08188 [37] Doya, K.; Samejima, K.; Katagiri, KI; Kawato, M., Multiple model-based reinforcement learning, Neural Computation, 14, 6, 1347-1369 (2002) · Zbl 0997.93037 [38] Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., & Coppin, B. (2015). Deep reinforcement learning in large discrete action spaces. Preprint arXiv:1512.07679 [39] Dulac-Arnold, G., Mankowitz, D. J., & Hester, T. (2019). Challenges of real-world reinforcement learning. In ICML workshop on reinforcement learning for real life. arXiv:1904.12901 [40] Ernst, D.; Geurts, P.; Wehenkel, L., Tree-based batch mode reinforcement learning, Journal of Machine Learning Research, 6, 503-556 (2005) · Zbl 1222.68193 [41] Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., et al. (2018a). IMPALA: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv:1802.01561. [42] Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., & Kavukcuoglu, K. (2018b). IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In J Dy, A Krause (Eds.), Proceedings of the 35th international conference on machine learning, PMLR, Stockholmsmssan, Stockholm Sweden, proceedings of machine learning research (Vol. 80, pp. 1407-1416). [43] Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. (2018c) Impala: Scalable distributed deep-RL with importance weighted actor-learner architectures. Preprint arXiv:1802.01561 [44] Evans, R., & Gao, J. (2016). Deepmind ai reduces google data centre cooling bill by 40 [45] Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th international conference on machine learning—Volume 70, JMLR. org (pp. 1126-1135). [46] Fujimoto, S., Meger, D., & Precup, D. (2019). Off-policy deep reinforcement learning without exploration. In International conference on machine learning (pp. 2052-2062). [47] Gauci, J., Conti, E., Liang, Y., Virochsiri, K., He, Y., Kaden, Z., Narayanan, V., & Ye, X. (2018). Horizon: Facebook’s open source applied reinforcement learning platform. Preprint arXiv:1811.00260 [48] Gu, S., Holly, E., Lillicrap, T., & Levine, S. (2017). Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 3389-3396). IEEE. [49] Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., et al. (2018). Soft actor-critic algorithms and applications. Preprint arXiv:1812.05905 [50] Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., & Dragan, A. D. (2017). Inverse reward design. CoRR. arXiv:1711.02827 [51] Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2018). Learning latent dynamics for planning from pixels. Preprint arXiv:1811.04551 [52] Hausknecht, M. J., & Stone, P. (2015). Deep recurrent q-learning for partially observable mdps. CoRR. arXiv:1507.06527 [53] He, J., Chen, J., He, X., Gao, J., Li, L., Deng, L., & Ostendorf, M. (2015). Deep reinforcement learning with a natural language action space. Preprint arXiv:1511.04636 [54] Heess, N. T. B. D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y., Erez, T., Wang, Z., Eslami, S., et al. (2017). Emergence of locomotion behaviours in rich environments. Preprint arXiv:1707.02286 [55] Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep reinforcement learning that matters. In 32nd AAAI conference on artificial intelligence. [56] Hester, T.; Stone, P., TEXPLORE: Real-time sample-efficient reinforcement learning for robots, Machine Learning (2013) [57] Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., Horgan, D., Quan, J., Sendonaris, A., Osband, I., Dulac-Arnold, G., Agapiou, J., Leibo, J. Z., & Gruslys, A. (2018a). Deep q-learning from demonstrations. In Proceedings of the 32nd AAAI conference on artificial intelligence (AAAI-18) (pp. 3223-3230). [58] Hester, T. A., Fisher, E. J., & Khandelwal, P. (2018b). Predictively controlling an environmental control system. US Patent 9,869,484. [59] Hoffman, M., Shahriari, B., Aslanides, J., Barth-Maron, G., Behbahani, F., Norman, T., Abdolmaleki, A., Cassirer, A., Yang, F., Baumli, K., et al. (2020). ACME: A research framework for distributed reinforcement learning. Preprint arXiv:2006.00979 [60] Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, M., van Hasselt, H., & Silver, D. (2018). Distributed prioritized experience replay. CoRR arXiv:1803.00933 [61] Hung, C. C., Lillicrap, T., Abramson, J., Wu, Y., Mirza, M., Carnevale, F., Ahuja, A., & Wayne, G. (2018). Optimizing agent behavior over long time scales by transporting value. Preprint arXiv:1810.06721 [62] Ie, E., Hsu, C. W., Mladenov, M., Jain, V., Narvekar, S., Wang, J., Wu, R., & Boutilier, C. (2019). Recsim: A configurable simulation platform for recommender systems. Preprint arXiv:1909.04847 [63] Iyengar, GN, Robust dynamic programming, Mathematics of Operations Research, 30, 2, 257-280 (2005) · Zbl 1082.90123 [64] Jaderberg, M., Mnih, V., Czarnecki, W., Schaul, T., Leibo, J. Z. L., Silver, D., & Kavukcuoglu, K. (2016). Reinforcement learning with unsupervised auxiliary tasks (pp. 1-11). doi:10.1051/0004-6361/201527329. arXiv:1509.03044v2 [65] James, S., Rosman, B., & Konidaris, G. (2018). Learning to plan with portable symbols. In Workshop on planning and learning (PAL@ ICML/IJCAI/AAMAS). [66] Jaques, N., Ghandeharioun, A., Shen, J. H., Ferguson, C., Lapedriza, À., Jones, N., Gu, S., & Picard, R. W. (2019). Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. Preprint arXiv:1907.00456 [67] Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., et al. (2018). Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. Preprint arXiv:1806.10293 [68] Kidambi, R., Rajeswaran, A., Netrapalli, P., & Joachims, T. (2020). Morel: Model-based offline reinforcement learning. Preprint arXiv:2005.05951 [69] Konidaris, G.; Kaelbling, LP; Lozano-Perez, T., From skills to symbols: Learning symbolic representations for abstract high-level planning, Journal of Artificial Intelligence Research, 61, 215-289 (2018) · Zbl 1426.68254 [70] Kumar, A., Fu, J., Soh, M., Tucker, G., & Levine, S. (2019). Stabilizing off-policy Q-learning via bootstrapping error reduction. In Conference on neural information processing systems (pp. 11761-11771). [71] Lagoudakis, MG; Parr, R., Least-squares policy iteration, Journal of Machine Learning Research, 4, 1107-1149 (2003) · Zbl 1094.68080 [72] Levine, N., Chow, Y., Shu, R., Li, A., Ghavamzadeh, M., & Bui, H. (2019). Prediction, consistency, curvature: Representation learning for locally-linear control. Preprint arXiv:1909.01506 [73] Levine, S., & Koltun, V. (2013). Guided policy search. In International conference on machine learning (pp. 1-9). [74] Levine, S.; Finn, C.; Darrell, T.; Abbeel, P., End-to-end training of deep visuomotor policies, The Journal of Machine Learning Research, 17, 1, 1334-1373 (2016) · Zbl 1360.68687 [75] Levine, S., Kumar, A., Tucker, G., & Fu, J. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. Preprint arXiv:2005.01643 [76] Li, K., Zhang, T., & Wang, R. (2019). Deep reinforcement learning for multi-objective optimization. IEEE Transactions on Cybernetics, 14(8), 1-10. arXiv:1906.02386 [77] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. Preprint arXiv:1509.02971 [78] Mahmood, A. R., Korenkevych, D., Vasan, G., Ma, W., & Bergstra, J. (2018). Benchmarking reinforcement learning algorithms on real-world robots. Preprint arXiv:1809.07731 [79] Mankowitz, D. J., Mann, T. A., & Mannor, S. (2016a). Adaptive skills adaptive partitions (ASAP). In Advances in neural information processing systems (pp. 1588-1596). [80] Mankowitz, D. J., Mann, T. A., & Mannor, S. (2016b). Iterative hierarchical optimization for misspecified problems (ihomp). Preprint arXiv:1602.03348 [81] Mankowitz, D. J., Tamar, A., & Mannor, S. (2016c). Situational awareness by risk-conscious skills. Preprint arXiv:1610.02847 [82] Mankowitz, D. J., Mann, T. A., Bacon, P. L., Precup, D., & Mannor, S. (2018a) Learning robust options. In 32nd AAAI conference on artificial intelligence. [83] Mankowitz, D. J., Žídek, A., Barreto, A., Horgan, D., Hessel, M., Quan, J., Oh, J., van Hasselt, H., Silver, D., & Schaul, T. (2018b). Unicorn: Continual learning with a universal, off-policy agent. Preprint arXiv:1802.08294 [84] Mankowitz, D. J., Levine, N., Jeong, R., Abdolmaleki, A., Springenberg, J. T., Mann, T. A., et al. (2019). Robust reinforcement learning for continuous control with model misspecification. CoRR arXiv:1906.07516 [85] Mankowitz, D. J., Calian, D. A., Jeong, R., Paduraru, C., Heess, N., Dathathri, S., et al. (2020). Robust constrained reinforcement learning for continuous control with model misspecification. Eprint arXiv:2010.10644 [86] Mann, T. A., Gowal, S., Jiang, R., Hu, H., Lakshminarayanan, B., & György, A. (2018). Learning from delayed outcomes with intermediate observations. CoRR. arXiv:1807.09387 [87] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, AA; Veness, J.; Bellemare, MG, Human-level control through deep reinforcement learning, Nature, 518, 7540, 529 (2015) [88] Moffaert, KV; Now, A., Multi-objective reinforcement learning using sets of pareto dominating policies, JMLR, 1, 3663-3692 (2014) · Zbl 1312.90089 [89] Nagabandi, A., Finn, C., & Levine, S. (2018). Deep online learning via meta-learning: Continual adaptation for model-based RL. CoRR. arXiv:1812.07671 [90] Nagabandi, A., Konoglie, K., Levine, S., & Kumar, V. (2019). Deep dynamics models for learning dexterous manipulation. Preprint arXiv:1909.11652 [91] Ng, A. Y., Russell, S. J., et al. (2000). Algorithms for inverse reinforcement learning. In Icml (Vol. 1, p. 2). [92] OpenAI. (2018) Openai five. https://blog.openai.com/openai-five/ [93] Osband, I.; Blundell, C.; Pritzel, A.; Van Roy, B.; Lee, DD; Sugiyama, M.; Luxburg, UV; Guyon, I.; Garnett, R., Deep exploration via bootstrapped DQN, Advances in neural information processing systems, 4026-4034 (2016), New York: Curran Associates, Inc., New York [94] Osband, I., Doron, Y., Hessel, M., Aslanides, J., Sezener, E., Saraiva, A., McKinney, K., Lattimore, T., Szepezvari, C., Singh, S., et al. (2019). Behaviour suite for reinforcement learning. Preprint arXiv:1908.03568 [95] Peng, X.B., Andrychowicz, M., Zaremba, W., & Abbeel, P. (2018). Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 1-8). IEEE. [96] Peng, X. B., Kumar, A., Zhang, G., & Levine, S. (2019). Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. Preprint arXiv:1910.00177 [97] Pham, T., Magistris, G. D., & Tachibana, R. (2017). Optlayer-practical constrained optimization for deep reinforcement learning in the real world. CoRR arXiv:1709.07643 [98] Pomerleau, D. A. (1989). ALVINN: An autonomous land vehicle in a neural network. In Conference on neural information processing systems (pp. 305-313). [99] Ramstedt, S., & Pal, C. (2019). Real-time reinforcement learning. In Advances in neural information processing systems (pp. 3067-3076). [100] Ray, A., Achiam, J., & Amodei, D. (2019). Benchmarking safe exploration in deep reinforcement learning. [101] Riedmiller, M. (2005). Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, L. Torgo (Eds.), European conference on machine learning (pp. 317-328). [102] Riedmiller, M. (2012). 10 steps and some tricks to set up neural reinforcement controllers. In Neural networks: Tricks of the trade (pp. 735-757). Springer. [103] Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Van de Wiele, T., Mnih, V., Heess, N., & Springenberg, J. T. (2018). Learning by playing-solving sparse reward tasks from scratch. Preprint arXiv:1802.10567 [104] Roijers, DM; Vamplew, P.; Whiteson, S.; Dazeley, R., A survey of multi-objective sequential decision-making, Journal of Artificial Intelligence Research, 48, 67-113 (2013) · Zbl 1364.68323 [105] Ross, S., Gordon, G., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the 14th international conference on artificial intelligence and statistics (pp. 627-635). [106] Russell, SJ, Learning agents for uncertain environments, COLT, 98, 101-103 (1998) [107] Satija, H., Amortila, P., & Pineau, J. (2020). Constrained Markov decision processes via backward value functions. Preprint arXiv:2008.11811 [108] Schaul, T., Horgan, D., Gregor, K., & Silver, D. (2015). Universal value function approximators. In International conference on machine learning (pp. 1312-1320). [109] Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., et al. (2019). Mastering atari, go, chess and shogi by planning with a learned model. Preprint arXiv:1911.08265 [110] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. CoRR arXiv:1707.06347 [111] Shashua, S.D.C., & Mannor, S. (2017). Deep robust kalman filter. Preprint arXiv:1703.02310 [112] Siegel, N., Springenberg, J.T., Berkenkamp, F., Abdolmaleki, A., Neunert, M., Lampe, T., Hafner, R., Heess, N., & Riedmiller, M. (2020). Keep doing what worked: Behavior modelling priors for offline reinforcement learning. In International conference on learning representations. [113] Silver, D.; Huang, A.; Maddison, CJ; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M., Mastering the game of go with deep neural networks and tree search, Nature, 529, 7587, 484 (2016) [114] Spirtes, P. (2001). An anytime algorithm for causal inference. In AISTATS. [115] Stooke, A., Achiam, J., & Abbeel, P. (2020). Responsive safety in reinforcement learning by PID Lagrangian methods. Preprint arXiv:2007.03964 [116] Sutton, RS; Barto, AG, Reinforcement learning: An introduction (2018), London: MIT Press, London · Zbl 1407.68009 [117] Sutton, RS; Precup, D.; Singh, S., Between MDPS and semi-MDPS: A framework for temporal abstraction in reinforcement learning, Artificial Intelligence, 112, 1-2, 181-211 (1999) · Zbl 0996.68151 [118] Tamar, A., Mannor, S., & Xu, H. (2014). Scaling up robust MDPS using function approximation. In International conference on machine learning (pp. 181-189). [119] Tamar, A., Chow, Y., Ghavamzadeh, M., & Mannor, S. (2015a). Policy gradient for coherent risk measures. In Advances in neural information processing systems (pp. 1468-1476). [120] Tamar, A., Glassner, Y., & Mannor, S. (2015b). Optimizing the Cvar via sampling. In 29th AAAI conference on artificial intelligence. [121] Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, DdL., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al. (2018). Deepmind control suite. Preprint arXiv:1801.00690 [122] Tessler, C., Givony, S., Zahavy, T., Mankowitz, D. J., & Mannor, S. (2016). A deep hierarchical approach to lifelong learning in minecraft. CoRR arXiv:1604.07255 [123] Tessler, C., Mankowitz, D. J., & Mannor, S. (2018). Reward constrained policy optimization. Preprint arXiv:1805.11074 [124] Tessler, C., Zahavy, T., Cohen, D., Mankowitz, D. J., & Mannor, S. (2019). Action assembly: Sparse imitation learning for text based games with combinatorial action spaces. CoRR arXiv:1905.09700 [125] Thomas, P. S. (2015). Safe reinforcement learning. Ph.D. thesis, University of Massachusetts Libraries. [126] Thomas, P. S., da Silva, B. C., Barto, A. G., & Brunskill, E. (2017). On ensuring that intelligent machines are well-behaved. Preprint arXiv:1708.05448 [127] Travnik, JB; Mathewson, KW; Sutton, RS; Pilarski, PM, Reactive reinforcement learning in asynchronous environments, Frontiers in Robotics and AI, 5, 79 (2018) [128] Turchetta, M., Berkenkamp, F., & Krause, A. (2016). Safe exploration in finite Markov decision processes with gaussian processes. CoRR arXiv:1606.04753 [129] Van Seijen, H.; Fatemi, M.; Romoff, J.; Laroche, R.; Barnes, T.; Tsang, J., Hybrid reward architecture for reinforcement learning, Advances in Neural Information Processing Systems, 30, 5392-5402 (2017) [130] Vecerik, M., Sushkov, O., Barker, D., Rothörl, T., Hester, T., & Scholz, J. (2019a). A practical approach to insertion with variable socket position using deep reinforcement learning. In 2019 international conference on robotics and automation (ICRA) (pp. 754-760). IEEE. [131] Vecerík, M., Sushkov, O., Barker, D., Rothörl, T., Hester, T., & Scholz, J. (2019b). A practical approach to insertion with variable socket position using deep reinforcement learning. In 2019 international conference on robotics and automation (ICRA) (pp. 754-760). [132] Vlasselaer, J., Van den Broeck, G., Kimmig, A., Meert, W., & De Raedt, L. (2015). Anytime inference in probabilistic logic programs with tp-compilation. In 24th international joint conference on artificial intelligence. · Zbl 1386.68174 [133] Wachi, A., Sui, Y., Yue, Y., & Ono, M. (2018). Safe exploration and optimization of constrained MDPS using Gaussian processes. In AAAI (pp. 6548-6556). AAAI Press. [134] Wagstaff, K. (2012). Machine learning that matters. Preprint arXiv:1206.4656 [135] Wang, J., & Yuan, S. (2015). Real-time bidding: A new frontier of computational advertising research. In Proceedings of the 8th ACM international conference on web search and data mining (pp. 415-416). [136] Wang, Q., Xiong, J., Han, L., Sun, P., Liu, H., Zhang, T. (2018). Exponentially weighted imitation learning for batched historical data. In Conference on neural information processing systems (pp. 6288-6297). [137] Wang, Z., Novikov, A., Zolna, K., Springenberg, J. T., Reed, S., Shahriari, B., Siegel, N., Merel, J., Gulcehre, C., Heess, N., et al. (2020). Critic regularized regression. Preprint arXiv:2006.15134 [138] Wu, Y., Tucker, G., & Nachum, O. (2019). Behavior regularized offline reinforcement learning. Preprint arXiv:1911.11361 [139] Xu, H., & Mannor, S. (2011). Probabilistic goal Markov decision processes. In 22nd international joint conference on artificial intelligence. [140] Yahya, A., Li, A., Kalakrishnan, M., Chebotar, Y., & Levine, S. (2017). Collective robot reinforcement learning with distributed asynchronous guided policy search. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 79-86). IEEE. [141] Yang, R., Sun, X., & Narasimhan, K. (2019). A generalized algorithm for multi-objective reinforcement learning and policy adaptation (NeurIPS):1-27 Eprint arXiv:1908.08342 [142] Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J., Levine, S., Finn, C., & Ma, T. (2020). Mopo: Model-based offline policy optimization. Preprint arXiv:2005.13239 [143] Zahavy, T., Haroush, M., Merlis, N., Mankowitz, D.J., & Mannor, S. (2018). Learn what not to learn: Action elimination with deep reinforcement learning. In Advances in neural information processing systems (pp. 3562-3573). This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.