Deep reinforcement learning with temporal logics. (English) Zbl 1455.68190

Bertrand, Nathalie (ed.) et al., Formal modeling and analysis of timed systems. 18th international conference, FORMATS 2020, Vienna, Austria, September 1–3, 2020. Proceedings. Cham: Springer. Lect. Notes Comput. Sci. 12288, 1-22 (2020).
Summary: The combination of data-driven learning methods with formal reasoning has seen a surge of interest, as either area has the potential to bolstering the other. For instance, formal methods promise to expand the use of state-of-the-art learning approaches in the direction of certification and sample efficiency. In this work, we propose a deep Reinforcement Learning (RL) method for policy synthesis in continuous-state/action unknown environments, under requirements expressed in Linear Temporal Logic (LTL). We show that this combination lifts the applicability of deep RL to complex temporal and memory-dependent policy synthesis goals. We express an LTL specification as a Limit Deterministic Büchi Automaton (LDBA) and synchronise it on-the-fly with the agent/environment. The LDBA in practice monitors the environment, acting as a modular reward machine for the agent: accordingly, a modular Deep Deterministic Policy Gradient (DDPG) architecture is proposed to generate a low-level control policy that maximises the probability of the given LTL formula. We evaluate our framework in a cart-pole example and in a Mars rover experiment, where we achieve near-perfect success rates, while baselines based on standard RL are shown to fail in practice.
For the entire collection see [Zbl 1455.68021].


68T07 Artificial neural networks and deep learning
03B44 Temporal logic
03D05 Automata and formal grammars in connection with logical questions
68T27 Logic in artificial intelligence
Full Text: DOI


[1] Abate, A.; Katoen, JP; Lygeros, J.; Prandini, M., Approximate model checking of stochastic hybrid systems, Eur. J. Control, 16, 6, 624-641 (2010) · Zbl 1216.93091
[2] Andreas, J., Klein, D., Levine, S.: Modular multitask reinforcement learning with policy sketches. In: ICML, pp. 166-175 (2017)
[3] Baier, C.; Katoen, JP, Principles of Model Checking (2008), Cambridge: MIT Press, Cambridge · Zbl 1179.68076
[4] Belzner, L., Wirsing, M.: Synthesizing safe policies under probabilistic constraints with reinforcement learning and Bayesian model checking. arXiv preprint arXiv:2005.03898 (2020)
[5] Bertsekas, DP; Shreve, S., Stochastic Optimal Control: The Discrete-Time Case (2004), USA: Athena Scientific, USA · Zbl 0471.93002
[6] Bertsekas, DP; Tsitsiklis, JN, Neuro-Dynamic Programming (1996), USA: Athena Scientific, USA · Zbl 0924.68163
[7] Bozkurt, A.K., Wang, Y., Zavlanos, M.M., Pajic, M.: Control synthesis from linear temporal logic specifications using model-free reinforcement learning. arXiv preprint arXiv:1909.07299 (2019)
[8] Brázdil, T.; Cassez, F.; Raskin, J-F, Verification of Markov decision processes using learning algorithms, Automated Technology for Verification and Analysis, 98-114 (2014), Cham: Springer, Cham · Zbl 1448.68290
[9] Daniel, C., Neumann, G., Peters, J.: Hierarchical relative entropy policy search. In: Artificial Intelligence and Statistics, pp. 273-281 (2012)
[10] De Giacomo, G., Favorito, M., Iocchi, L., Patrizi, F.: Imitation learning over heterogeneous agents with restraining bolts. In: Proceedings of the International Conference on Automated Planning and Scheduling, vol. 30, pp. 517-521 (2020)
[11] De Giacomo, G., Iocchi, L., Favorito, M., Patrizi, F.: Foundations for restraining bolts: reinforcement learning with LTLf/LDLf restraining specifications. In: ICAPS, vol. 29, pp. 128-136 (2019)
[12] Fu, J., Topcu, U.: Probably approximately correct MDP learning and control with temporal logic constraints. In: Robotics: Science and Systems (2014)
[13] Fulton, N.: Verifiably safe autonomy for cyber-physical systems. Ph.D. thesis, Carnegie Mellon University Pittsburgh, PA (2018)
[14] Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward safe control through proof and learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)
[15] Fulton, N.; Platzer, A.; Vojnar, T.; Zhang, L., Verifiably safe off-model reinforcement learning, Tools and Algorithms for the Construction and Analysis of Systems, 413-430 (2019), Cham: Springer, Cham
[16] Gunter, E.: From natural language to linear temporal logic: aspects of specifying embedded systems in LTL. In: Monterey Workshop on Software Engineering for Embedded Systems: From Requirements to Implementation (2003)
[17] Hahn, EM; Perez, M.; Schewe, S.; Somenzi, F.; Trivedi, A.; Wojtczak, D.; Vojnar, T.; Zhang, L., Omega-regular objectives in model-free reinforcement learning, Tools and Algorithms for the Construction and Analysis of Systems, 395-412 (2019), Cham: Springer, Cham
[18] Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained reinforcement learning. arXiv preprint arXiv:1801.08099 (2018) · Zbl 1455.68190
[19] Hasanbeig, M., Abate, A., Kroening, D.: Certified reinforcement learning with logic guidance. arXiv preprint arXiv:1902.00778 (2019) · Zbl 1455.68190
[20] Hasanbeig, M., Abate, A., Kroening, D.: Logically-constrained neural fitted Q-iteration. In: AAMAS, pp. 2012-2014 (2019)
[21] Hasanbeig, M., Abate, A., Kroening, D.: Cautious reinforcement learning with logical constraints. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp. 483-491. International Foundation for Autonomous Agents and Multiagent Systems (2020) · Zbl 1455.68190
[22] Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In: Proceedings of the 58th Conference on Decision and Control, pp. 5338-5343. IEEE (2019)
[23] Hasanbeig, M., Yogananda Jeppu, N., Abate, A., Melham, T., Kroening, D.: Deepsynth: program synthesis for automatic task segmentation in deep reinforcement learning. arXiv preprint arXiv:1911.10244 (2019)
[24] Huang, C., Xu, S., Wang, Z., Lan, S., Li, W., Zhu, Q.: Opportunistic intermittent control with safety guarantees for autonomous systems. arXiv preprint arXiv:2005.03726 (2020)
[25] Hunt, N., Fulton, N., Magliacane, S., Hoang, N., Das, S., Solar-Lezama, A.: Verifiably safe exploration for end-to-end reinforcement learning. arXiv preprint arXiv:2007.01223 (2020)
[26] Jansen, N., Könighofer, B., Junges, S., Bloem, R.: Shielded decision-making in MDPs. arXiv preprint arXiv:1807.06096 (2018)
[27] Junges, S.; Jansen, N.; Dehnert, C.; Topcu, U.; Katoen, J-P; Chechik, M.; Raskin, J-F, Safety-constrained reinforcement learning for MDPs, Tools and Algorithms for the Construction and Analysis of Systems, 130-146 (2016), Heidelberg: Springer, Heidelberg
[28] Kazemi, M., Soudjani, S.: Formal policy synthesis for continuous-space systems via reinforcement learning. arXiv preprint arXiv:2005.01319 (2020)
[29] Kearns, M.; Singh, S., Near-optimal reinforcement learning in polynomial time, Mach. Learn., 49, 2-3, 209-232 (2002) · Zbl 1014.68071
[30] Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In: NIPS, pp. 3675-3683 (2016)
[31] Lavaei, A., Somenzi, F., Soudjani, S., Trivedi, A., Zamani, M.: Formal controller synthesis for continuous-space MDPs via model-free reinforcement learning. In: 2020 ACM/IEEE 11th International Conference on Cyber-Physical Systems (ICCPS), pp. 98-107. IEEE (2020)
[32] Levy, A., Konidaris, G., Platt, R., Saenko, K.: Learning multi-level hierarchies with hindsight. In: International Conference on Learning Representations (ICLR) (2019)
[33] Li, X., Ma, Y., Belta, C.: A policy search method for temporal logic specified reinforcement learning tasks. In: ACC, pp. 240-245 (2018)
[34] Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv:1509.02971 (2015)
[35] McEwen, AS, Recurring slope lineae in equatorial regions of Mars, Nat. Geosci., 7, 1, 53-58 (2014)
[36] Mnih, V., et al.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
[37] Newell, RG; Pizer, WA, Discounting the distant future: how much do uncertain rates increase valuations?, J. Environ. Econ. Manag., 46, 1, 52-71 (2003) · Zbl 1041.91502
[38] Nikora, A.P., Balcom, G.: Automated identification of LTL patterns in natural language requirements. In: ISSRE, pp. 185-194 (2009)
[39] Oura, R.; Sakakibara, A.; Ushio, T., Reinforcement learning of control policy for linear temporal logic specifications using limit-deterministic generalized Büchi automata, IEEE Control Syst. Lett., 4, 3, 761-766 (2020)
[40] Pitis, S.: Rethinking the discount factor in reinforcement learning: a decision theoretic approach. arXiv preprint arXiv:1902.02893 (2019)
[41] Pnueli, A.: The temporal logic of programs. In: Foundations of Computer Science, pp. 46-57 (1977)
[42] Polymenakos, K., Abate, A., Roberts, S.: Safe policy search using Gaussian process models. In: Proceedings of AAMAS, pp. 1565-1573 (2019)
[43] Precup, D.: Temporal abstraction in reinforcement learning. Ph.D. thesis, University of Massachusetts Amherst (2001)
[44] Sadigh, D., Kim, E.S., Coogan, S., Sastry, S.S., Seshia, S.A.: A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications. In: CDC, pp. 1091-1096 (2014)
[45] Sickert, S.; Esparza, J.; Jaax, S.; Křetínský, J.; Chaudhuri, S.; Farzan, A., Limit-deterministic Büchi automata for linear temporal logic, Computer Aided Verification, 312-332 (2016), Cham: Springer, Cham · Zbl 1411.68054
[46] Sickert, S.; Křetínský, J.; Artho, C.; Legay, A.; Peled, D., MoChiBA: probabilistic LTL model checking using limit-deterministic Büchi automata, Automated Technology for Verification and Analysis, 130-137 (2016), Cham: Springer, Cham
[47] Silver, D., Lever, G., Heess, N., Thomas Degris, D.W., Riedmiller, M.: Deterministic policy gradient algorithms. In: ICML (2014)
[48] Squyres, SW, Exploration of Victoria crater by the Mars rover opportunity, Science, 324, 5930, 1058-1061 (2009)
[49] Sutton, RS; Barto, AG, Reinforcement Learning: An Introduction (1998), Cambridge: MIT Press, Cambridge · Zbl 1407.68009
[50] Tassa, Y., et al.: Deepmind control suite. arXiv preprint arXiv:1801.00690 (2018)
[51] Toro Icarte, R., Klassen, T.Q., Valenzano, R., McIlraith, S.A.: Teaching multiple tasks to an RL agent using LTL. In: AAMA, pp. 452-461 (2018)
[52] Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., et al.: Strategic attentive writer for learning macro-actions. In: NIPS, pp. 3486-3494 (2016)
[53] Watkins, CJ; Dayan, P., Q-learning, Mach. Learn., 8, 3-4, 279-292 (1992) · Zbl 0773.68062
[54] Wei, Q.; Guo, X., Markov decision processes with state-dependent discount factors and unbounded rewards/costs, Oper. Res. Lett., 39, 5, 369-374 (2011) · Zbl 1235.90178
[55] Yan, R., Cheng, C.H., Chai, Y.: Formal consistency checking over specifications in natural languages. In: DATE, pp. 1677-1682 (2015)
[56] Yoshida, N., Uchibe, E., Doya, K.: Reinforcement learning with state-dependent discount factor. In: 2013 IEEE 3rd Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL), pp. 1-6. IEEE (2013)
[57] Yuan, L.Z., Hasanbeig, M., Abate, A., Kroening, D.: Modular deep reinforcement learning with temporal logic specifications. arXiv preprint arXiv:1909.11591 (2019) · Zbl 1455.68190
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.