×

Task-aware verifiable RNN-based policies for partially observable Markov decision processes. (English) Zbl 1522.68448

Summary: Partially observable Markov decision processes (POMDPs) are models for sequential decision-making under uncertainty and incomplete information. Machine learning methods typically train recurrent neural networks (RNN) as effective representations of POMDP policies that can efficiently process sequential data. However, it is hard to verify whether the POMDP driven by such RNN-based policies satisfies safety constraints, for instance, given by temporal logic specifications. We propose a novel method that combines techniques from machine learning with the field of formal methods: training an RNN-based policy and then automatically extracting a so-called finite-state controller (FSC) from the RNN. Such FSCs offer a convenient way to verify temporal logic constraints. Implemented on a POMDP, they induce a Markov chain, and probabilistic verification methods can efficiently check whether this induced Markov chain satisfies a temporal logic specification. Using such methods, if the Markov chain does not satisfy the specification, a byproduct of verification is diagnostic information about the states in the POMDP that are critical for the specification. The method exploits this diagnostic information to either adjust the complexity of the extracted FSC or improve the policy by performing focused retraining of the RNN. The method synthesizes policies that satisfy temporal logic specifications for POMDPs with up to millions of states, which are three orders of magnitude larger than comparable approaches.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
90C40 Markov and semi-Markov decision processes

Software:

Adam; Reluplex; Storm; POMDP
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Ahmadi, M., Jansen, N., Wu, B., & Topcu, U. (2020). Control theory meets pomdps: A hybrid systems approach.IEEE Transactions on Automatic Control.
[2] Akintunde, M. E., Kevorchian, A., Lomuscio, A., & Pirovano, E. (2019). Verification of rnnbased neural agent-environment systems. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI)(pp. 6006-6013). AAAI Press.
[3] Amir, G., Wu, H., Barrett, C. W., & Katz, G. (2021). An smt-based approach for verifying binarized neural networks. InTools and Algorithms for the Construction and Analysis of Systems (TACAS)(Vol. 12652, pp. 203-222). Springer. · Zbl 1474.68188
[4] Bai, H., Cai, S., Ye, N., Hsu, D., & Lee, W. S. (2015). Intention-aware online pomdp planning for autonomous driving in a crowd. InInternational Conference on Robotics and Automation (ICRA)(pp. 454-460). IEEE.
[5] Baier, C., Gr¨oßer, M., & Bertrand, N. (2012). Probabilisticω-automata.Journal of the ACM,59(1), 1:1-1:52. · Zbl 1281.68152
[6] Baier, C., & Katoen, J.-P. (2008).Principles of model checking. MIT Press. · Zbl 1179.68076
[7] Bakker, B. (2001). Reinforcement learning with long short-term memory. InAdvances in Neural Information Processing Systems (NIPS)(pp. 1475-1482). MIT Press.
[8] Baum, E. B., & Wilczek, F. (1987). Supervised learning of probability distributions by neural networks. InAdvances in Neural Information Processing Systems (NIPS)(pp. 52-61). American Institue of Physics.
[9] Bouton, M., Tumova, J., & Kochenderfer, M. J. (2020). Point-based methods for model checking in partially observable markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI)(pp. 10061-10068). AAAI Press.
[10] Carr, S., Jansen, N., & Topcu, U. (2020). Verifiable rnn-based policies for pomdps under temporal logic constraints. InInternational Joint Conference on Artificial Intelligence (IJCAI)(pp. 4121-4127). IJCAI.org.
[11] Carr, S., Jansen, N., Wimmer, R., Fu, J., & Topcu, U. (2018). Human-in-the-loop synthesis for partially observable Markov decision processes. InAmerican Control Conference (ACC)(pp. 762-769). IEEE.
[12] Carr, S., Jansen, N., Wimmer, R., Serban, A. C., Becker, B., & Topcu, U. (2019). Counterexampleguided strategy improvement for pomdps using recurrent neural networks. InInternational Joint Conference on Artificial Intelligence (IJCAI)(pp. 5532-5539). IJCAI.org.
[13] Cassandra, A. R. (1998). A survey of pomdp applications. InWorking Notes of the AAAI 1998 Fall Symposium on Planning with Partially Observable Markov Decision Processes(Vol. 1724). AAAI Press.
[14] Chatterjee, K., Chmelik, M., & Davies, J. (2016). A symbolic sat-based algorithm for almostsure reachability with small strategies in pomdps. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI)(pp. 3225-3232). AAAI Press.
[15] Chatterjee, K., Chmel´ık, M., Gupta, R., & Kanodia, A. (2015). Qualitative analysis of pomdps with temporal logic specifications for robotics applications. InInternational Conference on Robotics and Automation (ICRA)(pp. 325-330). IEEE.
[16] Chatterjee, K., Chmel´ık, M., Gupta, R., & Kanodia, A. (2016). Optimal cost almost-sure reachability in pomdps.Artificial Intelligence,234, 26-48. · Zbl 1351.68307
[17] Clarke, E. M., Henzinger, T. A., Veith, H., & Bloem, R. (2018).Handbook of model checking 844 · Zbl 1390.68001
[18] Cover, T. M., & Thomas, J. A. (2012).Elements of information theory. John Wiley & Sons. · Zbl 1140.94001
[19] Cubuktepe, M., Jansen, N., Junges, S., Marandi, A., Suilen, M., & Topcu, U. (2021). Robust finitestate controllers for uncertain pomdps. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI)(pp. 11792-11800). AAAI Press.
[20] Dehnert, C., Junges, S., Katoen, J., & Volk, M. (2017). A storm is coming: A modern probabilistic model checker. InComputer Aided Verification (CAV)(Vol. 10427, pp. 592-600). Springer.
[21] Ghosh, B., & Neider, D.(2020).A formal language approach to explaining rnns.CoRR, abs/2006.07292.
[22] Goodfellow, I. J., Bengio, Y., & Courville, A. C. (2016).Deep learning. MIT Press. · Zbl 1373.68009
[23] Goodfellow, I. J., & Vinyals, O. (2015). Qualitatively characterizing neural network optimization problems. InInternational Conference on Learning Representations (ICLR).
[24] Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S. J., & Dragan, A. D. (2017). Inverse reward design. InAdvances in Neural Information Processing Systems (NIPS)(pp. 6765-6774). MIT Press.
[25] Hausknecht, M. J., & Stone, P. (2015). Deep recurrent Q-learning for partially observable MDPs. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI)(pp. 29-37). AAAI Press.
[26] Heess, N., Hunt, J. J., Lillicrap, T. P., & Silver, D. (2015). Memory-based control with recurrent neural networks.CoRR,1512.04455.
[27] Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., & Tassa, Y. (2015). Learning continuous control policies by stochastic value gradients. InAdvances in Neural Information Processing Systems (NIPS)(pp. 2944-2952). MIT Press.
[28] Hernandez-Gardiol, N., & Mahadevan, S. (2000). Hierarchical memory-based reinforcement learning. InAdvances in Neural Information Processing Systems (NIPS)(pp. 1047-1053). MIT Press.
[29] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.Neural Computation,9(8), 1735-1780.
[30] Huang, X., Kwiatkowska, M., Wang, S., & Wu, M. (2017). Safety verification of deep neural networks. InComputer Aided Verification (CAV)(Vol. 10426, pp. 3-29). Springer. · Zbl 1494.68166
[31] Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable markov decision problems. InAdvances in Neural Information Processing Systems (NIPS)(pp. 345-352). MIT Press.
[32] Jansen, N., Wimmer, R., ´Abrah´am, E., Zajzon, B., Katoen, J., Becker, B., & Schuster, J. (2014). Symbolic counterexample generation for large discrete-time markov chains.Science of Computer Programming,91, 90-114.
[33] Junges, S., Jansen, N., & Seshia, S. A. (2021). Enforcing almost-sure reachability in pomdps. In Computer Aided Verification (CAV)(Vol. 12760, pp. 602-625). Springer. · Zbl 1493.68213
[34] Junges, S., Jansen, N., Wimmer, R., Quatmann, T., Winterer, L., Katoen, J.-P., & Becker, B. (2018). Finite-state controllers of pomdps via parameter synthesis. InConference on Uncertainty in Artificial Intelligence (UAI)(pp. 519-529). AUAI Press.
[35] Junges, S., Katoen, J., P´erez, G. A., & Winkler, T. (2021). The complexity of reachability in parametric markov decision processes.Journal of Computer and System Sciences,119, 183- 210. · Zbl 1477.68124
[36] Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains.Artificial Intelligence,101(1), 99-134. · Zbl 0908.68165
[37] Katz, G., Barrett, C., Dill, D. L., Julian, K., & Kochenderfer, M. J. (2017). Reluplex: An efficient SMT solver for verifying deep neural networks. InComputer Aided Verification (CAV)(pp. 97-117). Springer. · Zbl 1494.68167
[38] Kearns, M. J., Mansour, Y., & Ng, A. Y. (1999). Approximate planning in large pomdps via reusable trajectories. InAdvances in Neural Information Processing Systems (NIPS)(pp. 1001-1007). MIT Press.
[39] Ketkar, N. (2017). Introduction to keras. InDeep learning with python(pp. 97-111). Springer.
[40] Khmelnitsky, I., Neider, D., Roy, R., Barbot, B., Bollig, B., Finkel, A., . . . Ye, L. (2020). Propertydirected verification of recurrent neural networks.CoRR,abs/2009.10610.
[41] Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR).
[42] Koul, A., Fern, A., & Greydanus, S. (2019). Learning finite state representations of recurrent policy networks. InInternational Conference on Learning Representations (ICLR).
[43] Kret´ınsk´y, J., Meggendorfer, T., Sickert, S., & Ziegler, C. (2018). Rabinizer 4: From LTL to your favourite deterministic automaton. InComputer Aided Verification (CAV)(Vol. 10981, pp. 567-577). Springer.
[44] Kurniawati, H., Hsu, D., & Lee, W. S. (2008). SARSOP: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. InRobotics: Science and Systems (RSS). MIT Press.
[45] Kwiatkowska, M., Norman, G., & Parker, D. (2011). PRISM4.0: Verification of probabilistic realtime systems. InComputer Aided Verification (CAV)(Vol. 6806, pp. 585-591). Springer.
[46] Leike, J., Martic, M., Krakovna, V., Ortega, P. A., Everitt, T., Lefrancq, A., . . . Legg, S. (2017). AI safety gridworlds.CoRR,abs/1711.09883.
[47] Littman, M. L., Topcu, U., Fu, J., Isbell, C., Wen, M., & MacGlashan, J. (2017). Environmentindependent task specifications via GLTL.CoRR,abs/1704.04341.
[48] Madani, O., Hanks, S., & Condon, A. (1999). On the undecidability of probabilistic planning and infinite-horizon partially observable Markov decision problems. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI)(pp. 541-548). AAAI Press.
[49] Meuleau, N., Kim, K., Kaelbling, L. P., & Cassandra, A. (1999). Solving pomdps by searching the space of finite policies. InConference on Uncertainty in Artificial Intelligence (UAI)(pp. 417-426). Morgan Kaufmann.
[50] Michalenko, J. J., Shah, A., Verma, A., Baraniuk, R. G., Chaudhuri, S., & Patel, A. B. (2019). CoRR,abs/1902.10297.
[51] Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning.Nature,518(7540), 529.
[52] Mulder, W. D., Bethard, S., & Moens, M. (2015). A survey on the application of recurrent neural networks to statistical language modeling.Computer Speech & Language,30(1), 61-98.
[53] Norman, G., Parker, D., & Zou, X. (2017). Verification and control of partially observable probabilistic systems.Real-Time Systems,53(3), 354-402. · Zbl 1425.68266
[54] Pascanu, R., Gulcehre, C., Cho, K., & Bengio, Y. (2014). How to construct deep recurrent neural networks. InInternational Conference on Learning Representations (ICLR).
[55] Pnueli, A. (1977). The temporal logic of programs. InAnnual Symposium on Foundations of Computer Science (FOCS)(pp. 46-57). IEEE Computer Society.
[56] Poupart, P., & Boutilier, C. (2003). Bounded finite state controllers. InAdvances in Neural Information Processing Systems (NIPS)(pp. 823-830). MIT Press.
[57] Puterman, M. L. (1994).Markov decision processes. John Wiley and Sons. · Zbl 0829.90134
[58] Shani, G., Pineau, J., & Kaplow, R. (2013). A survey of point-based pomdp solvers.Autonomous Agents and Multi-Agent Systems,27(1), 1-51.
[59] Sharan, R., & Burdick, J. (2014). Finite state control of pomdps with ltl specifications. InAmerican Control Conference (ACC)(pp. 501-508). IEEE.
[60] Sherstinsky, A. (2020). Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network.Physica D: Nonlinear Phenomena,404, 132306. · Zbl 1508.68298
[61] Silver, D., & Veness, J. (2010). Monte-carlo planning in large pomdps. InAdvances in Neural Information Processing Systems (NIPS)(pp. 2164-2172). MIT Press.
[62] Smith, T., & Simmons, R. (2004). Heuristic search value iteration for pomdps. InConference on Uncertainty in Artificial Intelligence (UAI)(pp. 520-527). AUAI Press.
[63] Spaan, M. T. J., & Vlassis, N. (2005). Perseus: Randomized point-based value iteration for pomdps. Journal of Artificial Intelligence Research (JAIR),24, 195-220. · Zbl 1080.68674
[64] Suilen, M., Jansen, N., Cubuktepe, M., & Topcu, U. (2020). Robust policy synthesis for uncertain pomdps via convex optimization. InInternational Joint Conference on Artificial Intelligence (IJCAI)(pp. 4113-4120). ijcai.org.
[65] Walraven, E., & Spaan, M. (2017). Accelerated vector pruning for optimal pomdp solvers. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)(pp. 3672-3678). AAAI Press.
[66] Wang, Q., Zhang, K., Liu, X., & Giles, C. L. (2018). Verification of recurrent neural networks through rule extraction.CoRR,abs/1811.06029.
[67] Wang, Y., Chaudhuri, S., & Kavraki, L. E. (2018). Bounded policy synthesis for pomdps with safereachability objectives. InInternational Conference on Autonomous Agents and Multiagent Systems (AAMAS)(pp. 238-246). ACM.
[68] Weiss, G., Goldberg, Y., & Yahav, E. (2018). Extracting automata from recurrent neural networks using queries and counterexamples. InInternational Conference on Machine Learning (ICML)(Vol. 80, pp. 5244-5253). PMLR.
[69] Wierstra, D., F¨orster, A., Peters, J., & Schmidhuber, J. (2007). Solving deep memory pomdps with recurrent policy gradients. InInternational Conference on Artificial Neural Networks (ICANN)(pp. 697-706). Springer.
[70] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning,8, 229-256. · Zbl 0772.68076
[71] Wimmer, R., Jansen, N., ´Abrah´am, E., Katoen, J., & Becker, B. (2014). Minimal counterexamples for linear-time probabilistic verification.Theoretical Computer Science,549, 61-100. · Zbl 1360.68604
[72] Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Yu, P. S. (2021). A comprehensive survey on graph neural networks.IEEE Transactions on Neural Networks and Learning Systems,32(1), 4-24.
[73] Zeng, Z., Goodman, R. M., & Smyth, P. (1993). Learning finite state machines with self-clustering recurrent networks.Neural Computation,5(6), 976-990.
[74] Zhang, Z., Hsu, D., & Lee, W. S. (2014). Covering number for efficient heuristic-based pomdp planning. InInternational Conference on Machine Learning (ICML)(Vol. 32, pp. 28-36). JMLR.org
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.