Learning partial policies to speedup MDP tree search via reduction to i.i.d. learning. (English) Zbl 1434.68446

Summary: A popular approach for online decision-making in large MDPs is time-bounded tree search. The effectiveness of tree search, however, is largely influenced by the action branching factor, which limits the search depth given a time bound. An obvious way to reduce action branching is to consider only a subset of potentially good actions at each state as specified by a provided partial policy. In this work, we consider offline learning of such partial policies with the goal of speeding up search without significantly reducing decision-making quality. Our first contribution consists of reducing the learning problem to set learning. We give a reduction-style analysis of three such algorithms, each making different assumptions, which relates the set learning objectives to the sub-optimality of search using the learned partial policies. Our second contribution is to describe concrete implementations of the algorithms within the popular framework of Monte-Carlo tree search. Finally, the third contribution is to evaluate the learning algorithms on two challenging MDPs with large action branching factors. The results show that the learned partial policies can significantly improve the anytime performance of Monte-Carlo tree search.


68T05 Learning and adaptive systems in artificial intelligence
68W27 Online algorithms; streaming algorithms
90C40 Markov and semi-Markov decision processes
91B06 Decision theory
Full Text: Link


[1] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal, 47(2-3):235–256, 2002. 32 · Zbl 1012.68093
[2] J Andrew Bagnell, Sham M Kakade, Jeff G Schneider, and Andrew Y Ng. Policy search by dynamic programming. In Advances in neural information processing systems, page None, 2003.
[3] Radha-Krishna Balla and Alan Fern. Uct for tactical assault planning in real-time strategy games. In International Joint Conference on Artificial Intelligence, pages 40–45, 2009.
[4] Andrew G Barto, Steven J Bradtke, and Satinder P Singh. Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1):81–138, 1995.
[5] Ronald Bjarnason, Alan Fern, and Prasad Tadepalli. Lower Bounding Klondike Solitaire with Monte-Carlo Planning. In International Conference on Automated Planning and Scheduling, 2009.
[6] Bonet Blai and Geffner Hector. Action Selection for MDPs: Anytime AO* Versus UCT. In AAAI Conference on Artificial Intelligence, 2012. · Zbl 1270.68012
[7] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. Computational Intelligence and AI in Games, IEEE Transactions on, 4(1):1–43, 2012.
[8] Vadim Bulitko and Greg Lee. Learning in Real-Time Search: A Unifying Framework. Journal of Artificial Intelligence Research, 25:119–157, 2006. · Zbl 1182.68160
[9] Guillaume Chaslot, Mark Winands, Jaap H van den Herik, Jos Uiterwijk, and Bruno Bouzy. Progressive strategies for monte-carlo tree search. In Joint Conference on Information Sciences, pages 655–661, 2007. · Zbl 1198.68225
[10] Adrien Cou¨etoux, Jean-Baptiste Hoock, Nataliya Sokolovska, Olivier Teytaud, and Nicolas Bonnard. Continuous upper confidence trees. In Learning and Intelligent Optimization, pages 433– 445. 2011a.
[11] Adrien Cou¨etoux, Mario Milone, Matyas Brendel, Hassen Doghmen, Michele Sebag, Olivier Teytaud, et al. Continuous rapid action value estimates. In The 3rd Asian Conference on Machine Learning, volume 20, pages 19–31. JMLR, 2011b.
[12] Debadeepta Dey, Tian Yu Liu, Martial Hebert, and J Andrew Bagnell. Contextual sequence prediction with application to control library optimization. Proceedings of robotics: Science and systems VIII, 2013.
[13] Hilmar Finnsson. Simulation-Based General Game Playing. PhD thesis, Reykjavik University, 2012.
[14] Hilmar Finnsson and Yngvi Bj¨ornsson. Learning Simulation Control in General Game-Playing Agents. In AAAI Conference on Artificial Intelligence, pages 954–959, 2010.
[15] Romaric Gaudel and Michele Sebag. Feature selection as a one-player game. In International Conference on Machine Learning, pages 359–366, 2010.
[16] Sylvain Gelly and David Silver. Combining online and offline knowledge in UCT. In International Conference on Machine Learning, pages 273–280, 2007. 33
[17] Sylvain Gelly, Yizao Wang, R´emi Munos, and Olivier Teytaud. Modification of UCT with patterns in Monte-Carlo Go. Technical report, INRIA, 2006.
[18] Sylvain Gelly, Levente Kocsis, Marc Schoenauer, Michele Sebag, David Silver, Csaba Szepesv´ari, and Olivier Teytaud. The grand challenge of computer go: Monte carlo tree search and extensions. Communications of the ACM, 55(3):106–113, 2012.
[19] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Advances in Neural Information Processing Systems, 2014.
[20] Todd Hester and Peter Stone. TEXPLORE: real-time sample-efficient reinforcement learning for robots. Machine Learning Journal, 90(3):385–429, 2013.
[21] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
[22] Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning Journal, 49(2-3):193–208, 2002. · Zbl 1014.68150
[23] Thomas Keller and Malte Helmert. Trial-based heuristic tree search for finite horizon mdps. In International Conference on Automated Planning and Scheduling, 2013.
[24] Levente Kocsis and Csaba Szepesv´ari. Bandit based Monte-Carlo planning. In European Conference on Machine Learning, pages 282–293. 2006.
[25] Richard E Korf. Depth-first iterative-deepening: An optimal admissible tree search. Artificial intelligence, 27(1):97–109, 1985. · Zbl 0573.68030
[26] John Langford. Vowpal Wabbit. URL https://github. com/JohnLangford/vowpal wabbit/wiki, 2011.
[27] Jean M´ehat and Tristan Cazenave. Combining uct and nested monte carlo search for single-player general game playing. Computational Intelligence and AI in Games, IEEE Transactions on, 2(4): 271–277, 2010.
[28] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[29] Jervis Pinto and Alan Fern. Learning partial policies to speedup mdp tree search. In Conference on Uncertainty in Artificial Intelligence, 2014. · Zbl 1434.68446
[30] D. Pomerleau. ALVINN: an autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, 1989.
[31] Alexander Reinefeld and T. Anthony Marsland. Enhanced iterative-deepening search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(7):701–710, 1994.
[32] St´ephane Ross and Drew Bagnell. Efficient reductions for imitation learning. In International Conference on Artificial Intelligence and Statistics, pages 661–668, 2010. 34
[33] Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive noregret learning. arXiv preprint arXiv:1406.5979, 2014.
[34] St´ephane Ross, Geoffrey J Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011.
[35] Stephane Ross, Jiaji Zhou, Yisong Yue, Debadeepta Dey, and Drew Bagnell. Learning policies for contextual submodular prediction. In Proceedings of The 30th International Conference on Machine Learning, pages 1364–1372, 2013.
[36] Claude Sammut, Scott Hurst, Dana Kedzier, and Donald Michie. Learning to fly. In International Workshop on Machine Learning, 1992.
[37] David Silver and Gerald Tesauro. Monte-Carlo simulation balancing. In International Conference on Machine Learning, pages 945–952, 2009.
[38] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
[39] Jonathan Sorg, Satinder P Singh, and Richard L Lewis. Optimal Rewards versus Leaf-Evaluation Heuristics in Planning Agents. In AAAI Conference on Artificial Intelligence, 2011.
[40] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. Cambridge Univ Press, 1998. · Zbl 0465.68041
[41] Umar Syed and Robert E Schapire. A Reduction from Apprenticeship Learning to Classification. In Advances in Neural Information Processing Systems, pages 2253–2261, 2010.
[42] Joel Veness, David Silver, Alan Blair, and William W Cohen. Bootstrapping from game tree search. In Advances in Neural Information Processing Systems, pages 1937–1945, 2009.
[43] Joel Veness, Kee Siong Ng, Marcus Hutter, William Uther, and David Silver. A Monte-Carlo AIXI approximation. Journal of Artificial Intelligence Research, 40(1):95–142, 2011. · Zbl 1214.68302
[44] Thomas J Walsh, Sergiu Goschin, and Michael L Littman. Integrating Sample-Based Planning and Model-Based Reinforcement Learning. In AAAI Conference on Artificial Intelligence, 2010.
[45] Yuehua Xu, Alan Fern, and Sungwook Yoon. Learning linear ranking functions for beam search with application to planning. The Journal of Machine Learning Research, 10:1571–1610, December 2009. · Zbl 1235.68210
[46] Sungwook Yoon, Alan Fern, and Robert Givan. Learning control knowledge for forward search planning. The Journal of Machine Learning Research, 9:683–718, 2008. · Zbl 1225.68246
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.