##
**Adaptive playouts for online learning of policies during Monte Carlo tree search.**
*(English)*
Zbl 1370.68260

Summary: Monte Carlo Tree Search evaluates positions with the help of a playout policy. If the playout policy evaluates a position wrong then there are cases where the tree search has difficulties to find the correct move due to the large search space. This paper explores adaptive playout policies which improve the playout policy during a tree search. With the help of policy gradient reinforcement learning techniques we optimize the playout policy to give better evaluations. We tested the algorithm in Computer Go and measured an increase in playing strength of more than 100 ELO. The resulting program was able to deal with difficult test cases which are known to pose a problem for Monte Carlo Tree Search.

### MSC:

68T20 | Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.) |

68T05 | Learning and adaptive systems in artificial intelligence |

91A46 | Combinatorial games |

### Software:

PACHI
PDF
BibTeX
XML
Cite

\textit{T. Graf} and \textit{M. Platzner}, Theor. Comput. Sci. 644, 53--62 (2016; Zbl 1370.68260)

Full Text:
DOI

### References:

[1] | Baier, H.; Drake, P. D., The power of forgetting: improving the last-good-reply policy in Monte Carlo go, IEEE Trans. Computat. Intell. AI Games, 2, 4, 303-309, (2010) |

[2] | Baudis, Petr, Effect of LGRF on the playing strength agains gnugo, (2012), visited on 09.03.2015 |

[3] | Baudiš, Petr; Gailly, Jean-loup, PACHI: state of the art open source go program, (Jaap van den Herik, H.; Plaat, Aske, Advances in Computer Games, Lecture Notes in Computer Science, vol. 7168, (2012), Springer Berlin, Heidelberg), 24-38 |

[4] | Bottou, Léon, Stochastic gradient tricks, (Montavon, Grégoire; Orr, Genevieve B.; Müller, Klaus-Robert, Neural Networks, Tricks of the Trade, Reloaded, Lecture Notes in Computer Science, vol. 7700, (2012), Springer), 430-445 |

[5] | Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.; Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; Colton, S., A survey of Monte Carlo tree search methods, IEEE Trans. Computat. Intell. AI Games, 4, 1, 1-43, (2012) |

[6] | Chaslot, G.; Winands, M.; Uiterwijk, J.; van den Herik, H. J.; Bouzy, B., Progressive strategies for Monte-Carlo tree search, New Math. Nat. Comput., 4, 3, 343-357, (2008) · Zbl 1198.68225 |

[7] | Coquelin, Pierre-Arnaud; Munos, Rémi, Bandit algorithms for tree search, (Uncertainty in Artificial Intelligence, (2007), Vancouver Canada) |

[8] | Drake, Peter, The last-good-reply policy for Monte-Carlo go, ICGA Journal, 32, 4, 221-227, (2009) |

[9] | Gelly, Sylvain; Silver, David, Combining online and offline knowledge in UCT, (Proceedings of the 24th International Conference on Machine Learning, ICML ’07, (2007), ACM New York, NY, USA), 273-280 |

[10] | Graf, T.; Platzner, M., Common fate graph patterns in Monte Carlo tree search for computer go, (2014 IEEE Conference on Computational Intelligence and Games (CIG), (August 2014)), 1-8 |

[11] | Graf, Tobias; Platzner, Marco, Adaptive playouts in Monte-Carlo tree search with policy-gradient reinforcement learning, (Plaat, Aske; van den Herik, Jaap; Kosters, Walter, Advances in Computer Games, Lecture Notes in Computer Science, vol. 9525, (2015), Springer International Publishing), 1-11 |

[12] | Greensmith, Evan; Bartlett, Peter L.; Baxter, Jonathan, Variance reduction techniques for gradient estimates in reinforcement learning, J. Mach. Learn. Res., 1471-1530, (2001) · Zbl 1222.68207 |

[13] | Huang, Shih-Chieh; Coulom, Rémi; Lin, Shun-Shii, Monte-Carlo simulation balancing in practice, (Proceedings of the 7th International Conference on Computers and Games, CG’10, (2011), Springer-Verlag Berlin, Heidelberg), 81-92 · Zbl 1316.68138 |

[14] | Huang, Shih-Chieh; Müller, Martin, Investigating the limits of Monte-Carlo tree search methods in computer go, (Jaap van den Herik, H.; Iida, Hiroyuki; Plaat, Aske, Computers and Games, Lecture Notes in Computer Science, (2014), Springer International Publishing), 39-48 · Zbl 1437.91008 |

[15] | Ikeda, Kokoko; Viennot, Simon, Efficiency of static knowledge bias in Monte-Carlo tree search, (Computers and Games, (2013)) · Zbl 1437.91009 |

[16] | Kocsis, Levente; Szepesvári, Csaba, Bandit based Monte-Carlo planning, (ECML-06, Lecture Notes in Computer Science, vol. 4212, (2006), Springer), 282-293 |

[17] | Lucas, Simon M.; Samothrakis, Spyridon; Perez, Diego, Fast evolutionary adaptation for Monte Carlo tree search, (Applications of Evolutionary Computation, Lecture Notes in Computer Science, (2014), Springer Berlin, Heidelberg), 349-360 |

[18] | Perez, D.; Samothrakis, S.; Lucas, S., Knowledge-based fast evolutionary MCTS for general video game playing, (2014 IEEE Conference on Computational Intelligence and Games (CIG), (August 2014)), 1-8 |

[19] | Silver, David, Reinforcement learning and simulation-based search in computer go, (2009), University of Alberta, PhD thesis |

[20] | Silver, David; Sutton, Richard S.; Müller, Martin, Sample-based learning and search with permanent and transient memories, (Proceedings of the 25th International Conference on Machine Learning, ICML ’08, (2008)), 968-975 |

[21] | Szepesvari, Csaba, Algorithms for reinforcment learning, (2010), Morgan and Claypool · Zbl 1205.68320 |

[22] | Weaver, Lex; Tao, Nigel, The optimal reward baseline for gradient-based reinforcement learning, (Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, (2001), Morgan Kaufmann Publishers Inc. San Francisco, CA, USA), 538-545 |

[23] | Williams, Ronald J., Simple statistical gradient-following algorithms for connectionist reinforcement learning, Mach. Learn., 8, 229-256, (1992) · Zbl 0772.68076 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.