×

The factored policy-gradient planner. (English) Zbl 1192.68636

Summary: We present an any-time concurrent probabilistic temporal planner (CPTP) that includes continuous and discrete uncertainties and metric functions. Rather than relying on dynamic programming our approach builds on methods from stochastic local policy search. That is, we optimise a parameterised policy using gradient ascent. The flexibility of this policy-gradient approach, combined with its low memory use, the use of function approximation methods and factorisation of the policy, allow us to tackle complex domains. This factored policy gradient (FPG) planner can optimise steps to goal, the probability of success, or attempt a combination of both. We compare the FPG planner to other planners on CPTP domains, and on simpler but better studied non-concurrent non-temporal probabilistic planning (PP) domains. We present FPG-ipc, the PP version of the planner which has been successful in the probabilistic track of the fifth international planning competition.

MSC:

68T20 Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.)

Software:

PDDL; Graphplan
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Barto, A.; Bradtke, S.; Singh, S., Learning to act using real-time dynamic programming, Artificial Intelligence, 72, 81-138 (1995)
[2] Hansen, E.; Zilberstein, S., \(LAO^\ast \): A heuristic search algorithm that finds solutions with loops, Artificial Intelligence, 129, 35-62 (2001) · Zbl 0971.68036
[3] Mausam, D.S. Weld, Concurrent probabilistic temporal planning, in: Proceedings of the Fifteenth International Conference on Automated Planning and Scheduling (ICAPS’05), Monterey, CA, 2005; Mausam, D.S. Weld, Concurrent probabilistic temporal planning, in: Proceedings of the Fifteenth International Conference on Automated Planning and Scheduling (ICAPS’05), Monterey, CA, 2005 · Zbl 1183.68581
[4] I. Little, D. Aberdeen, S. Thiébaux, Prottle: A probabilistic temporal planner, in: Proceedings of the Twentieth American National Conference on Artificial Intelligence (AAAI’05), 2005; I. Little, D. Aberdeen, S. Thiébaux, Prottle: A probabilistic temporal planner, in: Proceedings of the Twentieth American National Conference on Artificial Intelligence (AAAI’05), 2005
[5] Bertsekas, D. P.; Tsitsiklis, J. N., Neuro-Dynamic Programming (1996), Athena Scientific · Zbl 0924.68163
[6] Sutton, R. S.; Barto, A. G., Reinforcement Learning: An Introduction (1998), MIT Press: MIT Press Cambridge MA
[7] Fern, A.; Yoon, S.; Givan, R., Approximate policy iteration with a policy language bias: Solving relational Markov decision processes, Journal of Artificial Intelligence Research, 25, 85-118 (2006) · Zbl 1182.68237
[8] Williams, R. J., Simple statistical gradient-following algorithms for connectionist reinforcement learning, Machine Learning, 8, 229-256 (1992) · Zbl 0772.68076
[9] Sutton, R. S.; McAllester, D.; Singh, S.; Mansour, Y., Policy gradient methods for reinforcement learning with function approximation, (Advances in Neural Information Processing Systems (NIPS’99), vol. 12 (2000), MIT Press), 1057-1063
[10] Baxter, J.; Bartlett, P.; Weaver, L., Experiments with infinite-horizon, policy-gradient estimation, Journal of Artificial Intelligence Research, 15, 351-381 (2001) · Zbl 0994.68187
[11] Kimura, H.; Miyazaki, K.; Kobayashi, S., Reinforcement learning in POMDPs with function approximation, (Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97) (1997), Morgan Kaufmann), 152-160
[12] Fox, M.; Long, D., PDDL2.1: An extension to PDDL for expressing temporal planning domains, Journal of Artificial Intelligence Research, 20, 61-124 (2003) · Zbl 1036.68093
[13] L. Peshkin, K.-E. Kim, N. Meuleau, L.P. Kaelbling, Learning to cooperate via policy search, in: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI’00), 2000; L. Peshkin, K.-E. Kim, N. Meuleau, L.P. Kaelbling, Learning to cooperate via policy search, in: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI’00), 2000
[14] Tao, N.; Baxter, J.; Weaver, L., A multi-agent, policy-gradient approach to network routing, (Proceedings of the Eighteenth International Conference on Machine Learning (ICML’01) (2001), Morgan Kaufmann)
[15] Ai-Chang, M.; Bresina, J.; Charest, L.; Chase, A.; jung Hsu, J. C.; Jonsson, A.; Kanefsky, B.; Morris, P.; Rajan, K.; Yglesias, J.; Chafin, B.; Dias, W.; Maldague, P. F., MAPGEN: Mixed-initiative planning and scheduling for the Mars exploration rover mission, IEEE Intelligent Systems, 19, 1, 8-12 (2004)
[16] C. Gretton, Gradient-based relational reinforcement-learning of temporally extended policies, in: Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling (ICAPS’07), 2007; C. Gretton, Gradient-based relational reinforcement-learning of temporally extended policies, in: Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling (ICAPS’07), 2007
[17] H.L.S. Younes, M.L. Littman, PPDDL1.0: An extension to PDDL for expressing planning domains with probabilistic effects, Tech. Rep. CMU-CS-04-167, Carnegie Mellon University, October 2004; H.L.S. Younes, M.L. Littman, PPDDL1.0: An extension to PDDL for expressing planning domains with probabilistic effects, Tech. Rep. CMU-CS-04-167, Carnegie Mellon University, October 2004
[18] H.L.S. Younes, Extending PDDL to model stochastic decision processes, in: Proceedings of the ICAPS’03 Workshop on PDDL, 2003; H.L.S. Younes, Extending PDDL to model stochastic decision processes, in: Proceedings of the ICAPS’03 Workshop on PDDL, 2003
[19] W. Cushing, S. Kambhampati, Mausam, D. Weld, When is temporal planning really temporal? in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’07), Hyderabad, India, 2007; W. Cushing, S. Kambhampati, Mausam, D. Weld, When is temporal planning really temporal? in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’07), Hyderabad, India, 2007
[20] Mausam, P. Bertoli, D. Weld, Planning with durative actions in stochastic domains, Journal of Artificial Intelligence Research; Mausam, P. Bertoli, D. Weld, Planning with durative actions in stochastic domains, Journal of Artificial Intelligence Research · Zbl 1183.68581
[21] B. Bonet, B. Givan, Results of probabilistic track in the 5th international planning competition, in: Not in the Proceedings of the Fifth International Planning Competition (IPC-5), 2006; B. Bonet, B. Givan, Results of probabilistic track in the 5th international planning competition, in: Not in the Proceedings of the Fifth International Planning Competition (IPC-5), 2006
[22] Mausam, D.S. Weld, Probabilistic temporal planning with uncertain durations, in: Proceedings of the Sixteenth International Conference on Automated Planning and Scheduling (ICAPS’06), 2006; Mausam, D.S. Weld, Probabilistic temporal planning with uncertain durations, in: Proceedings of the Sixteenth International Conference on Automated Planning and Scheduling (ICAPS’06), 2006 · Zbl 1183.68581
[23] D. Aberdeen, S. Thiébaux, L. Zhang, Decision-theoretic military operations planning, in: Proceedings of the Fourteenth International Conference on Automated Planning and Scheduling (ICAPS’04), 2004, pp. 402-411; D. Aberdeen, S. Thiébaux, L. Zhang, Decision-theoretic military operations planning, in: Proceedings of the Fourteenth International Conference on Automated Planning and Scheduling (ICAPS’04), 2004, pp. 402-411
[24] H.L.S. Younes, R.G. Simmons, Policy generation for continuous-time stochastic domains with concurrency, in: Proceedings of the Fourteenth International Conference on Automated Planning and Scheduling (ICAPS’04), 2004; H.L.S. Younes, R.G. Simmons, Policy generation for continuous-time stochastic domains with concurrency, in: Proceedings of the Fourteenth International Conference on Automated Planning and Scheduling (ICAPS’04), 2004
[25] S. Sanner, C. Boutilier, Practical linear value-approximation techniques for first-order MDPs, in: Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI’06), 2006; S. Sanner, C. Boutilier, Practical linear value-approximation techniques for first-order MDPs, in: Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI’06), 2006 · Zbl 1191.68641
[26] I. Little, S. Thiébaux, Concurrent probabilistic planning in the graphplan framework, in: Proceedings of the Sixteenth International Conference on Automated Planning and Scheduling (ICAPS’06), 2006; I. Little, S. Thiébaux, Concurrent probabilistic planning in the graphplan framework, in: Proceedings of the Sixteenth International Conference on Automated Planning and Scheduling (ICAPS’06), 2006
[27] P. Fabiani, F. Teichteil-Königsbuch, Symbolic focused dynamic programming for planning under uncertainty, in: Proceedings of the IJCAI’05 Workshop on Reasoning with Uncertainty in Robotics (RUR’05), 2005; P. Fabiani, F. Teichteil-Königsbuch, Symbolic focused dynamic programming for planning under uncertainty, in: Proceedings of the IJCAI’05 Workshop on Reasoning with Uncertainty in Robotics (RUR’05), 2005
[28] S. Yoon, A. Fern, B. Givan, FF-Replan: a baseline for probabilistic planning, in: Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling (ICAPS’07), 2007; S. Yoon, A. Fern, B. Givan, FF-Replan: a baseline for probabilistic planning, in: Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling (ICAPS’07), 2007
[29] Hoffmann, J.; Nebel, B., The FF planning system: Fast plan generation through heuristic search, Journal of Artificial Intelligence Research, 14, 253-302 (2001) · Zbl 0970.68044
[30] I. Little, S. Thiébaux, Probabilistic planning vs replanning, in: Proceedings of the ICAPS’07 Workshop on the International Planning Competition: Past, Present and Future, 2007; I. Little, S. Thiébaux, Probabilistic planning vs replanning, in: Proceedings of the ICAPS’07 Workshop on the International Planning Competition: Past, Present and Future, 2007
[31] Aberdeen, D., Policy-gradient methods for planning, (Advances in Neural Information Processing Systems (NIPS’05), vol. 18 (2006), MIT Press)
[32] D. Aberdeen, O. Buffet, Concurrent probabilistic temporal planning with policy-gradients, in: Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling (ICAPS’07), Providence, USA, 2007; D. Aberdeen, O. Buffet, Concurrent probabilistic temporal planning with policy-gradients, in: Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling (ICAPS’07), Providence, USA, 2007 · Zbl 1192.68636
[33] Y. Xu, A. Fern, S. Yoon, Discriminative learning of beam-search heuristics for planning, in: Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI’07), 2007; Y. Xu, A. Fern, S. Yoon, Discriminative learning of beam-search heuristics for planning, in: Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI’07), 2007
[34] Dzeroski, S.; Raedt, L. D.; Driessens, K., Relational reinforcement learning, Machine Learning, 43, 7-52 (2001) · Zbl 0988.68088
[35] A. Ng, D. Harada, S. Russell, Policy invariance under reward transformations: Theory and application to reward shaping, in: Proceedings of the Sixteenth International Conference on Machine Learning (ICML’99), 1999; A. Ng, D. Harada, S. Russell, Policy invariance under reward transformations: Theory and application to reward shaping, in: Proceedings of the Sixteenth International Conference on Machine Learning (ICML’99), 1999
[36] Nicol, T. G.; Schraudolph, N., Conjugate directions for stochastic gradient descent, (Proceedings of the International Conference on Artificial Neural Networks (ICANN’02). Proceedings of the International Conference on Artificial Neural Networks (ICANN’02), Lecture Notes in Computer Science, vol. 2415 (2002), Springer-Verlag), 1351-1356 · Zbl 1013.68699
[37] Baxter, J.; Bartlett, P. L., Infinite-horizon policy-gradient estimation, Journal of Artificial Intelligence Research, 15, 319-350 (2001) · Zbl 0994.68119
[38] Benveniste, A.; Metivier, M.; Priouret, P., Adaptive Algorithms and Stochastic Approximation (1990), Springer-Verlag
[39] Greensmith, E.; Bartlett, P.; Baxter, J., Variance reduction techniques for gradient estimates in reinforcement learning, Journal of Machine Learning Research, 5, 1471-1530 (2004) · Zbl 1222.68207
[40] Aberdeen, D.; Baxter, J., Scaling internal-state policy-gradient methods for POMDPs, (Proceedings of the Nineteenth International Conference on Machine Learning (ICML’02) (2002), Morgan Kaufmann: Morgan Kaufmann Sydney, Australia)
[41] J. Hoey, R. St-Aubin, A. Hu, C. Boutilier, SPUDD: Stochastic planning using decision diagrams, in: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI’99), 1999, pp. 279-288; J. Hoey, R. St-Aubin, A. Hu, C. Boutilier, SPUDD: Stochastic planning using decision diagrams, in: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI’99), 1999, pp. 279-288
[42] J. Nicholls, Algebraic decision diagrams for reinforcement learning, Tech. rep., Australian National University, honours thesis, September 2005; J. Nicholls, Algebraic decision diagrams for reinforcement learning, Tech. rep., Australian National University, honours thesis, September 2005
[43] C. Boutilier, Sequential optimality and coordination in multiagent systems, in: Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI’99), 1999; C. Boutilier, Sequential optimality and coordination in multiagent systems, in: Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI’99), 1999
[44] D. Bernstein, S. Zilberstein, N. Immerman, The complexity of decentralized control of Markov decision processes, in: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI’00), 2000; D. Bernstein, S. Zilberstein, N. Immerman, The complexity of decentralized control of Markov decision processes, in: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI’00), 2000 · Zbl 1082.90593
[45] Guestrin, C.; Lagoudakis, M. G.; Parr, R., Coordinated reinforcement learning, (Proceedings of the Nineteenth International Conference on Machine Learning (ICML’02) (2002), Morgan Kaufmann Publishers Inc.), 227-234
[46] G.J. Gordon, Reinforcement learning with function approximation converges to a region, in: Advances in Neural Information Processing Systems (NIPS’00), vol. 13, 2001, pp. 1040-1046; G.J. Gordon, Reinforcement learning with function approximation converges to a region, in: Advances in Neural Information Processing Systems (NIPS’00), vol. 13, 2001, pp. 1040-1046
[47] Blum, A.; Furst, M., Fast planning through planning graph analysis, Artificial Intelligence, 90, 281-300 (1997) · Zbl 1017.68533
[48] O. Buffet, D. Aberdeen, The factored policy gradient planner, in: Proceedings of the Fifth International Planning Competition (IPC-5), 2006, see http://www.ldc.usb.ve/ bonet/ipc5; O. Buffet, D. Aberdeen, The factored policy gradient planner, in: Proceedings of the Fifth International Planning Competition (IPC-5), 2006, see http://www.ldc.usb.ve/ bonet/ipc5 · Zbl 1192.68636
[49] B. Scherrer, A. Boumaza, C. Thiery, Personal communication, 2008; B. Scherrer, A. Boumaza, C. Thiery, Personal communication, 2008
[50] C. Anderson, D. Smith, D. Weld, Conditional effects in graphplan, in: Proceedings of the International Conference on Artificial Intelligence Planning and Scheduling (AIPS’98), 1998; C. Anderson, D. Smith, D. Weld, Conditional effects in graphplan, in: Proceedings of the International Conference on Artificial Intelligence Planning and Scheduling (AIPS’98), 1998
[51] Peters, J.; Vijayakumar, S.; Schaal, S., Natural actor-critic, (Proceedings of the Sixteenth European Conference on Machine Learning (ECML’05). Proceedings of the Sixteenth European Conference on Machine Learning (ECML’05), Lecture Notes in Computer Science, vol. 3720 (2005), Springer-Verlag) · Zbl 1183.93130
[52] S. Kakade, A natural policy gradient, in: Advances in Neural Information Processing Systems (NIPS’03), vol. 14, 2003; S. Kakade, A natural policy gradient, in: Advances in Neural Information Processing Systems (NIPS’03), vol. 14, 2003
[53] O. Buffet, D. Aberdeen, FF+FPG: Guiding a policy-gradient planner, in: Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling (ICAPS’07), Providence, USA, 2007; O. Buffet, D. Aberdeen, FF+FPG: Guiding a policy-gradient planner, in: Proceedings of the Seventeenth International Conference on Automated Planning and Scheduling (ICAPS’07), Providence, USA, 2007 · Zbl 1192.68636
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.