zbMATH — the first resource for mathematics

Solving factored MDPs using non-homogeneous partitions. (English) Zbl 1082.68804
Summary: We present an algorithm for aggregating states in solving large MDPs (represented as factored MDPs) using search by successive refinement in the space of non-homogeneous partitions. Homogeneity is defined in terms of stochastic bisimulation and reward equivalence within blocks of a partition. Since homogeneous partitions that define equivalent reduced-state-space MDPs can have a large number of blocks, we relax the requirement of homogeneity. The algorithm constructs approximate aggregate MDPs from non-homogeneous partitions, solves the aggregate MDPs exactly, and then uses the resulting value functions as part of a heuristic in refining the current best non-homogeneous partition. We outline the theory motivating the use of this heuristic and present empirical results. In addition to investigating more exhaustive local search methods we explore the use of techniques derived from research on discretizing continuous state spaces. Finally, we compare the results from our algorithms which search in the space of non-homogeneous partitions with exact and approximate algorithms which represent homogeneous and approximately homogeneous partitions as decision trees or algebraic decision diagrams.

68T20 Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.)
68T37 Reasoning under uncertainty in the context of artificial intelligence
Full Text: DOI
[1] Bahar, R.I.; Frohm, E.; Gaona, C.; Hachtel, G.; Macii, E.; Pardo, A.; Somenzi, F., Algebraic decision diagrams and their applications, ()
[2] Bertsekas, D.; Tsitsiklis, J., Neuro-dynamic programming, (1996), Athena Scientific · Zbl 0924.68163
[3] Bertsekas, D.P.; Castañon, D.A., Adaptive aggregation for infinite horizon dynamic programming, IEEE trans. automat. control, 34, 6, 589-598, (1989) · Zbl 0675.90089
[4] Boutilier, C.; Dean, T.; Hanks, S., Decision-theoretic planning: structural assumptions and computational leverage, J. artificial intelligence res., 11, (1999) · Zbl 0918.68110
[5] Boutilier, C.; Dearden, R., Approximating value trees in structured dynamic programming, () · Zbl 0948.68167
[6] Boutilier, C.; Dearden, R.; Goldszmidt, M., Exploiting structure in policy construction, (), 1104-1111
[7] Boutilier, C.; Friedman, N.; Goldszmidt, M.; Koller, D., Context-specific independence in Bayesian networks, ()
[8] Boyan, J.A., Least-squares temporal difference learning, () · Zbl 1014.68072
[9] Bradtke, S.J.; Barto, A.G., Linear least-squares algorithms for temporal difference learning, Machine learning, 22, 33-57, (1996) · Zbl 1099.93534
[10] Chapman, D.; Kaelbling, L., Input generalization in delayed reinforcement learning: an algorithm and performance comparisons, () · Zbl 0748.68047
[11] Dean, T.; Givan, R., Model minimization in Markov decision processes, () · Zbl 0948.68171
[12] Dean, T.; Givan, R.; Leach, S., Model reduction techniques for computing approximately optimal solutions for Markov decision processes, () · Zbl 0948.68171
[13] Dean, T.; Kanazawa, K., A model for reasoning about persistence and causation, Comput. intelligence, 143-150, (1989)
[14] Forbes, J.; Huang, T.; Kanazawa, K.; Russell, S., The batmobile: towards a Bayesian automated taxi, ()
[15] Givan, R.; Leach, S.; Dean, T., Bounded-parameter Markov decision processes, Artificial intelligence, 122, 71-109, (2000) · Zbl 0948.68171
[16] Goldsmith, J.; Sloan, R.H., The complexity of model aggregation, ()
[17] G.J. Gordon, Stable function approximation in dynamic programming, Technical Report CMU-CS-103, School of Computer Science, Carnegie Mellon University, 1995
[18] Hoey, J.; St-Aubin, R.; Hu, A.; Boutilier, C., SPUDD: stochastic planning using decision diagrams, ()
[19] Hopcroft, J.; Ullman, J., Introduction to automata theory, languages, and computation, (1979), Addison-Wesley Reading, MA · Zbl 0426.68001
[20] Kearns, M.; Mansour, Y.; Ng, A., A sparse sampling algorithm for near-optimal planning in large Markov decision processes, () · Zbl 1014.68150
[21] Kearns, M.; Singh, S., Finite-sample convergence rates for Q-learning and indirect algorithms, ()
[22] Koller, D.; Parr, R., Computing factored value functions for policies in structured mdps, ()
[23] Koller, D.; Parr, R., Policy iteration for factored mdps, () · Zbl 1026.68125
[24] Littman, M.; Goldsmith, J.; Mundhenk, M., The computational complexity of probabilistic planning, J. artificial intelligence res., 9, (1998) · Zbl 0903.68100
[25] Madani, O.; Hanks, S.; Condon, A., On the undecidability of probabilistic planning and infinite-horizon partially observable Markov decision problems, ()
[26] McCallum, A., Learning to use selective attention and short-term memory in sequential tasks, ()
[27] Munos, R.; Moore, A., Variable resolution discretization for high-accuracy solutions of optimal control problems, ()
[28] Papadimitriou, C.; Tsitsiklis, J., The complexity of Markov decision processes, Math. oper. res., 12, 3, 441-450, (1987) · Zbl 0638.90099
[29] Puterman, M.L., Markov decision processes: discrete stochastic dynamic programming, (1994), Wiley New York · Zbl 0829.90134
[30] Singh, S.; Yee, R., An upper bound on the loss from approximate optimal-value functions, Machine learning, 16, 227-233, (1994) · Zbl 0939.68781
[31] F. Somenzi, CUDD: CU Decision Diagram Package Release 2.3.0, Department of Electrical and Computer Engineering, University of Colorado at Boulder, 1998
[32] St-Aubin, R.; Hoey, J.; Boutilier, C., APRICODD: approximate policy construction using decision diagrams, ()
[33] White, C.; Eldeib, H., Markov decision processes with imprecise transition probabilities, Oper. res., 42, 4, (1994) · Zbl 0837.90121
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.