×

Stochastic dynamic programming with factored representations. (English) Zbl 0948.68167

Summary: Markov Decision Processes (MDPs) have proven to be popular models for decision-theoretic planning, but standard dynamic programming algorithms for solving MDPs rely on explicit, state-based specifications and computations. To alleviate the combinatorial problems associated with such methods, we propose new representational and computational techniques for MDPs that exploit certain types of problem structure. We use dynamic Bayesian networks (with decision trees representing the local families of conditional probability distributions) to represent stochastic actions in an MDP, together with a decision-tree representation of rewards. Based on this representation, we develop versions of standard dynamic programming algorithms that directly manipulate decision-tree representations of policies and value functions. This generally obviates the need for state-by-state computation, aggregating states at the leaves of these trees and requiring computations only for each aggregate state. The key to these algorithms is a decision-theoretic generalization of classic regression analysis, in which we determine the features relevant to predicting expected value. We demonstrate the method empirically on several planning problems, showing significant savings for certain types of domains. We also identify certain classes of problems for which this technique fails to perform well and suggest extensions and related ideas that may prove useful in such circumstances. We also briefly describe an approximation scheme based on this approach.

MSC:

68T20 Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.)

Software:

UCPOP
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Atkeson, C. G.; Moore, A. W.; Schaal, S., Locally weighted learning for control, Artificial Intelligence Review, Vol. 11, 75-113 (1997)
[2] Bahar, R. I.; Frohm, E. A.; Gaona, C. M.; Hachtel, G. D.; Macii, E.; Pardo, A.; Somenzi, F., Algebraic decision diagrams and their applications, (Proc. International Conference on Computer-Aided Design (1993)), 188-191
[3] Bellman, R. E., Dynamic Programming (1957), Princeton University Press: Princeton University Press Princeton, NJ
[4] Bertsekas, D. P.; Castanon, D. A., Adaptive aggregation for infinite horizon dynamic programming, IEEE Trans. Automat. Control, Vol. 34, 589-598 (1989) · Zbl 0675.90089
[5] Bertsekas, D. P., Dynamic Programming: Deterministic and Stochastic Models (1987), Prentice-Hall: Prentice-Hall Englewood Cliffs, NJ · Zbl 0649.93001
[6] Bertsekas, D. P.; Tsitsiklis, J. N., Neuro-Dynamic Programming (1996), Athena: Athena Belmont, MA · Zbl 0924.68163
[7] Bohanic, M.; Bratko, I., Trading accuracy for simplicity in decision trees, Machine Learning, Vol. 15, 223-250 (1994) · Zbl 0811.68112
[8] Boutilier, C., Correlated action effects in decision theoretic regression, (Proc. 13th Conference on Uncertainty in Artificial Intelligence Providence, RI (1997)), 30-37
[9] Boutilier, C.; Brafman, R. I.; Geib, C., Prioritized goal decomposition of Markov decision processes: Toward a synthesis of classical and decision theoretic planning, (Proc. IJCAI-97, Nagoya, Japan (1997)), 1156-1162
[10] Boutilier, C.; Brafman, R. I.; Geib, C., Structured reachability analysis for Markov decision processes, (Proc. 14th Conference on Uncertainty in Artificial Intelligence, Madison, WI (1998)), 24-32
[11] Boutilier, C.; Dean, T.; Hanks, S., Decision theoretic planning: Structural assumptions and computational leverage, J. Artificial Intelligence Res., Vol. 11, 1-94 (1999) · Zbl 0918.68110
[12] Boutilier, C.; Dearden, R., Using abstractions for decision-theoretic planning with time constraints, (Proc. AAAI-94, Seattle, WA (1994)), 1016-1022
[13] Boutilier, C.; Dearden, R., Approximating value trees in structured dynamic programming, (Proc. 13th International Conference on Machine Learning, Bari, Italy (1996)), 54-62
[14] Boutilier, C.; Friedman, N.; Goldszmidt, M.; Koller, D., Context-specific independence in Bayesian networks, (Proc. 12th Conference on Uncertainty in Artificial Intelligence, Portland, OR (1996)), 115-123
[15] Boutilier, C.; Goldszmidt, M., The frame problem and Bayesian network action representations, (Proc. 11th Biennial Canadian Conference on Artificial Intelligence, Toronto, Ontario (1996)), 69-83
[16] Boutilier, C.; Poole, D., Computing optimal policies for partially observable decision processes using compact representations, (Proc. AAAI-96, Portland, OR (1996)), 1168-1175
[17] Boutilier, C.; Puterman, M. L., Process-oriented planning and average-reward optimality, (Proc. IJCAI-95, Montreal, Quebec (1995)), 1096-1103
[18] Boyan, J. A.; Moore, A. W., Generalization in reinforcement learning: Safely approximating the value function, (Tesauro, G.; Touretzky, D. S.; Leen, T. K., Advances in Neural Information Processing Systems 7 (1995), MIT Press: MIT Press Cambridge, MA)
[19] Bryant, R. E., Graph-based algorithms for Boolean function manipulation, IEEE Trans. Comput., Vol. C-35, 8, 677-691 (1986) · Zbl 0593.94022
[20] Burch, J. R.; Clarke, E. M.; McMillan, K. L.; Dill, D. L.; Hwang, L. J., Symbolic model checking:
((10^{20}\) states and beyond, (Proc. Conference on Logic in Computer Science (1990)), 428-439 · Zbl 0753.68066
[21] Cassandra, A. R.; Kaelbling, L. P.; Littman, M. L., Acting optimally in partially observable stochastic domains, (Proc. AAAI-94, Seattle, WA (1994)), 1023-1028
[22] Chapman, D., Planning for conjunctive goals, Artificial Intelligence, Vol. 32, 3, 333-377 (1987) · Zbl 0642.68171
[23] Chapman, D.; Kaelbling, L. P., Input generalization in delayed reinforcement learning: An algorithm and performance comparisons, (Proc. IJCAI-91, Sydney, Australia (1991)), 726-731 · Zbl 0748.68047
[24] Clarke, E. M.; Emerson, E. A.; Sistla, A. P., Automatic verification of finite state concurrent systems using temporal logic specifications: A practical approach, (Proc. 10th ACM Symposium on Principles of Programming Languages, Austin, TX (1983)), 117-126 · Zbl 0591.68027
[25] Darwiche, A.; Goldszmidt, M., Action networks: A framework for reasoning about actions and change under uncertainty, (Proc. 10th Conference on Uncertainty in Artificial Intelligence, Seattle, WA (1994)), 136-144
[26] Dean, T.; Givan, R., Model minimization in Markov decision processes, (Proc. AAAI-97, Providence, RI (1997)), 106-111
[27] Dean, T.; Givan, R.; Leach, S., Model reduction techniques for computing approximately optimal solutions for Markov decision processes, (Proc. 13th Conference on Uncertainty in Artificial Intelligence, Providence, RI (1997)), 124-131
[28] Dean, T.; Kaelbling, L. P.; Kirman, J.; Nicholson, A., Planning with deadlines in stochastic domains, (Proc. AAAI-93, Washington, DC (1993)), 574-579
[29] Dean, T.; Kanazawa, K., A model for reasoning about persistence and causation, Computational Intelligence, Vol. 5, 3, 142-150 (1989)
[30] Dearden, R.; Boutilier, C., Abstraction and approximate decision theoretic planning, Artificial Intelligence, Vol. 89, 219-283 (1997) · Zbl 1042.68669
[31] Dietterich, T. G., The MAXQ method for hierarchical reinforcement learning, (Proc. 15th International Conference on Machine Learning, Madison, WI (1998)), 118-126 · Zbl 0963.68085
[32] Dietterich, T. G.; Flann, N. S., Explanation-based learning and reinforcement learning: A unified approach, (Proc. 12th International Conference on Machine Learning, Lake Tahoe, CA (1995)), 176-184
[33] Dietterich, T. G.; Flann, N. S., Explanation-based learning and reinforcement learning: A unified view, Machine Learning, Vol. 28, 2, 169-210 (1997)
[34] Dijkstra, E. W., Guarded commands, nondeterminacy and formal derivation of programs, Comm. ACM, Vol. 18, 8, 453-457 (1975) · Zbl 0308.68017
[35] Feldman, J. A.; Sproull, R. F., Decision theory and artificial intelligence II: The hungry monkey, Cognitive Sci., Vol. 1, 158-192 (1977)
[36] Fikes, R. E.; Nilsson, N. J., STRIPS: A new approach to the application of theorem proving to problem solving, Artificial Intelligence, Vol. 2, 189-208 (1971) · Zbl 0234.68036
[37] Gábor, Z.; Kalmár, Z.; Szepesvári, C., Multi-criteria reinforcement learning, (Proc. 15th International Conference on Machine Learning, Madison, WI (1998)), 197-205
[38] Geiger, D.; Heckerman, D., Advances in probabilistic reasoning, (Proc. 7th Conference on Uncertainty in Artificial Intelligence, Los Angeles, CA (1991)), 118-126
[39] Givan, R.; Dean, T., Model minimization, regression, and propositional STRIPS planning, (Proc. IJCAI-97, Nagoya, Japan (1997)), 1163-1168
[40] Hanks, S.; McDermott, D. V., Modeling a dynamic and uncertain world I: Symbolic and probabilistic reasoning about change, Artificial Intelligence, Vol. 66, 1-55 (1994) · Zbl 0803.68121
[41] Hanks, S. J., Projecting Plans for Uncertain Worlds, Ph.D. Thesis (1990), Yale University: Yale University New Haven, CT
[42] Hartmanis, J.; Stearns, R. E., Algebraic Structure Theory of Sequential Machines (1966), Prentice-Hall: Prentice-Hall Englewood Cliffs, NJ · Zbl 0154.41701
[43] Hoey, J.; St-Aubin, R.; Hu, A.; Boutilier, C., SPUDD: Stochastic planning using decision diagrams, (Proc. 15th Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden (1999)), 279-288
[44] Howard, R. A., Dynamic Programming and Markov Processes (1960), MIT Press: MIT Press Cambridge, MA · Zbl 0091.16001
[45] (Howard, R. A.; Matheson, J. E., Readings on the Principles and Applications of Decision Analysis (1984), Strategic Decision Group: Strategic Decision Group Menlo Park, CA)
[46] Hyafil, L.; Rivest, R. L., Constructing optimal binary decision trees is NP-complete, Inform. Process. Lett., Vol. 5, 15-17 (1976) · Zbl 0333.68029
[47] Kaelbling, L. P., Hierarchical reinforcement learning: Preliminary results, (Proc. 10th International Conference on Machine Learning, Amherst, MA (1993)), 167-173 · Zbl 0994.68154
[48] Kaelbling, L. P.; Littman, M. L.; Moore, A. W., Reinforcement learning: A survey, J. Artificial Intelligence Res., Vol. 4, 237-285 (1996)
[49] Keeney, R. L.; Raiffa, H., Decisions with Multiple Objectives: Preferences and Value Trade-offs (1976), Wiley: Wiley New York · Zbl 0488.90001
[50] Kushmerick, N.; Hanks, S.; Weld, D., An algorithm for probabilistic planning, Artificial Intelligence, Vol. 76, 239-286 (1995)
[51] Lee, D.; Yannakakis, M., Online miminization of transition systems, (Proc. 24th Annual ACM Symposium on the Theory of Computing (STOC-92), Victoria, BC (1992)), 264-274
[52] Littman, M. L., Algorithms for sequential decision making, Ph.D. Thesis CS-96-09 (1996), Brown University, Department of Computer Science: Brown University, Department of Computer Science Providence, RI
[53] Lovejoy, W. S., A survey of algorithmic methods for partially observed Markov decision processes, Ann. Oper. Res., Vol. 28, 47-66 (1991) · Zbl 0717.90086
[54] McAllester, D.; Rosenblitt, D., Systematic nonlinear planning, (Proc. AAAI-91, Anaheim, CA (1991)), 634-639
[55] McCarthy, J.; Hayes, P. J., Some philosophical problems from the standpoint of artificial intelligence, (Meltzer, B.; Michie, D., Machine Intelligence 4 (1969), Edinburgh University Press: Edinburgh University Press Edinburgh), 463-502 · Zbl 0226.68044
[56] Meuleau, N.; Hauskrecht, M.; Kim, K.-E.; Peshkin, L.; Kaelbling, L. P.; Dean, T.; Boutilier, C., Solving very large weakly coupled Markov decision processes, (Proc. AAAI-98, Madison, WI (1998)), 165-172
[57] Pearl, J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (1988), Morgan Kaufmann: Morgan Kaufmann San Mateo, CA
[58] Penberthy, J. S.; Weld, D. S., UCPOP: A sound, complete, partial order planner for ADL, (Proc. 3rd International Conference on Principles of Knowledge Representation and Reasoning (KR-92), Cambridge, MA (1992)), 103-114
[59] Poole, D., Probabilistic Horn abduction and Bayesian networks, Artificial Intelligence, Vol. 64, 1, 81-129 (1993) · Zbl 0792.68176
[60] Poole, D., The independent choice logic for modelling multiple agents under uncertainty, Artificial Intelligence, Vol. 94, 1-2, 7-56 (1997) · Zbl 0902.03017
[61] Precup, D.; Sutton, R. S.; Singh, S., Theoretical results on reinforcement learning with temporally abstract behaviors, (Proc. 10th European Conference on Machine Learning, Chemnitz, Germany (1998)), 382-393
[62] Puterman, M. L., Markov Decision Processes: Discrete Stochastic Dynamic Programming (1994), Wiley: Wiley New York · Zbl 0829.90134
[63] Puterman, M. L.; Shin, M. C., Modified policy iteration algorithms for discounted Markov decision problems, Management Science, Vol. 24, 1127-1137 (1978) · Zbl 0391.90093
[64] Quinlan, J. R., C45: Programs for Machine Learning (1993), Morgan Kaufmann: Morgan Kaufmann San Mateo, CA
[65] Rivest, R. L., Learning decision lists, Machine Learning, Vol. 2, 229-246 (1987)
[66] Sacerdoti, E. D., The nonlinear nature of plans, (Proc. IJCAI-75, Tbilisi, Georgia (1975)), 206-214
[67] Schoppers, M. J., Universal plans for reactive robots in unpredictable environments, (Proc. IJCAI-87, Milan, Italy (1987)), 1039-1046
[68] Schweitzer, P. L.; Puterman, M. L.; Kindle, K. W., Iterative aggregation-disaggregation procedures for discounted semi-Markov reward processes, Oper. Res., Vol. 33, 589-605 (1985) · Zbl 0571.90094
[69] Shachter, R. D., Evaluating influence diagrams, Oper. Res., Vol. 33, 6, 871-882 (1986)
[70] Shimony, S. E., The role of relevance in explanation I: Irrelevance as statistical independence, Internat. J. Approx. Reason., Vol. 8, 4, 281-324 (1993) · Zbl 0776.68104
[71] Singh, S. P.; Cohn, D., How to dynamically merge Markov decision processes, (Advances in Neural Information Processing Systems 10 (1998), MIT Press: MIT Press Cambridge, MA), 1057-1063
[72] Singh, S. P., Transfer of learning by composing solutions of elemental sequential tasks, Machine Learning, Vol. 8, 323-339 (1992) · Zbl 0772.68073
[73] Smallwood, R. D.; Sondik, E. J., The optimal control of partially observable Markov processes over a finite horizon, Oper. Res., Vol. 21, 1071-1088 (1973) · Zbl 0275.93059
[74] Smith, J. E.; Holtzman, S.; Matheson, J. E., Structuring conditional relationships in influence diagrams, Oper. Res., Vol. 41, 2, 280-297 (1993)
[75] Sondik, E. J., The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs, Oper. Res., Vol. 26, 282-304 (1978) · Zbl 0379.60067
[76] Sutton, R. S., Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, (Proc. 7th International Conference on Machine Learning, Austin, TX (1990)), 216-224
[77] Sutton, R. S.; Barto, A. G., Reinforcement Learning: An Introduction (1998), MIT Press: MIT Press Cambridge, MA
[78] Tash, J.; Russell, S., Control strategies for a stochastic planner, (Proc. AAAI-94, Seattle, WA (1994)), 1079-1085
[79] Tatman, J. A.; Shachter, R. D., Dynamic programming and influence diagrams, IEEE Trans. Systems Man Cybernet., Vol. 20, 2, 365-379 (1990) · Zbl 0715.90094
[80] Tesauro, G. J., TD-Gammon, a self-teaching backgammon program, achieves master-level play, Neural Computation, Vol. 6, 215-219 (1994)
[81] Tsitsiklis, J. H.; Van Roy, B., Feature-based methods for large scale dynamic programming, Machine Learning, Vol. 22, 59-94 (1996) · Zbl 1099.90586
[82] Utgoff, P. E., Decision tree induction based on efficient tree restructuring, Technical Report 95-18 (1995), University of Massachusetts
[83] Waldinger, R., Achieving several goals simultaneously, (Elcock, E.; Michie, D., Machine Intelligence 8: Machine Representations of Knowledge (1977), Ellis Horwood: Ellis Horwood Chichester, England), 94-136
[84] Watkins, C. J.C. H.; Dayan, P., Q-learning, Machine Learning, Vol. 8, 279-292 (1992) · Zbl 0773.68062
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.