##
**Planning and acting in partially observable stochastic domains.**
*(English)*
Zbl 0908.68165

Summary: In this paper, we bring techniques from operations research to bear on the problem of choosing optimal actions in partially observable stochastic domains. We begin by introducing the theory of Markov decision processes (mdps) and partially observable mdps (pomdps). We then outline a novel algorithm for solving pomdps off line and show how, in some cases, a finite-memory controller can be extracted from the solution to a pomdp. We conclude with a discussion of how our approach relates to previous work, the complexity of finding exact solutions to pomdps, and of some possibilities for finding approximate solutions.

### MSC:

68T20 | Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.) |

PDF
BibTeX
XML
Cite

\textit{L. P. Kaelbling} et al., Artif. Intell. 101, No. 1--2, 99--134 (1998; Zbl 0908.68165)

Full Text:
DOI

### References:

[1] | Aström, K. J., Optimal control of Markov decision processes with incomplete state estimation, J. Math. Anal. Appl., 10, 174-205 (1995) · Zbl 0137.35803 |

[2] | Bacchus, F.; Boutilier, C.; Grove, A., Rewarding behaviors, (Proceedings Thirteenth National Conference on Artificial Intelligence (AAAI-96). Proceedings Thirteenth National Conference on Artificial Intelligence (AAAI-96), Portland, OR (1996), AAAI Press/MIT Press: AAAI Press/MIT Press Menlo Park, CA), 1160-1167 |

[3] | Bertsekas, D. P., (Dynamic Programming and Optimal Control, Vols. 1 and 2 (1995), Athena Scientific: Athena Scientific Belmont, MA) · Zbl 0904.90170 |

[4] | Blum, A. L.; Furst, M. L., Fast planning through planning graph analysis, Artificial Intelligence, 90, 1-2, 279-298 (1997) · Zbl 1017.68533 |

[5] | Blythe, J., Planning with external events, (Proceedings Tenth Conference on Uncertainty in Artificial Intelligence (UAI-94). Proceedings Tenth Conference on Uncertainty in Artificial Intelligence (UAI-94), Seattle, WA (1994)), 94-101 |

[6] | Boutilier, C.; Poole, D., Computing optimal policies for partially observable decision processes using compact representations, (Proceedings Thirteenth National Conference on Artificial Intelligence (AAAI-96). Proceedings Thirteenth National Conference on Artificial Intelligence (AAAI-96), Portland, OR (1996), AAAI Press/MIT Press: AAAI Press/MIT Press Menlo Park, CA), 1168-1175 |

[7] | Cassandra, A.; Littman, M. L.; Zhang, N. L., Incremental Pruning: a simple, fast, exact method for partially observable Markov decision processes, (Proceedings Thirteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-97) (1997), Morgan Kaufmann: Morgan Kaufmann San Francisco, CA), 54-61 |

[8] | Cassandra, A. R.; Kaelbling, L. P.; Littman, M. L., Acting optimally in partially observable stochastic domains, (Proceedings Twelfth National Conference on Artificial Intelligence (AAAI-94). Proceedings Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA (1994)), 1023-1028 |

[9] | Cassandra, A. R., Exact and approximate algorithms for partially observable Markov decision problems, (Ph.D. Thesis (1998), Department of Computer Science, Brown University: Department of Computer Science, Brown University Providence, RI) |

[10] | Cheng, H.-T., Algorithms for partially observable Markov decision processes, (Ph.D. Thesis (1988), University of British Columbia: University of British Columbia Vancouver, BC) |

[11] | Chrisman, L., Reinforcement learning with perceptual aliasing: The perceptual distinctions approach, (Proceedings Tenth National Conference on Artificial Intelligence (AAAI-92). Proceedings Tenth National Conference on Artificial Intelligence (AAAI-92), San Jose, CA (1992), AAAI Press: AAAI Press San Jose, CA), 183-188 |

[12] | Condon, A., The complexity of stochastic games, Inform. and Comput., 96, 2, 203-224 (1992) · Zbl 0756.90103 |

[13] | Dean, T.; Kaelbling, L. P.; Kirman, J.; Nicholson, A., Planning under time constraints in stochastic domains, Artificial Intelligence, 76, 1-2, 35-74 (1995) |

[14] | Draper, D.; Hanks, S.; Weld, D., Probabilistic planning with information gathering and contingent execution, (Technical Report 93-12-04 (1993), University of Washington: University of Washington Seattle, WA) |

[15] | Drummond, M.; Bresina, J., Anytime synthetic projection: maximizing the probability of goal satisfaction, (Proceedings Eighth National Conference on Artificial Intelligence (AAAI-90). Proceedings Eighth National Conference on Artificial Intelligence (AAAI-90), Boston, MA (1990), Morgan Kaufmann: Morgan Kaufmann San Francisco, CA), 138-144 |

[16] | Eagle, J. N., The optimal search for a moving target when the search path is constrained, Oper. Res., 32, 5, 1107-1115 (1984) · Zbl 0562.90035 |

[17] | Fernández-Gaucherand, E.; Arapostathis, A.; Marcus, S. I., On the average cost optimality equation and the structure of optimal policies for partially observable Markov processes, Ann. Oper. Res., 29, 471-512 (1991) |

[18] | Goldman, R. P.; Boddy, M. S., Conditional linear planning, (Hammond, K., The Second International Conference on Artificial Intelligence Planning Systems (1994), AAAI Press/MIT Press: AAAI Press/MIT Press Menlo Park, CA), 80-85 |

[19] | Goldman, R. P.; Boddy, M. S., Epsilon-safe planning, (Proceedings 10th Conference on Uncertainty in Artificial Intelligence (UAI-94). Proceedings 10th Conference on Uncertainty in Artificial Intelligence (UAI-94), Seattle, WA (1994)), 253-261 |

[20] | Goldman, R. P.; Boddy, M. S., Representing uncertainty in simple planners, (Proceedings 4th International Conference on Principles of Knowledge Representation and Reasoning (KR-94). Proceedings 4th International Conference on Principles of Knowledge Representation and Reasoning (KR-94), Bonn, Germany (1994)), 238-245 |

[21] | Haddawy, P.; Hanks, S., Utility models for goal-directed decision-theoretic planners, (Technical Report 93-06-04 (1993), Department of Computer Science and Engineering, University of Washington) |

[22] | Hansen, E. A., Cost-effective sensing during plan execution, (Proceedings Twelfth National Conference on Artificial Intelligence (AAAI-94). Proceedings Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA (1994), AAAI Press/MIT Press: AAAI Press/MIT Press Menlo Park, CA), 1029-1035 |

[23] | Hansen, E. A., An improved policy iteration algorithm for partially observable MDPs, Advances in Neural Information Processing Systems, 10 (1998) |

[24] | Howard, R. A., Dynamic Programming and Markov Processes (1960), MIT Press: MIT Press Cambridge, MA · Zbl 0091.16001 |

[25] | Howard, R. A., Information value theory, IEEE Trans. Systems Science and Cybernetics SSC-2, 1, 22-26 (1966) |

[26] | Kaiman, R. E., A new approach to linear filtering and prediction problems, Trans, American Society of Mechanical Engineers, Journal of Basic Engineering, 82, 35-45 (1960) |

[27] | Koenig, S., Optimal probabilistic and decision-theoretic planning using Markovian decision theory, Technical Report UCB/CSD 92/685 (1992), Berkeley, CA |

[28] | Koenig, S.; Simmons, R. G., Risk-sensitive planning with probabilistic decision graphs, (Proceedings 4th International Conference on Principles of Knowledge Representation and Reasoning (KR-94). Proceedings 4th International Conference on Principles of Knowledge Representation and Reasoning (KR-94), Bonn, Germany (1994)), 363-373 |

[29] | Koza, J. R., Genetic Programming: On the Programming of Computers by Means of Natural Selection (1992), MIT Press: MIT Press Cambridge, MA · Zbl 0850.68161 |

[30] | Kushmerick, N.; Hanks, S.; Weld, D. S., An algorithm for probabilistic planning, Artificial Intelligence, 76, 1-2, 239-286 (1995) |

[31] | Lin, S.-H.; Dean, T., Generating optimal policies for high-level plans with conditional branches and loops, (Proceedings Third European Workshop on Planning (1995)), 205-218 |

[32] | Littman, M. L., Memoryless policies: theoretical limitations and practical results, (Cliff, D.; Husbands, P.; Meyer, J.-A.; Wilson, S. W., From Animals to Animals 3: Proceedings Third International Conference on Simulation of Adaptive Behavior (1994), MIT Press: MIT Press Cambridge, MA) |

[33] | Littman, M. L.; Cassandra, A. R.; Kaelbling, L. P., Learning policies for partially observable environments: scaling up, (Prieditis, A.; Russell, S., Proceedings Twelfth International Conference on Machine Learning (1995), Morgan Kaufmann: Morgan Kaufmann San Francisco, CA). (Huhns, M. H.; Singh, M. P., Readings in Agents (1998), Morgan Kaufmann: Morgan Kaufmann San Francisco, CA), 362-370, Reprinted in: |

[34] | Littman, M. L.; Cassandra, A. R.; Kaelbling, L. P., Efficient dynamic-programming updates in partially observable Markov decision processes, (Technical Report CS-95-19 (1996), Brown University: Brown University Providence, RI) |

[36] | Lovejoy, W. S., A survey of algorithmic methods for partially observable Markov decision processes, Ann. Oper. Res., 28, 1, 47-65 (1991) · Zbl 0717.90086 |

[37] | Majercik, S. M.; Littman, M. L., MAXPLAN: a new approach to probabilistic planning, (Technical Report CS-1998-01 (1998), Department of Computer Science, Duke University: Department of Computer Science, Duke University Durham, NC), submitted for review |

[38] | Mansell, T. M., A method for planning given uncertain and incomplete information, (Proceedings 9th Conference on Uncertainty in Artificial Intelligence (UAI-93) (1993), Morgan Kaufmann: Morgan Kaufmann San Mateo, CA), 350-358 |

[39] | McAllester, D.; Rosenblitt, D., Systematic nonlinear planning, (Proceedings 9th National Conference on Artificial Intelligence (AAAI-91). Proceedings 9th National Conference on Artificial Intelligence (AAAI-91), Anaheim, CA (1991)), 634-639 |

[40] | McCallum, R. A., Overcoming incomplete perception with utile distinction memory, (Proceedings Tenth International Conference on Machine Learning (1993), Morgan Kaufmann: Morgan Kaufmann Amherst, MA), 190-196 |

[41] | McCallum, R. A., Instance-based utile distinctions for reinforcement learning with hidden state, (Proceedings Twelfth International Conference on Machine Learning (1995), Morgan Kaufmann: Morgan Kaufmann San Francisco, CA), 387-395 |

[42] | Monahan, G. E., A survey of partially observable Markov decision processes: theory, models, and algorithms, Management Science, 28, 1, 1-16 (1982) · Zbl 0486.90084 |

[43] | Moore, R. C., A formal theory of knowledge and action, (Hobbs, J. R.; Moore, R. C., Formal Theories of the Commonsense World (1985), Ablex Publishing: Ablex Publishing Norwood, NJ), 319-358 |

[44] | Morgenstern, L., Knowledge preconditions for actions and plans, (Proceedings 10th International Joint Conference on Artificial Intelligence (IJCAI-87). Proceedings 10th International Joint Conference on Artificial Intelligence (IJCAI-87), Milan, Italy (1987)), 867-874 |

[45] | Penberthy, J. S.; Weld, D., UCPOP: a sound, complete, partial order planner for ADL, (Proceedings Third International Conference on Principles of Knowledge Representation and Reasoning (KR-92). Proceedings Third International Conference on Principles of Knowledge Representation and Reasoning (KR-92), Cambridge, MA (1992)), 103-114 |

[46] | Peot, M. A.; Smith, D. E., Conditional nonlinear planning, (Proceedings First International Conference on Artificial Intelligence Planning Systems (1992)), 189-197 |

[47] | Platzman, L. K., A feasible computational approach to infinite-horizon partially-observed Markov decision problems, (Technical Report (1981), Georgia Institute of Technology: Georgia Institute of Technology Atlanta, GA) |

[48] | Pryor, L.; Collins, G., Planning for contingencies: a decision-based approach, J. Artif. Intell. Res., 4, 287-339 (1996) |

[49] | Puterman, M. L., Markov Decision Processes—Discrete Stochastic Dynamic Programming (1994), Wiley: Wiley New York, NY · Zbl 0829.90134 |

[50] | Rabiner, L. R., A tutorial on hidden Markov models and selected applications in speech recognition, (Proc. IEEE, 77 (1989)), 257-286, (2) |

[51] | Sawaki, K.; Ichikawa, A., Optimal control for partially observable Markov decision processes over an infinite horizon, J. Oper. Res. Soc. Japan, 21, 1, 1-14 (1978) · Zbl 0386.49008 |

[52] | Schert, R. B.; Levesque, H. J., The frame problem and knowledge-producing actions, (Proceedings 11th National Conference on Artificial Intelligence (AAAI-93). Proceedings 11th National Conference on Artificial Intelligence (AAAI-93), Washington, DC (1993)), 689-697 |

[53] | Schoppers, M. J., Universal plans for reactive robots in unpredictable environments, (Proceedings Tenth International Joint Conference on Artificial Intelligence (IJCAI-87). Proceedings Tenth International Joint Conference on Artificial Intelligence (IJCAI-87), Milan, Italy (1987)), 1039-1046 |

[54] | Schrijver, A., Theory of Linear and Integer Programming (1986), Wiley-Interscience: Wiley-Interscience New York, NY · Zbl 0665.90063 |

[55] | Singh, S. P.; Jaakkola, T.; Jordan, M. I., Model-free reinforcement learning for non-Markovian decision problems, (Proceedings Eleventh International Conference on Machine Learning (1994), Morgan Kaufmann: Morgan Kaufmann San Francisco, CA), 284-292 |

[56] | Smallwood, R. D.; Sondik, E. J., The optimal control of partially observable Markov processes over a finite horizon, Oper. Res., 21, 1071-1088 (1973) · Zbl 0275.93059 |

[57] | Smith, D. E.; Williamson, M., Representation and evaluation of plans with loops, (Working Notes for the 1995 Stanford Spring Symposium on Extended Theories of Action (1995)) |

[58] | Sondik, E., The optimal control of partially observable Markov processes, (Ph.D. Thesis (1971), Stanford University) · Zbl 0379.60067 |

[59] | Sondik, E. J., The optimal control of partially observable Markov processes over the infinite horizon: discounted costs, Oper. Res., 26, 2, 282-304 (1978) · Zbl 0379.60067 |

[60] | Stolcke, A.; Omohundro, S., Hidden Markov model induction by Bayesian model merging, (Hanson, S. J.; Cowan, J. D.; Giles, C. L., Advances in Neural Information Processing Systems, 5 (1993), Morgan Kaufmann: Morgan Kaufmann San Mateo, CA), 11-18 |

[61] | Tash, J.; Russell, S., Control strategies for a stochastic planner, (Proceedings 12th National Conference on Artificial Intelligence (AAAI-94). Proceedings 12th National Conference on Artificial Intelligence (AAAI-94), Seattle, WA (1994)), 1079-1085 |

[62] | Tseng, P., Solving \(H\)-horizon, stationary Markov decision problems in time proportional to log \((H)\), Oper. Res. Lett., 9, 5, 287-297 (1990) · Zbl 0717.90090 |

[63] | White, C. C.; Harrington, D., Application of Jensen’s inequality for adaptive suboptimal design, J. Optim. Theory Appl., 32, 1, 89-99 (1980) · Zbl 0416.90075 |

[64] | White, C. C., Partially observed Markov decision processes: a survey, Ann. Oper. Res., 32 (1991) · Zbl 0727.90089 |

[65] | White, C. C.; Scherer, W. T., Solution procedures for partially observed Markov decision processes, Oper. Res., 37, 5, 791-797 (1989) · Zbl 0684.90099 |

[66] | Williams, R. J.; Baird, L. C., Tight performance bounds on greedy policies based on imperfect value functions, (Technical Report NU-CCS-93-14 (1993), Northeastern University, College of Computer Science: Northeastern University, College of Computer Science Boston, MA) |

[67] | Zhang, N. L.; Liu, W., Planning in stochastic domains: problem characteristics and approximation, (Technical Report HKUST-CS96-31 (1996), Department of Computer Science, Hong Kong University of Science and Technology) |

[68] | Zhao, J.; Schmidhuber, J. H., Incremental self-improvement for life-time multi-agent reinforcement learning, (Maes, P.; Mataric, M. J.; Meyer, J.-A.; Pollack, J.; Wilson, S. W., From Animals to Animals: Proceedings Fourth International Conference on Simulation of Adaptive Behavior (1996), MIT Press: MIT Press Cambridge, MA), 516-525 |

[69] | Zwick, U.; Paterson, M., The complexity of mean payoff games on graphs, Theoret. Comput. Sci., 158, 1-2, 343-359 (1996) · Zbl 0871.68138 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.