Planning and acting in partially observable stochastic domains.

*(English)*Zbl 0908.68165Summary: In this paper, we bring techniques from operations research to bear on the problem of choosing optimal actions in partially observable stochastic domains. We begin by introducing the theory of Markov decision processes (mdps) and partially observable mdps (pomdps). We then outline a novel algorithm for solving pomdps off line and show how, in some cases, a finite-memory controller can be extracted from the solution to a pomdp. We conclude with a discussion of how our approach relates to previous work, the complexity of finding exact solutions to pomdps, and of some possibilities for finding approximate solutions.

##### MSC:

68T20 | Problem solving in the context of artificial intelligence (heuristics, search strategies, etc.) |

PDF
BibTeX
XML
Cite

\textit{L. P. Kaelbling} et al., Artif. Intell. 101, No. 1--2, 99--134 (1998; Zbl 0908.68165)

Full Text:
DOI

##### References:

[1] | Aström, K.J., Optimal control of Markov decision processes with incomplete state estimation, J. math. anal. appl., 10, 174-205, (1995) · Zbl 0137.35803 |

[2] | Bacchus, F.; Boutilier, C.; Grove, A., Rewarding behaviors, (), 1160-1167 |

[3] | Bertsekas, D.P., () |

[4] | Blum, A.L.; Furst, M.L., Fast planning through planning graph analysis, Artificial intelligence, 90, 1-2, 279-298, (1997) · Zbl 1017.68533 |

[5] | Blythe, J., Planning with external events, (), 94-101 |

[6] | Boutilier, C.; Poole, D., Computing optimal policies for partially observable decision processes using compact representations, (), 1168-1175 |

[7] | Cassandra, A.; Littman, M.L.; Zhang, N.L., Incremental pruning: a simple, fast, exact method for partially observable Markov decision processes, (), 54-61 |

[8] | Cassandra, A.R.; Kaelbling, L.P.; Littman, M.L., Acting optimally in partially observable stochastic domains, (), 1023-1028 |

[9] | Cassandra, A.R., Exact and approximate algorithms for partially observable Markov decision problems, () |

[10] | Cheng, H.-T., Algorithms for partially observable Markov decision processes, () |

[11] | Chrisman, L., Reinforcement learning with perceptual aliasing: the perceptual distinctions approach, (), 183-188 |

[12] | Condon, A., The complexity of stochastic games, Inform. and comput., 96, 2, 203-224, (1992) · Zbl 0756.90103 |

[13] | Dean, T.; Kaelbling, L.P.; Kirman, J.; Nicholson, A., Planning under time constraints in stochastic domains, Artificial intelligence, 76, 1-2, 35-74, (1995) |

[14] | Draper, D.; Hanks, S.; Weld, D., Probabilistic planning with information gathering and contingent execution, () |

[15] | Drummond, M.; Bresina, J., Anytime synthetic projection: maximizing the probability of goal satisfaction, (), 138-144 |

[16] | Eagle, J.N., The optimal search for a moving target when the search path is constrained, Oper. res., 32, 5, 1107-1115, (1984) · Zbl 0562.90035 |

[17] | Fernández-Gaucherand, E.; Arapostathis, A.; Marcus, S.I., On the average cost optimality equation and the structure of optimal policies for partially observable Markov processes, Ann. oper. res., 29, 471-512, (1991) |

[18] | Goldman, R.P.; Boddy, M.S., Conditional linear planning, (), 80-85 |

[19] | Goldman, R.P.; Boddy, M.S., Epsilon-safe planning, (), 253-261 |

[20] | Goldman, R.P.; Boddy, M.S., Representing uncertainty in simple planners, (), 238-245 |

[21] | Haddawy, P.; Hanks, S., Utility models for goal-directed decision-theoretic planners, () |

[22] | Hansen, E.A., Cost-effective sensing during plan execution, (), 1029-1035 |

[23] | Hansen, E.A., An improved policy iteration algorithm for partially observable mdps, Advances in neural information processing systems, 10, (1998) |

[24] | Howard, R.A., Dynamic programming and Markov processes, (1960), MIT Press Cambridge, MA · Zbl 0091.16001 |

[25] | Howard, R.A., Information value theory, IEEE trans. systems science and cybernetics SSC-2, 1, 22-26, (1966) |

[26] | Kaiman, R.E., A new approach to linear filtering and prediction problems, Trans, American society of mechanical engineers, journal of basic engineering, 82, 35-45, (1960) |

[27] | Koenig, S., Optimal probabilistic and decision-theoretic planning using Markovian decision theory, Technical report UCB/CSD 92/685, (1992), Berkeley, CA |

[28] | Koenig, S.; Simmons, R.G., Risk-sensitive planning with probabilistic decision graphs, (), 363-373 |

[29] | Koza, J.R., Genetic programming: on the programming of computers by means of natural selection, (1992), MIT Press Cambridge, MA · Zbl 0850.68161 |

[30] | Kushmerick, N.; Hanks, S.; Weld, D.S., An algorithm for probabilistic planning, Artificial intelligence, 76, 1-2, 239-286, (1995) |

[31] | Lin, S.-H.; Dean, T., Generating optimal policies for high-level plans with conditional branches and loops, (), 205-218 |

[32] | Littman, M.L., Memoryless policies: theoretical limitations and practical results, () |

[33] | Littman, M.L.; Cassandra, A.R.; Kaelbling, L.P., Learning policies for partially observable environments: scaling up, (), 362-370, Reprinted in: |

[34] | Littman, M.L.; Cassandra, A.R.; Kaelbling, L.P., Efficient dynamic-programming updates in partially observable Markov decision processes, () |

[35] | also Technical Report CS-96-09 |

[36] | Lovejoy, W.S., A survey of algorithmic methods for partially observable Markov decision processes, Ann. oper. res., 28, 1, 47-65, (1991) · Zbl 0717.90086 |

[37] | Majercik, S.M.; Littman, M.L., MAXPLAN: a new approach to probabilistic planning, (), submitted for review |

[38] | Mansell, T.M., A method for planning given uncertain and incomplete information, (), 350-358 |

[39] | McAllester, D.; Rosenblitt, D., Systematic nonlinear planning, (), 634-639 |

[40] | McCallum, R.A., Overcoming incomplete perception with utile distinction memory, (), 190-196 |

[41] | McCallum, R.A., Instance-based utile distinctions for reinforcement learning with hidden state, (), 387-395 |

[42] | Monahan, G.E., A survey of partially observable Markov decision processes: theory, models, and algorithms, Management science, 28, 1, 1-16, (1982) · Zbl 0486.90084 |

[43] | Moore, R.C., A formal theory of knowledge and action, (), 319-358 |

[44] | Morgenstern, L., Knowledge preconditions for actions and plans, (), 867-874 |

[45] | Penberthy, J.S.; Weld, D., UCPOP: a sound, complete, partial order planner for ADL, (), 103-114 |

[46] | Peot, M.A.; Smith, D.E., Conditional nonlinear planning, (), 189-197 |

[47] | Platzman, L.K., A feasible computational approach to infinite-horizon partially-observed Markov decision problems, () |

[48] | Pryor, L.; Collins, G., Planning for contingencies: a decision-based approach, J. artif. intell. res., 4, 287-339, (1996) |

[49] | Puterman, M.L., Markov decision processes—discrete stochastic dynamic programming, (1994), Wiley New York, NY · Zbl 0829.90134 |

[50] | Rabiner, L.R., A tutorial on hidden Markov models and selected applications in speech recognition, (), 257-286, (2) |

[51] | Sawaki, K.; Ichikawa, A., Optimal control for partially observable Markov decision processes over an infinite horizon, J. oper. res. soc. Japan, 21, 1, 1-14, (1978) · Zbl 0386.49008 |

[52] | Schert, R.B.; Levesque, H.J., The frame problem and knowledge-producing actions, (), 689-697 |

[53] | Schoppers, M.J., Universal plans for reactive robots in unpredictable environments, (), 1039-1046 |

[54] | Schrijver, A., Theory of linear and integer programming, (1986), Wiley-Interscience New York, NY · Zbl 0665.90063 |

[55] | Singh, S.P.; Jaakkola, T.; Jordan, M.I., Model-free reinforcement learning for non-Markovian decision problems, (), 284-292 |

[56] | Smallwood, R.D.; Sondik, E.J., The optimal control of partially observable Markov processes over a finite horizon, Oper. res., 21, 1071-1088, (1973) · Zbl 0275.93059 |

[57] | Smith, D.E.; Williamson, M., Representation and evaluation of plans with loops, () |

[58] | Sondik, E., The optimal control of partially observable Markov processes, () · Zbl 0379.60067 |

[59] | Sondik, E.J., The optimal control of partially observable Markov processes over the infinite horizon: discounted costs, Oper. res., 26, 2, 282-304, (1978) · Zbl 0379.60067 |

[60] | Stolcke, A.; Omohundro, S., Hidden Markov model induction by Bayesian model merging, (), 11-18 |

[61] | Tash, J.; Russell, S., Control strategies for a stochastic planner, (), 1079-1085 |

[62] | Tseng, P., Solving H-horizon, stationary Markov decision problems in time proportional to log(H), Oper. res. lett., 9, 5, 287-297, (1990) · Zbl 0717.90090 |

[63] | White, C.C.; Harrington, D., Application of Jensen’s inequality for adaptive suboptimal design, J. optim. theory appl., 32, 1, 89-99, (1980) · Zbl 0416.90075 |

[64] | White, C.C., Partially observed Markov decision processes: a survey, Ann. oper. res., 32, (1991) · Zbl 0727.90089 |

[65] | White, C.C.; Scherer, W.T., Solution procedures for partially observed Markov decision processes, Oper. res., 37, 5, 791-797, (1989) · Zbl 0684.90099 |

[66] | Williams, R.J.; Baird, L.C., Tight performance bounds on greedy policies based on imperfect value functions, () |

[67] | Zhang, N.L.; Liu, W., Planning in stochastic domains: problem characteristics and approximation, () |

[68] | Zhao, J.; Schmidhuber, J.H., Incremental self-improvement for life-time multi-agent reinforcement learning, (), 516-525 |

[69] | Zwick, U.; Paterson, M., The complexity of Mean payoff games on graphs, Theoret. comput. sci., 158, 1-2, 343-359, (1996) · Zbl 0871.68138 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.