zbMATH — the first resource for mathematics

Learning algorithms for Markov decision processes. (English) Zbl 0631.90085
The paper presents a finite Markov decision chain with unknown random one stage rewards and unknown transition law. At each stage the state of the process and after choosing a decision the actual reward are observed. From these observations the estimations of the transition law and of the reward distributions are updated. Then an optimal decision for the next step is determined by one successive approximation step with the updated parameters. But this optimal decision is not applied directly for the next stage, but is used to update a randomized decision rule by a kind of relaxation. It is shown that under some reasonable conditions the randomized policy thus obtained is average optimal for the true parameters, and that the estimations converge a.s. to the true transition law and reward distributions.
The article is related to papers by A. Federgruen and P. J. Schweitzer [J. Optimization Theory Appl. 34, 207-241 (1981; Zbl 0457.90083), R. S. Acosta-Abreu and O. Hernandez-Lerma, Control Cybern. 14, 313-322 (1985; Zbl 0606.90130), and the reviewer [see the preceding review].
Reviewer: G.Hübner

90C40 Markov and semi-Markov decision processes
Full Text: DOI