×

Policy iteration for continuous-time average reward Markov decision processes in Polish spaces. (English) Zbl 1192.90243

Summary: We study the policy iteration algorithm (PIA) for continuous-time jump Markov decision processes in general state and action spaces. The corresponding transition rates are allowed to be unbounded, and the reward rates may have neither upper nor lower bounds. The criterion that we are concerned with is expected average reward. We propose a set of conditions under which we first establish the average reward optimality equation and present the PIA. Then, under two slightly different sets of conditions, we show that the PIA yields the optimal (maximum) reward, an average optimal stationary policy, and a solution to the average reward optimality equation.

MSC:

90C40 Markov and semi-Markov decision processes
60J25 Continuous-time Markov processes on general state spaces
PDFBibTeX XMLCite
Full Text: DOI EuDML

References:

[1] R. A. Howard, Dynamic Programming and Markov Processes, The Technology Press of M.I.T., Cambridge, Mass, USA, 1960. · Zbl 0091.16001
[2] R. Dekker, “Counter examples for compact action Markov decision chains with average reward criteria,” Communications in Statistics, vol. 3, no. 3, pp. 357-368, 1987. · Zbl 0646.90087
[3] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, John Wiley & Sons, New York, NY, USA, 1994. · Zbl 0829.90134
[4] P. J. Schweitzer, “On undiscounted Markovian decision processes with compact action spaces,” RAIRO-Operations Research, vol. 19, no. 1, pp. 71-86, 1985. · Zbl 0571.90095
[5] E. V. Denardo and B. L. Fox, “Multichain Markov renewal programs,” SIAM Journal on Applied Mathematics, vol. 16, pp. 468-487, 1968. · Zbl 0201.19303 · doi:10.1137/0116038
[6] X. P. Guo and O. Hernández-Lerma, “Drift and monotonicity conditions for continuous-time controlled Markov chains with an average criterion,” IEEE Transactions on Automatic Control, vol. 48, no. 2, pp. 236-245, 2003. · Zbl 1364.90346 · doi:10.1109/TAC.2002.808469
[7] X. P. Guo and X. R. Cao, “Optimal control of ergodic continuous-time Markov chains with average sample-path rewards,” SIAM Journal on Control and Optimization, vol. 44, no. 1, pp. 29-48, 2005. · Zbl 1116.90108 · doi:10.1137/S0363012903420875
[8] O. Hernández-Lerma and J. B. Lasserre, Further Topics on Discrete-Time Markov Control Processes, vol. 42 of Applications of Mathematics, Springer, New York, NY, USA, 1999. · Zbl 0928.93002
[9] O. Hernández-Lerma and J. B. Lasserre, “Policy iteration for average cost Markov control processes on Borel spaces,” Acta Applicandae Mathematicae, vol. 47, no. 2, pp. 125-154, 1997. · Zbl 0872.93080 · doi:10.1023/A:1005781013253
[10] A. Hordijk and M. L. Puterman, “On the convergence of policy iteration in finite state undiscounted Markov decision processes: the unichain case,” Mathematics of Operations Research, vol. 12, no. 1, pp. 163-176, 1987. · Zbl 0627.90095 · doi:10.1287/moor.12.1.163
[11] J. B. Lasserre, “A new policy iteration scheme for Markov decision processes using Schweitzer’s formula,” Journal of Applied Probability, vol. 31, no. 1, pp. 268-273, 1994. · Zbl 0796.60074 · doi:10.2307/3215254
[12] S. P. Meyn, “The policy iteration algorithm for average reward Markov decision processes with general state space,” IEEE Transactions on Automatic Control, vol. 42, no. 12, pp. 1663-1680, 1997. · Zbl 0906.93063 · doi:10.1109/9.650016
[13] M. S. Santos and J. Rust, “Convergence properties of policy iteration,” SIAM Journal on Control and Optimization, vol. 42, no. 6, pp. 2094-2115, 2004. · Zbl 1134.90530 · doi:10.1137/S0363012902399824
[14] Q. X. Zhu, “Average optimality for continuous-time Markov decision processes with a policy iteration approach,” Journal of Mathematical Analysis and Applications, vol. 339, no. 1, pp. 691-704, 2008. · Zbl 1156.90023 · doi:10.1016/j.jmaa.2007.06.071
[15] A. Y. Golubin, “A note on the convergence of policy iteration in Markov decision processes with compact action spaces,” Mathematics of Operations Research, vol. 28, no. 1, pp. 194-200, 2003. · Zbl 1082.90127 · doi:10.1287/moor.28.1.194.14255
[16] X. P. Guo and U. Rieder, “Average optimality for continuous-time Markov decision processes in Polish spaces,” The Annals of Applied Probability, vol. 16, no. 2, pp. 730-756, 2006. · Zbl 1160.90010 · doi:10.1214/105051606000000105
[17] Q. X. Zhu, “Average optimality inequality for continuous-time Markov decision processes in Polish spaces,” Mathematical Methods of Operations Research, vol. 66, no. 2, pp. 299-313, 2007. · Zbl 1138.90038 · doi:10.1007/s00186-007-0157-x
[18] Q. X. Zhu and T. Prieto-Rumeau, “Bias and overtaking optimality for continuous-time jump Markov decision processes in Polish spaces,” Journal of Applied Probability, vol. 45, no. 2, pp. 417-429, 2008. · Zbl 1189.90187 · doi:10.1239/jap/1214950357
[19] R. B. Lund, S. P. Meyn, and R. L. Tweedie, “Computable exponential convergence rates for stochastically ordered Markov processes,” The Annals of Applied Probability, vol. 6, no. 1, pp. 218-237, 1996. · Zbl 0863.60093 · doi:10.1214/aoap/1034968072
[20] I. I. Gīhman and A. V. Skorohod, Controlled Stochastic Processes, Springer, New York, NY, USA, 1979.
[21] Q. X. Zhu and X. P. Guo, “Markov decision processes with variance minimization: a new condition and approach,” Stochastic Analysis and Applications, vol. 25, no. 3, pp. 577-592, 2007. · Zbl 1152.90646 · doi:10.1080/07362990701282807
[22] Q. X. Zhu and X. P. Guo, “Another set of conditions for Markov decision processes with average sample-path costs,” Journal of Mathematical Analysis and Applications, vol. 322, no. 2, pp. 1199-1214, 2006. · Zbl 1124.90044 · doi:10.1016/j.jmaa.2006.02.050
[23] Q. X. Zhu and X. P. Guo, “Another set of conditions for strong n(n= - 1,0) discount optimality in Markov decision processes,” Stochastic Analysis and Applications, vol. 23, no. 5, pp. 953-974, 2005. · Zbl 1160.90686 · doi:10.1080/07362990500184865
[24] M. Schäl, “Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal,” Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, vol. 32, no. 3, pp. 179-196, 1975. · Zbl 0316.90080
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.