## Continuous-time Markov decision processes. Theory and applications.(English)Zbl 1209.90002

Stochastic Modelling and Applied Probability 62. Berlin: Springer (ISBN 978-3-642-02546-4/hbk; 978-3-642-26072-8/pbk). xvii, 231 p. (2009).
A continuous-time Markov decision process is a mathematical model of a controlled stochastic process with piecewise constant trajectories, when a selected action can be changed any time. If the model is controlled by a stationary policy, that is, an action is a function of the current state, the underlying process is a continuous-time jump Markov process. Sometimes, this class of controlled systems is called a continuous-time jump Markov decision process to underline the discrete nature of the trajectories, but the name continuous-time Markov decision process is often used for this class of controlled processes in spite of the fact that, in a broader sense, a continuous-time Markov process may not have piecewise-constant trajectories and certain controlled Markovian models, such as controlled diffusion, do not belong to this class of models.
The first studies of continuous-time Markov decision processes were conducted in the fifties and sixties by R. Bellman [Dynamic Programming. Princeton, New Jersey: Princeton University Press (1957; Zbl 0077.13605)], R. A. Howard [Dynamic Programming and Markov Processes. New York-London: John Wiley & Sons, Inc. and the Technology Press of the Massachusetts Institute of Technology (1960; Zbl 0091.16001)], L. E. Zachrisson [Ann. Math. Stud. 52, 211–253 (1964; Zbl 0126.36507)], V. V. Rykov [Theor. Probab. Appl. 11, 302-311 (1966); translation from Teor. Veroyatn. Primen. 11, 343–351 (1966; Zbl 0262.60063)], and A. Martin-Löf [Oper. Res. 15, 872–881 (1967; Zbl 0149.38104)]. Systematic studies of continuous-time Markov decision processes started with the papers by B. L. Miller [SIAM J. Control 6, 266–280 (1968; Zbl 0162.23302); J. Math. Anal. Appl. 22, 552–569 (1968; Zbl 0157.50301)].
Miller’s results on models with finite state sets were quickly extended to infinite state spaces under the condition that the jump rate is uniformly bounded. The results were similar to the corresponding results for discrete-time Markov Decision Processes. This striking similarity was explained by S. A. Lippman [Oper. Res. 23, 667–710 (1975; Zbl 0312.60048)], P. Kakumanu [Naval Res. Logist. Quart. 24, 431–439 (1977; Zbl 0371.90121)] and R. F. Serfozo [Oper. Res. 27, 616–620 (1979; Zbl 0413.90079)], who showed that continuous-time models with bounded jump rates can be transformed into equivalent discrete-time models. There are several observations that lead to this conclusion. One of the most popular methods of such a reduction is uniformization. This method is applicable only to models with bounded jump rates. Other methods, such as the equivalence of the corresponding optimality equations and transformations of discounted models, can be applied under some conditions to models with unbounded jump rates.
Though the discovery of uniformization and other reduction methods reduced the interest to the topic of continuous-time Markov decision processes, several publications established the results that did not follow from the known reduction even for the case of bounded jump rates. Examples include the papers by M. R. Lembersky [Ann. Stat. 2, 159–169 (1974; Zbl 0272.90083)] on finite-horizon models, A. A. Yushkevich and E. A. Feinberg [Teor. Veroyatn. Primen. 24, 155–160 (1979; Zbl 0399.93068)] on canonical policies, and E. A. Feinberg [Math. Oper. Res. 29, No. 3, 492–524 (2004; Zbl 1082.90126)], where another reduction scheme to discrete-time was introduced and the equivalence of different classes of randomized policies was established and applied to constrained problems.
A continuous-time Markov decision process is defined by the state and action sets, transition intensities, and rewards. After the model is defined, an important question is how to define stochastic processes associated with the initial states and selected strategies. In discrete time, this can be done either by using the Ionescu Tulcea theorem or by the Kolmogorov extension theorem. In continuous time the situation is more complicated and the first publications on this topic dealt only with Markov policies. In this case the underlying stochastic process is a nonstationary Markov processes defined by Kolmogorov’s equations. A. A. Yushkevich [Theory Probab. Appl. 22 (1977), 215–235 (1978); translation from Teor. Veroyatn. Primen 22, 222–241 (1977; Zbl 0379.93052)] introduced general past-dependent policies by using the Ionescu Tulcea theorem, and M. Yu. Kitaev [Theory Probab. Appl. 30, 272–288 (1986; Zbl 0586.90093)] provided an elegant way to define stochastic processes by using J. Jacod’s theorem [Z. Wahrscheinlichkeitstheor. Verw. Geb. 31, 235–253 (1975; Zbl 0302.60032)] that states that a jump process on its canonical space of trajectories is defined by its compensator.
The book under review focuses on problems with unbounded jump rates with the exception of Chapter 3, which deals with finite state and action sets. The current theory of continuous-time Markov decision processes with unbounded jump rates has been mostly developed by the authors of the book and their coauthors. The only earlier publication on this topic I am familiar with is [J. Bather, Adv. Appl. Probab. 8, 144–158 (1976; Zbl 0342.60045)].
The book consists of 12 chapters. Chapter 1 provides an introduction and a summary of the book. Chapter 2 introduces major definitions. Chapter 3 is entitled “Average Optimality for Finite Models” and this usually means that the chapter addresses average rewards per unit time. In fact, this chapter deals with the concepts of more sensitive $$n$$-bias optimal policies introduced in discrete time by Blackwell and Veinott. Thus the results of this chapter, except Section 3.5.1, which deals with the case $$n=-1$$ corresponding to average optimality, are applicable to more than just average optimality. The algorithms provided in this section are strikingly close to the algorithms in discrete time, and I cannot imagine that they cannot be deduced from discrete time. As the authors explain in the notes on pp. 52–53, the advantage of their presentation being based entirely on continuous time is that “... [the] proofs are direct and simple. Moreover, the approach is self-contained in the sense that it does not require results from other problems, for instance, from discount discrete-time or continuous-time MDPs.”
The remainder of the book deals with denumerable state spaces. Chapters 4 and 5 deal with discount and average optimality for nonnegative costs, respectively. Chapters 6 and 7 deal with discount and average optimality for unbounded rewards. Chapter 8 presents the results on the average pathwise optimality. Chapter 9 describes results on advanced optimality criteria, some of which were presented in Chapter 3 for problems with finite state and action sets. Chapter 10 presents results on variance minimization, and Chapters 11 and 12 describe constrained optimization for discount and average criteria.
As far as I know, this is the first monograph on continuous-time Markov decision processes. The only earlier book I can recall is [Lectures on continuous-time Markov control processes. Aportaciones Matematicas, Textos. 3 (Nivel avanzado). México: Sociedad Matematica Mexicana (1994; Zbl 0866.93102)] by the second author, O. Hernándes-Lerma, who is one of the leading experts worldwide on Markov decision processes and the author of several very successful books, including the book on adaptive control and three books with Jean Bernard Lasserre. However, these lectures are relatively short. They cover other topics, including controlled diffusion, in addition to continuous-time jump Markov decision processes, and almost all the material presented in the book under review was developed after 1994.
However, some of the important results, topics and concepts on continuous-time Markov decision processes are not included in this book. For example, the most general class of policies considered is the set of randomized Markov policies. In addition to Kitaev’s and Feinberg’s papers mentioned above, the construction of general policies via Jacod’s theorem can be found in the book by M. Yu. Kitaev and V. V. Rykov [Controlled Queueing Systems. Boca Raton, FL: CRC Press (1995; Zbl 0876.60077)]. I did not find in the book descriptions of the main reduction techniques to discrete time, including uniformization. It is mentioned that uniformization is not applicable to the reduction of problems with unbounded jump rates to discrete time. However, there are other reduction techniques.
Constrained Markov decision processes are presented only for two criteria and the answer is provided in terms of so-called randomized policies, which are known under the name of relaxed controls in control theory and can be implemented only when action sets are convex. Solutions in the form of implementable randomized policies, which are described in my paper cited above [loc. cit. (2004; Zbl 1082.90126)], are not provided in the book. Having said this, I would like to add that the book covers many interesting topics and results.
This is an important book written by leading experts on a mathematically rich topic which has many applications to engineering, business, and biological problems. The area of continuous-time Markov decision processes is not completely developed and I am sure that a significant amount of research will be conducted in the future. I think scholars and students interested in developing the theory of continuous-time Markov decision processes or working on their applications should have this book.

### MSC:

 90-02 Research exposition (monographs, survey articles) pertaining to operations research and mathematical programming 90C40 Markov and semi-Markov decision processes