##
**Adaptive Markov control processes.**
*(English)*
Zbl 0698.90053

Applied Mathematical Sciences, 79. New York, NY etc.: Springer-Verlag. XIV, 148 p. DM 78.00 (1989).

The purpose of this book is to present some recent developments on the theory of adaptive controlled Markov processes (CMP’s) - also known as Markov decision processes or Markov dynamic programs -, i.e., CMP’s that depend on unknown parameters. Thus, at each decision time, the controller or decision-maker must estimate the true parameter values and then adapt the control actions to the estimated values.

The material is devided into six chapters. The objective of the first chapter is to introduce the stochastic control processes; a brief description of some applications is also provided. General control systems, such as non-stationary CMP’s and semi-Markov control models are briefly discussed.

Two notions of optimality are considered: one is the standard concept of discount optimality, while the other one is an asymptotic definition introduced by Sch l to study adaptive control problems in the discounted case.

Section 2.3 of the second chapter relates asymptotic discount optimality to a function that measures the “discrepancy” between an optimal action in state x and any other action \(a\in A(x)\)- where A(x) denotes the set of admissible controls in state x.

In Section 2.4 the author develops a nonstationary value iteration (NVI) procedure to approximate dynamic programs and to obtain asymptotically discount optimal (ADO) policies. A finite-state approximation scheme for denumerable state controlled Markov processes is also given.

The following two sections study adaptive control problems, that is, MCM’s (Markov control models (X, A, q(\(\theta\)), r(\(\theta\))), where the state space X and the control set A are Borel spaces, q(\(\theta\)) is the transition law and r(\(\theta\)) denotes a one-step reward function depending on an unknown parameter \(\theta\). Section 2.6 treats the particular case when the unknown parameter is the distribution of the disturbance process \(\{\xi_ t\}\) in a discrete-time system \(x_{t+1}=F(x_ t,a_ t,\xi_ t).\)

The objective of the third chapter is to give a unified presentation to several results on the approximation and adaptive control of average- reward controlled Markov processes. In Sections 3.2 - 3.4 some optimality conditions assuming the existence of a bounded solution to the optimality equation, several sufficient (ergodicity) conditions for the existence of one such solution, as well as several uniform approximations to the optimal value function are obtained. In Sections 3.5 and 3.6 conditions for the convergence of a sequence of MCM’s to a limit MCM are given; uniform approximations to the optimal value function of the limit MCM as well as optimal policies are also provided.

Section 4.2 introduces the concept of a partially observable (Markov) control model (or PO-CM, for short), and then defines the partially observable (PO) control problem. An important difference between the PO control problem and the standard “completely observable” (CO) problem in previous chapters is that the policies for the former case are defined in terms of the “observable” histories, and not in terms of the (unobservable) state process \(\{x_ t\}\). It is shown in Section 4.3 how the PO control problem can be transformed into a CO control problem, in which the new “state” process, \(z_ t\), is the conditional (or a posteriori) distribution of the unobservable state \(x_ t\). By means of the equivalence of the two problems, optimality conditions for the PO control problem in terms of the new CO problem are formulated. In the following, adaptive policies for PO-CM’s depending on unknown parameters are considered. This is done in two steps. First, the PO-CM is transformed into a new completely observed (CO) control model and conditions are imposed so that the CO-CM satisfies the usual compactness and continuity conditions. Once we have this, the second step is simply to apply to the CO-CM the results for adaptive control in Chapter 2.

In Section 5 a statistical method to obtain a sequence of “strongly consistent” estimators of \(\theta^*\), where \(\theta^*\), the “true” parameter value is known is presented. The concept of a contrast function is introduced. Examples which illustrate how the minimum contrast method, under suitable “identifiability” conditions, includes some commonly used statistical parameter-estimation methods, are also presented. Minimum contrast estimators are also defined (MCE’s) and conditions sufficient for their strong consistency are presented.

The last chapter of the book considers the standard state-space discretization and shows how it can be extended to yield recursive approximations to adaptive and non-adaptive Markov control problems with a discounted reward criterion.

The prerequisite for this book is a knowledge of real analysis and probability theory, but no previous knowledge of control or decision processes is required. The presentation is meant to be self-contained in the sense that, whenever a result from analysis or probability is used, it is usually stated in full and references are supplied for further discussion, if necessary. Several appendices are provided for this purpose.

The material is devided into six chapters. The objective of the first chapter is to introduce the stochastic control processes; a brief description of some applications is also provided. General control systems, such as non-stationary CMP’s and semi-Markov control models are briefly discussed.

Two notions of optimality are considered: one is the standard concept of discount optimality, while the other one is an asymptotic definition introduced by Sch l to study adaptive control problems in the discounted case.

Section 2.3 of the second chapter relates asymptotic discount optimality to a function that measures the “discrepancy” between an optimal action in state x and any other action \(a\in A(x)\)- where A(x) denotes the set of admissible controls in state x.

In Section 2.4 the author develops a nonstationary value iteration (NVI) procedure to approximate dynamic programs and to obtain asymptotically discount optimal (ADO) policies. A finite-state approximation scheme for denumerable state controlled Markov processes is also given.

The following two sections study adaptive control problems, that is, MCM’s (Markov control models (X, A, q(\(\theta\)), r(\(\theta\))), where the state space X and the control set A are Borel spaces, q(\(\theta\)) is the transition law and r(\(\theta\)) denotes a one-step reward function depending on an unknown parameter \(\theta\). Section 2.6 treats the particular case when the unknown parameter is the distribution of the disturbance process \(\{\xi_ t\}\) in a discrete-time system \(x_{t+1}=F(x_ t,a_ t,\xi_ t).\)

The objective of the third chapter is to give a unified presentation to several results on the approximation and adaptive control of average- reward controlled Markov processes. In Sections 3.2 - 3.4 some optimality conditions assuming the existence of a bounded solution to the optimality equation, several sufficient (ergodicity) conditions for the existence of one such solution, as well as several uniform approximations to the optimal value function are obtained. In Sections 3.5 and 3.6 conditions for the convergence of a sequence of MCM’s to a limit MCM are given; uniform approximations to the optimal value function of the limit MCM as well as optimal policies are also provided.

Section 4.2 introduces the concept of a partially observable (Markov) control model (or PO-CM, for short), and then defines the partially observable (PO) control problem. An important difference between the PO control problem and the standard “completely observable” (CO) problem in previous chapters is that the policies for the former case are defined in terms of the “observable” histories, and not in terms of the (unobservable) state process \(\{x_ t\}\). It is shown in Section 4.3 how the PO control problem can be transformed into a CO control problem, in which the new “state” process, \(z_ t\), is the conditional (or a posteriori) distribution of the unobservable state \(x_ t\). By means of the equivalence of the two problems, optimality conditions for the PO control problem in terms of the new CO problem are formulated. In the following, adaptive policies for PO-CM’s depending on unknown parameters are considered. This is done in two steps. First, the PO-CM is transformed into a new completely observed (CO) control model and conditions are imposed so that the CO-CM satisfies the usual compactness and continuity conditions. Once we have this, the second step is simply to apply to the CO-CM the results for adaptive control in Chapter 2.

In Section 5 a statistical method to obtain a sequence of “strongly consistent” estimators of \(\theta^*\), where \(\theta^*\), the “true” parameter value is known is presented. The concept of a contrast function is introduced. Examples which illustrate how the minimum contrast method, under suitable “identifiability” conditions, includes some commonly used statistical parameter-estimation methods, are also presented. Minimum contrast estimators are also defined (MCE’s) and conditions sufficient for their strong consistency are presented.

The last chapter of the book considers the standard state-space discretization and shows how it can be extended to yield recursive approximations to adaptive and non-adaptive Markov control problems with a discounted reward criterion.

The prerequisite for this book is a knowledge of real analysis and probability theory, but no previous knowledge of control or decision processes is required. The presentation is meant to be self-contained in the sense that, whenever a result from analysis or probability is used, it is usually stated in full and references are supplied for further discussion, if necessary. Several appendices are provided for this purpose.

Reviewer: G.Dimitriu

### MSC:

90C40 | Markov and semi-Markov decision processes |

90-02 | Research exposition (monographs, survey articles) pertaining to operations research and mathematical programming |

93-02 | Research exposition (monographs, survey articles) pertaining to systems and control theory |

60Jxx | Markov processes |

60-02 | Research exposition (monographs, survey articles) pertaining to probability theory |

93C40 | Adaptive control/observation systems |