##
**Variance-penalized Markov decision processes.**
*(English)*
Zbl 0676.90096

For a finite state, finite action, discrete time Markov decision process: E is the set of states; A(i) is the set of actions for \(i\in E\); \(r_{ia}\) is the immediate expected reward if action \(a\in A(i)\) is taken in state i; \(p_{iaj}\) is the transition probability from state i to state j given action \(a\in A(i)\); \(\beta\) is the initial state probability vector; X(\(\pi)\) is the set of limit points of state-action probability vectors for a general policy \(\pi\) ; \(C_ 1\) is the set of policies \(\{\) \(x\}\) for which X(\(\pi)\) is a singleton; \(x_{ia}(\beta,\pi)\) is the limiting probability of being in state i and taking action \(a\in A(i)\), given \(\pi \in C_ 1\) and \(\beta\).

The set \(X(\beta)=\{x_{ia}(\pi,\beta)\}\) is the set of feasible solutions to the linear constraints: \[ \sum_{i,a}(\delta_{ij}- p_{iaj})x_{ia}=0,\quad j\in E, \]

\[ \sum_{a}x_{ja}+\sum_{i,a}(\delta_{ij}-p_{iaj})y_{ia}=\beta_ j,\quad j\in E. \]

\[ x_{ia},y_{ia}\geq 0,\quad i\in E,\quad a\in A(i). \] The paper deals with problems in which certain functions of the variance of the rewards are to be maximized over X, for the average reward and discounted reward, infinite horizon processes. For the average reward case, one formulation is: \[ \max_{\pi \in C_ 1}[\Phi (\beta,\pi)+\lambda V(\pi)] \] where \(\Phi\) (\(\beta\),\(\pi)\) is the long run average expected reward, and V(\(\pi)\) is the long run variation of the reward at any time about \(\Phi\) (\(\beta\),\(\pi)\), given \(\beta\),\(\pi\).

Solving the above problem is equivalent to solving the problem: \[ \max_{x\in X(\beta)}[\sum_{i,a}r_{ia}x_{ia}-\lambda \sum_{i,a}r^ 2_{ia}x_{ia}+\lambda (\sum_{i,a}r_{ia}x_{ia})^ 2]. \] Theorem 2.2 proves the above result, and, in addition, proves that solutions to the optimization problem are (\(\phi\) (\(\beta\),\(\pi)\)-V(\(\pi)\)] efficient.

Other results are proved for the discounted problem, using several definitions of variance.

The set \(X(\beta)=\{x_{ia}(\pi,\beta)\}\) is the set of feasible solutions to the linear constraints: \[ \sum_{i,a}(\delta_{ij}- p_{iaj})x_{ia}=0,\quad j\in E, \]

\[ \sum_{a}x_{ja}+\sum_{i,a}(\delta_{ij}-p_{iaj})y_{ia}=\beta_ j,\quad j\in E. \]

\[ x_{ia},y_{ia}\geq 0,\quad i\in E,\quad a\in A(i). \] The paper deals with problems in which certain functions of the variance of the rewards are to be maximized over X, for the average reward and discounted reward, infinite horizon processes. For the average reward case, one formulation is: \[ \max_{\pi \in C_ 1}[\Phi (\beta,\pi)+\lambda V(\pi)] \] where \(\Phi\) (\(\beta\),\(\pi)\) is the long run average expected reward, and V(\(\pi)\) is the long run variation of the reward at any time about \(\Phi\) (\(\beta\),\(\pi)\), given \(\beta\),\(\pi\).

Solving the above problem is equivalent to solving the problem: \[ \max_{x\in X(\beta)}[\sum_{i,a}r_{ia}x_{ia}-\lambda \sum_{i,a}r^ 2_{ia}x_{ia}+\lambda (\sum_{i,a}r_{ia}x_{ia})^ 2]. \] Theorem 2.2 proves the above result, and, in addition, proves that solutions to the optimization problem are (\(\phi\) (\(\beta\),\(\pi)\)-V(\(\pi)\)] efficient.

Other results are proved for the discounted problem, using several definitions of variance.

Reviewer: D.J.White

### MSC:

90C40 | Markov and semi-Markov decision processes |