Algorithmic aspects of mean-variance optimization in Markov decision processes. (English) Zbl 1317.90318

Summary: We consider finite horizon Markov decision processes under performance measures that involve both the mean and the variance of the cumulative reward. We show that either randomized or history-based policies can improve performance. We prove that the complexity of computing a policy that maximizes the mean reward under a variance constraint is NP-hard for some cases, and strongly NP-hard for others. We finally offer pseudopolynomial exact and approximation algorithms.


90C40 Markov and semi-Markov decision processes
90C39 Dynamic programming
90C60 Abstract computational complexity for mathematical programming problems
Full Text: DOI


[1] Altman, E., Constrained Markov decision processes, (1999), Chapman and Hall · Zbl 0963.90068
[2] Arlotto, A., Gans, N., & Steel, M. J. (2013). Markov decision problems where means bound variances (Tech. Rep). <https://faculty.fuqua.duke.edu/∼aa249/ArlottoGansSteele-MDPsWhereMeansBoundVariances.pdf>.
[3] Artzner, P.; Delbaen, F.; Eber, J.; Heath, D., Coherent measures of risk, Mathematical Finance, 9, 3, 203-228, (1999) · Zbl 0980.91042
[4] Bertsekas, D. (1995). Dynamic programming and optimal control. Athena Scientific. · Zbl 0904.90170
[5] Chung, K.; Sobel, M., Discounted MDP’s: distribution functions and exponential utility maximization, SIAM Journal on Control and Optimization, 25, 1, 49-62, (1987) · Zbl 0617.90085
[6] Collins, E. J., Finite-horizon variance penalised Markov decision processes, Operations-Research-Spektrum, 19, 35-39, (1997) · Zbl 0894.90161
[7] Filar, J.; Kallenberg, L. C.M.; Lee, H. M., Variance-penalised Markov decision processes, Mathematics of Operations Research, 14, 147-161, (1989) · Zbl 0676.90096
[8] Garey, M. R.; Johnson, D. S., Computers and intractability: A guide to the theory of NP-completeness, (1979), W.H. Freeman New York · Zbl 0411.68039
[9] Gittins, J. C., Bandit processes and dynamic allocation indices, Journal of the Royal Statistical Society. Series B (Methodological), 41, 2, 148-177, (1979) · Zbl 0411.62055
[10] Guo, X.; Ye, L.; Yin, G., A mean-variance optimization problem for discounted Markov decision processes, European Journal of Operational Research, 220, 423-429, (2012) · Zbl 1253.90214
[11] Huang, Y.; Kallenberg, L. C.M., On finding optimal policies for Markov decision chains: A unifying framework for mean-variance tradeoffs, Mathematics of Operations Research, 19, 434-448, (1994) · Zbl 0842.90120
[12] Iyengar, G., Robust dynamic programming, Mathematics of Operations Research, 30, 257-280, (2005) · Zbl 1082.90123
[13] Kawai, H., A variance minimisation problem for a Markov decision process, European Journal of Operations Research, 31, 140-145, (1987) · Zbl 0623.90087
[14] Le Tallec, Y. (2007). Robust, risk-sensitive, and data-driven control of Markov decision processes. Unpublished doctoral dissertation, Operations Research Center, MIT, Cambridge, MA.
[15] Liu, Y., & Koenig, S. (2005). Risk-sensitive planning with one-switch utility functions: Value iteration. In Proceedings of the twentieth AAAI conference on artificial intelligence (p. 993-999).
[16] Liu, Y., & Koenig, S. (2006). Functional value iteration for decision-theoretic planning with general utility functions. In Proceedings of the twenty first AAAI conference on artificial intelligence (p. 1186-1193).
[17] Luenberger, D., Investment science, (1997), Oxford University Press
[18] Meyn, S. P., Control techniques for complex networks, (2008), Cambridge University Press New York NY · Zbl 1139.91002
[19] Nilim, A.; El Ghaoui, L., Robust Markov decision processes with uncertain transition matrices, Operations Research, 53, 5, 780-798, (2005) · Zbl 1165.90674
[20] Papadimitriou, C.H., & Yannakakis, M. (2000). On the approximability of trade-offs and optimal access of web sources. In Proceedings of the 41st symposium on foundations of computer science (p. 86-92). Washington, DC, USA.
[21] Riedel, F., Dynamic coherent risk measures, Stochastic Processes and their Applications, 112, 185-200, (2004) · Zbl 1114.91055
[22] Shapley, L., Stochastic games, Proceedings of the National Academy of Sciences, Mathematics, 1095-1100, (1953) · Zbl 0051.35805
[23] Sobel, M., The variance of discounted Markov decision processes, Journal of Applied Probability, 19, 794-802, (1982) · Zbl 0503.90091
[24] Sutton, R. S.; Barto, A. G., Reinforcement learning: an introduction, (1998), MIT Press
[25] Tamar, A., Di-Castro, D., & Mannor, S. (2012). Policy gradients with variance related risk criteria. In International conference on machine learning.
[26] White, D. J., Computational approaches to variance-penalised Markov decision processes, Operations-Research-Spektrum, 14, 79-83, (1992) · Zbl 0768.90087
[27] Whitt, W., Approximation of dynamic programs - I, Mathematics of Operations Research, 3, 231-243, (1978) · Zbl 0393.90094
[28] Whitt, W., Approximation of dynamic programs - II, Mathematics of Operations Research, 4, 179-185, (1979) · Zbl 0408.90082
[29] Whittle, P., Restless bandits: activity allocation in a changing world, Journal of Applied Probability, 25, 287-298, (1988) · Zbl 0664.90043
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.