Yang, Wenhao; Zhang, Liangyu; Zhang, Zhihua Toward theoretical understandings of robust Markov decision processes: sample complexity and asymptotics. (English) Zbl 1539.68286 Ann. Stat. 50, No. 6, 3223-3248 (2022). Summary: In this paper, we study the nonasymptotic and asymptotic performances of the optimal robust policy and value function of robust Markov Decision Processes (MDPs), where the optimal robust policy and value function are estimated from a generative model. While prior work focusing on nonasymptotic performances of robust MDPs is restricted in the setting of the KL uncertainty set and \((s,a)\)-rectangular assumption, we improve their results and also consider other uncertainty sets, including the \({L_1}\) and \({\chi^2}\) balls. Our results show that when we assume \((s,a)\)-rectangular on uncertainty sets, the sample complexity is about \(\widetilde{\mathcal{O}}(\frac{|\mathcal{S}{|^2}|\mathcal{A}|}{{\varepsilon^2}{\rho^2}{(1-\gamma )^4}})\). In addition, we extend our results from the \((s,a)\)-rectangular assumption to the \(s\)-rectangular assumption. In this scenario, the sample complexity varies with the choice of uncertainty sets and is generally larger than the case under the \((s,a)\)-rectangular assumption. Moreover, we also show that the optimal robust value function is asymptotically normal with a typical rate \(\sqrt{n}\) under the \((s,a)\) and \(s\)-rectangular assumptions from both theoretical and empirical perspectives. Cited in 1 Document MSC: 68T05 Learning and adaptive systems in artificial intelligence 62C05 General considerations in statistical decision theory 62G05 Nonparametric estimation 90C17 Robustness in mathematical programming 90C40 Markov and semi-Markov decision processes Keywords:model-based reinforcement learning; robust MDPs; distributional robustness; \(f\)-divergence set Software:ElemStatLearn × Cite Format Result Cite Review PDF Full Text: DOI arXiv Link References: [1] AGARWAL, A., KAKADE, S. and YANG, L. F. (2020). Model-based reinforcement learning with a generative model is minimax optimal. In Proceedings of Thirty Third Conference on Learning Theory 67-83. [2] AGARWAL, R., SCHUURMANS, D. and NOROUZI, M. (2020). An optimistic perspective on offline reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning 104-114. [3] BEHZADIAN, B., RUSSEL, R. H., PETRIK, M. and HO, C. P. (2021). Optimizing percentile criterion using robust MDPs. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics 1009-1017. [4] BEN-TAL, A., DEN HERTOG, D., DE WAEGENAERE, A., MELENBERG, B. and RENNEN, G. (2013). Robust solutions of optimization problems affected by uncertain probabilities. Manage. Sci. 59 341-357. [5] BERTSEKAS, D. P. and TSITSIKLIS, J. N. (1995). Neuro-dynamic programming: An overview. In Proceedings of 1995 34th IEEE Conference on Decision and Control 1 560-564. [6] Bertsimas, D., Gupta, V. and Kallus, N. (2018). Data-driven robust optimization. Math. Program. 167 235-292. · Zbl 1397.90298 · doi:10.1007/s10107-017-1125-8 [7] Blanchet, J. and Murthy, K. (2019). Quantifying distributional model risk via optimal transport. Math. Oper. Res. 44 565-600. · Zbl 1434.60113 · doi:10.1287/moor.2018.0936 [8] CHEN, J. and JIANG, N. (2019). Information-theoretic considerations in batch reinforcement learning. In Proceedings of the 36th International Conference on Machine Learning 1042-1051. [9] DAI, B., NACHUM, O., CHOW, Y., LI, L., SZEPESVÁRI, C. and SCHUURMANS, D. (2020). Coindice: Off-policy confidence interval estimation. Adv. Neural Inf. Process. Syst. 33 9398-9411. [10] DANN, C., NEUMANN, G. and PETERS, J. (2014). Policy evaluation with temporal differences: A survey and comparison. J. Mach. Learn. Res. 15 809-883. · Zbl 1317.68150 [11] Delage, E. and Ye, Y. (2010). Distributionally robust optimization under moment uncertainty with application to data-driven problems. Oper. Res. 58 595-612. · Zbl 1228.90064 · doi:10.1287/opre.1090.0741 [12] DERMAN, E. and MANNOR, S. (2020). Distributional robustness and regularization in reinforcement learning. ArXiv preprint. Available at arXiv:2003.02894. [13] DUAN, Y., JIA, Z. and WANG, M. (2020). Minimax-optimal off-policy evaluation with linear function approximation. In Proceedings of the 37th International Conference on Machine Learning 2701-2709. [14] DUAN, Y., JIN, C. and LI, Z. (2021). Risk bounds and Rademacher complexity in batch reinforcement learning. In Proceedings of the 38th International Conference on Machine Learning 2892-2902. [15] DUCHI, J. and NAMKOONG, H. (2019). Variance-based regularization with convex objectives. J. Mach. Learn. Res. 20 Paper No. 68. · Zbl 1489.62193 [16] DUCHI, J. C., GLYNN, P. W. and NAMKOONG, H. (2021). Statistics of robust optimization: A generalized empirical likelihood approach. Math. Oper. Res. 46 946-969. · Zbl 1473.62292 · doi:10.1287/moor.2020.1085 [17] DUCHI, J. C. and NAMKOONG, H. (2021). Learning models with uniform performance via distributionally robust optimization. Ann. Statist. 49 1378-1406. · Zbl 1473.62019 · doi:10.1214/20-aos2004 [18] DUDÍK, M., ERHAN, D., LANGFORD, J. and LI, L. (2014). Doubly robust policy evaluation and optimization. Statist. Sci. 29 485-511. · Zbl 1331.62059 · doi:10.1214/14-STS500 [19] EPSTEIN, L. G. and SCHNEIDER, M. (2003). Recursive multiple-priors. J. Econom. Theory 113 1-31. · Zbl 1107.91360 · doi:10.1016/S0022-0531(03)00097-8 [20] FARAJTABAR, M., CHOW, Y. and GHAVAMZADEH, M. (2018). More robust doubly robust off-policy evaluation. In Proceedings of the 35th International Conference on Machine Learning 1447-1456. [21] FUJIMOTO, S., MEGER, D. and PRECUP, D. (2019). Off-policy deep reinforcement learning without exploration. In Proceedings of the 36th International Conference on Machine Learning 2052-2062. [22] GAO, R. and KLEYWEGT, A. J. (2016). Distributionally robust stochastic optimization with Wasserstein distance. ArXiv Preprint. Available at arXiv:1604.02199. [23] GHAVAMZADEH, M., PETRIK, M. and CHOW, Y. (2016). Safe policy improvement by minimizing robust baseline regret. Adv. Neural Inf. Process. Syst. 29 2298-2306. [24] GHESHLAGHI AZAR, M., MUNOS, R. and KAPPEN, H. J. (2013). Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn. 91 325-349. · Zbl 1295.68180 · doi:10.1007/s10994-013-5368-1 [25] GOYAL, V. and GRAND-CLEMENT, J. (2022). Robust Markov decision processes: Beyond rectangularity. Math. Oper. Res. [26] GRÜNEWÄLDER, S., LEVER, G., BALDASSARRE, L., PONTIL, M. and GRETTON, A. (2012). Modelling transition dynamics in MDPs with RKHS embeddings. In Proceedings of the 29th International Coference on International Conference on Machine Learning 1603-1610. [27] HAARNOJA, T., ZHOU, A., ABBEEL, P. and LEVINE, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning 1861-1870. [28] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer Series in Statistics. Springer, New York. · Zbl 1273.62005 · doi:10.1007/978-0-387-84858-7 [29] Hirano, K., Imbens, G. W. and Ridder, G. (2003). Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71 1161-1189. · Zbl 1152.62328 · doi:10.1111/1468-0262.00442 [30] HO, C. P., PETRIK, M. and WIESEMANN, W. (2018). Fast Bellman updates for robust MDPs. In Proceedings of the 35th International Conference on Machine Learning 1979-1988. [31] HO, C. P., PETRIK, M. and WIESEMANN, W. (2021). Partial policy iteration for \[{L_1}\]-robust Markov decision processes. J. Mach. Learn. Res. 22 Paper No. 275. · Zbl 07626790 [32] Iyengar, G. N. (2005). Robust dynamic programming. Math. Oper. Res. 30 257-280. · Zbl 1082.90123 · doi:10.1287/moor.1040.0129 [33] JIANG, N. and LI, L. (2016). Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning 652-661. [34] JIN, C., ALLEN-ZHU, Z., BUBECK, S. and JORDAN, M. I. (2018). Is Q-learning provably efficient? Adv. Neural Inf. Process. Syst. 31. [35] JIN, C., YANG, Z., WANG, Z. and JORDAN, M. I. (2020). Provably efficient reinforcement learning with linear function approximation. In Proceedings of Thirty Third Conference on Learning Theory 2137-2143. [36] JIN, Y., YANG, Z. and WANG, Z. (2021). Is pessimism provably efficient for offline rl? In Proceedings of the 38th International Conference on Machine Learning 5084-5096. [37] JONG, N. K. and STONE, P. (2007). Model-based function approximation in reinforcement learning. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems 1-8. [38] KALLUS, N. and UEHARA, M. (2020). Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. J. Mach. Learn. Res. 21 Paper No. 167. · Zbl 1525.68113 [39] KAUFMAN, D. L. and SCHAEFER, A. J. (2013). Robust modified policy iteration. INFORMS J. Comput. 25 396-410. · doi:10.1287/ijoc.1120.0509 [40] Lam, H. (2016). Robust sensitivity analysis for stochastic systems. Math. Oper. Res. 41 1248-1275. · Zbl 1361.65008 · doi:10.1287/moor.2015.0776 [41] LAZARIC, A., GHAVAMZADEH, M. and MUNOS, R. (2012). Finite-sample analysis of least-squares policy iteration. J. Mach. Learn. Res. 13 3041-3074. · Zbl 1433.68361 [42] LE, H., VOLOSHIN, C. and YUE, Y. (2019). Batch policy learning under constraints. In Proceedings of the 36th International Conference on Machine Learning 3703-3712. [43] LEE, J. and RAGINSKY, M. (2018). Minimax statistical learning with Wasserstein distances. Adv. Neural Inf. Process. Syst. 31. [44] LI, L., MUNOS, R. and SZEPESVÁRI, C. (2015). Toward minimax off-policy value estimation. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics 608-616. [45] LI, X., YANG, W., ZHANG, Z. and JORDAN, M. I. (2021). Polyak-Ruppert averaged Q-leaning is statistically efficient. ArXiv Preprint. Available at arXiv:2112.14582. [46] LILLICRAP, T. P., HUNT, J. J., PRITZEL, A., HEESS, N., EREZ, T., TASSA, Y., SILVER, D. and WIERSTRA, D. (2015). Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations. [47] LIM, S. H., XU, H. and MANNOR, S. (2013). Reinforcement learning in robust Markov decision processes. Adv. Neural Inf. Process. Syst. 26 701-709. [48] LIU, Q., LI, L., TANG, Z. and ZHOU, D. (2018). Breaking the curse of horizon: Infinite-horizon off-policy estimation. Adv. Neural Inf. Process. Syst. 31. [49] MANNOR, S., MEBEL, O. and XU, H. (2012). Lightning does not strike twice: Robust MDPs with coupled uncertainty. In Proceedings of the 29th International Conference on Machine Learning 451-458. [50] MANNOR, S., SIMESTER, D., SUN, P. and TSITSIKLIS, J. N. (2004). Bias and variance in value function estimation. In Proceedings of the 21st International Conference on Machine Learning 72. [51] MNIH, V., BADIA, A. P., MIRZA, M., GRAVES, A., LILLICRAP, T., HARLEY, T., SILVER, D. and KAVUKCUOGLU, K. (2016). Asynchronous methods for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning 1928-1937. [52] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K. et al. (2015). Human-level control through deep reinforcement learning. Nature 518 529. [53] MOHRI, M., ROSTAMIZADEH, A. and TALWALKAR, A. (2018). Foundations of Machine Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA. · Zbl 1407.68007 [54] MUNOS, R. and SZEPESVÁRI, C. (2008). Finite-time bounds for fitted value iteration. J. Mach. Learn. Res. 9 815-857. · Zbl 1225.68203 [55] Nilim, A. and El Ghaoui, L. (2005). Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 53 780-798. · Zbl 1165.90674 · doi:10.1287/opre.1050.0216 [56] PANAGANTI, K. and KALATHIL, D. (2022). Sample complexity of robust reinforcement learning with a generative model. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics 9582-9602. [57] PENG, X. B., ANDRYCHOWICZ, M., ZAREMBA, W. and ABBEEL, P. (2018). Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA) 3803-3810. [58] PETRIK, M. (2012). Approximate dynamic programming by minimizing distributionally robust bounds. In Proceedings of the 29th International Conference on Machine Learning 1595-1602. [59] PETRIK, M. and RUSSEL, R. H. (2019). Beyond confidence regions: Tight Bayesian ambiguity sets for robust mdps. Adv. Neural Inf. Process. Syst. 32. [60] PETRIK, M. and SUBRAMANIAN, D. (2014). RAAM: The benefits of robustness in approximating aggregated MDPs in reinforcement learning. Adv. Neural Inf. Process. Syst. 27. [61] PUTERMAN, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. Wiley, New York. · Zbl 0829.90134 [62] QI, Z. and LIAO, P. (2020). Robust batch policy learning in markov decision processes. ArXiv preprint. Available at arXiv:2011.04185. [63] SASON, I. and VERDÚ, S. (2016). \(f\)-divergence inequalities. IEEE Trans. Inf. Theory 62 5973-6006. · Zbl 1359.94363 · doi:10.1109/TIT.2016.2603151 [64] Shapiro, A. (2017). Distributionally robust stochastic programming. SIAM J. Optim. 27 2258-2275. · Zbl 1373.90089 · doi:10.1137/16M1058297 [65] SI, N., ZHANG, F., ZHOU, Z. and BLANCHET, J. (2020). Distributionally robust policy evaluation and learning in offline contextual bandits. In Proceedings of the 37th International Conference on Machine Learning 8884-8894. [66] SIDFORD, A., WANG, M., WU, X., YANG, L. F. and YE, Y. (2018). Near-optimal time and sample complexities for solving Markov decision processes with a generative model. In Advances in Neural Information Processing Systems 31. [67] SILVER, D., HUANG, A., MADDISON, C. J., GUEZ, A., SIFRE, L., VAN DEN DRIESSCHE, G., SCHRITTWIESER, J., ANTONOGLOU, I., PANNEERSHELVAM, V. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature 529 484-489. [68] SMIRNOVA, E., DOHMATOB, E. and MARY, J. (2019). Distributionally robust reinforcement learning. ArXiv Preprint. Available at arXiv:1902.08708. [69] SWAMINATHAN, A. and JOACHIMS, T. (2015). The self-normalized estimator for counterfactual learning. Adv. Neural Inf. Process. Syst. 28. [70] THOMAS, P. and BRUNSKILL, E. (2016). Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning 2139-2148. [71] van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge. · Zbl 0910.62001 · doi:10.1017/CBO9780511802256 [72] Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics 48. Cambridge Univ. Press, Cambridge. · Zbl 1457.62011 · doi:10.1017/9781108627771 [73] WANG, R., FOSTER, D. and KAKADE, S. M. (2020). What are the statistical limits of offline RL with linear function approximation? In Proceedings of the 9th International Conference on Learning Representations. [74] WIESEMANN, W., KUHN, D. and RUSTEM, B. (2013). Robust Markov decision processes. Math. Oper. Res. 38 153-183. · Zbl 1291.90295 · doi:10.1287/moor.1120.0566 [75] Wozabal, D. (2012). A framework for optimization under ambiguity. Ann. Oper. Res. 193 21-47. · Zbl 1255.91454 · doi:10.1007/s10479-010-0812-0 [76] XIAO, C., WU, Y., MEI, J., DAI, B., LATTIMORE, T., LI, L., SZEPESVARI, C. and SCHUURMANS, D. (2021). On the optimality of batch policy optimization algorithms. In Proceedings of the 38th International Conference on Machine Learning 11362-11371. [77] XIE, T. and JIANG, N. (2021). Batch value-function approximation with only realizability. In Proceedings of the 38th International Conference on Machine Learning 11404-11413. [78] XIE, T., MA, Y. and WANG, Y.-X. (2019). Toward optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Adv. Neural Inf. Process. Syst. 32. [79] XU, H. and MANNOR, S. (2006). The robustness-performance tradeoff in Markov decision processes. Adv. Neural Inf. Process. Syst. 19. [80] XU, H. and MANNOR, S. (2009). Parametric regret in uncertain Markov decision processes. In Proceedings of the 48h IEEE Conference on Decision and Control (CDC) Held Jointly with 2009 28th Chinese Control Conference 3606-3613. [81] YANG, W., ZHANG, L. and ZHANG, Z. (2022). Supplement to “Toward theoretical understandings of robust Markov decision processes: Sample complexity and asymptotics.” https://doi.org/10.1214/22-AOS2225SUPP [82] YIN, M., BAI, Y. and WANG, Y.-X. (2021). Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics 1567-1575. [83] YIN, M. and WANG, Y.-X. (2020). Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics 3948-3958. [84] ZHAO, W., QUERALTA, J. P. and WESTERLUND, T. (2020). Sim-to-real transfer in deep reinforcement learning for robotics: A survey. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI) 737-744. [85] ZHOU, Z., BAI, Q., ZHOU, Z., QIU, L., BLANCHET, J. and GLYNN, P. (2021). Finite-sample regret bound for distributionally robust offline tabular reinforcement learning. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics 3331-3339 This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.