Q-learning with censored data. (English) Zbl 1246.62206

Summary: We develop a methodology for a multistage decision problem with a flexible number of stages in which the rewards are survival times that are subject to censoring. We present a novel Q-learning algorithm that is adjusted for censored data and allows a flexible number of stages. We provide finite sample bounds on the generalization error of the policy learned by the algorithm, and show that when the optimal Q-function belongs to the approximation space, the expected survival time for policies obtained by the algorithm converges to that of the optimal policy. We simulate a multistage clinical trial with flexible number of stages and apply the proposed censored-Q-learning algorithm to find individualized treatment regimens. The methodology presented in this paper has implications in the design of personalized medicine trials in cancer and in other life-threatening diseases.


62P10 Applications of statistics to biology and medical sciences; meta analysis
92C50 Medical applications (general)
62N01 Censored data models
65C60 Computational problems in statistics (MSC2010)
Full Text: DOI arXiv Euclid


[1] Anthony, M. and Bartlett, P. L. (1999). Neural Network Learning : Theoretical Foundations . Cambridge Univ. Press, Cambridge. · Zbl 0968.68126
[2] Bellman, R. (1957). Dynamic Programming . Princeton Univ. Press, Princeton, NJ. · Zbl 0077.13605
[3] Biganzoli, E., Boracchi, P., Mariani, L. and Marubini, E. (1998). Feed forward neural networks for the analysis of censored survival data: A partial logistic regression approach. Stat. Med. 17 1169-1186.
[4] Bitouzé, D., Laurent, B. and Massart, P. (1999). A Dvoretzky-Kiefer-Wolfowitz type inequality for the Kaplan-Meier estimator. Ann. Inst. Henri Poincaré Probab. Stat. 35 735-763. · Zbl 1054.62589
[5] Chen, P.-Y. and Tsiatis, A. A. (2001). Causal inference on the difference of the restricted mean lifetime between two groups. Biometrics 57 1030-1038. · Zbl 1209.62267
[6] Goldberg, Y. and Kosorok, M. R. (2012). Supplement to “Q-learning with censored data.” . · Zbl 1246.62206
[7] Goldberg, Y. and Kosorok, M. R. (2012). Support vector regression for right censored data. Unpublished manuscript. Available at . · Zbl 1456.62234
[8] Karrison, T. G. (1997). Use of Irwin’s restricted mean as an index for comparing survival in different treatment groups-interpretation and power considerations. Control Clin. Trials 18 151-167.
[9] Kosorok, M. R. (2008). Introduction to Empirical Processes and Semiparametric Inference . Springer, New York. · Zbl 1180.62137
[10] Krzakowski, M., Ramlau, R., Jassem, J., Szczesna, A., Zatloukal, P., Pawel, J. V., Sun, X., Bennouna, J., Santoro, A., Biesma, B., Delgado, F. M., Salhi, Y., Vaissiere, N., Hansen, O., Tan, E.-H., Quoix, E., Garrido, P. and Douillard, J.-Y. (2010). Phase III trial comparing vinflunine with docetaxel in second-line advanced non-small-cell lung cancer previously treated with platinum-containing chemotherapy. J. Clin. Oncol. 28 2167-2173.
[11] Laber, E., Qian, M., Lizotte, D. J. and Murphy, S. A. (2010). Statistical inference in dynamic treatment regimes. Available at .
[12] Lavori, P. W. and Dawson, R. (2004). Dynamic treatment regimes: Practical design considerations. Clin. Trials 1 9-20.
[13] Lunceford, J. K., Davidian, M. and Tsiatis, A. A. (2002). Estimation of survival distributions of treatment policies in two-stage randomization designs in clinical trials. Biometrics 58 48-57. · Zbl 1209.62307
[14] Miyahara, S. and Wahed, A. S. (2010). Weighted Kaplan-Meier estimators for two-stage treatment regimes. Stat. Med. 29 2581-2591.
[15] Moodie, E. E. M., Richardson, T. S. and Stephens, D. A. (2007). Demystifying optimal dynamic treatment regimes. Biometrics 63 447-455. · Zbl 1137.62077
[16] Murphy, S. A. (2003). Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol. 65 331-366. · Zbl 1065.62006
[17] Murphy, S. A. (2005a). An experimental design for the development of adaptive treatment strategies. Stat. Med. 24 1455-1481.
[18] Murphy, S. A. (2005b). A generalization error for Q-learning. J. Mach. Learn. Res. 6 1073-1097 (electronic). · Zbl 1222.68271
[19] Murphy, S. A., Oslin, D. W., Rush, A. J., Zhu, J. and MCATS (2007). Methodological challenges in constructing effective treatment sequences for chronic psychiatric disorders. Neuropsychopharmacology 32 257-262.
[20] Orellana, L., Rotnitzky, A. and Robins, J. M. (2010). Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, Part I: Main content. Int. J. Biostat. 6 Art. 8, 49.
[21] Robins, J. M. (1999). Association, causation, and marginal structural models. Synthese 121 151-179. · Zbl 1078.62523
[22] Robins, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium in Biostatistics (D. Lin and P. J. Heagerty, eds.) 189-326. Springer, New York. · Zbl 1279.62024
[23] Robins, J., Orellana, L. and Rotnitzky, A. (2008). Estimation and extrapolation of optimal treatment and testing strategies. Stat. Med. 27 4678-4721.
[24] Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89 846-866. · Zbl 0815.62043
[25] Satten, G. A. and Datta, S. (2001). The Kaplan-Meier estimator as an inverse-probability-of-censoring weighted average. Amer. Statist. 55 207-210. · Zbl 1182.62191
[26] Shim, J. and Hwang, C. (2009). Support vector censored quantile regression under random censoring. Comput. Statist. Data Anal. 53 912-919. · Zbl 1452.62122
[27] Shivaswamy, P. K., Chu, W. and Jansche, M. (2007). A support vector approach to censored targets. In Proceedings of the 7 th IEEE International Conference on Data Mining ( ICDM 2007), Omaha , Nebraska , USA 655-660. IEEE Computer Society.
[28] Steinwart, I. and Christmann, A. (2008). Support Vector Machines . Springer, New York. · Zbl 1203.68171
[29] Stinchcombe, T. E. and Socinski, M. A. (2008). Considerations for second-line therapy of non-small cell lung cancer. Oncologist 13 28-36.
[30] Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning : An Introduction . MIT Press, Cambridge, MA.
[31] Thall, P. F., Wooten, L. H., Logothetis, C. J., Millikan, R. E. and Tannir, N. M. (2007). Bayesian and frequentist two-stage treatment strategies based on sequential failure times subject to interval censoring. Stat. Med. 26 4687-4702.
[32] Tsitsiklis, J. N. and van Roy, B. (1996). Feature-based methods for large scale dynamic programming. Machine Learning 22 59-94. · Zbl 1099.90586
[33] van der Laan, M. J. and Petersen, M. L. (2007). Causal effect models for realistic individualized treatment and intention to treat rules. Int. J. Biostat. 3 Art. 3, 54. · Zbl 1165.62357
[34] van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes : With Applications to Statistics . Springer, New York. · Zbl 0862.60002
[35] Vapnik, V. N. (1999). The Nature of Statistical Learning Theory , 2nd ed. Springer, New York. · Zbl 0928.68093
[36] Wahed, A. S. (2009). Estimation of survival quantiles in two-stage randomization designs. J. Statist. Plann. Inference 139 2064-2075. · Zbl 1159.62084
[37] Wahed, A. S. and Tsiatis, A. A. (2006). Semiparametric efficient estimation of survival distributions in two-stage randomisation designs in clinical trials with censored data. Biometrika 93 163-177. · Zbl 1152.62397
[38] Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. thesis, Cambridge Univ.
[39] Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning 8 279-292. · Zbl 0773.68062
[40] Wellner, J. A. (2007). On an exponential bound for the Kaplan-Meier estimator. Lifetime Data Anal. 13 481-496. · Zbl 1331.62258
[41] Zhao, Y., Kosorok, M. R. and Zeng, D. (2009). Reinforcement learning design for cancer clinical trials. Stat. Med. 28 3294-3315.
[42] Zhao, Y., Zeng, D., Socinski, M. A. and Kosorok, M. R. (2011). Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics 67 1422-1433. · Zbl 1274.62922
[43] Zucker, D. M. (1998). Restricted mean life with covariates: Modification and extension of a useful survival analysis method. J. Amer. Statist. Assoc. 93 702-709. · Zbl 1130.62362
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.