Estimation of causal effects with multiple treatments: a review and new ideas. (English) Zbl 1442.62021

Summary: The propensity score is a common tool for estimating the causal effect of a binary treatment in observational data. In this setting, matching, subclassification, imputation or inverse probability weighting on the propensity score can reduce the initial covariate bias between the treatment and control groups. With more than two treatment options, however, estimation of causal effects requires additional assumptions and techniques, the implementations of which have varied across disciplines. This paper reviews current methods, and it identifies and contrasts the treatment effects that each one estimates. Additionally, we propose possible matching techniques for use with multiple, nominal categorical treatments, and use simulations to show how such algorithms can yield improved covariate similarity between those in the matched sets, relative the pre-matched cohort. To sum, this manuscript provides a synopsis of how to notate and use causal methods for categorical treatments.


62A01 Foundations and philosophical topics in statistics
62P10 Applications of statistics to biology and medical sciences; meta analysis
62-02 Research exposition (monographs, survey articles) pertaining to statistics
Full Text: DOI arXiv Euclid


[1] Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica 74 235-267. · Zbl 1112.62042
[2] Abadie, A. and Imbens, G. W. (2008). On the failure of the bootstrap for matching estimators. Econometrica 76 1537-1557. · Zbl 1153.91752
[3] Armstrong, C. S., Jagolinzer, A. D. and Larcker, D. F. (2010). Chief executive officer equity incentives and accounting irregularities. J. Acc. Res.48 225-271.
[4] Austin, P. C. (2009). Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat. Med.28 3083-3107.
[5] Austin, P. C. (2011). Optimal caliper widths for propensity-score matching when estimating differences in means and differences in proportions in observational studies. Pharm. Stat.10 150-161.
[6] Austin, P. C., Grootendorst, P. and Anderson, G. M. (2007). A comparison of the ability of different propensity score models to balance measured variables between treated and untreated subjects: A Monte Carlo study. Stat. Med.26 734-753.
[7] Austin, P. C. and Small, D. S. (2014). The use of bootstrapping when using propensity-score matching without replacement: A simulation study. Stat. Med.33 4306-4319.
[8] Bezdek, J. C., Ehrlich, R. and Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Comput. Geosci.10 191-203.
[9] Bryson, A., Dorsett, R. and Purdon, S. (2002). The use of propensity score matching in the evaluation of active labour market policies.
[10] Caliendo, M. and Kopeinig, S. (2008). Some practical guidance for the implementation of propensity score matching. J. Econ. Surv.22 31-72.
[11] Cangul, M. Z., Chretien, Y. R., Gutman, R. and Rubin, D. B. (2009). Testing treatment effects in unconfounded studies under model misspecification: Logistic regression, discretization, and their combination. Stat. Med.28 2531-2551.
[12] Chertow, G. M., Normand, S. L. T. and McNeil, B. J. (2004). “Renalism”: Inappropriately low rates of coronary angiography in elderly individuals with renal insufficiency. J. Am. Soc. Nephrol.15 2462-2468.
[13] Crump, R. K., Hotz, V. J., Imbens, G. W. and Mitnik, O. A. (2009). Dealing with limited overlap in estimation of average treatment effects. Biometrika 96 187-199. · Zbl 1163.62083
[14] D’Agostino, R. B. (1998). Tutorial in biostatistics: Propensity score methods for bias reduction in the comparison of a treatment to a non-randomized control group. Stat. Med.17 2265-2281.
[15] Davidson, M. B., Hix, J. K., Vidt, D. G. and Brotman, D. J. (2006). Association of impaired diurnal blood pressure variation with a subsequent decline in glomerular filtration rate. Arch. Intern. Med.166 846-852.
[16] Dearing, E., McCartney, K. and Taylor, B. A. (2009). Does higher quality early child care promote low-income children’s math and reading achievement in middle childhood? Child Dev.80 1329-1349.
[17] Dehejia, R. H. and Wahba, S. (1998). Causal effects in non-experimental studies: Re-evaluating the evaluation of training programs. Technical report, National Bureau of Economic Research.
[18] Dehejia, R. H. and Wahba, S. (2002). Propensity score-matching methods for nonexperimental causal studies. Rev. Econ. Stat.84 151-161.
[19] Dore, D. D., Swaminathan, S., Gutman, R., Trivedi, A. N. and Mor, V. (2013). Different analyses estimate different parameters of the effect of erythropoietin stimulating agents on survival in end stage renal disease: A comparison of payment policy analysis, instrumental variables, and multiple imputation of potential outcomes. J. Clin. Epidemiol.66 S42-S50.
[20] Dorsett, R. (2006). The new deal for young people: Effect on the labour market status of young men. Labour Econ.13 405-422.
[21] Drichoutis, A. C., Lazaridis, P. and Nayga Jr., R. M. (2005). Nutrition knowledge and consumer use of nutritional food labels. Eur. Rev. Agricult. Econ.32 93-118.
[22] Efron, B. and Tibshirani, R. J. (1994). An Introduction to the Bootstrap. CRC Press, Boca Raton. · Zbl 0835.62038
[23] Feng, P., Zhou, X.-H., Zou, Q.-M., Fan, M.-Y. and Li, X.-S. (2012). Generalized propensity score for estimating the average treatment effect of multiple treatments. Stat. Med.31 681-697.
[24] Filardo, G., Hamilton, C., Hamman, B. and Grayburn, P. (2007). Obesity and stroke after cardiac surgery: The impact of grouping body mass index. Ann. Thorac. Surg.84 720-722.
[25] Filardo, G., Hamilton, C., Hamman, B., Hebeler Jr., R. F. and Grayburn, P. A. (2009). Relation of obesity to atrial fibrillation after isolated coronary artery bypass grafting. Am. J. Cardiol.103 663-666.
[26] Frank, R., Akresh, I. R. and Lu, B. (2010). Latino immigrants and the US racial order. Am. Sociol. Rev.75 378-401.
[27] Gutman, R. and Rubin, D. B. (2013). Robust estimation of causal effects of binary treatments in unconfounded studies with dichotomous outcomes. Stat. Med.32 1795-1814.
[28] Gutman, R. and Rubin, D. B. (2015). Estimation of causal effects of binary treatments in unconfounded studies. Stat. Med.34 3381-3398.
[29] Hade, E. M. (2012). Propensity score adjustment in multiple group observational studies: Comparing matching and alternative methods. Ph.D. thesis, Ohio State University.
[30] Hade, E. M. and Lu, B. (2014). Bias associated with using the estimated propensity score as a regression covariate. Stat. Med.33 74-87.
[31] Hedman, L. and Van Ham, M. (2012). Understanding Neighbourhood Effects: Selection Bias and Residential Mobility. Springer, Berlin.
[32] Hill, J. and Reiter, J. P. (2006). Interval estimation for treatment effects using propensity score matching. Stat. Med.25 2230-2256.
[33] Holland, P. W. (1986). Statistics and causal inference. J. Amer. Statist. Assoc.81 945-970. · Zbl 0607.62001
[34] Hott, J. R., Brunelle, N. and Myers, J. A. (2012). KD-tree algorithm for propensity score matching with three or more treatment groups. Division of Pharmacoepidemiology and Pharmacoeconomics, Technical Report Series.
[35] Iacus, S. M., King, G. and Porro, G. (2011). Causal inference without balance checking: Coarsened exact matching. Polit. Anal. mpr013. · Zbl 1396.62011
[36] Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score. J. R. Stat. Soc. Ser. B. Stat. Methodol.76 243-263. · Zbl 1411.62025
[37] Imai, K. and van Dyk, D. A. (2004). Causal inference with general treatment regimes: Generalizing the propensity score. J. Amer. Statist. Assoc.99 854-866. · Zbl 1117.62361
[38] Imbens, G. W. (2000). The role of the propensity score in estimating dose-response functions. Biometrika 87 706-710. · Zbl 1120.62334
[39] Imbens, G. W. and Rubin, D. B. (2015). Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge Univ. Press, Cambridge. · Zbl 1355.62002
[40] Joffe, M. M. and Rosenbaum, P. R. (1999). Invited commentary: Propensity scores. Am. J. Epidemiol.150 327-333.
[41] Johnson, R. A., Wichern, D. W. et al. (1992). Applied Multivariate Statistical Analysis 4. Prentice Hall, Englewood Cliffs, NJ. · Zbl 0745.62050
[42] Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci.22 523-539. · Zbl 1246.62073
[43] Karp, R. M. (1972). Reducibility among combinatorial problems. In Complexity of Computer Computations (Proc. Sympos., IBM Thomas J. Watson Res. Center, Yorktown Heights, N.Y., 1972) 85-103. Plenum, New York. · Zbl 1467.68065
[44] Kilpatrick, R. D., Gilbertson, D., Brookhart, M. A., Polley, E., Rothman, K. J. and Bradbury, B. D. (2013). Exploring large weight deletion and the ability to balance confounders when using inverse probability of treatment weighting in the presence of rare treatment decisions. Pharmacoepidemiol. Drug Saf.22 111-121.
[45] Kosteas, V. D. (2010). The effect of exercise on earnings: Evidence from the NLSY. J. Labor Res. 1-26.
[46] Lechner, M. (2001). Identification and estimation of causal effects of multiple treatments under the conditional independence assumption. Econom. Evaluation Labour Mark. Polic. 43-58.
[47] Lechner, M. (2002). Program heterogeneity and propensity score matching: An application to the evaluation of active labor market policies. Rev. Econ. Stat.84 205-220.
[48] Lee, B. K., Lessler, J. and Stuart, E. A. (2011). Weight trimming and propensity score weighting. PLoS ONE 6 e18174.
[49] Levin, I. and Alvarez, R. M. (2009). Measuring the effects of voter confidence on political participation: An application to the 2006 Mexican election. VTP Working Paper 75, Caltech/MIT Voting Technology Project.
[50] Little, R. J. A. (1988). Missing-data adjustments in large surveys. J. Bus. Econom. Statist. 287-296.
[51] Lopez, M. J. and Gutman, R. (2014). Estimating the average treatment effects of nutritional label use using subclassification with regression adjustment. Stat. Methods Med. Res.DOI:10.1177/0962280214560046.
[52] Lu, B., Zanutto, E., Hornik, R. and Rosenbaum, P. R. (2001). Matching with doses in an observational study of a media campaign against drug abuse. J. Amer. Statist. Assoc.96 1245-1253. · Zbl 1051.62113
[53] McCaffrey, D. F., Ridgeway, G. and Morral, A. R. (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychol. Methods 9 403-425.
[54] McCaffrey, D. F., Griffin, B. A., Almirall, D., Slaughter, M. E., Ramchand, R. and Burgette, L. F. (2013). A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Stat. Med.32 3388-3414.
[55] McCullagh, P. (1980). Regression models for ordinal data. J. R. Stat. Soc. Ser. B. Stat. Methodol.42 109-142. · Zbl 0483.62056
[56] Moore, A. W. (1991). An introductory tutorial on kd-trees. Extract from PhD thesis. Technical report.
[57] Quade, D. (1979). Using weighted rankings in the analysis of complete blocks with additive block effects. J. Amer. Statist. Assoc.74 680-683. · Zbl 0416.62037
[58] R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
[59] Rassen, J. A., Solomon, D. H., Glynn, R. J. and Schneeweiss, S. (2011). Simultaneously assessing intended and unintended treatment effects of multiple treatment options: A pragmatic “matrix design.” Pharmacoepidemiol. Drug Saf.20 675-683.
[60] Rassen, J. A., Shelat, A. A., Franklin, J. M., Glynn, R. J., Solomon, D. H. and Schneeweiss, S. (2013). Matching by propensity score in cohort studies with three treatment groups. Epidemiology 24 401-409.
[61] Robins, J. M., Hernan, M. A. and Brumback, B. (2000). Marginal structural models and causal inference in epidemiology. Epidemiology 11 550-560.
[62] Rosenbaum, P. R. (1991). A characterization of optimal designs for observational studies. J. R. Stat. Soc., B 53 597-610. · Zbl 0800.62465
[63] Rosenbaum, P. R. (2002). Observational Studies, 2nd ed. Springer, New York. · Zbl 0985.62091
[64] Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70 41-55. · Zbl 0522.62091
[65] Rosenbaum, P. R. and Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. J. Amer. Statist. Assoc.79 516-524.
[66] Rosenbaum, P. R. and Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Amer. Statist.39 33-38.
[67] Royston, P., Altman, D. G. and Sauerbrei, W. (2006). Dichotomizing continuous predictors in multiple regression: A bad idea. Stat. Med.25 127-141.
[68] Rubin, D. B. (1973). Matching to remove bias in observational studies. Biometrics 29 159-183.
[69] Rubin, D. B. (1975). Bayesian inference for causality: The importance of randomization. In The Proceedings of the Social Statistics Section of the American Statistical Association 233-239.
[70] Rubin, D. B. (1976). Multivariate matching methods that are equal percent bias reducing. II. Maximums on bias reduction for fixed sample sizes. Biometrics 32 121-132. · Zbl 0326.62044
[71] Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to control bias in observational studies. J. Amer. Statist. Assoc.74 318-328. · Zbl 0413.62047
[72] Rubin, D. B. (1980). Discussion of Basu’s paper. J. Amer. Statist. Assoc.75 591-593.
[73] Rubin, D. B. (2001). Using propensity scores to help design observational studies: Application to the tobacco litigation. Health Serv. Outcomes Res. Methodol.2 169-188.
[74] Rubin, D. B. and Thomas, N. (1992a). Affinely invariant matching methods with ellipsoidal distributions. Ann. Statist.20 1079-1093. · Zbl 0761.62065
[75] Rubin, D. B. and Thomas, N. (1992b). Characterizing the effect of matching using linear propensity score methods with normal distributions. Biometrika 79 797-809. · Zbl 0765.62098
[76] Rubin, D. B. and Thomas, N. (1996). Matching using estimated propensity scores: Relating theory to practice. Biometrics 52 249-264. · Zbl 0881.62121
[77] Sakia, R. M. (1992). The Box-Cox transformation technique: A review. Statistician 42 169-178.
[78] SAS Institute Inc. (2003). SAS/STAT Software. SAS Institute Inc., Cary, NC.
[79] Schneeweiss, S., Setoguchi, S., Brookhart, A., Dormuth, C. and Wang, P. S. (2007). Risk of death associated with the use of conventional versus atypical antipsychotic drugs among elderly patients. CMAJ, Can. Med. Assoc. J.176 627-632.
[80] Sekhon, J. (2011). Multivariate and propensity score matching software with automated balance optimization: The matching package for R. J. Stat. Softw.42, 1-52.
[81] Snodgrass, G., Blokland, A. A. J., Haviland, A., Nieuwbeerta, P. and Nagin, D. S. (2011). Does the time cause the crime? An examination of the relationship between time served and reoffending in the Netherlands. Criminology 49 1149-1194.
[82] Splawa-Neyman, J., Dabrowska, D. M. and Speed, T. P. (1990 [1923]). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Statist. Sci.5 465-472.
[83] Spreeuwenberg, M. D., Bartak, A., Croon, M. A., Hagenaars, J. A., Busschbach, J. J. V., Andrea, H., Twisk, J. and Stijnen, T. (2010). The multiple propensity score as control for bias in the comparison of more than two treatment arms: An introduction from a case study in mental health. Med. Care 48 166.
[84] Sprent, P. and Smeeton, N. C. (2007). Applied Nonparametric Statistical Methods. CRC Press, Boca Raton, FL. · Zbl 1141.62020
[85] Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statist. Sci.25 1-21. · Zbl 1328.62007
[86] Stuart, E. A. and Rubin, D. B. (2008). Best practices in quasi-experimental designs. Best Pract. Quant. Methods 155-176.
[87] Tan, Z. (2010). Bounded, efficient and doubly robust estimation with inverse weighting. Biometrika 97 661-682. · Zbl 1195.62037
[88] Tchernis, R., Horvitz-Lennon, M. and Normand, S. L. T. (2005). On the use of discrete choice models for causal inference. Stat. Med.24 2197-2212.
[89] Tu, C., Jiao, S. and Koh, W. Y. (2012). Comparison of clustering algorithms on generalized propensity score in observational studies: A simulation study. J. Stat. Comput. Simul.83 2206-2218. · Zbl 1453.62554
[90] Vermorken, J. B., Parmar, M. K., Brady, M. F., Eisenhauer, E. A., Hogberg, T., Ozols, R. F., Rochon, J., Rustin, G. J., Sagae, S., Verheijen, R. H. et al. (2005). Clinical trials in ovarian carcinoma: Study methodology. Ann. Oncol.16 viii20.
[91] Yanovitzky, I., Zanutto, E. and Hornik, R. (2005). Estimating causal effects of public health education campaigns using propensity score methodology. Eval. Program Plann.28 209-220.
[92] Zanutto, E., Lu, B. and Hornik, R. (2005). Using propensity score subclassification for multiple treatment doses to evaluate a national antidrug media campaign. J. Educ. Behav. Stat.30 59-73.
[93] Zubizarreta, J. R. (2012). Using mixed integer programming for matching in an observational study of kidney failure after surgery. J. Amer. Statist. Assoc.107 1360-1371. · Zbl 1258.62119
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.