Oracle, multiple robust and multipurpose calibration in a missing response problem. (English) Zbl 1331.62070

Summary: In the presence of a missing response, reweighting the complete case subsample by the inverse of nonmissing probability is both intuitive and easy to implement. When the population totals of some auxiliary variables are known and when the inclusion probabilities are known by design, survey statisticians have developed calibration methods for improving efficiencies of the inverse probability weighting estimators and the methods can be applied to missing data analysis. Model-based calibration has been proposed in the survey sampling literature, where multidimensional auxiliary variables are first summarized into a predictor function from a working regression model. Usually, one working model is being proposed for each parameter of interest and results in different sets of calibration weights for estimating different parameters. This paper considers calibration using multiple working regression models for estimating a single or multiple parameters. Contrary to a common belief that overfitting hurts efficiency, we present three rather unexpected results. First, when the missing probability is correctly specified and multiple working regression models for the conditional mean are posited, calibration enjoys an oracle property: the same semiparametric efficiency bound is attained as if the true outcome model is known in advance. Second, when the missing data mechanism is misspecified, calibration can still be a consistent estimator when any one of the outcome regression models is correctly specified. Third, a common set of calibration weights can be used to improve efficiency in estimating multiple parameters of interest and can simultaneously attain semiparametric efficiency bounds for all parameters of interest. We provide connections of a wide class of calibration estimators, constructed based on generalized empirical likelihood, to many existing estimators in biostatistics, econometrics and survey sampling and perform simulation studies to show that the finite sample properties of calibration estimators conform well with the theoretical results being studied.


62D05 Sampling theory, sample surveys
Full Text: DOI arXiv Euclid


[1] Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics 61 962-972. · Zbl 1087.62121 · doi:10.1111/j.1541-0420.2005.00377.x
[2] Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E. and Kulich, M. (2009). Improved Horvitz-Thompson estimation of model parameters from two-phase stratified samples: Applications in epidemiology. Statistics in Biosciences 1 32-49.
[3] Cassel, C. M., Särndal, C. E. and Wretman, J. H. (1976). Some results on generalized difference estimation and generalized regression estimation for finite populations. Biometrika 63 615-620. · Zbl 0344.62011 · doi:10.1093/biomet/63.3.615
[4] Chan, K. C. G. (2012). Uniform improvement of empirical likelihood for missing response problem. Electron. J. Stat. 6 289-302. · Zbl 1334.62033 · doi:10.1214/12-EJS673
[5] Chan, K. C. G. (2013). A simple multiply robust estimator for missing response problem. Stat 2 143-149.
[6] Chan, K. C. G. and Yam, S. C. P. (2014). Supplement to “Oracle, Multiple Robust and Multipurpose Calibration in a Missing Response Problem.” . · Zbl 1331.62070 · doi:10.1214/13-STS461
[7] Chaussé, P. (2010). Computing generalized method of moments and generalized empirical likelihood with R. Journal of Statistical Software 34 1-35.
[8] Chen, J. and Sitter, R. R. (1999). A pseudo empirical likelihood approach to the effective use of auxiliary information in complex surveys. Statist. Sinica 9 385-406. · Zbl 0930.62005
[9] Chen, J., Sitter, R. R. and Wu, C. (2002). Using empirical likelihood methods to obtain range restricted weights in regression estimators for surveys. Biometrika 89 230-237. · Zbl 0997.62008 · doi:10.1093/biomet/89.1.230
[10] Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests. J. Roy. Statist. Soc. Ser. B 46 440-464. · Zbl 0571.62017
[11] Deming, W. E. and Stephan, F. F. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Statist. 11 427-444. · Zbl 0024.05502 · doi:10.1214/aoms/1177731829
[12] Deville, J.-C. and Särndal, C.-E. (1992). Calibration estimators in survey sampling. J. Amer. Statist. Assoc. 87 376-382. · Zbl 0760.62010 · doi:10.2307/2290268
[13] Deville, J. C., Särndal, C. E. and Sautory, O. (1993). Generalized raking procedures in survey sampling. J. Amer. Statist. Assoc. 88 1013-1020. · Zbl 0794.62005 · doi:10.2307/2290793
[14] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. · Zbl 1073.62547 · doi:10.1198/016214501753382273
[15] Graham, B. S., De Xavier Pinto, C. C. and Egel, D. (2012). Inverse probability tilting for moment condition model with missing data. Rev. Econ. Stud. 79 1053-1079. · doi:10.1093/restud/rdr047
[16] Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica 66 315-331. · Zbl 1055.62572 · doi:10.2307/2998560
[17] Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis 20 25-46.
[18] Han, P. and Wang, L. (2013). Estimation with missing data: Beyond double robustness. Biometrika 100 417-430. · Zbl 1284.62260 · doi:10.1093/biomet/ass087
[19] Hansen, L. P. (1982). Large sample properties of generalized method of moments estimators. Econometrica 50 1029-1054. · Zbl 0502.62098 · doi:10.2307/1912775
[20] Hansen, L. P., Heaton, J. and Yaron, A. (1996). Finite-sample properties of some alternative GMM estimators. J. Bus. Econom. Statist. 14 262-280.
[21] Hellerstein, J. K. and Imbens, G. W. (1999). Imposing moment restrictions from auxiliary data by weighting. Rev. Econ. Statist. 81 1-14.
[22] Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663-685. · Zbl 0047.38301 · doi:10.2307/2280784
[23] Imbens, G. W., Spady, R. H. and Johnson, P. (1998). Information-theoretic approaches to inference in moment condition models. Econometrica 66 333-357. · Zbl 1055.62512 · doi:10.2307/2998561
[24] Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22 523-539. · Zbl 1246.62073 · doi:10.1214/07-STS227
[25] Kim, J. K. (2009). Calibration estimation using empirical likelihood in survey sampling. Statist. Sinica 19 145-157. · Zbl 1153.62006
[26] Kitamura, Y. and Stutzer, M. (1997). An information-theoretic alternative to generalized method of moments estimation. Econometrica 65 861-874. · Zbl 0894.62011 · doi:10.2307/2171942
[27] Kott, P. S. and Chang, T. (2010). Using calibration weighting to adjust for nonignorable unit nonresponse. J. Amer. Statist. Assoc. 105 1265-1275. · Zbl 1390.62011 · doi:10.1198/jasa.2010.tm09016
[28] Lehmann, E. L. and Casella, G. (1998). Theory of Point Estimation , 2nd ed. Springer, New York. · Zbl 0916.62017
[29] Lindsay, B. G. and Qu, A. (2003). Inference functions and quadratic score tests. Statist. Sci. 18 394-410. · Zbl 1055.62047 · doi:10.1214/ss/1076102427
[30] Lumley, T., Shaw, P. A. and Dai, J. Y. (2011). Connections between survey calibration estimators and semiparametric models for incomplete data. Internat. Statist. Rev. 79 200-220. · Zbl 1422.62048
[31] McCaffrey, D. F., Ridgeway, G. and Morral, A. R. (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods 9 403-425.
[32] Newey, W. K. and McFadden, D. (1994). Large sample estimation and hypothesis testing. In Handbook of Econometrics , Vol. IV. Handbooks in Econom. 2 2111-2245. North-Holland, Amsterdam. · doi:10.1016/S1573-4412(05)80005-4
[33] Newey, W. K. and Smith, R. J. (2004). Higher order properties of GMM and generalized empirical likelihood estimators. Econometrica 72 219-255. · Zbl 1151.62313 · doi:10.1111/j.1468-0262.2004.00482.x
[34] Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional. Biometrika 75 237-249. · Zbl 0641.62032 · doi:10.1093/biomet/75.2.237
[35] Qin, J. and Lawless, J. (1994). Empirical likelihood and general estimating equations. Ann. Statist. 22 300-325. · Zbl 0799.62049 · doi:10.1214/aos/1176325370
[36] Qin, J. and Zhang, B. (2007). Empirical-likelihood-based inference in missing response problems and its application in observational studies. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 101-122. · doi:10.1111/j.1467-9868.2007.00579.x
[37] Ridgeway, G. and McCaffrey, D. F. (2007). Comment: Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci. 22 540-543. · Zbl 1246.62075 · doi:10.1214/07-STS227C
[38] Robins, J. M. and Rotnitzky, A. (1995). Semiparametric efficiency in multivariate regression models with missing data. J. Amer. Statist. Assoc. 90 122-129. · Zbl 0818.62043 · doi:10.2307/2291135
[39] Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89 846-866. · Zbl 0815.62043 · doi:10.2307/2290910
[40] Saegusa, T. and Wellner, J. A. (2013). Weighted likelihood estimation under two-phase sampling. Ann. Statist. 41 269-295. · Zbl 1347.62033 · doi:10.1214/12-AOS1073
[41] Scharfstein, D. O., Rotnitzky, A. and Robins, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. J. Amer. Statist. Assoc. 94 1096-1146. · Zbl 1072.62644 · doi:10.2307/2669923
[42] Tan, Z. (2006). A distributional approach for causal inference using propensity scores. J. Amer. Statist. Assoc. 101 1619-1637. · Zbl 1171.62320 · doi:10.1198/016214506000000023
[43] Théberge, A. (1999). Extensions of calibration estimators in survey sampling. J. Amer. Statist. Assoc. 94 635-644. · Zbl 0997.62009 · doi:10.2307/2670183
[44] White, H. (1982). Maximum likelihood estimation of misspecified models. Econometrica 50 1-25. · Zbl 0478.62088 · doi:10.2307/1912526
[45] Wu, C. and Sitter, R. R. (2001). A model-calibration approach to using complete auxiliary information from survey data. J. Amer. Statist. Assoc. 96 185-193. · Zbl 1015.62005 · doi:10.1198/016214501750333054
[46] Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418-1429. · Zbl 1171.62326 · doi:10.1198/016214506000000735
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.