Fractional imputation in survey sampling: a comparative review. (English) Zbl 1442.62032

Summary: Fractional imputation (FI) is a relatively new method of imputation for handling item nonresponse in survey sampling. In FI, several imputed values with their fractional weights are created for each record with missing items. Each fractional weight represents the conditional probability of the imputed value given the observed data, and the parameters in the conditional probabilities are often computed by an iterative method such as the EM algorithm. The underlying model for FI can be fully parametric, semiparametric or nonparametric, depending on the plausibility of assumptions and the data structure.
In this paper, we give an overview of FI, introduce key ideas and methods to readers who are new to the FI literature, and highlight some new developments. We also provide guidance on practical implementation of FI and valid inferential tools after imputation. We demonstrate the empirical performance of FI with respect to multiple imputation using a pseudo finite population generated from a sample from the Monthly Retail Trade Survey conducted by the US Census Bureau.


62D05 Sampling theory, sample surveys
62P25 Applications of statistics to social sciences
Full Text: DOI arXiv Euclid


[1] Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike 199-213. Springer, Berlin. · Zbl 0283.62006
[2] Andridge, R. R. and Little, R. J. (2010). A review of hot deck imputation for survey non-response. Int. Stat. Rev.78 40-64.
[3] Bang, H. and Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics61 962-972. · Zbl 1087.62121 · doi:10.1111/j.1541-0420.2005.00377.x
[4] Beaumont, J.-F. and Bocci, C. (2009). Variance estimation when donor imputation is used to fill in missing values. Canad. J. Statist.37 400-416. · Zbl 1177.62010 · doi:10.1002/cjs.10019
[5] Beaumont, J.-F., Haziza, D. and Bocci, C. (2011). On variance estimation under auxiliary value imputation in sample surveys. Statist. Sinica21 515-537. · Zbl 1214.62008 · doi:10.5705/ss.2011.024a
[6] Berg, E., Kim, J. K. and Skinner, C. (2016). Imputation under informative sampling. Surv. Methodol. To appear.
[7] Binder, D. A. and Patak, Z. (1994). Use of estimating functions for estimation from complex surveys. J. Amer. Statist. Assoc.89 1035-1043. · Zbl 0825.62392 · doi:10.1080/01621459.1994.10476839
[8] Binder, D. A. and Sun, W. (1996). Frequency valid multiple imputation for surveys with a complex design. In Proceedings of the Survey Research Methods Section of the American Statistical Association 281-286. Amer. Statist. Assoc., Alexandria, VA.
[9] Cao, W., Tsiatis, A. A. and Davidian, M. (2009). Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika96 723-734. · Zbl 1170.62007 · doi:10.1093/biomet/asp033
[10] Chauvet, G., Deville, J.-C. and Haziza, D. (2011). On balanced random imputation in surveys. Biometrika98 459-471. · Zbl 1215.62006 · doi:10.1093/biomet/asr011
[11] Chen, J. and Shao, J. (2001). Jackknife variance estimation for nearest-neighbor imputation. J. Amer. Statist. Assoc.96 260-269. · Zbl 1014.62008 · doi:10.1198/016214501750332839
[12] Durrant, G. B. (2009). Imputation methods for handling item-nonresponse in practice: Methodological issues and recent debates. International Journal of Social Research Methodology12 293-304.
[13] Durrant, G. B. and Skinner, C. (2006). Using missing data methods to correct for measurement error in a distribution function. Surv. Methodol.32 25-36.
[14] Fay, R. E. (1992). When are inferences from multiple imputation valid? In Proceedings of the Survey Research Methods Section of the American Statistical Association81 227-332. Amer. Statist. Assoc., Alexandria, VA.
[15] Fay, R. E. (1996). Alternative paradigms for the analysis of imputed survey data. J. Amer. Statist. Assoc.91 490-498. · Zbl 0869.62015
[16] Fuller, W. A. (2003). Estimation for multiple phase samples. In Analysis of Survey Data (Southampton, 1999) (R. L. Chambers and C. J. Skinner, eds.) 307-322. Wiley, Chichester. · doi:10.1002/0470867205.ch19
[17] Fuller, W. A. and Kim, J. K. (2005). Hot deck imputation for the response model. Surv. Methodol.31 139-149.
[18] Godambe, V. P. and Thompson, M. E. (1986). Parameters of superpopulation and survey population: Their relationships and estimation. Int. Stat. Rev.54 127-138. · Zbl 0612.62011 · doi:10.2307/1403139
[19] Hastings, W. K. (1970). Monte Carlo sampling methods using Markov Chains and their applications. Biometrika57 97-109. · Zbl 0219.65008 · doi:10.1093/biomet/57.1.97
[20] Haziza, D. (2009). Imputation and inference in the presence of missing data. In Sample Surveys: Design, Methods and Applications (C. R. Rao and D. Pfeffermann, eds.). Handbook of Statist.29 215-246. Elsevier, Amsterdam. · Zbl 1179.62026 · doi:10.1016/S0169-7161(08)00010-2
[21] Ibrahim, J. G. (1990). Incomplete data in generalized linear models. J. Amer. Statist. Assoc.85 765-769.
[22] Kalton, G. and Kish, L. (1984). Some efficient random imputation methods. Comm. Statist. Theory Methods13 1919-1939.
[23] Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statist. Sci.22 523-539. · Zbl 1246.62073 · doi:10.1214/07-STS227
[24] Kim, J. K. (2011). Parametric fractional imputation for missing data analysis. Biometrika98 119-132. · Zbl 1214.62024 · doi:10.1093/biomet/asq073
[25] Kim, J. K. and Fuller, W. (2004). Fractional hot deck imputation. Biometrika91 559-578. · Zbl 1111.62008 · doi:10.1093/biomet/91.3.559
[26] Kim, J. K., Fuller, W. A. and Bell, W. R. (2011). Variance estimation for nearest neighbor imputation for US census long form data. Ann. Appl. Stat.5 824-842. · Zbl 1454.62500 · doi:10.1214/10-AOAS419
[27] Kim, J. K. and Haziza, D. (2014). Doubly robust inference with missing data in survey sampling. Statist. Sinica24 375-394. · Zbl 1285.62010
[28] Kim, J. K. and Hong, M. (2012). Imputation for statistical inference with coarse data. Canad. J. Statist.40 604-618. · Zbl 1349.62024 · doi:10.1002/cjs.11142
[29] Kim, J. Y. and Kim, J. K. (2012). Parametric fractional imputation for nonignorable missing data. J. Korean Statist. Soc.41 291-303. · Zbl 1296.62185 · doi:10.1016/j.jkss.2011.10.002
[30] Kim, J. K., Navarro, A. and Fuller, W. A. (2006). Replication variance estimation for two-phase stratified sampling. J. Amer. Statist. Assoc.101 312-320. · Zbl 1118.62305 · doi:10.1198/016214505000000763
[31] Kim, J. K. and Rao, J. N. K. (2012). Combining data from two independent surveys: A model-assisted approach. Biometrika99 85-100. · Zbl 1234.62008 · doi:10.1093/biomet/asr063
[32] Kim, J. K. and Shao, J. (2014). Statistical Methods for Handling Incomplete Data. Chapman & Hall, Raton, FL. · Zbl 1276.62004
[33] Kim, J. K. and Yang, S. (2014). Fractional hot deck imputation for robust inference under item nonresponse in survey sampling. Surv. Methodol.40 211-230.
[34] Kim, J. K. and Yu, C. L. (2011a). Replication variance estimation under two-phase sampling. Surv. Methodol.37 67-74.
[35] Kim, J. K. and Yu, C. L. (2011b). A semiparametric estimation of mean functionals with nonignorable missing data. J. Amer. Statist. Assoc.106 157-165. · Zbl 1396.62032 · doi:10.1198/jasa.2011.tm10104
[36] Kim, J. K., Brick, J. M., Fuller, W. A. and Kalton, G. (2006). On the bias of the multiple-imputation variance estimator in survey sampling. J. R. Stat. Soc. Ser. B. Stat. Methodol.68 509-521. · Zbl 1110.62008 · doi:10.1111/j.1467-9868.2006.00546.x
[37] Kitamura, Y., Tripathi, G. and Ahn, H. (2004b). Empirical likelihood-based inference in conditional moment restriction models. Econometrika72 1667-1714. · Zbl 1142.62331 · doi:10.1111/j.1468-0262.2004.00550.x
[38] Kott, P. (1995). A paradox of multiple imputation. In Proceedings of the Survey Research Methods Section of the American Statistical Association 384-389.
[39] Legg, J. C. and Fuller, W. A. (2009). Two-phase sampling. In Sample Surveys: Design, Methods and Applications. Handbook of Statist.29 55-70. Elsevier, Amsterdam. · Zbl 1179.62026 · doi:10.1016/S0169-7161(08)00003-5
[40] Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley, Hoboken, NJ. · Zbl 1011.62004
[41] Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. J. Roy. Statist. Soc. Ser. B44 226-233. · Zbl 0488.62018
[42] Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sources of input. Statist. Sci.9 538-558.
[43] Meng, X.-L. and Romero, M. (2003). Discussion: Efficiency and self-efficiency with multiple imputation inference. Int. Stat. Rev.71 607-618.
[44] Mulry, M. H., Oliver, B. E. and Kaputa, S. J. (2014). Detecting and treating verified influential values in a monthly retail trade survey. J. Off. Stat.30 721-747.
[45] Nadaraya, E. A. (1964). On estimating regression. Theory Probab. Appl.9 141-142.
[46] Nielsen, S. F. (2003). Proper and improper multiple imputation. Int. Stat. Rev.71 593-607. · Zbl 1114.62323
[47] Pfeffermann, D., Skinner, C. J., Holmes, D. J., Goldstein, H. and Rasbash, J. (1998). Weighting for unequal selection probabilities in multilevel models. J. R. Stat. Soc. Ser. B. Stat. Methodol.60 23-56. · Zbl 0909.62006 · doi:10.1111/1467-9868.00106
[48] Rao, J. N. K. (1973). On double sampling for stratification and analytical surveys. Biometrika60 125-133. · Zbl 0252.62006 · doi:10.1093/biomet/60.1.125
[49] Rao, J. N. K. and Shao, J. (1992). Jackknife variance estimation with survey data under hot deck imputation. Biometrika79 811-822. · Zbl 0764.62008 · doi:10.1093/biomet/79.4.811
[50] Rao, J. N. K., Yung, W. and Hidiroglou, M. A. (2002). Estimating equations for the analysis of survey data using poststratification information. Sankhya, Ser. A64 364-378. · Zbl 1192.62023
[51] Reiter, J. P., Raghunathan, T. E. and Kinney, S. K. (2006). The importance of modeling the sampling design in multiple imputation for missing data. Surv. Methodol.32 143.
[52] Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc.89 846-866. · Zbl 0815.62043 · doi:10.1080/01621459.1994.10476818
[53] Rubin, D. B. (1976). Inference and missing data. Biometrika63 581-592. · Zbl 0344.62034 · doi:10.1093/biomet/63.3.581
[54] Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley, New York. · Zbl 1070.62007
[55] Rubin, D. B. (1996). Multiple imputation after 18+ years. J. Amer. Statist. Assoc.91 473-489. · Zbl 0869.62014
[56] Schenker, N. and Raghunathan, T. E. (2007). Combining information from multiple surveys to enhance estimation of measures of health. Stat. Med.26 1802-1811. · doi:10.1002/sim.2801
[57] Schenker, N., Raghunathan, T. E., Chiu, P.-L., Makuc, D. M., Zhang, G. and Cohen, A. J. (2006). Multiple imputation of missing income data in the National Health interview survey. J. Amer. Statist. Assoc.101 924-933. · Zbl 1120.62348 · doi:10.1198/016214505000001375
[58] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist.6 461-464. · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[59] Tan, Z. (2006). A distributional approach for causal inference using propensity scores. J. Amer. Statist. Assoc.101 1619-1637. · Zbl 1171.62320 · doi:10.1198/016214506000000023
[60] Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. J. Amer. Statist. Assoc.82 528-550. · Zbl 0619.62029 · doi:10.1080/01621459.1987.10478458
[61] Vink, G., Frank, L. E., Pannekoek, J. and van Buuren, S. (2014). Predictive mean matching imputation of semicontinuous variables. Stat. Neerl.68 61-90. · doi:10.1111/stan.12023
[62] Wang, D. and Chen, S. X. (2009). Empirical likelihood for estimating equations with missing values. Ann. Statist.37 490-517. · Zbl 1155.62021 · doi:10.1214/07-AOS585
[63] Wang, N. and Robins, J. M. (1998). Large-sample theory for parametric multiple imputation procedures. Biometrika85 935-948. · Zbl 1054.62524 · doi:10.1093/biomet/85.4.935
[64] Wei, G. C. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J. Amer. Statist. Assoc.85 699-704.
[65] Yang, S. and Kim, J. K. (2016). A semiparametric inference to regression analysis with missing covariates in survey data. Statist. Sinica. To appear. · Zbl 1384.62071 · doi:10.1111/sjos.12184
[66] Yang, S. and Kim, J. K. (2016a). Likelihood-based inference with missing data under missing-at-random. Scand. J. Stat.43 436-454. · Zbl 1384.62071 · doi:10.1111/sjos.12184
[67] Yang, S. and Kim, J. K. (2016b). A note on multiple imputation for method of moments estimation. Biometrika103 244-251. · Zbl 1452.62173 · doi:10.1093/biomet/asv073
[68] Yang, S., Kim, J.-K. and Zhu, Z. (2013). Parametric fractional imputation for mixed models with nonignorable missing data. Stat. Interface6 339-347. · Zbl 1327.62116 · doi:10.4310/SII.2013.v6.n3.a4
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.