×

General-purpose imputation of planned missing data in social surveys: different strategies and their effect on correlations. (English) Zbl 07577515

Summary: Planned missing survey data, for example stemming from split questionnaire designs are becoming increasingly common in survey research, making imputation indispensable to obtain reasonably analyzable data. However, these data can be difficult to impute due to low correlations, many predictors, and limited sample sizes to support imputation models. This paper presents findings from a Monte Carlo simulation, in which we investigate the accuracy of correlations after multiple imputation using different imputation methods and predictor set specifications based on data from the German Internet Panel (GIP). The results show that strategies that simplify the imputation exercise (such as predictive mean matching with dimensionality reduction or restricted predictor sets, linear regression models, or the multivariate normal model without transformation) perform well, while especially generalized linear models for categorical data, classification trees, and imputation models with many predictor variables lead to strong biases.

MSC:

62D10 Missing data
65C05 Monte Carlo methods
62P25 Applications of statistics to social sciences
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] ADIGÜZEL, F. and WEDEL, M. (2008). Split questionnaire design for massive surveys. Journal of Marketing Research 45 608-617.
[2] ALLISON, P. D. (2005). Imputation of Categorical Variables with PROC MI. In Proceedings of the SAS Users Group International (SUGI) 30 113-30. SAS Institute, Cary.
[3] AKANDE, O, LI, F. and REITER, J. (2017). An Empirical Comparison of Multiple Imputation Methods for Categorical Data. The American Statistician 71 162-170. · Zbl 07671795
[4] AXENFELD, J. B., BRUCH, C. and WOLF, C. (2022). Code and Data Availability. Supplement to “General-purpose imputation of planned missing data in social surveys: Different strategies and their effect on correlations.”
[5] AXENFELD, J. B., BLOM, A.G., BRUCH, C. and WOLF, C. (2022). Split Questionnaire Designs for Online Surveys: The Impact of Module Construction on Imputation Quality. Journal of Survey Statistics and Methodology. https://doi.org/10.1093/jssam/smab055
[6] BAHRAMI, S., ASSMANN, C., MEINFELDER, F. and RÄSSLER, S. (2014). A split questionnaire survey design for data with block structure correlation matrix. In Improving Survey Methods: Lessons from Recent Research, (U. ENGEL, B. JANN, P. LYNN, A. SCHERPENZEEL and P. STURGIS, eds.) 368-380. Routledge, New York.
[7] BARTLETT, J. W., SEAMAN, S. R., WHITE, I. R. and CARPENTER, J. R. (2015). Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model. Statistical Methods in Medical Research 24 462-487.
[8] BELLMAN, R. E. (1961). Adaptive control processes: a guided tour. Princeton University Press, Princeton.
[9] BLOM, A. G., BOSSERT, D., FUNKE, F., GEBHARD, F., HOLTHAUSEN, A. and KRIEGER, U.; SFB 884 “POLITICAL ECONOMY OF REFORMS” UNIVERSITÄT MANNHEIM (2016). German Internet Panel, Wave 1 - Core Study (September 2012). GESIS Data Archive, Cologne. ZA5866 Data file Version 2.0.0. https://doi.org/10.4232/1.12607.
[10] BLOM, A. G., BOSSERT, D., GEBHARD, F., FUNKE, F., HOLTHAUSEN, A. and KRIEGER, U.; SFB 884 “POLITICAL ECONOMY OF REFORMS” UNIVERSITÄT MANNHEIM (2016). German Internet Panel, Wave 13 - Core Study (September 2014). GESIS Data Archive, Cologne. ZA5924 Data file Version 2.0.0. https://doi.org/10.4232/1.12619.
[11] BLOM, A. G., FIKEL, M., FRIEDEL, S., HÖHNE, J. K., KRIEGER, U., RETTIG, T. and WENZ, A.; SFB 884 “POLITICAL ECONOMY OF REFORMS”, UNIVERSITÄT MANNHEIM (2019). German Internet Panel, Wave 37 - Core Study (September 2018). GESIS Data Archive, Cologne. ZA6957 Data file Version 1.0.0. https://doi.org/10.4232/1.13390.
[12] BLOM, A. G., FIKEL, M., FRIEDEL, S., HÖHNE, J. K., KRIEGER, U., RETTIG, R. and WENZ, A.; SFB 884 “POLITICAL ECONOMY OF REFORMS”, UNIVERSITÄT MANNHEIM (2019). German Internet Panel, Wave 38 (November 2018). GESIS Data Archive, Cologne. ZA6958 Data file Version 1.0.0. https://doi.org/10.4232/1.13391.
[13] BLOM, A. G., GATHMANN, C. and KRIEGER, U. (2015). Setting up an online panel representative of the general population: The German Internet Panel. Field Methods 27 391-408.
[14] BLOM, A. G., HERZING, J. M. E., CORNESSE, C., SAKSHAUG, J. W., KRIEGER, U. and BOSSERT, D. (2017). Does the recruitment of offline households increase the sample representativeness of probability-based online panels? Evidence from the German Internet Panel. Social Science Computer Review 35 498-520.
[15] BODNER, T. E. (2008). What improves with increased missing data imputations? Structural Equation Modeling: A Multidisciplinary Journal 15 651-675.
[16] BRAND, J. P. L. (1999). Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets. Erasmus University Rotterdam, Rotterdam.
[17] BREIMAN, L., FRIEDMAN, J. H., OLSHEN, R. A. and STONE, C. J. (1984). Classification and regression trees. Wadsworth & Brooks/Cole Advanced Books & Software, Monterey. · Zbl 0541.62042
[18] BURGETTE, L. F. and REITER, J. P. (2010). Multiple Imputation for Missing Data via Sequential Regression Trees. American Journal of Epidemiology, 172 1070-1076.
[19] CORNESSE, C., FELDERER, B., FIKEL, M., KRIEGER, U. and BLOM, A. G. (2021). Recruiting a probability-based online panel via postal mail: experimental evidence. Social Science Computer Review. doi:10.1177/08944393211006059
[20] DE JONG, S. (1993). SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems 18 251-263.
[21] DOOVE, L. L., VAN BUUREN, S. and DUSSELDORP, E. (2014). Recursive partitioning for missing data imputation in the presence of interaction effects. Computational Statistics & Data Analysis 72 92-104. · Zbl 1506.62056
[22] GALESIC, M. and BOSNJAK, M (2009). Effects of questionnaire length on participation and indicators of response quality in a web survey. Public Opinion Quarterly 73 349-360.
[23] GRAHAM, J. W., HOFER, S. M. and MACKINNON, D. P. (1996). Maximizing the usefulness of data obtained with planned missing value patterns: An application of maximum likelihood procedures. Multivariate Behavioral Research 31 197-218.
[24] GRAHAM, J. W., OLCHOWSKI, A. E. and GILREATH, T. D. (2007). How many imputations are really needed? Some practical clarifications of multiple imputation theory. Prevention Science, 8 206-213.
[25] HONAKER, J. and KING, G. (2010). What to do about missing values in time-series cross-section data. American Journal of Political Science, 54 561-581.
[26] HONAKER, J., KING, G. and BLACKWELL, M. (2011). Amelia II: A Program for Missing Data. Journal of Statistical Software 45 1-47.
[27] HORTON, N. J., LIPSITZ, S. R. and PARZEN, M. (2003). A potential for bias when rounding in multiple imputation. The American Statistician 57 229-232. · Zbl 1182.62002
[28] IMBRIANO, P. M. and RAGHUNATHAN, T. E. (2020). Three-Form Split Questionnaire Design for Panel Surveys. Journal of Official Statistics 36 827-854.
[29] KLEINKE, K. (2018). Multiple imputation by predictive mean matching when sample size is small. Methodology 14 3-15.
[30] KOLLER-MEINFELDER, F. (2009). Analysis of incomplete survey data-multiple imputation via Bayesian bootstrap predictive mean matching. University of Bamberg, Bamberg.
[31] LEE, K. J. and CARLIN, J. B. (2010). Multiple imputation in the presence of non-normal data. Statistics in Medicine 171 624-632.
[32] LITTLE, R. J. A. (1988). Missing-Data Adjustments in Large Surveys. Journal of Business & Economic Statistics 6 287-296.
[33] LONG, J. S. (1997). Regression models for categorical and limited dependent variables. Sage, Thousand Oaks. · Zbl 0911.62055
[34] LUIJKX, R., JÓNSDÓTTIR, G. A., GUMMER, T., ERNST STÄHLI, M., FREDRIKSEN, M., REESKENS, T., KETOLA, K., BRISLINGER, E., CHRISTMANN, P., GUNNARSSON, S. Þ., BRAGI, Á., HJALTASON, D. J., LOMAZZI, V., MAINERI, A. M., MILBERT, P., OCHSNER, M., POLLIEN, A., SAPIN, M., SOLANES, I., VERHOEVEN, S. and WOLF, C. (2021). The European Values Study 2017: On the way to the future using mixed-modes. European Sociological Review 37 330-346.
[35] MEVIK, B.-H. and WEHRENS, R. (2007). The pls Package: Principal Component and Partial Least Squares Regression in R. Journal of Statistical Software 18(2) 1-24.
[36] MICROSOFT and WESTON, S. (2020). foreach: Provides Foreach Looping Construct. R package version 1.5.0.
[37] MORRIS, T. P., WHITE, I. R. and ROYSTON, P. (2014). Tuning multiple imputation by predictive mean matching and local residual draws. BMC Medical Research Methodology 14 1-13.
[38] MUNGER, G. F. and LOYD, B. H. (1988). The use of multiple matrix sampling for survey research. The Journal of Experimental Education 56 187-191.
[39] NICOLETTI, C. and PERACCHI, F. (2006). The effects of income imputation on microanalyses: evidence from the European Community Household Panel. Journal of the Royal Statistical Society: Series A (Statistics in Society) 169 625-646.
[40] OECD (2014). PISA 2012 Technical Report. OECD, Paris.
[41] PEYTCHEV, A. and PEYTCHEVA, E. (2017). Reduction of measurement error due to survey length: Evaluation of the split questionnaire design approach. Survey Research Methods 11 361-368.
[42] R CORE TEAM (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna.
[43] RAGHUNATHAN, T. E. and GRIZZLE, J. E. (1995). A split questionnaire survey design. Journal of the American Statistical Association 90 54-63. · Zbl 0925.62046
[44] RÄSSLER, S., KOLLER, F. and MÄENPÄÄ, C. (2002). A split questionnaire survey design applied to German media and consumer surveys. In Friedrich-Alexander University Erlangen-Nuremberg, Chair of Statistics and Econometrics Discussion Papers [online], available at https://www.statistik.rw.fau.de/files/2016/03/d0042b.pdf.
[45] ROBITZSCH, A. and GRUND, S. (2021). miceadds: Some Additional Multiple Imputation Functions, Especially for ‘mice’. R package version 3.11-6.
[46] RUBIN, D. B. (1986). Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations. Journal of Business & Economic Statistics 4 87-94.
[47] RUBIN, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, New York.
[48] SCHAFER, J. L. and OLSEN, M. K. (1998). Multiple imputation for multivariate missing-data problems: A data analyst’s perspective. Multivariate Behavioral Research 33 545-571.
[49] SCHAFER, J. L. (1999). NORM users guide (version 2). The Methodology Center, The Pennsylvania State University, University Park.
[50] SEAMAN, S. R., BARTLETT, J. W. and WHITE, I. R. (2012). Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods. BMC Medical Research Methodology 12 1-13.
[51] SLADE, E. and NAYLOR, M. G. (2020). A fair comparison of tree-based and parametric methods in multiple imputation by chained equations. Statistics in Medicine 39 1156-1166.
[52] SHAH, A. D., BARTLETT, J. W., CARPENTER, J., NICHOLAS, O. and HEMINGWAY, H. (2014). Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study. American Journal of Epidemiology 179 764-774.
[53] SHOEMAKER, D. M. (1973). Principles and Procedures of Multiple Matrix Sampling. Ballinger, Cambridge, MA.
[54] SIDDIQUE, J. and BELIN, T. R. (2008). Multiple imputation using an iterative hot-deck with distance-based donor selection. Statistics in Medicine 27 83-102.
[55] SIGNORELL, A., AHO, K., ALFONS, A., ANDEREGG, N., ARAGON, T., ARACHCHIGE, C., ARPPE, A., BADDELEY, A., BARTON, K., BOLKER, B., BORCHERS, H. W., CAEIRO, F., CHAMPELY, S., CHESSEL, D., CHHAY, L., COOPER, N., CUMMINS, C., DEWEY, M., DORAN, H. C., DRAY, S., DUPONT, C., EDDELBUETTEL, D., EKSTROM, C., ELFF, M., ENOS, J., FAREBROTHER, R. W., FOX, J., FRANCOIS, R., FRIENDLY, M., GALILI, T., GAMER, M., GASTWIRTH, J. L., GEGZNA, V., GEL, Y. R., GRABER, S., GROSS, J., GROTHENDIECK, G., HARRELL JR, F. E., HEIBERGER, R., HOEHLE, M., HOFFMANN, C. W., HOJSGAARD, S., HOTHORN, T., HUERZELER, M., HUI, W. W., HURD, P., HYNDMAN, R. J., JACKSON, C., KOHL, M., KORPELA, M., KUHN, M., LABES, D., LEISCH, F., LEMON, J., LI, D., MAECHLER, M., MAGNUSSON, A., MAINWARING, B., MALTER, D., MARSAGLIA, G., MARSAGLIA, J., MATEI, A., MEYER, D., MIAO, W., MILLO, G., MIN, Y., MITCHELL, D., MUELLER, F., NAEPFLIN, M., NAVARRO, D., NILSSON, H., NORDHAUSEN, K., OGLE, D., OOI, H., PARSONS, N., PAVOINE, S., PLATE, T., PRENDERGAST, L., RAPOLD, R., REVELLE, W., RINKER, T., RIPLEY, B. D., RODRIGUEZ, C., RUSSELL, N., SABBE, N., SCHERER, R., SESHAN, V. E., SMITHSON, M., SNOW, G., SOETAERT, K., STAHEL, W. A., STEPHENSON, A., STEVENSON, M, STUBNER, R., TEMPL, M., TEMPLE LANG, D., THERNEAU, T., TILLE, Y., TORGO, L., TRAPLETTI, A., ULRICH, J., USHEY, K., VANDERWAL, J., VENABLES, B., VERZANI, J., VILLACORTA IGLESIAS, P. J., WARNES, G. R., WELLEK, S., WICKHAM, H., WILCOX, R. R., WOLF, P., WOLLSCHLAEGER, D., WOOD, J., WU, Y., YEE, T. and ZEILEIS, A. (2020). DescTools: Tools for descriptive statistics. R package version 0.99.36.
[56] THOMAS, N., RAGHUNATHAN, T. E., SCHENKER, N., KATZOFF, M. J. and JOHNSON, C. L. (2006). An evaluation of matrix sampling methods using data from the National Health and Nutrition Examination Survey. Survey Methodology 32 217-231.
[57] VAN BELLE, G. (2002). Statistical Rules of Thumb. John Wiley & Sons, New York. · Zbl 1011.62003
[58] VAN BUUREN, S. (2018). Flexible Imputation of Missing Data. CRC press, Boca Raton, 2nd Edition. · Zbl 1416.62030
[59] VAN BUUREN, S., BOSHUIZEN, H. C. and KNOOK, D. L. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18 681-694.
[60] VAN BUUREN, S., BRAND, J. P., GROOTHUIS-OUDSHOORN, C. G. and RUBIN, D. B. (2006). Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation 76 1049-1064. · Zbl 1144.62332
[61] VAN BUUREN, S. and GROOTHUIS-OUDSHOORN, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software 45(3) 1-67.
[62] VENABLES, W. N. and RIPLEY, B. D. (2002). Modern Applied Statistics with S. Springer, New York. · Zbl 1006.62003
[63] VON HIPPEL, P. T. (2009). How to impute interactions, squares, and other transformed variables. Sociological Methodology 39 265-291.
[64] VON HIPPEL, P. T. (2013). Should a normal imputation model be modified to impute skewed variables? Sociological Methods & Research 42 105-138.
[65] VON HIPPEL, P. T. (2020). How many imputations do you need? A two-stage calculation using a quadratic rule. Sociological Methods & Research 49 699-718.
[66] WESTON, S. (2017). doMPI: foreach parallel adaptor for the Rmpi package. R package version 0.2.2.
[67] WICKHAM, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer, New York. · Zbl 1397.62006
[68] WICKHAM, H. and HENRY, L. (2019). tidyr: Easily Tidy Data with ‘spread()’ and ‘gather()’ Functions. R package version 0.8.3.
[69] WICKHAM, H. and MILLER, E. (2019). haven: Import and Export ‘SPSS’, ‘Stata’ and ‘SAS’ Files. R package version 2.1.1.
[70] WHITE, I. R., ROYSTON, P. and WOOD, A. M. (2011). Multiple imputation using chained equations: issues and guidance for practice. Statistics in Medicine 30 377-399.
[71] WU, H. and LEUNG, S.O. (2017). Can Likert scales be treated as interval scales?—A simulation study. Journal of Social Service Research 43 527-532.
[72] WU, W., JIA, F. and ENDERS, C. (2015). A comparison of imputation strategies for ordinal missing data on Likert scale variables. Multivariate Behavioral Research 50 484-503.
[73] YU, H. (2002). Rmpi: Parallel statistical computing in R. R News 2(2) 10-14.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.