×

Pretest estimation in combining probability and non-probability samples. (English) Zbl 07725162

Summary: Multiple heterogeneous data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we develop a unified framework of the test-and-pool approach to general parameter estimation by combining gold-standard probability and non-probability samples. We focus on the case when the study variable is observed in both datasets for estimating the target parameters, and each contains other auxiliary variables. Utilizing the probability design, we conduct a pretest procedure to determine the comparability of the non-probability data with the probability data and decide whether or not to leverage the non-probability data in a pooled analysis. When the probability and non-probability data are comparable, our approach combines both data for efficient estimation. Otherwise, we retain only the probability data for estimation. We also characterize the asymptotic distribution of the proposed test-and-pool estimator under a local alternative and provide a data-adaptive procedure to select the critical tuning parameters that target the smallest mean square error of the test-and-pool estimator. Lastly, to deal with the non-regularity of the test-and-pool estimator, we construct a robust confidence interval that has a good finite-sample coverage property.

MSC:

62D05 Sampling theory, sample surveys
62E20 Asymptotic distribution theory in statistics
62F03 Parametric hypothesis testing
62F35 Robustness and adaptive procedures (parametric inference)

Software:

qLearn

References:

[1] ABRAMOWITZ, M., STEGUN, I. A. and ROMER, R. H. (1988). Handbook of mathematical functions with formulas, graphs, and mathematical tables. · Zbl 0171.38503
[2] BAKER, R., BRICK, J. M., BATES, N. A., BATTAGLIA, M., COUPER, M. P., DEVER, J. A., GILE, K. J. and TOURANGEAU, R. (2013). Summary report of the AAPOR task force on non-probability sampling. Journal of Survey Statistics and Methodology 1 90-143.
[3] BALTAGI, B. H., BRESSON, G. and PIROTTE, A. (2003). Fixed effects, random effects or Hausman-Taylor?: A pretest estimator. Economics Letters 79 361-369. · Zbl 1255.62337
[4] BARR, D. R. and SHERRILL, E. T. (1999). Mean and variance of truncated normal distributions. The American Statistician 53 357-361.
[5] BEAUMONT, J.-F. (2020). Are probability surveys bound to disappear for the production of official statistics? Survey Methodology 46 1-28.
[6] BETHLEHEM, J. (2016). Solving the nonresponse problem with sample matching? Social Science Computer Review 34 59-77.
[7] BINDER, D. A. and ROBERTS, G. R. (2003). Design-based and model-based methods for estimating model parameters. Analysis of Survey Data 29 33-54.
[8] BOAS, M. L. (2006). Mathematical Methods in the Physical Sciences. John Wiley & Sons. · Zbl 1088.00002
[9] BOOS, D. D. and STEFANSKI, L. A. (2013). Essential Statistical Inference: Theory and Methods 591. Springer. · Zbl 1276.62016
[10] CHAKRABORTY, B., LABER, E. B. and ZHAO, Y. (2013). Inference for optimal dynamic treatment regimes using an adaptive m-out-of-n bootstrap scheme. Biometrics 69 714-723. · Zbl 1418.62182
[11] CHEN, S., YANG, S. and KIM, J. K. (2022). Nonparametric mass imputation for data integration. Journal of survey statistics and methodology 10 1-24.
[12] CHEN, Y., LI, P. and WU, C. (2019). Doubly Robust Inference With Nonprobability Survey Samples. Journal of the American Statistical Association 115 2011-2021. · Zbl 1453.62329
[13] CHENG, X. (2008). Robust confidence intervals in nonlinear regression under weak identification. Manuscript, Department of Economics, Yale University.
[14] CITRO, C. F. (2014). From multiple modes for surveys to multiple data sources for estimates. Survey Methodology 40 137-161.
[15] COCHRAN, W. G. (2007). Sampling Techniques, 3 ed. New York: John Wiley & Sons, Inc.
[16] COLNET, B., MAYER, I., CHEN, G., DIENG, A., LI, R., VAROQUAUX, G., VERT, J.-P., JOSSE, J. and YANG, S. (2020). Causal inference methods for combining randomized trials and observational studies: a review. arXiv preprint arXiv:2011.08047.
[17] COUPER, M. P. (2000). Web surveys: A review of issues and approaches. The Public Opinion Quarterly 64 464-494.
[18] COUPER, M. P. (2013). Is the sky falling? New technology, changing media, and the future of surveys. Survey Research Methods 7 145-156.
[19] DEVILLE, J.-C. and SÄRNDAL, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association 87 376-382. · Zbl 0760.62010
[20] ELLIOT, M. R. (2009). Combining data from probability and non-probability samples using pseudo-weights. Survey Practice 2 2982.
[21] ELLIOTT, M. N. and HAVILAND, A. (2007). Use of a web-based convenience sample to supplement a probability sample. Survey Methodology 33 211-215.
[22] ELLIOTT, M. R. (2007). Bayesian weight trimming for generalized linear regression models. Survey Methodology 33 23-34.
[23] ELLIOTT, M. R., VALLIANT, R. et al. (2017). Inference for nonprobability samples. Statistical Science 32 249-264. · Zbl 1381.62024
[24] FULLER, W. A. (2009). Sampling Statistics. Wiley, Hoboken, NJ. · Zbl 1179.62019
[25] GAO, C., YANG, S. and KIM, J. K. (2023). Soft calibration for selection bias problems under mixed-effects models. Biometrika doi.org/10.1093/biomet/asad016. · Zbl 07801358
[26] HAZIZA, D. and RAO, J. N. (2006). A nonresponse model approach to inference under imputation for missing survey data. Survey Methodology 32 53-64.
[27] KALTON, G. (1983). Models in the practice of survey sampling. International Statistical Review/Revue Internationale de Statistique 51 175-188.
[28] KALTON, G. (2019). Developments in survey research over the past 60 years: A personal perspective. International Statistical Review 87 S10-S30. · Zbl 07767676
[29] KIM, J. K. and HAZIZA, D. (2014). Doubly robust inference with missing data in survey sampling. Statistica Sinica 24 375-394. · Zbl 1285.62010
[30] KIM, J. K. and WANG, Z. (2019). Sampling techniques for big data analysis. International Statistical Review 87 S177-S191. · Zbl 07767686
[31] KOTT, P. S. (2006). Using calibration weighting to adjust for nonresponse and coverage errors. Survey Methodology 32 133-142.
[32] LABER, E. B., LIZOTTE, D. J., QIAN, M., PELHAM, W. E. and MURPHY, S. A. (2014). Dynamic treatment regimes: Technical challenges and applications. Electronic Journal of Statistics 8 1225-1272. · Zbl 1298.62189
[33] LABER, E. B. and MURPHY, S. A. (2011). Adaptive confidence intervals for the test error in classification. Journal of the American Statistical Association 106 904-913. · Zbl 1229.62085
[34] LITTLE, R. J. (1982). Models for nonresponse in sample surveys. Journal of the American statistical Association 77 237-250. · Zbl 0494.62009
[35] MASHREGHI, Z., LÉGER, C. and HAZIZA, D. (2014). Bootstrap methods for imputed data from regression, ratio and hot-deck imputation. Canadian Journal of Statistics 42 142-167. · Zbl 1349.62027
[36] MCROBERTS, R. E., TOMPPO, E. O. and NÆSSET, E. (2010). Advances and emerging issues in national forest inventories. Scandinavian Journal of Forest Research 25 368-381.
[37] MOLINA, E., SMITH, T. and SUGDEN, R. (2001). Modelling overdispersion for complex survey data. International Statistical Review 69 373-384. · Zbl 1213.62017
[38] MOSTELLER, F. (1948). On pooling data. Journal of the American Statistical Association 43 231-242.
[39] NELDER, J. A. and MEAD, R. (1965). A simplex method for function minimization. The Computer Journal 7 308-313. · Zbl 0229.65053
[40] PALMER, J. R., ESPENSHADE, T. J., BARTUMEUS, F., CHUNG, C. Y., OZGENCIL, N. E. and LI, K. (2013). New approaches to human mobility: Using mobile phones for demographic research. Demography 50 1105-1128.
[41] PFEFFERMANN, D., ELTINGE, J. L., BROWN, L. D. and PFEFFERMANN, D. (2015). Methodological issues and challenges in the production of official statistics: 24th Annual Morris Hansen Lecture. Journal of Survey Statistics and Methodology 3 425-483.
[42] RAO, J. (2020). On making valid inferences by integrating data from surveys and other sources. Sankhya B 83 242-272. · Zbl 1469.62198
[43] RAO, J., WU, C. and YUE, K. (1992). Some recent work on resampling methods for complex surveys. Survey Methodology 18 209-217.
[44] RAO, J. N. (2014). Small-area estimation. Wiley StatsRef: Statistics Reference Online.
[45] RAO, R. R. (1962). Relations between weak and uniform convergence of measures with applications. The Annals of Mathematical Statistics 33 659-680. · Zbl 0117.28602
[46] RIVERS, D. (2007). Sample Matching for Web Surveys: Theory and Application. In Joint Statistical Meetings.
[47] ROBBINS, M. W., GHOSH-DASTIDAR, B. and RAMCHAND, R. (2021). Blending of Probability and Non-Probability Samples: Applications to a Survey of Military Caregivers. Journal of Survey Statistics and Methodology 9 1114-1145.
[48] ROBINS, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium in Biostatistics 179 189-326. Springer. · Zbl 1279.62024
[49] ROBINS, J. M., ROTNITZKY, A. and ZHAO, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 89 846-866. · Zbl 0815.62043
[50] Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70 41-55. · Zbl 0522.62091 · doi:10.1093/biomet/70.1.41
[51] ROTHWELL, P. M. (2005). Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. The Lancet 365 176-186.
[52] SAKSHAUG, J. W., WIŚNIOWSKI, A., RUIZ, D. A. P. and BLOM, A. G. (2019). Supplementing Small Probability Samples with Nonprobability Samples: A Bayesian Approach. Journal of Official Statistics 35 653-681.
[53] SÄRNDAL, C.-E., SWENSSON, B. and WRETMAN, J. (2003). Model Assisted Survey Sampling. New York: Springer-Verlag. · Zbl 1027.62004
[54] SCHARFSTEIN, D. O., ROTNITZKY, A. and ROBINS, J. M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association 94 1096-1120. · Zbl 1072.62644
[55] SCHENKER, N. and WELSH, A. (1988). Asymptotic results for multiple imputation. Annals of Statistics 16 1550-1566. · Zbl 0668.62004
[56] SHAO, J. (1994). Bootstrap sample size in nonregular cases. Proceedings of the American Mathematical Society 122 1251-1262. · Zbl 0820.62037
[57] SHAO, J. and TU, D. (2012). The Jackknife and Bootstrap. Springer, New York.
[58] SKINNER, C. et al. (1992). Pseudo-likelihood and quasi-likelihood estimation for complex sampling schemes. Computational Statistics & Data Analysis 13 395-405. · Zbl 0800.62430
[59] Staiger, D. and Stock, J. H. (1997). Instrumental variables regression with weak instruments. Econometrica 65 557-586. · Zbl 0871.62101 · doi:10.2307/2171753
[60] TALLIS, G. (1963). Elliptical and radial truncation in normal populations. The Annals of Mathematical Statistics 34 940-944. · Zbl 0142.16104
[61] TAM, S.-M. and CLARKE, F. (2015). Big data, official statistics and some initiatives by the Australian Bureau of Statistics. International Statistical Review 83 436-448. · Zbl 07763455
[62] TOURANGEAU, R., CONRAD, F. G. and COUPER, M. P. (2013). The Science of Web Surveys. Oxford University Press: New York.
[63] TOYODA, T. and WALLACE, T. D. (1979). Pre-testing on part of the data. Journal of Econometrics 10 119-123.
[64] TSIATIS, A. (2006). Semiparametric Theory and Missing Data. Springer, New York. · Zbl 1105.62002
[65] VAN DER VAART (2000). Asymptotic Statistics 3. Cambridge university press, Cambridge: Cambridge University Press. · Zbl 0943.62002
[66] VAVRECK, L. and RIVERS, D. (2008). The 2006 cooperative congressional election study. Journal of Elections, Public Opinion and Parties 18 355-366.
[67] VERMEULEN, K. and VANSTEELANDT, S. (2015). Bias-reduced doubly robust estimation. Journal of the American Statistical Association 110 1024-1036. · Zbl 1373.62218
[68] WALLACE, T. D. (1977). Pretest estimation in regression: A survey. American Journal of Agricultural Economics 59 431-443.
[69] WILLIAMS, D. and BRICK, J. M. (2018). Trends in US face-to-face household survey nonresponse and level of effort. Journal of Survey Statistics and Methodology 6 186-211.
[70] XU, C., CHEN, J. and HAROLD, M. (2013). Pseudo-likelihood-based Bayesian information criterion for variable selection in survey data. Survey Methodology 39 303-322.
[71] YANG, S. and DING, P. (2020). Combining multiple observational data sources to estimate causal effects. Journal of the American Statistical Association 115 1540-1554. · Zbl 1441.62184
[72] YANG, S., GAO, C., ZENG, D. and WANG, X. (2022). Elastic integrative analysis of randomized trial and real-world data for treatment heterogeneity estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), In press.
[73] YANG, S. and KIM, J. K. (2020). Statistical data integration in survey sampling: A review. Japanese Journal of Statistics and Data Science 3 625-650. · Zbl 1466.62247
[74] YANG, S., KIM, J. K. and HWANG, Y. (2021). Integration of survey data and big observational data for finite population inference using mass imputation. Survey Methodology 47 29-58.
[75] YANG, S., KIM, J. K. and SONG, R. (2020). Doubly robust inference when combining probability and non-probability samples with high dimensional data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 445-465. · Zbl 07554761
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.