×

The correlation-assisted missing data estimator. (English) Zbl 07625194

Summary: We introduce a novel approach to estimation problems in settings with missing data. Our proposal – the Correlation-Assisted Missing data (CAM) estimator – works by exploiting the relationship between the observations with missing features and those without missing features in order to obtain improved prediction accuracy. In particular, our theoretical results elucidate general conditions under which the proposed CAM estimator has lower mean squared error than the widely used complete-case approach in a range of estimation problems. We showcase in detail how the CAM estimator can be applied to \(U\)-Statistics to obtain an unbiased, asymptotically Gaussian estimator that has lower variance than the complete-case \(U\)-Statistic. Further, in nonparametric density estimation and regression problems, we construct our CAM estimator using kernel functions, and show it has lower asymptotic mean squared error than the corresponding complete-case kernel estimator. We also include practical demonstrations throughout the paper using simulated data and the Terneuzen birth cohort and Brandsma datasets available from CRAN.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

Software:

MICE; KernSmooth
PDFBibTeX XMLCite
Full Text: arXiv Link

References:

[1] Anderson, T. W. (1957) Maximum likelihood estimates for a multivariate normal distribution when some observations are missing.J. Amer. Statist. Assoc.,52, 200-203. · Zbl 0086.35304
[2] Biau, G. and Devroye, L. (2015)Lectures on the Nearest Neighbour Method.Springer Series in the Data Sciences, Springer International Publishing Switzerland. · Zbl 1330.68001
[3] Breiman, L. (2002) Random forests.Machine Learning,45, 5-32. · Zbl 1007.68152
[4] Cai, T. T. and Zhang, A. (2016) Minimax rate-optimal estimation of high-dimensional covariance matrices with incomplete data.Journal of Multivariate Analysis150, 55-74. · Zbl 1347.62088
[5] Cai, T. T. and Zhang, L. (2018) High-dimensional linear discriminant analysis: optimality, adaptive algorithms and missing data.J. Roy. Statist. Soc., Ser. B,81, 675-705. · Zbl 1428.62267
[6] Carroll, R. J., Ruppert, D. and Walsh, A. H. (1998) Local estimating equations.J. Amer. Statist. Assoc.,93, 214-227. · Zbl 0910.62033
[7] Chen, Y.-H. and Chen, H. (2000) A unified approach to regression analysis under doublesampling designs.J. Roy. Statist. Soc., Ser. B,62, 449-460. · Zbl 0963.62062
[8] Dempster, A. P., Laird, N. M. and Rudin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm.J. Roy. Statist. Soc., Ser. B (with discussion),39, 1-38. · Zbl 0364.62022
[9] Elsener, A. and van de Geer, S. (2018) Sparse spectral estimation with missing and corrupted measurements.Stat,8, 1-11.
[10] Fan, J. and Gijbels, I. (1996)Local polynomial modelling and its applications, Chapman & Hall/CRC, Boca Raton, Florida. · Zbl 0873.62037
[11] Ford, B. L. (1983) An overview of hot-deck procedures. In Madow, W. G., Olkin, I. and Rubin, D. B. (Eds.)Incomplete Data in Sample Surveys, Vol. 2: Theory and Bibliographies, 185-207. Academic Press, New York.
[12] Fuller, W. A. (1998) Replication variance for two-phase samples.Statistica Sinica,8, 1153- 1164. · Zbl 0916.62007
[13] Horvitz, D. G. and Thompson, D. J. (1956) A generalization of sampling without replacement from a finite universe.J. Amer. Statist. Assoc.,47, 663-685. · Zbl 0047.38301
[14] Janson, S. (1984). The asymptotic distributions of incomplete U-statistics,Zeitschrift f¨ur Wahrscheinlichkeitstheorie und Verwandte Gebiete,66, 495-505. · Zbl 0523.62022
[15] Jiang, X., Jiang, J. and Liu, Y. (2011) Nonparametric regression under double-sampling designs.J. Syst. Sci. Complex,24, 167-175. · Zbl 1215.62040
[16] Josse, J. and Reiter, J. P. (2018) Introduction to the special section on missing data.Statistical Science,33, 139-141.
[17] Kang, J. D. Y., and Schafer, J. L. (2007) Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data. Statistical Science,22, 523-539. · Zbl 1246.62073
[18] Lin, H.-W. and Chen Y.-H. (2014) Adjustment for missing confounders in studies based on observational databases: 2-Stage calibration combining propensity scores from primary and validation data.American Journal of Epidemiology,180, 308-317.
[19] Little, R. J. A. and Rubin, D. B. (2002)Statistical analysis with missing data.Wiley, New Jersey. · Zbl 1011.62004
[20] Loh, P.-L. and Wainwright, M. J. (2012) High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity.Ann. Statist.,40, 1637-1664. · Zbl 1257.62063
[21] Lounici, K. (2014) High-dimensional covariance matrix estimation with missing observations.Bernoulli,20, 1029-1058. · Zbl 1320.62124
[22] Miao, W., Ding, P. and Geng, Z. (2016). Identifiability of normal and normal mixture models with nonignorable missing data.J. Amer. Statist. Assoc.,111, 1673-1683.
[23] Molenberghs, G., Fitzmaurice, G., Kenwood, M. G., Tsiatis, A. and Verbeke, G. (2015) Handbook of Missing Data Methodology. CRC Press, Florida. · Zbl 1369.62007
[24] Parzen, E. (1962) On estimation of a probability density function and mode.Ann. Math. Statist.,33, 1065-1076. · Zbl 0116.11302
[25] Pantanowitz, A. and Marwala, T. (2009) Missing data imputation through the use of the random forest algorithm.Adv. in Comp. Intel.,116, 53-62.
[26] Rosenblatt, M. (1956) Remarks on some nonparametric estimates of a density function. Ann. Math. Statist.,27, 832-837. · Zbl 0073.14602
[27] Rubin, D. B. (1976) Inference and missing data.Biometrika,63, 581-592. · Zbl 0344.62034
[28] Tsiatis, A. (2006)Semiparametric Theory and Missing Data. Springer Series in Statistics, Spinger-Verlag New York. · Zbl 1105.62002
[29] Tsybakov, A. B. (2004)Introduction to nonparametric estimation.Springer series in statistics, Springer, New York. · Zbl 1029.62034
[30] van der Vaart, A. (1998)Asymptotic Statistics. Cambridge University Press, Cambridge, U.K. · Zbl 0910.62001
[31] van Buuren, S., Groothuis-Oudshoorn, K., Robitzsch, A., Vink, G., Doove, L., Jolani, S., Schouten, R., Gaffert, P., Meinfelder, F. and Gray, B. (2018)mice: Multivariate imputation via chained equations.Rpackage, available fromCRAN.
[32] Wahba, G. (1990)Spline Models for Observational Data. SIAM, Philadelphia, PA. · Zbl 0813.62001
[33] Wand, M. P. and Jones, M. C. (1995)Kernel Smoothing. Chapman and Hall/CRC, Boca Raton, FL. · Zbl 0854.62043
[34] Wang, S., Shao, J., and Kim, J. (2014) An instrumental variable approach for identification and estimation with nonignorable nonresponse.Statistica Sinica,24, 1097-1116. · Zbl 06431822
[35] Yang, S. and Ding, P. (2020) Combining multiple observational data sources to estimate causal effects.J. Amer. Statist. Assoc.,115, 1540-1554. · Zbl 1441.62184
[36] Zhang, A., Brown, L. D. and Cai, T. T. (2019) Semi-supervised inference: general theory and estimation of means.Ann. Statist.,47, 2538-2566. · Zbl 1436.62083
[37] Zhu, Z., Wang, T. and Samworth, R. J. (2019) High-dimensional principal component analysis with heterogeneous missingness.Preprint,ArXiv:1906.12125. · Zbl 07686605
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.