Imputation for multisource data with comparison and assessment techniques. (English) Zbl 1411.62362

Summary: Missing data are prevalent issue in analyses involving data collection. The problem of missing data is exacerbated for multisource analysis, where data from multiple sensors are combined to arrive at a single conclusion. In this scenario, it is more likely to occur and can lead to discarding a large amount of data collected; however, the information from observed sensors can be leveraged to estimate those values not observed. We propose two methods for imputation of multisource data, both of which take advantage of potential correlation between data from different sensors, through ridge regression and a state-space model. These methods, as well as the common median imputation, are applied to data collected from a variety of sensors monitoring an experimental facility. Performance of imputation methods is compared with the mean absolute deviation; however, rather than using this metric to solely rank the methods, we also propose an approach to identify significant differences. Imputation techniques will also be assessed by their ability to produce appropriate confidence intervals, through coverage and length, around the imputed values. Finally, performance of imputed datasets is compared with a marginalized dataset through a weighted k-means clustering. In general, we found that imputation through a dynamic linear model tended to be the most accurate and to produce the most precise confidence intervals, and that imputing the missing values and down weighting them with respect to observed values in the analysis led to the most accurate performance.


62P30 Applications of statistics in engineering and industry; control charts
62H30 Classification and discrimination; cluster analysis (statistical aspects)


TCPDUMP; glmnet; impute
Full Text: DOI Link


[1] Marlin BM. Missing data problems in machine learning. Ph.D. Thesis; 2008.
[2] Wagstaff K. Clustering with Missing Values: No Imputation Required. Springer: Berlin/Heidelberg, 2004.
[3] Troyanskaya O, Cantor M, Sherlock G, et al.. Missing value estimation methods for dna microarrays. Bioinformatics. 2001;17(6):520‐525.
[4] Batista GEAPA, Monard MC. An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence. 2003;17(5‐6):519‐533.
[5] Hall DL, Llinas J. An introduction to multisensor data fusion. Proceedings of the IEEE. 1997;85(1):6‐23.
[6] Liggins II M, Hall D, Llinas J. Handbook of Multisensor Data Fusion: Theory and Practice. CRC press: Boca Raton, FL, 2008.
[7] Xiang S, Yuan L, Fan W, Wang Y, Thompson PM, Ye J. Bi‐level multi‐source learning for heterogeneous block‐wise missing data. NeuroImage. 2014;102:192‐206.
[8] Dasarathy BV. Sensor fusion potential exploitation‐innovative architectures and illustrative applications. Proceedings of the IEEE. 1997;85(1):24‐38.
[9] Moravec HP. Sensor fusion in certainty grids for mobile robots. AI magazine. 1988;9(2):61‐74.
[10] Moshenberg S, Lerner U, Fishbain B. Spectral methods for imputation of missing air quality data. Environmental System Research. 2015;4(1):1‐13.
[11] Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association. 1987;82(398):528‐540. · Zbl 0619.62029
[12] Hoff PD. A First Course in Bayesian Statistical Methods. Springer Science & Business Media: New York, 2009. · Zbl 1213.62044
[13] Kong A, Liu JS, Wong WH. Sequential imputations and bayesian missing data problems. Journal of the American Statistical Association. 1994;89(425):278‐288. · Zbl 0800.62166
[14] Donders ART, van der Heijden GJMG, Stijnen T, Moons KGM. Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology. 2006;59(10):1087‐1091.
[15] Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12(1):55‐67. · Zbl 0202.17205
[16] Petris G, Petrone S, Campagnoli P. Dynamic Linear Models. Springer: New York, 2009. · Zbl 1176.62088
[17] Reilly PM, Patino‐Lea1 H. A bayesian study of the error‐in‐variables model. Technometrics. 1981;23(3):221‐231. · Zbl 0467.62028
[18] Li Q, Lin N, et al.. The bayesian elastic net. Bayesian Analysis. 2010;5(1):151‐170. · Zbl 1330.65026
[19] Stevens GN, Buren KLV, Hemez FM. Darht multi‐intelligence seismic and acoustic data analysis. LA‐UR‐16‐25378,  Los Alamos National Laboratory; 2015.
[20] Jacobson V, Leres C, McCanne S. TCPDUMP user commands. Available from: http://www.tcpdump.org/ tcpdump_man.html, Accessed: 2015‐06‐07; 2015.
[21] Mockapetris P. DOMAIN NAMES ‐ IMPLEMENTATION AND SPECIFICATION. RFC 1035, RFC Editor. Available from: https://tools.ietf.org/html/rfc1035 [accessed 29 July 2016]; November 1987.
[22] Stauffer C, Grimson WEL. Adaptive background mixture models for real‐time tracking. In: IEEE Computer Society Conference On Computer Vision and Pattern Recognition, 1999, Vol. 2. IEEE; 1999; Fort Collins, Colorado.
[23] KaewTraKulPong P, Bowden R. An improved adaptive background mixture model for real‐time tracking with shadow detection. Video‐Based Surveillance Systems. Springer: New York, 2002:135‐144.
[24] Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M. Methods for imputation of missing values in air quality data sets. Atmospheric Environment. 2004;38(18):2895‐2907.
[25] Gondara L. Random forest with random projection to impute missing gene expression data. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA). IEEE; 2015; Miami, Florida:1251‐1256.
[26] Zhang S, Zhang J, Zhu X, Qin Y, Zhang C. Missing value imputation based on data clustering. Transactions on Computational Science I. Springer: Berlin, Heidelberg, 2008:128‐138.
[27] James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning, Vol. 112. Springer: New York, 2013. · Zbl 1281.62147
[28] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning, Springer series in statistics, vol. 2. Springer: Berlin, 2008. · Zbl 0973.62007
[29] Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33(1):1‐22.
[30] Efron B, Tibshirani RJ. An Introduction to the Bootstrap. CRC press: Boca Raton, FL, 1994.
[31] Montgomery DC. Design and Analysis of Experiments. 5th ed. New York: John Wiley & Sons, Inc; 2007.
[32] Becker TE, Billings RS, Eveleth DM, Gilbert NL. Foci and bases of employee commitment: Implications for job performance. Academy of Management Journal. 1996;39(2):464‐482.
[33] Jing L, Ng MK, Huang JZ. An entropy weighting k‐means algorithm for subspace clustering of high‐dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering. 2007;19(8):1026‐1041.
[34] Lloyd SP. Least squares quantization in PCM’s Bell telephone labs. Published in 1982 in IEEE Transactions on Information Theory. 1957, 1982;28:128‐137.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.