Nearest neighbour approach in the least-squares data imputation algorithms. (English) Zbl 1084.62043

Summary: Imputation of missing data is of interest in many areas, such as survey data editing, medical documentation maintaining and DNA microarray data analysis. This paper is devoted to experimental analysis of a set of imputation methods developed within the so-called least-squares approximation approach, a nonparametric computationally effective multidimensional technique. First, we review global methods for least-squares data imputation. Then we propose extensions of these algorithms based on the nearest neighbours approach. An experimental study of the algorithms on generated data sets is conducted. It appears that straight algorithms may work rather well on data of simple structure and/or with small number of missing entries. However, in more complex cases, the only winner within the least-squares approximation approach is a method, INI, proposed in this paper as a combination of global and local imputation algorithms.


62G99 Nonparametric inference
65C60 Computational problems in statistics (MSC2010)


Full Text: DOI


[1] Aha, D.; Editorial, Artificial intelligence review, 11, 1-6, (1997)
[2] Ahmad, S.; Tresp, V., Some solutions to the missing feature problem in vision, Advances in neural information processing systems 5, (1993), Morgan Kaufmann San Mateo, pp. 1712-1719
[3] Atkeson, C.G; Moore, A.W.; Schaal, S., Locally weighted learning, Artificial intelligence review, 11, 11-73, (1997)
[4] A. Christoffersson, The one component model with incomplete data, PhD Thesis, Uppsala University, 1970
[5] Davies, P.; Smith, P., Model quality reports in business statistics, (1999), ONS UK
[6] Dempster, A.P.; Laird, N.M.; Rubin, D.B., Maximum likelihood from incomplete data via the EM algorithm, Journal of the royal statistical society, 39, 1-38, (1977) · Zbl 0364.62022
[7] Dybowski, R., Classification of incomplete feature vectors by radial basis function networks, Pattern recognition letters, 19, 1257-1264, (1998) · Zbl 0921.68071
[8] EM based imputation software. Available from <http://www.stat.psu.edu/jls/misoftwa.html, http://methcenter.psu.edu/EMCOV.html>
[9] Everrit, B.S.; Hand, D.J., Finite mixture distributions, (1981), Chapman and Hall · Zbl 0466.62018
[10] Gabriel, K.R; Zamir, S., Lower rank approximation of matrices by least squares with any choices of weights, Technometrics, 21, 298-489, (1979) · Zbl 0471.62004
[11] Generation of Gaussian mixture distributed data, NETLAB neural network software. Available from <http://www.ncrg.aston.ac.uk/netlab>
[12] Grung, B.; Manne, R., Missing values in principal component analysis, Chemometrics and intelligent laboratory system, 42, 125-139, (1998)
[13] T. Hastie, R. Tibshirani, G. Sherlock, M. Eisen, P. Brown, D. Botstein, Imputing missing data for gene expression arrays, Technical Report, Division of Biostatistics, Stanford University, 1999
[14] Heiser, W.J., Convergent computation by iterative majorization: theory and applications in multidimensional analysis, (), 157-189
[15] Jollife, I.T, Principal component analysis, (1986), Springer-Verlag
[16] L. Kamakashi, S.A. Harp, T. Samad, R.P. Goldman, Imputation of missing data using machine learning techniques, in: E. Simoudis, J. Han, U. Fayyad (Eds.), Second International Conference on Knowledge Discovery and Data Mining, Oregon, 1996, pp. 140-145
[17] Kenney, N.; Macfarlane, A., Identifying problems with data collection at a local level: survey of NHS maternity units in england, British medical journal, 319, 619-622, (1999)
[18] Kiers, H.A.L., Weighted least squares Fitting using ordinary least squares algorithms, Psychometrika, 62, 251-266, (1997) · Zbl 0873.62058
[19] Laaksonen, S., Regression-based nearest neighbour hot decking, Computational statistics, 15, 65-71, (2000) · Zbl 0953.62002
[20] Little, R.J.A; Rubin, D.B, Statistical analysis with missing data, (1987), John Wiley and Sons
[21] Mirkin, B., Mathematical classification and clustering, (1996), Kluwer Academic Publishers · Zbl 0874.90198
[22] Mitchell, T.M, Machine learning, (1997), McGraw-Hill · Zbl 0913.68167
[23] Myrtveit, I.; Stensrud, E.; Olsson, U.H., Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods, IEEE transaction on software engineering, 27, 999-1013, (2001)
[24] J.R Quinlan, Unknown attribute values in induction, Sixth International Machine Learning Workshop, New York, 1989
[25] Roweis, S., EM algorithms for PCA and SPCA, (), 626-632
[26] Rubin, D.B., Multiple imputation for nonresponse in surveys, (1987), John Wiley & Sons · Zbl 1070.62007
[27] Rubin, D.B., Multiple imputation after 18+ years, Journal of the American statistical association, 91, 473-489, (1996) · Zbl 0869.62014
[28] Schafer, J.L., Analysis of incomplete multivariate data, (1997), Chapman and Hall · Zbl 0997.62510
[29] Shum, H.Y; Ikeuchi, K.; Reddy, R., PCA with missing data and its application to polyhedral object modelling, IEEE transactions on pattern analysis and machine intelligence, 17, 854-867, (1995)
[30] Tipping, M.E.; Bishop, C.M., Probabilistic principal component analysis, Journal of the royal statistical society series B, 61, 611-622, (1999) · Zbl 0924.62068
[31] Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Hastie, R.; Tibshirani, R.; Botsein, D.; Altman, R.B., Missing value estimation methods for DNA microarrays, Bioinformatics, 17, 520-525, (2001)
[32] Wold, H., Estimation of principal components and related models by iterative least square, (), 391-402
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.