zbMATH — the first resource for mathematics

Missing-values adjustment for mixed-type data. (English) Zbl 1229.62039
Summary: We propose a new method of single imputation, reconstruction, and estimation of nonreported, incorrect, implausible, or excluded values in more than one field of the record. In particular, we will be concerned with data sets involving a mixture of numeric, ordinal, binary, and categorical variables. Our technique is a variation of the popular nearest neighbor hot deck imputation (NNHDI) where “nearest” is defined in terms of a global distance obtained as a convex combination of the distance matrices computed for the various types of variables. We address the problem of proper weighting of the partial distance matrices in order to reflect their significance, reliability, and statistical adequacy. Performance of several weighting schemes is compared under a variety of settings in coordination with imputation of the least power mean of the Box-Cox transformation applied to the values of the donors. Through analysis of simulated and actual data sets, we will show that this approach is appropriate. Our main contribution has been to demonstrate that mixed data may optimally be combined to allow the accurate reconstruction of missing values in the target variable even when some data are absent from the other fields of the record.

62G05 Nonparametric estimation
65C60 Computational problems in statistics (MSC2010)
62H99 Multivariate analysis
R; normalp; UCI-ml
Full Text: DOI
[1] R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, Wiley Series in Probability and Statistics, John Wiley & Sons, New York, NY, USA, 2nd edition, 2002. · Zbl 1011.62004
[2] G. Kalton and D. Kasprzyk, “Imputing for missing survey responses,” in Proceedings of the Section on Survey Research Methods, pp. 22-31, 1982.
[3] M. Bankier, J. M. Fillion, M. Luc, and C. Nadeau, “Imputing numeric and qualitative variables simultaneously,” in Proceedings of the Section on Survey Research Methods, pp. 242-247, American Statistical Association, 1994.
[4] M. Bankier, M. Luc, C. Nadeau, and P. Newcombe, “Additional details on imputing numeric and qualitative variables simultaneously,” in Proceedings of the Section on Survey Research Methods, pp. 287-292, American Statistical Association, 1995.
[5] E. J. Welniak and J. F. Coder, “A measure of the bias in the march CPS earning imputation system and results of a simple bias adjustment procedure,” Tech. Rep., U.S. Census Bureau, 1980.
[6] I. G. Sande, “Imputation in surveys: coping with reality,” The American Statistician, vol. 36, pp. 145-152, 1982.
[7] D. Wettschereck and T. G. Dietterich, “An experimental comparison of the nearest-neighbor and nearest-hyperrectangle algorithms,” Machine Learning, vol. 19, no. 1, pp. 5-27, 1995.
[8] C. Abbate, “La completezza delle informazioni e l’imputazione da donatore con distanza mista minima,” Quaderni di Ricerca dell’ISTAT, vol. 4, pp. 68-102, 1997.
[9] W. E. Deming, Sample Design in Business Research, A Wiley Publication in Applied Statistics, John Wiley & Sons, New York, 1960. · Zbl 0705.62019
[10] A. K. Ghosh, “On nearest neighbor classification using adaptive choice of k,” Journal of Computational and Graphical Statistics, vol. 16, no. 2, pp. 482-502, 2007.
[11] J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An algorithm for finding best matches in logarithmic expected time,” Association for Computing Machinery Transactions on Mathematical Software, vol. 3, pp. 209-226, 1977. · Zbl 0364.68037
[12] R. J. Hyndman, “The problem with Sturges’ rule for constructing histograms,” Business, July, 1-2, 1995.
[13] D. R. Wilson and T. R. Martinez, “Reduction techniques for instance-based learning algorithms,” Machine Learning, vol. 38, no. 3, pp. 257-286, 2000. · Zbl 0954.68126
[14] J. Kaiser, “The effectiveness of hot-deck procedures in small samples,” in Proceedings of the Annual Meeting of the American Statistical Association Javaid Kaiser, University of Kansas Kalton G., Compensating for Missing Survey Data. Ann Arbor, MI: Survey Research Center, University of Michigan, 1983.
[15] S. J. Schieber, “A comparison of three alternative techniques for allocating unreported social security income on the survey of the low-income aged and disabled,” in Proceedings of the Section on Survey Research Methods, American Statistical Association, 1978.
[16] M. J. Colledge, J. H. Johnson, R. Pare, and I. J. Sande, “Large scale imputation of survey data,” in Proceedings of the Section on Survey Research Methods, pp. 431-436, American Statistical Association, 1978.
[17] P. Giles, “A model for generalized edit and imputation of survey data,” The Canadian Journal of Statistics, vol. 16, pp. 57-73, 1988. · Zbl 0663.62019
[18] A. M. Mineo and M. Ruggieri, “A software tool for the exponential power distribution: the normalp package,” Journal of Statistical Software, vol. 12, pp. 1-24, 2005.
[19] P. J√∂nsson and C. Wohlin, “Benchmarking k-nearest neighbour imputation with homogeneous Likert data,” Empirical Software Engineering, vol. 11, no. 3, pp. 463-489, 2006. · Zbl 05075374
[20] J. M. G. Taylor, “Power transformations to symmetry,” Biometrika, vol. 72, no. 1, pp. 145-152, 1985. · Zbl 0563.62044
[21] M. R. Anderberg, Cluster Analysis for Applications, Academic Press, New York, NY, USA, 1973. · Zbl 0299.62029
[22] I. E. Franck and R. Todeschini, The Data Analysis Handbook, Elsevier, Amsterdam, The Netherlands, 1994.
[23] J. C. Gower, “A general coefficient of similarity and some of its properties,” Biometrics, vol. 27, pp. 623-637, 1971.
[24] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics, John Wiley & Sons, New York, NY, USA, 1990. · Zbl 1345.62009
[25] A. Di Ciaccio, “Simultaneous clustering of qualitative and quantitative with missing observations,” Statistica Applicata, vol. 4, pp. 599-609, 1992.
[26] M. N. Murthy, E. Chacko, R. Penny, and M. Hossain, “Multivariate nearest neighbour imputation,” Journal of Statistics in Transition, vol. 6, pp. 55-66, 2003.
[27] G. A. F. Seber, Multivariate Observations, Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics, John Wiley & Sons, New York, NY, USA, 2004. · Zbl 0627.62052
[28] C. K. Enders, Applied Missing Data Analysis, The Guilford Press, New York, NY, USA, 2010.
[29] S. Pavoine, J. Vallet, A. B. Dufour, S. Gachet, and H. Daniel, “On the challenge of treating various types of variables: application for improving the measurement of functional diversity,” Oikos, vol. 118, no. 3, pp. 391-402, 2009.
[30] J. C. Gower and P. Legendre, “Metric and Euclidean properties of dissimilarity coefficients,” Journal of Classification, vol. 3, no. 1, pp. 5-48, 1986. · Zbl 0592.62048
[31] M. Bankier, M. Lachance, and P. Poirier, “2001 Canadian census minimum change donor imputation methodology,” in Proceedings of the Work Session on Statistical Data Editing, (UN-ECE), Cardiff, Wales, 2000.
[32] Istat, CONCORD V. 1.0: Controllo e Correzione dei Dati. Manuale Utente e Aspetti Metodologici, Istituto Nazionale di Statistica, Roma, Italy, 2004.
[33] M. Chiodi, “A partition type method for clustering mixed data,” Rivista di Statistica Applicata, vol. 2, pp. 135-147, 1990.
[34] H. C. Romesburg, Cluster Analysis for Researchers, Krieger Publishing, Malabar, Fla, USA, 1984.
[35] M. Kagie, M. van Wezel, and P. J. F. Groenen, “A graphical shopping interface based on product attributes,” Decision Support Systems, vol. 46, no. 1, pp. 265-276, 2008. · Zbl 05871777
[36] R. C. T. Lee, J. R. Slagle, and C. T. Mong, “Towards automatic auditing of records,” IEEE Transactions on Software Engineering, vol. 4, no. 5, pp. 441-448, 1978. · Zbl 0385.68078
[37] H. Abdi, A. J. O’Toole, D. Valentin, and B. Edelman, “DISTATIS: the analysis of multiple distance matrices,” in Proceedings of the the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 42-47, San Diego, Calif, USA, 2005.
[38] P. D’Urso and M. Vichi, “Dissimilarities between trajectories of a three-way longitudinal data set,” in Advances in Data Science and Classification, A. Rizzi, M. Vichi, and H.-H. Bock, Eds., pp. 585-592, Springer, Berlin, Germany, 1998.
[39] C. J. Albers, F. Critchley, and J. C. Gower, “Group average representations in Euclidean distance cones,” in Selected Contributions in Data Analysis and Classification, P. Brito, P. Bertrand, G. Cucumel, and F. de Carvalho, Eds., Studies in Classification, Data Analysis, and Knowledge Organization, pp. 445-454, Springer, Berlin, Germany, 2007. · Zbl 1154.15312
[40] Y. Escoufier, “Le traitement des variables vectorielles,” Biometrics, vol. 29, pp. 751-760, 1973.
[41] Y. G. Fang, K. A. Loparo, and X. Feng, “Inequalities for the trace of matrix product,” IEEE Transactions on Automatic Control, vol. 39, no. 12, pp. 2489-2490, 1994. · Zbl 0825.93107
[42] K. Y. Lin, “An elementary proof of the Perron-Frobenius theorem for non-negative symmetric matrices,” Chinese Journal of Phisics, vol. 15, pp. 283-285, 1977.
[43] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2009.
[44] A. Frank and A. Asuncion, UCI Machine Learning Repository, University of California, School of Information and Computer Sciences, Irvine, Calif, USA, 2010, http://archive.ics.uci.edu/ml.
[45] R. H. Lock, “New car data,” Journal of Statistics Education, vol. 1, no. 1, 1993, http://www.amstat.org/publications/jse/v1n1/datasets.lock.html.
[46] A. C. Cameron and P. K. Trivedi, Regression Analysis of Count Data, vol. 30 of Econometric Society Monographs, Cambridge University Press, Cambridge, Mass, USA, 1998. · Zbl 0924.62004
[47] J. H. Stock and M. W. Watson, Introduction to Econometrics, Addison Wesley, Boston, Mass, USA, 2nd edition, 2007.
[48] J. R. Quinlan, “Simplifying decision trees,” International Journal of Man-Machine Studies, vol. 27, no. 3, pp. 221-234, 1987.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.