×

zbMATH — the first resource for mathematics

Imputation of missing values for compositional data using classical and robust methods. (English) Zbl 1284.62049
Summary: New imputation algorithms for estimating missing values in compositional data are introduced. A first proposal uses the k-nearest neighbor procedure based on the Aitchison distance, a distance measure especially designed for compositional data. It is important to adjust the estimated missing values to the overall size of the compositional parts of the neighbors. As a second proposal an iterative model-based imputation technique is introduced which initially starts from the result of the proposed k-nearest neighbor procedure. The method is based on iterative regressions, thereby accounting for the whole multivariate data information. The regressions have to be performed in a transformed space, and depending on the data quality classical or robust regression techniques can be employed. The proposed methods are tested on a real and on simulated data sets. The results show that the proposed methods outperform standard imputation methods. In the presence of outliers, the model-based method with robust regressions is preferable.

MSC:
62-07 Data analysis (statistics) (MSC2010)
62F35 Robustness and adaptive procedures (parametric inference)
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Aitchison, J., The statistical analysis of compositional data, (1986), Chapman & Hall London, Reprinted in 2003 by Blackburn Press · Zbl 0688.62004
[2] Aitchison, J.; Barceló-Vidal, C.; Martín-Fernández, J.A.; Pawlowsky-Glahn, V., Logratio analysis and compositional distance, Mathematical geology, 32, 3, 271-275, (2000) · Zbl 1101.86309
[3] Beguin, C.; Hulliger, B., The BACON-EEM algorithm for multivariate outlier detection in incomplete survey data, Survey methodology, 34, 1, 91-103, (2008)
[4] Bishop, C.M., Probabilistic principal component analysis, Journal of the royal statistical society, series B, 61, 611-622, (1999) · Zbl 0924.62068
[5] Bren, M., Tolosana-Delgado, R., van den Boogaart, K.G., 2008. News from “Compositions”, the Package. CoDaWork’08. Girona. http://dugi-doc.udg.edu/bitstream/10256/716/1/BREN_cw08_nfc.pdf
[6] Dempster, A.P.; Laird, N.M.; Rubin, D.B., Maximum likelihood from incomplete data via the EM algorithm, Journal of the royal statistical society, 39, 1-38, (1977) · Zbl 0364.62022
[7] Egozcue, J.J.; Pawlowsky-Glahn, V., Groups of parts and their balances in compositional data analysis, Mathematical geology, 37, 7, 795-828, (2005) · Zbl 1177.86018
[8] Egozcue, J.J.; Pawlowsky-Glahn, V., Simplicial geometry for compositional data, (), 145-160 · Zbl 1156.86307
[9] Egozcue, J.J.; Pawlowsky-Glahn, V.; Mateu-Figueraz, G.; Barceló-Vidal, C., Isometric logratio transformations for compositional data analysis, Mathematical geology, 35, 3, 279-300, (2003) · Zbl 1302.86024
[10] Filzmoser, P.; Hron, K., Outlier detection for compositional data using robust methods, Mathematical geosciences, 40, 3, 233-248, (2008) · Zbl 1135.62040
[11] Filzmoser, P.; Hron, K., Correlation analysis for compositional data, Mathematical geosciences, 41, 8, 905-919, (2009) · Zbl 1178.86019
[12] Filzmoser, P.; Hron, K.; Reimann, C., Principal component analysis for compositional data with outliers, Environmetrics, 20, 6, 621-632, (2009)
[13] Fritz, H.; Filzmoser, P., Plausibility of databases and the relation to imputation methods, ISBN: 978-3-8364-5992-1, (2008), VDM Verlag Dr. Müller Saarbrücken
[14] Kim, H.; Golub, G.H.; Park, H., Missing value estimation for DNA microarray gene expression data: local least squares imputation, Bioinformatics, 21, 2, 187-198, (2005)
[15] Kovács, L.Ó.; Kovács, G.P.; Martín-Fernández, J.A.; Barceló-Vidal, C., Major-oxide compositional discrimination in cenozoic volcanites of Hungary, (), 145-160
[16] Little, R.J.A.; Rubin, D.B., Statistical analysis with missing data, (2002), Wiley New Jersey
[17] Maronna, R.; Martin, R.D.; Yohai, V.J., Robust statistics: theory and methods, (2006), John Wiley New York · Zbl 1094.62040
[18] Martín-Fernández, J.A.; Barceló-Vidal, C.; Pawlowsky-Glahn, V., Dealing with zeros and missing values in compositional data sets using nonparametric imputation, Mathematical geology, 35, 3, 253-278, (2003) · Zbl 1302.86027
[19] Mateu-Figueras, G.; Pawlowsky-Glahn, V., A critical approach to probability laws in geochemistry, Mathematical geosciences, 40, 5, 489-502, (2008) · Zbl 1153.86338
[20] Oba, S.; Sato, M.A.; Takemasa, I.; Monden, M.; Matsubara, K.; Ishii, S., A Bayesian missing value estimation method for gene expression expression profile data, Bioinformatics, 19, 16, 2088-2096, (2003)
[21] Palarea-Albaladejo, J.; Martín-Fernández, J.A., A modified EM alr-algorithm for replacing rounded zeros in compositional data sets, Computer & geosciences, 34, 8, 902-917, (2008)
[22] Pawlowsky-Glahn, V.; Egozcue, J.J., BLU estimators and compositional data, Mathematical geology, 34, 3, 259-274, (2002) · Zbl 1031.86007
[23] Pawlowsky-Glahn, V., Egozcue, J.J., Tolosana-Delgado, J., 2007. Lecture notes on compositional data analysis. http://hdl.handle.net/10256/297
[24] Pearson, K., Mathematical contributions to the theory of evolution. on a form of spurious correlation which may arise when indices are used in the measurement of organs, Proceedings of the royal society of London, 60, 489-502, (1897) · JFM 28.0209.02
[25] Rousseeuw, P.J.; Van Driessen, K., Computing LTS regression for large data sets, Data mining and knowledge discovery, 12, 29-45, (2006)
[26] R Development Core Team. 2008. R: A language and environment for statistical computing. Vienna. http://www.r-project.org
[27] Schafer, J.L., Analysis of incomplete multivariate data, (1997), Chapman & Hall London · Zbl 0997.62510
[28] Scholz, M.; Kaplan, F.; Guy, C.L.; Kopka, J.; Selbig, J., Non-linear PCA: A missing data approach, Bioinformatics, 21, 3887-3895, (2005)
[29] Serneels, S.; Verdonck, T., Principal component analysis for data containing outliers and missing elements, Computational statistics & data analysis, 52, 3, 1712-1727, (2008) · Zbl 1452.62419
[30] Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R., Missing value estimation methods for DNA microarrays, Bioinformatics, 17, 6, 520-525, (2001)
[31] Van den Boogaart, K.G., Tolosana-Delgado, R., Bren, M., 2006. Concept for handling with zeros and missing values in compositional data. In: Proceedings of IAMG’06—The XI Annual Conference of the International Association for Mathematical Geology. University of Liege, Belgium. CD-ROM
[32] Yucel, R.M.; Demirtas, H., Impact of non-normal random effects on inference by multiple imputation: A simulation assessment, Computational statistics & data analysis, 54, 3, 790-801, (2010) · Zbl 05689630
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.