# zbMATH — the first resource for mathematics

Exploratory tools for outlier detection in compositional data with structural zeros. (English) Zbl 07282066
Summary: The analysis of compositional data using the log-ratio approach is based on ratios between the compositional parts. Zeros in the parts thus cause serious difficulties for the analysis. This is a particular problem in case of structural zeros, which cannot be simply replaced by a non-zero value as it is done, e.g. for values below detection limit or missing values. Instead, zeros to be incorporated into further statistical processing. The focus is on exploratory tools for identifying outliers in compositional data sets with structural zeros. For this purpose, Mahalanobis distances are estimated, computed either directly for subcompositions determined by their zero patterns, or by using imputation to improve the efficiency of the estimates, and then proceed to the subcompositional and subgroup level. For this approach, new theory is formulated that allows to estimate covariances for imputed compositional data and to apply estimations on subgroups using parts of this covariance matrix. Moreover, the zero pattern structure is analyzed using principal component analysis for binary data to achieve a comprehensive view of the overall multivariate data structure. The proposed tools are applied to larger compositional data sets from official statistics, where the need for an appropriate treatment of zeros is obvious.
##### MSC:
 97K80 Applied statistics (educational aspects) 97K70 Foundations and methodology of statistics (educational aspects) 97K40 Descriptive statistics (educational aspects)
##### Software:
GitHub; impute; laeken; R; robCompositions; simPop
Full Text:
##### References:
 [1] J. Aitchison, The Statistical Analysis of Compositional Data, Chapman & Hall, London, 1986. · Zbl 0688.62004 [2] J. Aitchison and M. Greenacre, Biplots of compositional data, J. Appl. Stat. 51 (2002), pp. 375-392. · Zbl 1111.62300 [3] J. Aitchison and J. Kay, Possible solutions of some essential zero problems in compositional data analysis. pp. 1-6. Available at http://ima.udg.edu/Activitats/CoDaWork03/paper_Aitchison_and_Kay.pdf. [4] A. Alfons and M. Templ, Estimation of social exclusion indicators from complex surveys: The R package laeken, J. Statist. Softw. 54 (2013), pp. 1-25. [5] A. Alfons, S. Kraft, M. Templ, and P. Filzmoser, Simulation of close-to-reality population data for household surveys with application to EU-SILC, Statist. Methods Appl. 20 (2011), pp. 383-407. · Zbl 1237.91178 [6] J. Bacon-Shone, Discrete and continuous compositions, in CoDaWork’08, Universitat de Girona. Departament d’Informática i Matemática Aplicada, 2008, p. 11. [7] A. Butler and C. Glasbey, A latent Gaussian model for compositional data with zeros, J. Appl. Stat. 57 (2008), pp. 505-520. [8] F. Chebana and T. Ouarda, Depth-based multivariate descriptive statistics with hydrological applications, J. Geophys. Res: Atmos. 116 (2011), pp. 1-19. [9] X. Dang and R. Serfling, Nonparametric depth-based multivariate outlier identifiers, and masking robustness properties, J. Stat. Plan. Inference 140 (2010), pp. 198-213. · Zbl 1191.62084 [10] J. de Leeuw, Principal component analysis of binary data by iterated singular value decomposition, Comput. Stat. Data Anal. 50 (2006), pp. 21-39. · Zbl 1429.62218 [11] O. Dupriez, Building a household consumption database for the calculation of poverty ppps, Technical note, World Bank, 2007, Available at http://siteresources.worldbank.org/ICPINT/Resources/270056-1195253046582/Dupriez_BuildingaHHCdatabasefortheCalculationofPovertyPPPs_Mar07.pdf. [12] JJ. Egozcue, Reply to ‘On the Harker variation diagrams; …’ by J.A. Cortés, Math. Geosci. 41 (2009), pp. 829-834. · Zbl 1178.86018 [13] JJ. Egozcue and V. Pawlowsky-Glahn, Groups of parts and their balances in compositional data analysis, Math. Geol. 37 (2005), pp. 795-828. · Zbl 1177.86018 [14] J. Egozcue and V. Pawlowsky-Glahn, Compositional Data Analysis in the Geosciences: From theory to Practice, chap. Simplicial geometry for compositional data, Geological Society, London, 2006, pp. 145-160, special Publications 264. · Zbl 1156.86307 [15] JJ. Egozcue, V. Pawlowsky-Glahn, G. Mateu-Figueras, and C. Barceló-Vidal, Isometric logratio transformations for compositional data analysis, Math. Geol. 35 (2003), pp. 279-300. · Zbl 1302.86024 [16] J. Egozcue, V. Pawlowsky-Glahn, G. Mateu-Figueras, and C. Barceló-Vidal, Compositional Data Analysis: Theory and Applications, Elem. Simplicial Linear Algebra Geometry. Wiley, Chichester, 2011, 139-145. [17] Eurostat, Description of target variables: Cross-sectional and longitudinal, EU-SILC 065/04, Unit E-2: Living conditions, Directorate E: Social and regional statistics and geographical information system, Eurostat, Luxembourg, 2004. [18] P. Filzmoser and K. Hron, Outlier detection for compositional data using robust methods, Math. Geosci. 40 (2008), pp. 233-248. · Zbl 1135.62040 [19] P. Filzmoser, K. Hron, and C. Reimann, Principal component analysis for compositional data with outliers, Environmetrics 20 (2009), pp. 621-632. [20] P. Filzmoser, K. Hron, and C. Reimann, Interpretation of multivariate outliers for compositional data, Comput. Geosci. 39 (2012), pp. 77-85. [21] JM. Fry, TR. Fry, and KR. McLaren, Compositional data analysis and zeros in micro data, Appl. Econom. 32 (2000), pp. 953-959, Available at http://www.tandfonline.com/doi/abs/10.1080/000368400322002. [22] K.R. Gabriel, The biplot – graphic display of matrices with application to principal component analysis, Biometrika 58 (1971), pp. 453-467. · Zbl 0228.62034 [23] J. Guilford, Psychometric Methods, McGraw-Hill series in psychology, McGraw-Hill, New York City, 1954. [24] K. Hron, M. Templ, and P. Filzmoser, Imputation of missing values for compositional data using classical and robust methods, Comput. Statist. Data Anal. 54 (2010), pp. 3095-3107. · Zbl 1284.62049 [25] S. Lee, JZ. Huang, and J. Hu, Sparse logistic principal components analysis for binary data, Ann. Appl. Stat. 4 (2010), pp. 1579-1601, Available at http://dx.doi.org/10.1214/10-AOAS327. · Zbl 1202.62084 [26] JA. Martín-Fernández, C. Barceló-Vidal, and V. Pawlowsky-Glahn, Dealing with zeros and missing values in compositional data sets using nonparametric imputation, Math. Geol. 35 (2003), pp. 253-278. · Zbl 1302.86027 [27] J. Martín-Fernández, J. Palarea-Albaladejo, and R. Olea, Compositional Data Analysis: Theory and Applications, Dealing with Zeros, Wiley, Chichester, 2011, 43-58. [28] JA. Martín-Fernández, K. Hron, M. Templ, P. Filzmoser, and J. Palarea-Albaladejo, Model-based replacement of rounded zeros in compositional data: Classical and robust approaches, Comput. Statist. Data Anal. C 56 (2012), pp. 2688-2704. · Zbl 1255.62116 [29] J. Martín-Fernández, K. Hron, M. Templ, P. Filzmoser, and J. Palarea-Albaladejo, Bayesian-multiplicative treatment of count zeros in compositional data sets, Stat. Model. 15 (2015), doi:10.1177/1471082X14535524. · Zbl 1255.62116 [30] B. Meindl, M. Templ, A. Alfons, and A. Kowarik, simPop: Simulation of Synthetic Populations for Survey Data Considering Auxiliary Information, 2015, Available at http://CRAN.R-project.org/package=simPop, ##img## ##img####img##$$\mathsf{R}$$ package version 0.2.9. [31] V. Pawlowsky-Glahn and A. Buccianti, Compositional Data Analysis: Theory and Applications, Wiley, Chichester, 2011. · Zbl 1103.62111 [32] V. Pawlowsky-Glahn, J. Egozcue, and R. Tolosana-Delgado, Modeling and Analysis of Compositional Data, Wiley, Chichester, 2015. [33] P. Rousseeuw and K. von Driessen, A fast algorithm for the minimum covariance determinant estimator, Technometrics 41 (1999), pp. 212-223. [34] JL. Scealy and AH. Welsh, Regression for compositional data by using distributions defined on the hypersphere, J. R. Stat. Soc. Ser. B. Stat. Methodol. 73 (2011), pp. 351-375. · Zbl 1411.62179 [35] C. Stewart and C. Field, Managing the essential zeros in quantitative fatty acid signature analysis, J. Agric. Biol. Environ. Stat. 16 (2010), pp. 45-69. · Zbl 1306.62237 [36] F. Tang and H. Tao, Binary principal component analysis, In Proc. British Machine Vision Conference, Volume I, 2006, pp. 377-386. [37] M. Templ, A. Alfons, and P. Filzmoser, Exploring incomplete data using visualization techniques, Adv. Data Anal. Classif. 6 (2012), pp. 29-47. [38] M. Templ, K. Hron, and P. Filzmoser, robCompositions: An R-package for robust statistical analysis of compositional data, in Compositional Data Analysis: Theory and Applications, V. Pawlowsky-Glahn and A. Buccianti, eds., Wiley, Chichester, 2011, pp. 341-355. [39] M. Templ, K. Hron, and P. Filzmoser, Robust Estimation for Compositional Data, 2015, Available at https://github.com/matthias-da/robCompositions, R package version 1.9.2. · Zbl 1304.65033 [40] V. Todorov, M. Templ, and P. Filzmoser, Detection of multivariate outliers in business survey data with incomplete information, Adv. Data Anal. Classif. 5 (2011), pp. 37-56. [41] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and RB. Altman, Missing value estimation methods for dna microarrays, Bioinformatics 17 (2001), pp. 520-525. [42] K. van den Boogaart and R. Tolosana-Delgado, Analyzing Compositional Data with R, Springer, Heidelberg, 2013. · Zbl 1276.62011 [43] H. Wang, Q. Liu, HMK. Mok, L. Fu, and W. Man Tse, A hyperspherical transformation forecasting model for compositional data, Eur. J. Oper. Res. 179 (2007), pp. 459-468. · Zbl 1114.90049
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.