Dealing with distances and transformations for fuzzy \(c\)-means clustering of compositional data. (English) Zbl 1360.62347

Summary: Clustering techniques are based upon a dissimilarity or distance measure between objects and clusters. This paper focuses on the simplex space, whose elements – compositions – are subject to non-negativity and constant-sum constraints. Any data analysis involving compositions should fulfill two main principles: scale invariance and subcompositional coherence. Among fuzzy clustering methods, the FCM algorithm is broadly applied in a variety of fields, but it is not well-behaved when dealing with compositions. Here, the adequacy of different dissimilarities in the simplex, together with the behavior of the common log-ratio transformations, is discussed in the basis of compositional principles. As a result, a well-founded strategy for FCM clustering of compositions is suggested. Theoretical findings are accompanied by numerical evidence, and a detailed account of our proposal is provided. Finally, a case study is illustrated using a nutritional data set known in the clustering literature.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H86 Multivariate analysis and fuzziness
86A32 Geostatistics


Full Text: DOI


[1] AITCHISON, J. (1986), The Statistical Analysis of Compositional Data, London: Chapman & Hall, reprinted in 2003 by Blackburn Press. · Zbl 0688.62004
[2] AITCHISON, J. (1992), ”On Criteria for Measures of Compositional Difference,” Mathematical Geology, 24, 365–379. · Zbl 0970.86531
[3] AITCHISON, J., BARCELÓ-VIDAL, C., MARTÍN-FERNÁNDEZ, J.A., and PAWLOWSKY-GLAHN, V. (2000), ”Logratio Analysis and Compositional Distance,” Mathematical Geology, 32, 271–275. · Zbl 1101.86309
[4] AITCHISON, J., and GREENACRE, M. (2002), ”Biplots for Compositional Data,” Journal of the Royal Statistical Society, Series C, 51, 375–392. · Zbl 1111.62300
[5] BAXTER, M.J., and FREESTONE, I.C. (2006), ”Log-ratio Compositional Data Analysis in Archeometry,” Archaeometry, 48, 511–531.
[6] BERGET, I., MEVIK, B-H., and NAES, T. (2008), ”New Modifications and Applications of Fuzzy C-Means Methodology,” Computational Statistics & Data Analysis, 52, 2403–2418. · Zbl 1452.62432
[7] BEZDEK, J. (1981), Pattern Recognition with Fuzzy Objective Function Algorithms, New York: Plenum Press. · Zbl 0503.68069
[8] BILLHEIMER, D., GUTTORP, P., and FAGAN, W. (2001), ”Statistical Interpretation of Species Composition,” Journal of the American Statistical Association, 96, 1205–1214. · Zbl 1073.62573
[9] CHACÓN, J.E., MATEU-FIGUERAS, G., and MARTÍN-FERNÁNDEZ, J.A. (2011), ”Gaussian Kernels for Density Estimation with Compositional Data,” Computers & Geosciences, 37, 702–711.
[10] DESARBO, W.S., RAMASWAMY, V., and LENK, P. (1993), ”A Latent Class Procedure for the Structural Analysis of Two-Way Compositional Data,” Journal of Classification, 10, 159–193. · Zbl 0800.62332
[11] DÖRING, C., LESOT, M-J., and KRUSE, R. (2006), ”Data Analysis with Fuzzy Clustering Methods,” Computational Statistics & Data Analysis, 51, 192–214. · Zbl 1157.62434
[12] EGOZCUE, J.J., PAWLOWSKY-GLAHN, V., MATEU-FIGUERAS, G., and BARCELÓ-VIDAL, C. (2003), ”Isometric Logratio Transformations for Compositional Data Analysis,” Mathematical Geology, 35, 279–300. · Zbl 1302.86024
[13] EGOZCUE, J.J., and PAWLOWSKY-GLAHN, V. (2005), ”CoDa-Dendrogram: A New Exploratory Tool,” in Proceedings of the Second Compositional Data Analysis Workshop - CoDaWork’05, Girona, Spain.
[14] GABRIEL, K.R. (1971), ”The Biplot Graphic Display of Matrices with Application to Principal Component Analysis,” Biometrika, 58, 453–467. · Zbl 0228.62034
[15] GAVIN, D.G., OSWALD, W.W., WAHL, E.R., and WILLIAMS, J.W. (2003), ”A Statistical Approach to Evaluating Distance Metrics and Analog Assignments for Pollen Records,” Quaternary Research, 60, 356–367.
[16] GREENACRE, M. (1988), ”Clustering the Rows and Columns of a Contingency Table,” Journal of Classification, 5, 39–51. · Zbl 0652.62053
[17] HARTIGAN, J.A. (1975), Clustering Algorithms, New York: Wiley & Sons. · Zbl 0372.62040
[18] HÖPPNER, F., KLAWONN, F., KRUSE, R., and RUNKLER, T. (1999), Fuzzy Cluster Analysis: Methods for Classification, Data analysis, and Image Recognition, Chichester: John Wiley & Sons. · Zbl 0944.65009
[19] LEGENDRE, P., and GALLAGHER, E.D. (2001), ”Ecologically Meaningful Transformations for Ordination of Species Data,” Oecologia, 129, 271–280.
[20] MARTÍN, M.C. (1996), ”Performance of Eight Dissimilarity Coefficients to Cluster a Compositional Data Set,” in Abstracts of the Fifth Conference of International Federation of Classification Societies (Vol. 1), Kobe, Japan, pp. 215–217.
[21] MARTÍN-FERNÁNDEZ, J.A., BREN, M., BARCELÓ-VIDAL, C., and PAWLOWSKYGLAHN, V. (1999), ”A Measure of Difference for Compositional Data Based On Measures of Divergence,” in Proceedings of the Fifth Annual Conference of the International Assotiation for Mathematical Geology (Vol. 1), Trondheim, Norway, pp. 211–215.
[22] MARTÍN-FERNÁNDEZ, J.A., BARCELÓ-VIDAL, C., and PAWLOWSKY-GLAHN, V. (2003), ”Dealing with Zeros and Missing Values in Compositional Data Sets,” Mathematical Geology, 35, 253–278. · Zbl 1302.86027
[23] MILLER, W.E. (2002), ”Revisiting the Geometry of a Ternary Diagram with the Half-Taxi Metric,” Mathematical Geology, 34, 275–290. · Zbl 1031.86005
[24] PALAREA-ALBALADEJO, J., MARTÍN-FERNÁNDEZ, J.A., and GÓMEZ-GARCÍA, J. (2007), ”A Parametric Approach for Dealing with Compositional Rounded Zeros,” Mathematical Geology, 39, 625–645. · Zbl 1130.86001
[25] PALAREA-ALBALADEJO, J., and MARTÍN-FERNÁNDEZ, J.A. (2008), ”A Modified EM alr-Algorithm for Replacing Rounded Zeros in Compositional Data Sets,” Computers & Geosciences, 34, 902–917.
[26] PAWLOWSKY-GLAHN, V., and EGOZCUE, J.J. (2001), ”Geometric Approach to Statistical Analysis on the Simplex,” Stochastic Environmental Research and Risk Assessment, 15, 384–398. · Zbl 0987.62001
[27] PAWLOWSKY-GLAHN, V. (2003), ”Statistical Modelling on Coordinates,” in Proceedings of the First Compositional Data Analysis Workshop - CoDaWork’03, Girona, Spain.
[28] PAWLOWSKY-GLAHN, V., and EGOZCUE, J.J. (2008), ”Compositional Data and Simpson’s Paradox,” in Proceedings of the Third Compositional Data Analysis Workshop - CoDaWork’08, Girona, Spain. · Zbl 1031.86007
[29] SOTO, J., FLORES-SINTAS, A., and PALAREA-ALBALADEJO, J. (2008), ”Improving Probabilities in a Fuzzy Clustering Partition,” Fuzzy Sets & Systems, 159, 406–421. · Zbl 1176.68167
[30] TEMPL, M., FILZMOSER, P., and REIMANN, C. (2008), ”Cluster Analysis Applied to Regional Geochemical Data: Problems and Possibilities,” Applied Geochemistry, 23, 2198–2213.
[31] VÊNCIO, R., VARUZZA, L., PEREIRA, C., BRENTANI, H. and SHMULEVICH, I. (2007), ”Simcluster: Clustering Enumeration Gene Expression Data on the Simplex Space,” BMC Bioinformatics, 8, 246. · Zbl 05326349
[32] WAHL, E.R. (2004), ”A General Framework for Determining Cut-off Values to Select Pollen Analogs with Dissimilarity Metrics in the Modern Analog Technique,” Review of Palaeobotany and Palynology, 128, 263–280.
[33] WANG, H., LIU, Q., MOK, H.M.K., FU, L., and TSE, W.M. (2007), ”A Hyperspherical Transformation Forecasting Model for Compositional Data,” European Journal of Operations Research, 179, 459–468. · Zbl 1114.90049
[34] WATSON, D.F., and PHILIP, G.M. (1989), ”Measures of Variability for Geological Data,” Mathematical Geology, 21, 233–254.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.