×

zbMATH — the first resource for mathematics

Hierarchical clustering of continuous variables based on the empirical copula process and permutation linkages. (English) Zbl 1284.62380
Summary: The agglomerative hierarchical clustering of continuous variables is studied in the framework of the likelihood linkage analysis method proposed by Lerman. The similarity between variables is defined from the process comparing the empirical copula with the independence copula in the spirit of the test of independence proposed by Deheuvels. Unlike more classical similarity coefficients for variables based on rank statistics, the comparison measure considered in this work can also be sensitive to non-monotonic dependencies. As aggregation criteria, besides classical linkages, permutation-based linkages related to procedures for combining dependent \(p\)-values are considered. The performances of the corresponding clustering algorithms are compared through thorough simulations. In order to guide the choice of a partition, a natural probabilistic selection strategy, related to the use of the gap statistic in object clustering, is proposed and empirically compared with classical ordinal approaches. The resulting variable clustering procedure can be equivalently regarded as a potentially less computationally expensive alternative to more powerful tests of multivariate independence.

MSC:
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62-07 Data analysis (statistics) (MSC2010)
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Beran, R.; Bilodeau, M.; Lafaye de Micheaux, P., Nonparametric tests of independence between random vectors, Journal of multivariate analysis, 98, 9, 1805-1824, (2007) · Zbl 1130.62040
[2] Bruynooghe, M., Classification ascendante hiérarchique de grands ensembles de données: un algorithm rapide fondé sur la construction de voisinages réductibles, LES cahiers de l’analyse de données, III, 7-33, (1978)
[3] Deheuvels, P., A non parametric test for independence, Publications de l’institut de statistique de l’université de Paris, 26, 29-50, (1981) · Zbl 0478.62029
[4] Edgington, E.S., An additive method for combining probability values from independent experiments, The journal of psychology, 80, 351-363, (1972)
[5] Embrechts, P.; McNeil, A.J.; Straumann, D., Correlation and dependence in risk management: properties and pitfalls, (), 176-223
[6] Feller, W., ()
[7] Fisher, R.A., Statistical methods for research workers, (1932), Olivier and Boyd London · JFM 58.1161.04
[8] Fredericks, G.A.; Nelsen, R.B., On the relationship between spearman’s rho and kendall’s tau for pairs of continuous variables, Journal of statistical planning and inference, 137, 2143-2150, (2007) · Zbl 1120.62045
[9] Genest, C.; Rémillard, B., Tests of independence and randomness based on the empirical copula process, Test, 13, 2, 335-369, (2004) · Zbl 1069.62039
[10] Genest, C.; Verret, F., Locally most powerful rank tests of independence for copula models, Nonparametric statistics, 17, 5, 521-539, (2005) · Zbl 1065.62081
[11] Genest, C.; Quessy, J.-F.; Rémillard, B., Local efficiency of a cramér-von Mises test of independence, Journal of multivariate analysis, 97, 274-294, (2006) · Zbl 1079.62048
[12] Genest, C.; Quessy, J.-F.; Rémillard, B., Asymptotic local efficiency of cramér-von Mises tests for multivariate independence, The annals of statistics, 35, 166-191, (2007) · Zbl 1114.62058
[13] Hansen, P.; Jaumard, B., Cluster analysis and mathematical programming, Mathematical programming, 79, 191-215, (1997) · Zbl 0887.90182
[14] Harrell, F.E., R package hmisc, (2007), URL http://biostat.mc.vanderbilt.edu/s/Hmisc. R package version 3.2-1
[15] Hoeffding, W., A non-parametric test of independence, Ann. math. stat., 19, 546-557, (1948) · Zbl 0032.42001
[16] Joe, H., Relative entropy measures of multivariate dependence, Journal of the American statistical association, 84, 157-164, (1989) · Zbl 0677.62054
[17] Kojadinovic, I., Agglomerative hierarchical clustering of continuous variables based on mutual information, Computational statistics and data analysis, 46, 269-294, (2004) · Zbl 1429.62251
[18] Kojadinovic, I.; Holmes, M., Tests of independence among continuous random vectors based on cramér-von Mises functionals of the empirical copula process, Journal of multivariate analysis, 100, 6, 1137-1154, (2009) · Zbl 1159.62033
[19] Kojadinovic, I.; Lerman, I.C.; Peter, P., Hclust: hierarchical clustering of variables or objects based on the likelihood linkage analysis method, (2009), R package version 0.2-2
[20] Lerman, I.C., Classification et analyse ordinale de données, (1981), Dunod Paris · Zbl 0485.62051
[21] Lerman, I.C., Foundations of the likelihood linkage analysis classification method, Applied stochastic models and data analysis, 7, 63-76, (1991) · Zbl 0800.62320
[22] Lerman, I.C., Likelihood linkage analysis classification method: an example treated by hand, Biochimie, 75, 379-397, (1993)
[23] Loughin, T.M., A systematic comparison of methods for combining \(p\)-values from independent tests, Computational statistics and data analysis, 47, 467-485, (2004) · Zbl 1430.62048
[24] Milligan, G.W.; Cooper, M.C., An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50, 2, 159-179, (1985)
[25] Murtagh, F., A survey of recent advances in hierarchical clustering algorithms, Computer journal, 26, 354-359, (1983) · Zbl 0523.68030
[26] Pesarin, F., Multivariate permutation tests with applications in biostatistics, (2001), Wiley
[27] Plasse, M.; Niang, N.; Saporta, G.; Villeminot, A.; Leblond, L., Combined use of association rules mining and clustering methods to find relevant links between binary rare attributes in a large data set, Computational statistics and data analysis, 52, 1, 596-613, (2007) · Zbl 1452.62460
[28] R Development Core Team, R foundation for statistical computing, Vienna, Austria, R: A language and environment for statistical computing, ISBN: 3-900051-07-0, (2009), URL http://www.R-project.org
[29] Rényi, A., On measures of dependance, Acta mathematicaacademiae scientiarium hungaricae, 10, 441-451, (1959) · Zbl 0091.14403
[30] Sahmer, K.; Vigneau, E.; Qannari, E.M., A cluster approach to analyze preference data: choice of the number of clusters, Food quality and preference, 17, 257-265, (2006)
[31] Sarle, W.S., SAS/STAT user’s guide: the VARCLUS procedure, (1990), SAS Institute, Inc Cary, NC, USA
[32] Schweizer, B.; Wolff, E.F., On nonparametric measures of dependence for random variables, The annals of statistics, 9, 4, 879-885, (1981) · Zbl 0468.62012
[33] Sklar, A., Fonctions de répartition à \(n\) dimensions et leurs marges, Publications de l’institut de statistique de l’université de Paris, 8, 229-231, (1959) · Zbl 0100.14202
[34] Tibshirani, R.; Walther, G.; Hastie, T., Estimating the number of clusters in a data set via the gap statistic, Journal of the royal statistical society B, 63, (2001), 441-423 · Zbl 0979.62046
[35] Tippett, L.H.C., The method of statistics, (1931), Williams and Norgate London
[36] Vigneau, E.; Qannari, E.M., Clustering of variables around latent component: application to sensory analysis, Commun. statist. simulation comput., 32, 4, 1131-1150, (2003) · Zbl 1100.62582
[37] Yan, J.; Kojadinovic, I., Copula: multivariate dependence with copulas, (2009), R package version 0.8-8
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.