×

Cross-study replicability in cluster analysis. (English) Zbl 07708433

Summary: In cancer research, clustering techniques are widely used for exploratory analyses, playing a critical role in the identification of novel cancer subtypes and patient management. As data collected by multiple research groups grows, it is increasingly feasible to investigate the replicability of clustering procedures, that is, their ability to consistently recover biologically meaningful clusters across several data sets. In this paper, we review methods for replicability of clustering analyses, and discuss a novel framework for evaluating cross-study clustering replicability, useful when two or more studies are available. Our approach can be applied to any clustering algorithm and can employ different measures of similarity between partitions to quantify replicability, globally (i.e., for the whole sample) as well as locally (i.e., for individual clusters). Using experiments on synthetic and real gene expression data, we illustrate the usefulness of our procedure to evaluate if the same clusters are identified consistently across a collection of data sets.

MSC:

62-XX Statistics

References:

[1] ALBATINEH, A. N., NIEWIADOMSKA-BUGAJ, M. and MIHALKO, D. (2006). On similarity indices and correction for chance agreement. J. Classification 23 301-313. · Zbl 1336.62168 · doi:10.1007/s00357-006-0017-z
[2] ALEXE, G., DALGIN, G. S., RAMASWAMY, R., DELISI, C. and BHANOT, G. (2006). Data perturbation independent diagnosis and validation of breast cancer subtypes using clustering and patterns. Cancer Inform. 2.
[3] ARRIETA, A. B., DÍAZ-RODRÍGUEZ, N., DEL SER, J., BENNETOT, A., TABIK, S., BARBADO, A., GARCÍA, S., GIL-LÓPEZ, S., MOLINA, D. et al. (2020). Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Inf. Fusion 58 82-115.
[4] BEN-DAVID, S., PÁL, D. and SIMON, H. U. (2007). Stability of \(k\)-means clustering. In Learning Theory. Lecture Notes in Computer Science 4539 20-34. Springer, Berlin. · Zbl 1203.68138 · doi:10.1007/978-3-540-72927-3_4
[5] BERNAU, C., RIESTER, M., BOULESTEIX, A.-L., PARMIGIANI, G., HUTTENHOWER, C., WALDRON, L. and TRIPPA, L. (2014). Cross-study validation for the assessment of prediction algorithms. Bioinformatics 30.
[6] BERTONI, A. and VALENTINI, G. (2007). Model order selection for bio-molecular data clustering. BMC Bioinform. 8.
[7] BROCK, G., PIHUR, V., DATTA, S. and DATTA, S. (2008). clvalid: An R package for cluster validation. J. Stat. Softw. 25.
[8] BRYAN, J. (2004). Problems in gene clustering based on gene expression data. J. Multivariate Anal. 90 44-66. · Zbl 1047.62060 · doi:10.1016/j.jmva.2004.02.011
[9] ESTER, M., KRIEGEL, H.-P., SANDER, J. and XU, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.
[10] FANG, Y. and WANG, J. (2012). Selection of the number of clusters via the bootstrap method. Comput. Statist. Data Anal. 56 468-477. · Zbl 1239.62076 · doi:10.1016/j.csda.2011.09.003
[11] FRÄNTI, P. and SIERANOJA, S. (2018). K-means properties on six clustering benchmark datasets. Appl. Intell. 48. · Zbl 1521.68212
[12] FRÄNTI, P., VIRMAJOKI, O. and HAUTAMÄKI, V. (2006). Fast agglomerative clustering using a \(k\)-nearest neighbor graph. IEEE Trans. Pattern Anal. Mach. Intell. 28.
[13] FREY, B. J. and DUECK, D. (2007). Clustering by passing messages between data points. Science 315 972-976. · Zbl 1226.94027 · doi:10.1126/science.1136800
[14] HAIBE-KAINS, B., DESMEDT, C., LOI, S., CULHANE, A. C., BONTEMPI, G., QUACKENBUSH, J. and SOTIRIOU, C. (2012). A three-gene model to robustly identify breast cancer molecular subtypes. J. Natl. Cancer Inst. 104.
[15] HAYES, D. N., MONTI, S., PARMIGIANI, G., GILKS, C. B., NAOKI, K., BHATTACHARJEE, A., SOCINSKI, M. A., PEROU, C. and MEYERSON, M. (2006). Gene expression profiling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts. J. Clin. Oncol. 24.
[16] HENNIG, C. (2007). Cluster-wise assessment of cluster stability. Comput. Statist. Data Anal. 52 258-271. · Zbl 1452.62447 · doi:10.1016/j.csda.2006.11.025
[17] HENNIG, C. (2015). Package ‘fpc’. R-project, 91.
[18] HUBERT, L. and ARABIE, P. (1985). Comparing partitions. J. Classification 2. · Zbl 0587.62128
[19] JASKOWIAK, P. A., CAMPELLO, R. J. and COSTA, I. G. (2014). On the selection of appropriate distances for gene expression data clustering. BMC Bioinform. 15 S2.
[20] KAPP, A. V. and TIBSHIRANI, R. (2006). Are clusters found in one dataset present in another dataset? Biostatistics 8. · Zbl 1170.62390
[21] LANCASTER, H. O. and SENETA, E. (1969). Chi-square distribution. In Encyclopedia of Biostatistics 2.
[22] LANGE, T., ROTH, V., BRAUN, M. L. and BUHMANN, J. M. (2004). Stability-based validation of clustering solutions. Neural Comput. 16. · Zbl 1089.68100
[23] LEVENSTIEN, M. A., YANG, Y. and OTT, J. (2003). Statistical significance for hierarchical clustering in genetic association and microarray expression studies. BMC Bioinform. 4.
[24] LEVINE, E. and DOMANY, E. (2001). Resampling method for unsupervised estimation of cluster validity. Neural Comput. 13. · Zbl 0993.68113
[25] LIM, C. and YU, B. (2016). Estimation stability with cross-validation (ESCV). J. Comput. Graph. Statist. 25 464-492. · doi:10.1080/10618600.2015.1020159
[26] LIU, Y., HAYES, D. N., NOBEL, A. and MARRON, J. S. (2008). Statistical significance of clustering for high-dimension, low-sample size data. J. Amer. Statist. Assoc. 103 1281-1293. · Zbl 1205.62079 · doi:10.1198/016214508000000454
[27] Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Trans. Inf. Theory 28 129-137. · Zbl 0504.94015 · doi:10.1109/TIT.1982.1056489
[28] MAATEN, L. V. D. and HINTON, G. (2008). Visualizing data using t-SNE. J. Mach. Learn. Res. 9. · Zbl 1225.68219
[29] MASOERO, L., THOMAS, E., PARMIGIANI, G., TYEKUCHEVA, S. and TRIPPA, L. (2023). Supplement to “Cross-study replicability in cluster analysis.” https://doi.org/10.1214/22-STS871SUPP · Zbl 07708433
[30] MCSHANE, L. M., RADMACHER, M. D., FREIDLIN, B., YU, R., LI, M.-C. and SIMON, R. (2002). Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 18.
[31] Müller, P. and Quintana, F. (2010). Random partition models with regression on covariates. J. Statist. Plann. Inference 140 2801-2808. · Zbl 1191.62073 · doi:10.1016/j.jspi.2010.03.002
[32] MURDOCH, W. J., SINGH, C., KUMBIER, K., ABBASI-ASL, R. and YU, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proc. Natl. Acad. Sci. USA 116 22071-22080. · Zbl 1431.62266 · doi:10.1073/pnas.1900654116
[33] NATIONAL ACADEMIES OF SCIENCES, ENGINEERING, AND MEDICINE (2019). Reproducibility and Replicability in Science. The National Academies Press, Washington, DC.
[34] PARKER, J. S., MULLINS, M., CHEANG, M. C., LEUNG, S., VODUC, D., VICKERY, T., DAVIES, S., FAURON, C., HE, X. et al. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol. 27.
[35] PEROU, C. M., SØRLIE, T., EISEN, M. B., VAN DE RIJN, M., JEFFREY, S. S., REES, C. A., POLLACK, J. R., ROSS, D. T., JOHNSEN, H. et al. (2000). Molecular portraits of human breast tumours. Nature 406.
[36] RAND, W. M. (1971). Objective criteria for the evaluation of clustering methods. J. Amer. Statist. Assoc. 66.
[37] SCHROEDER, M., HAIBE-KAINS, B., CULHANE, A., SOTIRIOU, C., BONTEMPI, G. and QUACKENBUSH, J. (2011a). breastCancerMAINZ: Gene expression dataset published by Schmidt et al. (2008) (MAINZ). R package version 1.16.0.
[38] SCHROEDER, M., HAIBE-KAINS, B., CULHANE, A., SOTIRIOU, C., BONTEMPI, G. and QUACKENBUSH, J. (2011b). breastCancerTRANSBIG: Gene expression dataset published by Desmedt et al. (2007) (TRANSBIG). R package version 1.16.0.
[39] SCHROEDER, M., HAIBE-KAINS, B., CULHANE, A., SOTIRIOU, C., BONTEMPI, G. and QUACKENBUSH, J. (2011c). breastCancerVDX: Gene expression datasets published by Wang et al. (2005) and Minn et al. (2007) (VDX). R package version 1.16.0.
[40] SMOLKIN, M. and GHOSH, D. (2003). Cluster stability scores for microarray data in cancer studies. BMC Bioinform. 4.
[41] TIBSHIRANI, R. and WALTHER, G. (2005). Cluster validation by prediction strength. J. Comput. Graph. Statist. 14 511-528. · doi:10.1198/106186005X59243
[42] TRIPPA, L., WALDRON, L., HUTTENHOWER, C. and PARMIGIANI, G. (2015). Bayesian nonparametric cross-study validation of prediction methods. Ann. Appl. Stat. 9 402-428. · Zbl 1454.62411 · doi:10.1214/14-AOAS798
[43] ULLMANN, U., HENNIG, C. and BOULESTEIX, A. (2022). Validation of cluster analysis results on validation data: A systematic framework. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 3 e1444. · doi:10.1002/widm.1444
[44] VINH, N. X., EPPS, J. and BAILEY, J. (2009). Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In Proceedings of the 26th Annual International Conference on Machine Learning. ACM, New York. · Zbl 1242.62062
[45] Vinh, N. X., Epps, J. and Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11 2837-2854. · Zbl 1242.62062
[46] VON LUXBURG, U. (2010). Clustering stability: An overview. Found. Trends Mach. Learn. 2.
[47] Wade, S. and Ghahramani, Z. (2018). Bayesian cluster analysis: Point estimation and credible balls (with discussion). Bayesian Anal. 13 559-626. With discussion and a reply by the authors. · Zbl 1407.62241 · doi:10.1214/17-BA1073
[48] WAKS, A. G. and WINER, E. P. (2019). Breast cancer treatment: A review. JAMA 321.
[49] WARD, J. H. JR. (1963). Hierarchical grouping to optimize an objective function. J. Amer. Statist. Assoc. 58 236-244.
[50] Yu, B. (2013). Stability. Bernoulli 19 1484-1500. · Zbl 1440.62402 · doi:10.3150/13-BEJSP14
[51] ZHANG, T., RAMAKRISHNAN, R. and LIVNY, M. (1996). Birch: An efficient data clustering method for very large databases. In ACM Sigmod Record ACM, New York
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.