zbMATH — the first resource for mathematics

Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. (English) Zbl 1238.68124
Summary: The advent of high throughput technologies, in particular microarrays, for biological research has revived interest in clustering, resulting in a plethora of new clustering algorithms. However, model selection, i.e., the identification of the correct number of clusters in a dataset, has received relatively little attention. Indeed, although central for statistics, its difficulty is also well known. Fortunately, a few novel techniques for model selection, representing a sharp departure from previous ones in statistics, have been proposed and gained prominence for microarray data analysis. Among those, the stability-based methods are the most robust and best performing in terms of prediction, but the slowest in terms of time. It is very unfortunate that as fascinating and classic an area of statistics as model selection, with important practical applications, has received very little attention in terms of algorithmic design and engineering. In this paper, in order to partially fill this gap, we make the following contributions: (A) the first general algorithmic paradigm for stability-based methods for model selection; (B) reductions showing that all of the known methods in this class are an instance of the proposed paradigm; (C) a novel algorithmic paradigm for the class of stability-based methods for cluster validity, i.e., methods assessing how statistically significant is a given clustering solution; (D) a general algorithmic paradigm that describes heuristic and very effective speed-ups known in the literature for stability-based model selection methods.
Since the performance evaluation of model selection algorithms is mainly experimental, we offer, for completeness and without even attempting to be exhaustive, a representative synopsis of known experimental benchmarking results that highlight the ability of stability-based methods for model selection and the computational resources that they require for the task. As a whole, the contributions of this paper generalize in several respects reference methodologies in statistics and show that algorithmic approaches can yield deep methodological insights into this area, in addition to practical computational procedures.

68T05 Learning and adaptive systems in artificial intelligence
68P05 Data structures
92B05 General biology and biomathematics
Full Text: DOI
[1] NCI 60 Cancer Microarray Project. http://genome-www.stanford.edu/NCI60.
[2] Validation Work Bench: Valworkbench web page. http://www.math.unipa.it/ raffaele/valworkbench/.
[3] Achlioptas, D., Database-friendly random projections: Johnson-Lindenstrauss with binary coins, Journal of computer and system sciences, 66, 671-687, (2003) · Zbl 1054.68040
[4] Ailon, N.; Chazelle, B., Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform, (), 557-563 · Zbl 1301.68232
[5] Alizadeh, A.A.; Eisen, M.B.; Davis, R.E.; Ma, C.; Lossos, I.S.; Rosenwald, A.; Boldrick, J.C; Sabet, H.; Tran, T.; Yu, X.; Powell, J.I; Yang, L.; Marti, G.E.; Moore, T.; Hudson, J.; Lu, L.; Lewis, D.B.; Tibshirani, R.; Sherlock, G.; Chan, W.C.; Greiner, T.C.; Weisenburger, D.D.; Armitage, J.O.; Warnke, R.; Levy, R.; Wilson, W.; Grever, M.R.; Byrd, J.C; Botstein, D.; Brown, P.O.; Staudt, L.M., Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling, Nature, 403, 503-511, (2000)
[6] Alon, N., Problems and results in extremal combinatorics—II, Discrete mathematics, 308, 4460-4472, (2008) · Zbl 1152.05053
[7] Alon, U.; Barkai, N.; Notterman, D.A.; Gish, K.; Ybarra, S.; Mack, D.; Levine, A.J., Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proceedings of the national Academy of sciences of the united states of America, 96, 6745-6750, (1999)
[8] Andreopoulos, B.; An, A.; Wang, X.; Schroeder, M., A roadmap of clustering algorithms: finding a match for a biomedical application, Briefings in bioinformatics, 10, 3, 297-314, (2009)
[9] Ben-David, S.; von Luxburg, U.; Pál, D., A sober look at clustering stability, (), 5 · Zbl 1143.68520
[10] A. Ben-Hur, A. Elisseeff, I. Guyon, A stability based method for discovering structure in clustering data, in: Seventh Pacific Symposium on Biocomputing, ISCB, 2002, pp. 6-17.
[11] Benesty, J.; Morgan, D.; Sondhi, M., A better understanding and an improved solution to the problems of stereophonic acoustic echo cancellation, (), 303
[12] Bertoni, A.; Valentini, G., Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses, Artificial intelligence in medicine, 37, 85-109, (2006)
[13] Bertoni, A.; Valentini, G., Model order selection for bio-molecular data clustering, BMC bioinformatics, 8, (2007)
[14] A. Bhattacharya, P. Kar, M. Pal, On low distortion embeddings of statistical distance measures into low dimensional spaces, in: DEXA, 2009, pp. 164-172.
[15] Bingham, E.; Mannila, H., Random projection in dimensionality reduction: applications to image and text data, (), 245-250
[16] Bittner, M.; Meltzer, P.; Chen, Y.; Jiang, Y.; Seftor, E.; Hendrix, M.; Radmacher, M.; Simon, R.; Yakhini, Z.; Ben-Dor, A.; Sampas, N.; Dougherty, E.; Wang, E.; Marincola, F.; Gooden, F, C.; Lueders, J.; Glatfelter, A.; Pollock, P.; Carpten, J.; Gillanders, E.; Leja, D.; Dietrich, K.; Beaudry, C.; Berens, M.; Alberts, D.; Sondak, V., Molecular classification of cutaneous malignant melanoma by gene expression profiling, Nature, 406, 536-540, (2000)
[17] Bock, H.H., On some significance tests in cluster analysis, Journal of classification, 2, 77-108, (1985) · Zbl 0587.62048
[18] Breckenridge, J.N., Replicating cluster analysis: method, consistency, and validity, Multivariate behavioral research, 24, 2, 147-161, (1989)
[19] Breiman, L., Bagging predictors, Machine learning, 24, 123-140, (1996) · Zbl 0858.68080
[20] Brunet, J.-P.; Tamayo, P.; Golub, T.R.; Mesirov, J.P., Metagenes and molecular pattern discovery using matrix factorization, Proceedings of the national Academy of sciences of the united states of America, 101, 4164-4169, (2004)
[21] Cormode, G.; Datar, M.; Indyk, P.; Muthukrishnan, S., Comparing data streams using Hamming norms (how to zero in), IEEE transactions on knowledge and data engineering, 15, 529-540, (2003)
[22] Dasgupta, S.; Gupta, A., An elementary proof of a theorem of Johnson and Lindenstrauss, Random structures & algorithms, 22, 60-65, (2003) · Zbl 1018.51010
[23] D’haeseleer, P., How does gene expression cluster work?, Nature biotechnology, 23, 1499-1501, (2006)
[24] Di Gesú, V.; Giancarlo, R.; Lo Bosco, G.; Raimondi, A.; Scaturro, D., Genclust: a genetic algorithm for clustering gene expression data, BMC bioinformatics, 6, 289, (2005)
[25] Dudoit, S.; Fridlyand, J., A prediction-based resampling method for estimating the number of clusters in a dataset, Genome biology, 3, (2002)
[26] Dudoit, S.; Fridlyand, J., Bagging to improve the accuracy of a clustering procedure, Bioinformatics, 19, 9, 1090-1099, (2003)
[27] Dudoit, S.; Fridlyand, J.; Speed, T.P., Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American statistical association, 97, 457, 77-87, (2002) · Zbl 1073.62576
[28] Efron, B.; Tibshirani, R.J., An introduction to the bootstrap, (1993), Chapman & Hall London · Zbl 0835.62038
[29] Fisher, D.; Hoffman, P., The adjusted rand statistic: a SAS macro, Psychometrika, 53, 417-423, (1988) · Zbl 0726.62094
[30] Fowlkes, E.B.; Mallows, C.L., A method for comparing two hierarchical clusterings, Journal of the American statistical association, 78, 553-584, (1983) · Zbl 0545.62042
[31] Giancarlo, R.; Scaturro, D.; Utro, F., Computational cluster validation for microarray data analysis: experimental assessment of clest, consensus clustering, figure of merit, gap statistics and model explorer, BMC bioinformatics, 9, 462, (2008)
[32] Giancarlo, R.; Utro, F., Speeding up the consensus clustering methodology for microarray data analysis, Algorithms for molecular biology, 6, 1, (2011)
[33] Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeeck, M.; Mesirov, J.P.; Coller, H.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; Bloomfield, C.D.; Lander, E.S., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286, 531, 531-537, (1999), 5439
[34] Gordon, A.D., Clustering algorithms and cluster validation, (), 503-518
[35] Gordon, A.D., Null models in cluster validation, (), 32-44
[36] Handl, J.; Knowles, J.; Kell, D.B., Computational cluster validation in post-genomic data analysis, Bioinformatics, 21, 15, 3201-3212, (2005)
[37] Hansen, M.H.; Hurwitz, W.N.; Madow, W.G., Sample survey methods and theory methods and applications, vol. 1, (1993), Wiley · Zbl 0052.14801
[38] Hastie, T.; Tibshirani, R.; Friedman, J., The elements of statistical learning, (2003), Springer
[39] Hubert, L.; Arabie, P., Comparing partitions, Journal of classification, 2, 193-218, (1985)
[40] Indyk, P.; Motwani, R., Approximate nearest neighbors: towards removing the curse of dimensionality, (), 604-613 · Zbl 1029.68541
[41] Jain, A.K.; Dubes, R.C., Algorithms for clustering data, (1988), Prentice-Hall Englewood Cliffs · Zbl 0665.62061
[42] Johnson, W.B.; Lindenstrauss, J., Extensions of Lipschitz mappings into a Hilbert space, Contemporary mathematics, 26, 189-206, (1984) · Zbl 0539.46017
[43] W.B. Johnson, A. Naor, The Johnson-Lindenstrauss lemma almost characterizes Hilbert space, but not quite, in: SODA, 2009, pp. 885-891. · Zbl 1423.46020
[44] Harper, C.W., Groupings by locality in community ecology and paleoecology: tests of significance, Lethaia, 11, 251-257, (1978)
[45] Kaufman, L.; Rousseeuw, P.J., Finding groups in data: an introduction to cluster analysis, (1990), Wiley New York · Zbl 1345.62009
[46] Kerr, M.K.; Churchill, G.A., Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments, Pnas, 98, 8961-8965, (2000) · Zbl 1047.62110
[47] Kraus, J.; Kestler, H., A highly efficient multi-core algorithm for clustering extremely large datasets, BMC bioinformatics, 11, (2010)
[48] Lee, D.D.; Seung, H.S., Learning the parts of objects by non-negative matrix factorization, Nature, 401, 788-791, (1999) · Zbl 1369.68285
[49] D.D. Lee, H.S. Seung, Algorithms for non-negative matrix factorization, in: NIPS, 2000, pp. 556-562.
[50] Levine, E.; Domany, E., Resampling method for unsupervised estimation of cluster validity, Neural computation, 13, 2573-2593, (2001) · Zbl 0993.68113
[51] McShane, L.M.; Radmacher, M.D.; Freidlin, B.; Yu, R.; Li, M.-C.; Simon, R., Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data, Bioinformatics, 18, 1462-1469, (2002)
[52] Mehta, T.; Tanik, M.; Allison, D.B., Towards sound epistemological foundations of statistical methods for high-dimensional biology, Nature genetics, 36, 943-947, (2004)
[53] Monti, S.; Tamayo, P.; Mesirov, J.; Golub, T., Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data, Machine learning, 52, 91-118, (2003) · Zbl 1039.68103
[54] Papadimitriou, C.H.; Steiglitz, K., Combinatorial optimization: algorithms and complexity, (1982), Prentice-Hall · Zbl 0503.90060
[55] Perou, C.M.; Jeffrey, S.S.; van de Rijn, M.; Rees, C.A.; Eisen, M.B.; Ross, D.T.; Pergamenschikov, A.; Williams, C.F.; Zhu, S.X.; Lee, J.C.F.; Lashkari, D.; Shalon, D.; Brown, P.O.; Botstein, D., Distinctive gene expression patterns in human mammary epithelial cells and breast cancers, Proceedings of the national Academy of sciences of the united states of America, 96, 9212-9217, (1999)
[56] Pollack, J.R.; Perou, C.M.; Alizadeh, A.A.; Eisen, M.B.; Pergamenschikov, A.; Williams, C.F.; Jeffrey, S.S.; Botstein, D.; Brown, P.O., Genome-wide analysis of DNA copy-number changes using cdna microarrays, Nature genetics, 23, 41-46, (1999)
[57] Giancarlo, R.; Scaturro, D.; Utro, F., Statistical indices for computational and data driven class discovery in microarray data, (), 295-335
[58] Raviv, Y.; Intrator, N., Bootstrapping with noise: an effective regularization technique, Connection science, 8, 355-372, (1996)
[59] Van Rijsbergen, C., Information retrieval, (1979), Butterworths London · Zbl 1095.68030
[60] Ross, D.T.; Scherf, U.; Eisen, M.B.; Perou, C.M.; Spellman, P.; Iyer, V.; Jeffrey, S.S.; van de Rijn, M.; Walthama, M.; Pergamenschikov, A.; Lee, J.C.F.; Lashkari, D.; Shalon, D.; Myers, T.G.; Weistein, J.N.; Botstein, D.; Brown, P.O., Systematic variation in gene expression patterns in human cancer cell lines, Nature genetics, 24, 227-235, (2000)
[61] V. Roth, T. Lange, M. Braun, J. Buhmann, A resampling approach to cluster validation, in: Proceedings 15th Symposium in Computational Statistics, 2002, pp. 123-128.
[62] S. Salvador, P. Chan, Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms, in: 16th IEEE International Conference on Tools with Artificial Intelligence, 2004, ICTAI 2004, 2004, pp. 576-584.
[63] W.S. Sarle, Cubic clustering criterion, Technical report, SAS, 1983.
[64] Smolkin, M.; Ghosh, D., Cluster stability scores for microarray data in cancer studies, BMC bioinformatics, 4, (2003)
[65] Strauss, R.E., Statistical significance of species clusters in association analysis, Ecology, 63, 634-639, (1978)
[66] Tibshirani, R.; Walther, G.; Hastie, T., Estimating the number of clusters in a dataset via the gap statistics, Journal royal statistical society B., 2, 411-423, (2001) · Zbl 0979.62046
[67] F. Utro, Algorithms for internal validation clustering measures in the Post-genomic era, Doctoral Dissertation, University of Palermo. http://arxiv.org/abs/1102.2915v1, 2011.
[68] Valentini, G., Mosclust: a software library for discovering significant structures in bio-molecular data, Bioinformatics, 23, 387-389, (2007)
[69] Vassiliou, A.; Ignatiades, L.; Karydis, M., Clustering of transect phytoplankton collections with a quick randomization algorithm, Journal of experimental marine biology and ecology, 130, 135-145, (1989)
[70] Wolfinger, R.D.; Gibson, G.; Wolfinger, E.D.; Bennet, L.; Hamadeh, H.; Bushel, C.A.; Paules, R.S., Assessing gene significance from cdna microarray expression data via mixed models, Journal of computational biology, 625-637, (2001)
[71] K.Y. Yeung, Cluster analysis of gene expression data, Ph.D. Thesis, University of Washington, 2001.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.