×

Shrinkage clustering: a fast and size-constrained algorithm for biomedical applications. (English) Zbl 1443.92100

Schwartz, Russell (ed.) et al., 17th international workshop on algorithms in bioinformatics, WABI 2017, Boston, MA, USA, August 21–23, 2017. Proceedings. Wadern: Schloss Dagstuhl – Leibniz Zentrum für Informatik. LIPIcs – Leibniz Int. Proc. Inform. 88, Article 11, 13 p. (2017).
Summary: Motivation: Many common clustering algorithms require a two-step process that limits their efficiency. The algorithms need to be performed repetitively and need to be implemented together with a model selection criterion, in order to determine both the number of clusters present in the data and the corresponding cluster memberships. As biomedical datasets increase in size and prevalence, there is a growing need for new methods that are more convenient to implement and are more computationally efficient. In addition, it is often essential to obtain clusters of sufficient sample size to make the clustering result meaningful and interpretable for subsequent analysis.
Results: We introduce shrinkage clustering, a novel clustering algorithm based on matrix factorization that simultaneously finds the optimal number of clusters while partitioning the data. We report its performances across multiple simulated and actual datasets, and demonstrate its strength in accuracy and speed in application to subtyping cancer and brain tissues. In addition, the algorithm offers a straightforward solution to clustering with cluster size constraints. Given its ease of implementation, computing efficiency and extensible structure, we believe shrinkage clustering can be applied broadly to solve biomedical clustering tasks especially when dealing with large datasets.
For the entire collection see [Zbl 1372.68022].

MSC:

92C50 Medical applications (general)
62P10 Applications of statistics to biology and medical sciences; meta analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDFBibTeX XMLCite
Full Text: DOI

References:

[2] S. Aeberhard, D. Coomans, and O. De Vel. Comparison of classifiers in high dimensional settings. {\it Dept. Math. Statist., James Cook Univ., North Queensland, Australia, Tech. Rep}, (92-02), 1992.
[3] Arthur Asuncion and David Newman. Uci machine learning repository, 2007.
[4] P. S. Bradley, K. P. Bennett, and Ayhan Demiriz. Constrained k-means clustering. {\it Microsoft} {\it Research, Redmond}, pages 1-8, 2000.
[5] Jean-Philippe Brunet, Pablo Tamayo, Todd R. Golub, and Jill P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. {\it Proceedings of the national} {\it academy of sciences}, 101(12):4164-4169, 2004.
[6] Elisa Boari de Lima, Wagner Meira Júnior, and Raquel Cardoso de Melo-Minardi. Isofunc tional protein subfamily detection using data integration and spectral clustering. {\it PLoS} {\it Comput Biol}, 12(6):e1005001, 2016.
[7] Chris Ding, Xiaofeng He, and Horst D. Simon. On the equivalence of nonnegative ma trix factorization and spectral clustering. In {\it Proceedings of the 2005 SIAM International} {\it Conference on Data Mining}, pages 606-610. SIAM, 2005.
[8] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In {\it KDD}, volume 96, pages 226-231, 1996.
[9] Ronald A. Fisher. The use of multiple measurements in taxonomic problems. {\it Annals of} {\it eugenics}, 7(2):179-188, 1936.
[10] Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data points. {\it science}, 315(5814):972-976, 2007. · Zbl 1226.94027
[11] Chenyue W. Hu, Steven M. Kornblau, John H. Slater, and Amina A. Qutub. Progeny clustering: A method to identify biological phenotypes. {\it Scientific reports}, 5, 2015.
[12] :12
[13] :13
[14] Stephen C. Johnson. Hierarchical clustering schemes. {\it Psychometrika}, 32(3):241-254, 1967. · Zbl 1367.62191
[15] Da Kuang, Chris Ding, and Haesun Park. Symmetric nonnegative matrix factorization for graph clustering. In {\it Proceedings of the 2012 SIAM international conference on data mining}, pages 106-117. SIAM, 2012.
[16] Tilman Lange, Volker Roth, Mikio L. Braun, and Joachim M. Buhmann. Stability-based validation of clustering solutions. {\it Neural computation}, 16(6):1299-1323, 2004. · Zbl 1089.68100
[17] Tao Li and Chris H. Q. Ding. Nonnegative matrix factorizations for clustering: A survey., 2013.
[18] James MacQueen et al. Some methods for classification and analysis of multivariate ob servations. In {\it Proceedings of the fifth Berkeley symposium on mathematical statistics and} {\it probability}, volume 1, pages 281-297. California, USA, 1967. · Zbl 0214.46201
[19] Martin Maechler, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. Cluster: cluster analysis basics and extensions. {\it R package version}, 1(2):56, 2012.
[20] Olvi L Mangasarian, W. Nick Street, and William H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. {\it Operations Research}, 43(4):570-577, 1995. · Zbl 0857.90073
[21] Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, et al. {\it Introduction to} {\it information retrieval}, volume 1. Cambridge University Press, 2008. · Zbl 1160.68008
[22] Geoffrey J. McLachlan and Kaye E. Basford. Mixture models. inference and applications to clustering. {\it Statistics: Textbooks and Monographs, New York: Dekker, 1988}, 1, 1988. · Zbl 0697.62050
[23] Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. {\it Machine learning}, 52(1-2):91-118, 2003. · Zbl 1039.68103
[24] Thomas J. Montine, Joshua A. Sonnen, Kathleen S. Montine, Paul K. Crane, and Eric B. Larson. Adult changes in thought study: dementia is an individually varying convergent syndrome with prevalent clinically silent diseases that may be modified by some commonly used therapeutics. {\it Current Alzheimer Research}, 9(6):718-723, 2012.
[25] Wendy C. Moore, Deborah A. Meyers, Sally E. Wenzel, W. Gerald Teague, Huashi Li, Xing nan Li, Ralph D’Agostino Jr., Mario Castro, Douglas Curran-Everett, Anne M. Fitzpatrick, et al. Identification of asthma phenotypes using cluster analysis in the severe asthma re search program. {\it American journal of respiratory and critical care medicine}, 181(4):315-323, 2010.
[26] Dan Pelleg, Andrew W. Moore, et al. X-means: Extending K-means with Efficient Estima tion of the Number of Clusters. In {\it ICML}, pages 727-734, 2000.
[27] Peter J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. {\it Journal of computational and applied mathematics}, 20:53-65, 1987. · Zbl 0636.62059
[28] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. {\it Pattern Analysis} {\it and Machine Intelligence, IEEE Transactions on}, 22(8):888-905, 2000.
[29] John H Slater, James C. Culver, Byron L. Long, Chenyue W. Hu, Jingzhe Hu, Taylor F Birk, Amina A. Qutub, Mary E. Dickinson, and Jennifer L. West. Recapitulation and modulation of the cellular architecture of a user-chosen cell of interest using cell-derived, biomimetic patterning. {\it ACS nano}, 9(6):6128-6138, 2015.
[30] Nora Speicher and Thomas Lengauer. {\it Towards the identification of cancer subtypes by in-} {\it tegrative clustering of molecular data}. PhD thesis, Universität des Saarlandes Saarbrücken, 2012.
[31] W. Nick Street, William H. Wolberg, and Olvi L. Mangasarian. Nuclear feature extrac tion for breast tumor diagnosis. In {\it IS&T/SPIE’s Symposium on Electronic Imaging:} {\it Science and Technology}, pages 861-870. International Society for Optics and Photonics, 1993.
[32] Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic.{\it Journal of the Royal Statistical Society: Series B} {\it (Statistical Methodology)}, 63(2):411-423, 2001. · Zbl 0979.62046
[33] Joe H. Ward Jr. Hierarchical grouping to optimize an objective function. {\it Journal of the} {\it American statistical association}, 58(301):236-244, 1963.
[34] Pratyaksha Wirapati, Christos Sotiriou, Susanne Kunkel, Pierre Farmer, Sylvain Prader vand, Benjamin Haibe-Kains, Christine Desmedt, Michail Ignatiadis, Thierry Sengstag, Frédéric Schütz, et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. {\it Breast Cancer} {\it Research}, 10(4):R65, 2008.
[35] Christian Wiwie, Jan Baumbach, and Richard Röttger. Comparing the performance of biomedical clustering methods. {\it Nature Methods}, 12(11):1033-1038, 2015.
[36] Achim Zeileis, Kurt Hornik, Alex Smola, and Alexandros Karatzoglou.kernlab-an S4 package for kernel methods in R. {\it Journal of statistical software}, 11(9):1-20, 2004.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.