Seo, Beomseok; Lin, Lin; Li, Jia Block-wise variable selection for clustering via latent states of mixture models. (English) Zbl 07546465 J. Comput. Graph. Stat. 31, No. 1, 138-150 (2022). Summary: Mixture modeling is a major paradigm for clustering in statistics. In this article, we develop a new block-wise variable selection method for clustering by exploiting the latent states of the hidden Markov model on variable blocks or the Gaussian mixture model. The variable blocks are formed by depth-first-search on a dendrogram created based on the mutual information between any pair of variables. It is demonstrated that the latent states of the variable blocks together with the mixture model parameters can represent the original data effectively and much more compactly. We thus cluster the data using the latent states and select variables according to the relationship between the states and the clusters. As true class labels are unknown in the unsupervised setting, we first generate more refined clusters, namely, semi-clusters, for variable selection and then determine the final clusters based on the dimension reduced data. Experiments on simulated and real data show that the new method is highly competitive in terms of clustering accuracy compared with several widely used methods. Supplementary materials for this article are available online. MSC: 62-XX Statistics Keywords:feature selection; semi-clusters; unsupervised learning; variable blocks Software:clustvarsel; Silhouettes; VarSelLCM; dynamicTreeCut; sparcl; vscc; wskm PDFBibTeX XMLCite \textit{B. Seo} et al., J. Comput. Graph. Stat. 31, No. 1, 138--150 (2022; Zbl 07546465) Full Text: DOI References: [1] Amit, I.; Garber, M.; Chevrier, N.; Leite, A. P.; Donner, Y.; Eisenhaure, T.; Guttman, M.; Grenier, J. K.; Li, W.; Zuk, O.; Schubert, L. A.; Birditt, B.; Shay, T.; Goren, A.; Zhang, X.; Smith, Z.; Deering, R.; McDonald, R. C.; Cabili, M.; Bernstein, B. E.; Rinn, J. L.; Meissner, A.; Root, D. E.; Hacohen, N.; Regev, A., “Unbiased Reconstruction of a Mammalian Transcriptional Network Mediating Pathogen Responses, Science, 326, 257-263 (2009) · doi:10.1126/science.1179050 [2] Andrews, J.; McNicholas, P., vscc: Variable Selection for Clustering and Classification (2013), R package version 0.2 [3] Belkin, M.; Niyogi, P., Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering, Advances in Neural Information Processing Systems, 585-591 (2002) [4] Ben-Hur, A.; Guyon, I., Functional Genomics, “Detecting Stable Clusters Using Principal Component Analysis,”, 159-182 (2003), Springer [5] Benjamini, Y.; Hochberg, Y., “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing, Journal of the Royal Statistical Society, Series B, 57, 289-300 (1995) · Zbl 0809.62014 · doi:10.1111/j.2517-6161.1995.tb02031.x [6] Bhattacharyya, A., “On a Measure of Divergence Between Two Statistical Populations Defined by Their Probability Distributions, Bulletin of the Calcutta Mathematical Society, 35, 99-109 (1943) · Zbl 0063.00364 [7] Cai, D.; Zhang, C.; He, X., Unsupervised Feature Selection for Multi-Cluster Data, Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 333-342 (2010) · doi:10.1145/1835804.1835848 [8] Chang, W., “On Using Principal Components Before Separating a Mixture of Two Multivariate Normal Distributions, Applied Statistics, 32, 267-275 (1983) · Zbl 0538.62050 · doi:10.2307/2347949 [9] Chinchor, N., The Statistical Significance of the muc-4 Results, Proceedings of the 4th Conference on Message Understanding, McLean, VA, 30-50 (1992) [10] Constantinopoulos, C.; Titsias, M. K.; Likas, A., “Bayesian Feature and Model Selection for Gaussian Mixture Models,”, IEEE Transactions on Pattern Analysis & Machine Intelligence, 28, 1013-1018 (2006) [11] Davies, D. L.; Bouldin, D. W., “A Cluster Separation Measure, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1, 224-227 (1979) · doi:10.1109/TPAMI.1979.4766909 [12] Dy, J. G.; Brodley, C. E., “Feature Selection for Unsupervised Learning, Journal of Machine Learning Research, 5, 845-889 (2004) · Zbl 1222.68187 [13] Fop, M.; Murphy, T. B., “Variable Selection methods for model-based clustering, Statistics Surveys, 12, 18-65 (2018) · Zbl 1496.62105 · doi:10.1214/18-SS119 [14] Fraley, C.; Raftery, A. E., “Model-Based Clustering, Discriminant Analysis, and Density Estimation, Journal of the American Statistical Association, 97, 611-631 (2002) · Zbl 1073.62545 · doi:10.1198/016214502760047131 [15] Galimberti, G.; Soffritti, G., “Model-Based Methods to Identify Multiple Cluster Structures in a Data Set,”, Computational Statistics & Data Analysis, 52, 520-536 (2007) · Zbl 1452.62442 [16] Guyon, I.; Elisseeff, A., “An Introduction to Variable and Feature Selection,”, Journal of Machine Learning Research, 3, 1157-1182 (2003) · Zbl 1102.68556 [17] Hackstadt, A. J.; Hess, A. M., “Filtering for Increased Power for Microarray Data Analysis, BMC Bioinformatics, 10, 11 (2009) · doi:10.1186/1471-2105-10-11 [18] He, X.; Cai, D.; Niyogi, P., Advances in Neural Information Processing Systems, Vancouver, BC, Canada, Laplacian Score for Feature Selection, 507-514 (2006), Yair Weiss, Cambridge, MA: The MIT Press, Yair Weiss, Cambridge, MA [19] Islam, S.; Kjällquist, U.; Moliner, A.; Zajac, P.; Fan, J.-B.; Lönnerberg, P.; Linnarsson, S., “Characterization of the Single-Cell Transcriptional Landscape by Highly Multiplex RNA-seq, Genome Research, 21, 1160-1167 (2011) · doi:10.1101/gr.110882.110 [20] Kohavi, R.; John, G. H., “Wrappers for Feature Subset Selection, Artificial Intelligence, 97, 273-324 (1997) · Zbl 0904.68143 · doi:10.1016/S0004-3702(97)00043-X [21] Langfelder, P.; Zhang, B.; Horvath, S., “Defining Clusters From a Hierarchical Cluster Tree: The Dynamic Tree Cut Package for r, Bioinformatics, 24, 719-720 (2007) · doi:10.1093/bioinformatics/btm563 [22] Lawler, E. L., The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization (1985), Wiley-Interscience Series in Discrete Mathematics · Zbl 0562.00014 [23] LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P., “Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86, 2278-2324 (1998) · doi:10.1109/5.726791 [24] Lee, H.; Li, J., “Variable Selection for Clustering by Separability Based on Ridgelines, Journal of Computational and Graphical Statistics, 21, 315-336 (2012) · doi:10.1080/10618600.2012.679226 [25] Li, J., “Clustering Based on a Multilayer Mixture Model, Journal of Computational and Graphical Statistics, 14, 547-568 (2005) · doi:10.1198/106186005X59586 [26] Li, J.; Ray, S.; Lindsay, B., “A Nonparametric Statistical Approach to Clustering Via Mode Identification, Journal of Machine Learning Research, 8, 1687-1723 (2007) · Zbl 1222.62076 [27] Li, J.; Seo, B.; Lin, L., “Optimal Transport, Mean Partition, and Uncertainty Assessment in Cluster Analysis, Statistical Analysis and Data Mining: The ASA Data Science Journal, 12, 359-377 (2019) · Zbl 07260643 · doi:10.1002/sam.11418 [28] Li, Z.; Yang, Y.; Liu, J.; Zhou, X.; Lu, H., Twenty-Sixth AAAI Conference on Artificial Intelligence, “Unsupervised Feature Selection Using Nonnegative Spectral Analysis,” (2012), Toronto, ON, Canada. Palo Alto, CA: The AAAI Press, Toronto, ON, Canada. Palo Alto, CA [29] Lin, L.; Chan, C.; West, M., “Discriminative Variable Subsets in Bayesian Classification With Mixture Models, With Application in Flow Cytometry Studies, Biostatistics, 17, 40-53 (2016) · doi:10.1093/biostatistics/kxv021 [30] Lin, L.; Li, J., “Clustering With Hidden Markov Model on Variable Blocks, The Journal of Machine Learning Research, 18, 3913-3961 (2017) · Zbl 1441.62169 [31] Liu, H.; Xu, M.; Gu, H.; Gupta, A.; Lafferty, J.; Wasserman, L., “Forest Density Estimation,”, Journal of Machine Learning Research, 12, 907-951 (2011) · Zbl 1280.62045 [32] Liu, J. S.; Zhang, J. L.; Palumbo, M. J.; Lawrence, C. E., “Bayesian Clustering With Variable and Transformation Selections,”, Bayesian Statistics, 7, 249-275 (2003) [33] Liu, T.; Liu, S.; Chen, Z.; Ma, W.-Y., An Evaluation on Feature Selection for Text Clustering, 488-495 (2003) [34] Marbac, M.; Sedki, M., “Varsellcm: An r/c++ Package for Variable Selection in Model-Based Clustering of Mixed-Data With Missing Values, Bioinformatics, 35, 1255-1257 (2018) · doi:10.1093/bioinformatics/bty786 [35] Marbac, M.; Vandewalle, V., “A Tractable Multi-Partitions Clustering,”, Computational Statistics & Data Analysis, 132, 167-179 (2019) · Zbl 1507.62127 [36] Miao, J.; Niu, L., “A Survey on Feature Selection, Procedia Computer Science, 91, 919-926 (2016) · doi:10.1016/j.procs.2016.07.111 [37] Padilla, O. H. M.; Sharpnack, J.; Scott, J. G.; Tibshirani, R. J., “The dfs Fused Lasso: Linear-Time Denoising Over General Graphs, Journal of Machine Learning Research, 18, 176-1 (2017) · Zbl 1471.62421 [38] Pan, W.; Shen, X., “Penalized Model-Based Clustering With Application to Variable Selection,”, Journal of Machine Learning Research, 8, 1145-1164 (2007) · Zbl 1222.68279 [39] Qian, M.; Zhai, C., Twenty-Third International Joint Conference on Artificial Intelligence, “Robust Unsupervised Feature Selection,” (2013), Beijing, China. Menlo Park, CA: AAAI Press, Beijing, China. Menlo Park, CA [40] Raftery, A.; Dean, N., Variable selection for model-based clustering, Journal of the American Statistical Association, 101, 473, 168-178 (2006) · Zbl 1118.62339 · doi:10.1198/016214506000000113 [41] Raileanu, L. E.; Stoffel, K., “Theoretical Comparison Between the Gini Index and Information Gain Criteria, Annals of Mathematics and Artificial Intelligence, 41, 77-93 (2004) · Zbl 1048.68096 · doi:10.1023/B:AMAI.0000018580.96245.c6 [42] Rand, W. M., “Objective Criteria for the Evaluation of Clustering Methods, Journal of the American Statistical Association, 66, 846-850 (1971) · doi:10.1080/01621459.1971.10482356 [43] Rousseeuw, P. J., “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis, Journal of Computational and Applied Mathematics, 20, 53-65 (1987) · Zbl 0636.62059 · doi:10.1016/0377-0427(87)90125-7 [44] Scrucca, L.; Raftery, A. E., “clustvarsel: A Package Implementing Variable Selection for Gaussian Model-Based Clustering in r, Journal of Statistical Software, 84 (2018) · doi:10.18637/jss.v084.i01 [45] Tang, F.; Lao, K.; Surani, M. A., “Development and Applications of Single-Cell Transcriptome Analysis, Nature methods, 8, S6 (2011) · doi:10.1038/nmeth.1557 [46] Thorndike, R. L., “Who Belongs in the Family?, Psychometrika, 18, 267-276 (1953) · doi:10.1007/BF02289263 [47] Williams, G.; Huang, J.; Chen, X.; Wang, Q.; Xiao, L., “wskm: Weighted k-Means Clustering, R package version, 1, 19 (2014) [48] Witten, D. M.; Tibshirani, R., “A Framework for Feature Selection in Clustering, Journal of the American Statistical Association, 105, 713-726 (2010) · Zbl 1392.62194 · doi:10.1198/jasa.2010.tm09415 [49] Witten, D. M.; Tibshirani, R., sparcl: Perform Sparse Hierarchical Clustering and Sparse k-Means Clustering, 1 (2013), R package version [50] Wolfe, J. H., “Pattern Clustering by Multivariate Mixture Analysis, Multivariate Behavioral Research, 5, 329-350 (1970) · doi:10.1207/s15327906mbr0503_6 [51] Wu, C.; Kwon, S.; Shen, X.; Pan, W., “A New Algorithm and Theory for Penalized Regression-Based Clustering,”, The Journal of Machine Learning Research, 17, 6479-6503 (2016) · Zbl 1392.68371 [52] Xie, B.; Pan, W.; Shen, X., “Penalized Model-Based Clustering With Cluster-Specific Diagonal Covariance Matrices and Grouped Variables, Electronic Journal of Statistics, 2, 168-212 (2008) · Zbl 1135.62055 · doi:10.1214/08-EJS194 [53] Xie, J.; Girshick, R.; Farhadi, A., Unsupervised Deep Embedding for Clustering Analysis, 478-487 (2016) [54] Yan, L.; Yang, M.; Guo, H.; Yang, L.; Wu, J.; Li, R.; Liu, P.; Lian, Y.; Zheng, X.; Yan, J., “Single-Cell RNA-seq Profiling of Human Preimplantation Embryos and Embryonic Stem Cells, Nature Structural & Molecular Biology, 20, 1131-1139 (2013) [55] Yeung, K.; Ruzzo, W., “Principal Component Analysis for Clustering Gene Expression Data, Bioinformatics, 17, 763-774 (2001) · doi:10.1093/bioinformatics/17.9.763 [56] Yu, L.; Liu, H., Proceedings of the 20th International Conference on Machine Learning (ICML-03), Washington, DC, Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution, 856-863 (2003), Menlo Park, CA: AAAI Press, Menlo Park, CA [57] Zhu, P.; Zhu, W.; Hu, Q.; Zhang, C.; Zuo, W., “Subspace Clustering Guided Unsupervised Feature Selection, Pattern Recognition, 66, 364-374 (2017) · doi:10.1016/j.patcog.2017.01.016 This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.