Variable selection for clustering and classification. (English) Zbl 1360.62310

Summary: As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibitively computationally expensive for high-dimensional data. In this paper, a novel variable selection technique is introduced for use in clustering and classification analyses that is both intuitive and computationally efficient. We focus largely on applications in mixture model-based learning, but the technique could be adapted for use with various other clustering/classification methods. Our approach is illustrated on both simulated and real data, highlighted by contrasting its performance with that of other comparable variable selection techniques on the real data sets.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
65C60 Computational problems in statistics (MSC2010)
Full Text: DOI arXiv


[1] Andrews, JL; Mcnicholas, PD, Extending mixtures of multivariate T-factor analyzers, Statistics and Computing, 21, 361-373, (2011) · Zbl 1255.62171
[2] Andrews, JL; Mcnicholas, PD, Mixtures of modified T-factor analyzers for model-based clustering, classification, and discriminant analysis, Journal of Statistical Planning and Inference, 141, 1479-486, (2011) · Zbl 1204.62098
[3] Biernacki, C; Celeux, G; Govaert, G, Assessing a mixture model for clustering with the integrated completed likelihood, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 719-725, (2000)
[4] Biernacki, C; Celeux, G; Govaert, G; Langrognet, F, Model-based cluster and discriminant analysis with the MIXMOD software, Computational Statistics and Data Analysis, 51, 587-600, (2006) · Zbl 1157.62431
[5] Bouveyron, C; Brunet, C, Simultaneous model-based clustering and visualization in the Fisher discriminative subspace, Statistics and Computing, 22, 301-324, (2012) · Zbl 1322.62162
[6] DEAN, N., and RAFTERY, A.E. (2006), The Clustvarsel Package, R package version 0.2-4. · Zbl 1055.62071
[7] Fraley, C; Raftery, AE, Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association, 97, 611-631, (2002) · Zbl 1073.62545
[8] Fraley, C; Raftery, AE, Enhanced software for model-based clustering, density estimation, and discriminant analysis: MCLUST, Journal of Classification, 20, 263-286, (2003) · Zbl 1055.62071
[9] FRALEY, C., and RAFTERY, A.E. (2006, revised 2009), “MCLUST: Version 3 for R: Normal Mixture Modeling and Model-Based Clustering”, Technical Report 504, University of Washington, Department of Statistics.
[10] GHAHRAMANI, Z., and HINTON, G.E. (1997), “The EM Algorithm for Factor Analyzers”, Technical Report CRG-TR-96-1, University of Toronto, Toronto.
[11] Hubert, L; Arabie, P, Comparing partitions, Journal of Classification, 2, 193-218, (1985)
[12] HURLEY, C. (2012), it gclus: Clustering Graphics, R package version 1.3.1, http://CRAN.R-project.org/package=gclus. · Zbl 1157.62431
[13] Kass, RE; Raftery, AE, Bayes factors, Journal of the American Statistical Association, 90, 773-795, (1995) · Zbl 0846.62028
[14] Maugis, C; Celeux, G; Martin-Magniette, M-L, Variable selection for clustering with Gaussian mixture models, Biometrics, 65, 701-709, (2009) · Zbl 1172.62021
[15] Mclachlan, GJ; Bean, RW; Peel, D, A mixture model-based approach to the clustering of microarray expression data, Bioinformatics, 18, 413-422, (2002)
[16] MCLACHLAN, G.J., and PEEL, D. (2000), “Mixtures of Factor Analyzers”, in Proceedings of the Seventh International Conference on Machine Learning, San Francisco: Morgan Kaufmann, pp. 599-606.
[17] Mcnicholas, PD; Murphy, TB, Parsimonious Gaussian mixture models, Statistics and Computing, 18, 285-296, (2008)
[18] Mcnicholas, PD; Murphy, TB, Model-based clustering of microarray expression data via latent Gaussian mixture models, Bioinformatics, 26, 2705-2712, (2010) · Zbl 1203.82150
[19] Montanari, A; Viroli, C, Heteroscedastic factor mixture analysis, Statistical Modelling, 10, 441-460, (2010)
[20] Qiu, W; Joe, H, Generation of random clusters with specified degree of separation, Journal of Classification, 23, 315-334, (2006) · Zbl 1336.62189
[21] R DEVELOPMENT CORE TEAM (2012), R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing.
[22] Raftery, AE; Dean, N, Variable selection for model-based clustering, Journal of the American Statistical Association, 101, 168-178, (2006) · Zbl 1118.62339
[23] Rand, WM, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, 66, 846-850, (1971)
[24] Schwarz, G, Estimating the dimension of a model, The Annals of Statistics, 6, 461-464, (1978) · Zbl 0379.62005
[25] Scrucca, L, Dimension reduction for model-based clustering, Statistics and Computing, 20, 471-484, (2010)
[26] STREULI, H. (1973), “Der Heutige Stand der Kaffeechemie”, in Association Scientifique Internationale pour le Cafe, 6th International Colloquium on Coffee Chemistry, Bogata, Columbia, pp. 61-72.
[27] Tipping, TE; Bishop, CM, Mixtures of probabilistic principal component analysers, Neural Computation, 11, 443-482, (1999)
[28] VENABLES, W.N., and RIPLEY, B.D. (2002), Modern Applied Statistics with S (4th ed.), New York: Springer. · Zbl 1006.62003
[29] Viroli, C, Dimensionally reduced model-based clustering through mixtures of factor mixture analyzers, Journal of Classification, 27, 363-388, (2010) · Zbl 1337.62141
[30] Witten, D; Tibshirani, R, A framework for feature selection in clustering, Journal of the American Statistical Association, 105, 713-726, (2010) · Zbl 1392.62194
[31] WITTEN, D.M., and TIBSHIRANI, R. (2011), sparcl: Perform Sparse Hierarchical Clustering and Sparse K-means Clustering, R package version 1.0.2.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.