×

Variable selection for clustering and classification. (English) Zbl 1360.62310

Summary: As data sets continue to grow in size and complexity, effective and efficient techniques are needed to target important features in the variable space. Many of the variable selection techniques that are commonly used alongside clustering algorithms are based upon determining the best variable subspace according to model fitting in a stepwise manner. These techniques are often computationally intensive and can require extended periods of time to run; in fact, some are prohibitively computationally expensive for high-dimensional data. In this paper, a novel variable selection technique is introduced for use in clustering and classification analyses that is both intuitive and computationally efficient. We focus largely on applications in mixture model-based learning, but the technique could be adapted for use with various other clustering/classification methods. Our approach is illustrated on both simulated and real data, highlighted by contrasting its performance with that of other comparable variable selection techniques on the real data sets.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
65C60 Computational problems in statistics (MSC2010)
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] ANDREWS, J.L., and MCNICHOLAS, P.D. (2011a), “Extending Mixtures of Multivariate T-Factor Analyzers”, Statistics and Computing, 21(3), 361-373. · Zbl 1255.62171
[2] ANDREWS, J.L., and MCNICHOLAS, P.D. (2011b), “Mixtures of Modified T-Factor Analyzers for Model-Based Clustering, Classification, and Discriminant Analysis”, Journal of Statistical Planning and Inference, 141(4), 1479-486. · Zbl 1204.62098
[3] BIERNACKI, C., CELEUX, G., and GOVAERT, G. (2000), “Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood”, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7), 719-725.
[4] BIERNACKI, C., CELEUX, G. GOVAERT, G., and LANGROGNET, F. (2006), “Model-Based Cluster and Discriminant Analysis with the MIXMOD Software”, Computational Statistics and Data Analysis, 51(2), 587-600. · Zbl 1157.62431
[5] BOUVEYRON, C., and BRUNET, C. (2012),“Simultaneous Model-Based Clustering and Visualization in the Fisher Discriminative Subspace”, Statistics and Computing, 22(1), 301-324. · Zbl 1322.62162
[6] DEAN, N., and RAFTERY, A.E. (2006), The Clustvarsel Package, R package version 0.2-4. · Zbl 1055.62071
[7] FRALEY, C., and RAFTERY, A.E. (2002), “Model-Based Clustering, Discriminant Analysis, and Density Estimation”, Journal of the American Statistical Association, 97(458), 611-631. · Zbl 1073.62545
[8] FRALEY, C., and RAFTERY, A.E. (2003), “Enhanced Software for Model-Based Clustering, Density Estimation, and Discriminant Analysis: MCLUST”, Journal of Classification, 20, 263-286. · Zbl 1055.62071
[9] FRALEY, C., and RAFTERY, A.E. (2006, revised 2009), “MCLUST: Version 3 for R: Normal Mixture Modeling and Model-Based Clustering”, Technical Report 504, University of Washington, Department of Statistics.
[10] GHAHRAMANI, Z., and HINTON, G.E. (1997), “The EM Algorithm for Factor Analyzers”, Technical Report CRG-TR-96-1, University of Toronto, Toronto.
[11] HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification, 2, 193-218.
[12] HURLEY, C. (2012), it gclus: Clustering Graphics, R package version 1.3.1, http://CRAN.R-project.org/package=gclus. · Zbl 1157.62431
[13] KASS, R.E., and RAFTERY, A.E. (1995), “Bayes Factors”, Journal of the American Statistical Association, 90, 773-795. · Zbl 0846.62028
[14] MAUGIS, C., CELEUX, G., and MARTIN-MAGNIETTE, M.-L. (2009), “Variable Selection for Clustering with Gaussian Mixture Models”, Biometrics, 65(3), 701-709. · Zbl 1172.62021
[15] MCLACHLAN, G.J., BEAN, R.W., and PEEL, D. (2002), “A Mixture Model-Based Approach to the Clustering of Microarray Expression Data”, Bioinformatics, 18(3), 413-422.
[16] MCLACHLAN, G.J., and PEEL, D. (2000), “Mixtures of Factor Analyzers”, in Proceedings of the Seventh International Conference on Machine Learning, San Francisco: Morgan Kaufmann, pp. 599-606.
[17] MCNICHOLAS, P.D., and MURPHY, T.B. (2008), “Parsimonious Gaussian Mixture Models”, Statistics and Computing, 18, 285-296.
[18] MCNICHOLAS, P.D., and MURPHY, T.B. (2010), “Model-Based Clustering of Microarray Expression Data Via Latent Gaussian Mixture Models”, Bioinformatics, 26(21), 2705-2712. · Zbl 1203.82150
[19] MONTANARI, A., and VIROLI, C. (2010), “Heteroscedastic Factor Mixture Analysis”, Statistical Modelling, 10(4), 441-460. · Zbl 07256833
[20] QIU, W., and JOE, H. (2006), “Generation of Random Clusters with Specified Degree of Separation”, Journal of Classification, 23, 315-334. · Zbl 1336.62189
[21] R DEVELOPMENT CORE TEAM (2012), R: A Language and Environment for Statistical Computing, Vienna, Austria: R Foundation for Statistical Computing.
[22] RAFTERY, A.E., and DEAN, N. (2006), “Variable Selection for Model-Based Clustering”, Journal of the American Statistical Association, 101(473), 168-178. · Zbl 1118.62339
[23] RAND, W.M. (1971), “Objective Criteria for the Evaluation of Clustering Methods”, Journal of the American Statistical Association, 66, 846-850.
[24] SCHWARZ, G. (1978), “Estimating the Dimension of a Model”, The Annals of Statistics, 6(2), 461-464. · Zbl 0379.62005
[25] SCRUCCA, L. (2010), “Dimension Reduction for Model-Based Clustering”, Statistics and Computing, 20(4), 471-484.
[26] STREULI, H. (1973), “Der Heutige Stand der Kaffeechemie”, in Association Scientifique Internationale pour le Cafe, 6th International Colloquium on Coffee Chemistry, Bogata, Columbia, pp. 61-72.
[27] TIPPING, T.E., and BISHOP, C.M. (1999), “Mixtures of Probabilistic Principal Component Analysers”, Neural Computation, 11(2), 443-482.
[28] VENABLES, W.N., and RIPLEY, B.D. (2002), Modern Applied Statistics with S (4th ed.), New York: Springer. · Zbl 1006.62003
[29] VIROLI, C. (2010), “Dimensionally Reduced Model-Based Clustering Through Mixtures of Factor Mixture Analyzers”, Journal of Classification, 27(3), 363-388. · Zbl 1337.62141
[30] WITTEN, D., and TIBSHIRANI, R. (2010), “A Framework for Feature Selection in Clustering”, Journal of the American Statistical Association, 105(490), 713-726. · Zbl 1392.62194
[31] WITTEN, D.M., and TIBSHIRANI, R. (2011), sparcl: Perform Sparse Hierarchical Clustering and Sparse K-means Clustering, R package version 1.0.2.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.