×

A simple model-based approach to variable selection in classification and clustering. (English. French summary) Zbl 1328.62388

Summary: Clustering and classification of replicated data is often performed using classical techniques that inappropriately treat the data as unreplicated, or by complex modern ones that are computationally demanding. In this paper, we introduce a simple approach based on a “spike-and-slab” mixture model that is fast, automatic, allows classification, clustering and variable selection in a single framework, and can handle replicated or unreplicated data. Simulation shows that our approach compares well with other recently proposed methods. The ideas are illustrated by application to microarray and metabolomic data.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F15 Bayesian inference
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Bergé, HDclassif: An R package for model-based clustering and discriminant analysis of high-dimensional data, Journal of Statistical Software 42 pp 1– (2012)
[2] Bhowmick, A Laplace mixture model for identification of differential expression in microarray experiments, Biostatistics 7 pp 630– (2006) · Zbl 1170.62369 · doi:10.1093/biostatistics/kxj032
[3] Bickel, Some theory for Fisher’s discriminant function, naive Bayes, and some alternatives when there are more variables than observations, Bernoulli 10 pp 989– (2004) · Zbl 1064.62073 · doi:10.3150/bj/1106314847
[4] Booth, Clustering using objective functions and stochastic search, Journal of the Royal Statistical Society, Series B 70 pp 119– (2008) · Zbl 1400.62128 · doi:10.1111/j.1467-9868.2007.00629.x
[5] Chang, On using principal components before separating a mixture of two multivariate normal distributions, Applied Statistics 32 pp 267– (1983) · Zbl 0538.62050 · doi:10.2307/2347949
[6] Claeskens, Model Selection And Model Averaging (2008) · Zbl 1166.62001 · doi:10.1017/CBO9780511790485
[7] Dudoit, Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association 97 pp 77– (2002) · Zbl 1073.62576 · doi:10.1198/016214502753479248
[8] Everitt, Cluster Analysis (2011) · Zbl 1274.62003 · doi:10.1002/9780470977811
[9] Fraley, Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association 97 pp 611– (2002) · Zbl 1073.62545 · doi:10.1198/016214502760047131
[10] Friedman, Exploratory projection pursuit, Journal of the American Statistical Association 82 pp 249– (1987) · Zbl 0664.62060 · doi:10.1080/01621459.1987.10478427
[11] George, Approaches for Bayesian variable selection, Statistica Sinica 7 pp 339– (1997) · Zbl 0884.62031
[12] Ghahramani, Variational inference for Bayesian mixtures of factor analyzers. In Advances in Neural Information Processing Systems pp 449– (2000)
[13] Gohlke, Early gas chromatography/mass spectrometry, Journal of the American Society for Mass Spectrometry 4 pp 367– (1993) · doi:10.1016/1044-0305(93)85001-E
[14] Golub, Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, Science 286 pp 531– (1999) · doi:10.1126/science.286.5439.531
[15] Green, Delayed rejection in reversible jump Metropolis-Hastings, Biometrika 88 pp 1035– (2001) · Zbl 1099.60508 · doi:10.1093/biomet/88.4.1035
[16] Guo, Pairwise variable selection for high-dimensional model-based clustering, Biometrics 66 pp 793– (2010) · Zbl 1203.62190 · doi:10.1111/j.1541-0420.2009.01341.x
[17] Hall, Geometric representation of high dimension, low sample size data, Journal of the Royal Statistical Society, Series B 67 pp 427– (2005) · Zbl 1069.62097 · doi:10.1111/j.1467-9868.2005.00510.x
[18] Hand, Classifier technology and the illusion of progress, Statistical Science 21 pp 1– (2006) · Zbl 1426.62188 · doi:10.1214/088342306000000060
[19] Hand, Idiot’s Bayes-not so stupid after all?, International Statistical Review 69 pp 385– (2001) · Zbl 1213.62010
[20] Hastie, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009) · Zbl 1273.62005 · doi:10.1007/978-0-387-84858-7
[21] Heard, A quantitative study of gene regulation involved in the immune response of Anopheline mosquitoes: An application of Bayesian hierarchical clustering of curves, Journal of the American Statistical Association 101 pp 18– (2006) · Zbl 1118.62368 · doi:10.1198/016214505000000187
[22] Heller, Proceedings of the 22nd International Conference on Machine Learning, ACM (Association for Computing Machinery) pp 297– (2005)
[23] Kaufman, Finding Groups in Data: An Introduction to Cluster Analysis (1990) · Zbl 1345.62009 · doi:10.1002/9780470316801
[24] Kim, Variable selection in clustering via Dirichlet process mixture models, Biometrika 93 pp 877– (2006) · Zbl 1436.62266 · doi:10.1093/biomet/93.4.877
[25] Lance, A general theory of classificatory sorting strategies. 1: Hierarchical systems, Computer Journal 9 pp 373– (1967) · doi:10.1093/comjnl/9.4.373
[26] Lau, Bayesian model-based clustering procedures, Computational Statistics and Data Analysis 16 pp 526– (2007)
[27] McCulloch, Generalized, Linear, and Mixed Models (2001)
[28] McLachlan, A mixture model-based approach to the clustering of microarray expression data, Bioinformatics 18 pp 413– (2002) · doi:10.1093/bioinformatics/18.3.413
[29] McLachlan, Finite Mixture Models (2000) · Zbl 0963.62061 · doi:10.1002/0471721182
[30] McNicholas, Model-based clustering of microarray expression data via latent Gaussian mixture models, Bioinformatics 26 pp 2705– (2010) · doi:10.1093/bioinformatics/btq498
[31] Messerli, Rapid classification of phenotypic mutants of Arabidopsis via metabolite fingerprinting, Plant Physiology 143 pp 1481– (2007) · doi:10.1104/pp.106.090795
[32] Mitchell, Bayesian variable selection in linear regression (with discussion), Journal of the American Statistical Association 83 pp 1023– (1988) · Zbl 0673.62051 · doi:10.1080/01621459.1988.10478694
[33] Pan, Penalized model-based clustering with application to variable selection, Journal of Machine Learning Research 8 pp 1145– (2007) · Zbl 1222.68279
[34] Partovi Nia, High-dimensional Bayesian clustering with variable selection: The R package bclust, Journal of Statistical Software 47 pp 1– (2012)
[35] Raftery, Variable selection for model-based clustering, Journal of the American Statistical Association 101 pp 168– (2006) · Zbl 1118.62339 · doi:10.1198/016214506000000113
[36] Rand, Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association 66 pp 846– (1971) · doi:10.1080/01621459.1971.10482356
[37] Scholz, Metabolite fingerprinting: Detecting biological features by independent component analysis, Bioinformatics 20 pp 2447– (2004) · doi:10.1093/bioinformatics/bth270
[38] Tadesse, Bayesian variable selection in clustering high-dimensional data, Journal of the American Statistical Association 100 pp 602– (2005) · Zbl 1117.62433 · doi:10.1198/016214504000001565
[39] Wang, Variable selection for model-based high-dimensional clustering and its application to microarray data, Biometrics 64 pp 440– (2008) · Zbl 1137.62041 · doi:10.1111/j.1541-0420.2007.00922.x
[40] Witten, A framework for feature selection in clustering, Journal of the American Statistical Association 105 pp 713– (2010) · Zbl 1392.62194 · doi:10.1198/jasa.2010.tm09415
[41] Yeung, Principal component analysis for clustering gene expression data, Bioinformatics 17 pp 763– (2001) · doi:10.1093/bioinformatics/17.9.763
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.