Variable selection in model-based discriminant analysis. (English) Zbl 1219.62103

Summary: A general methodology for selecting predictors for Gaussian generative classification models is presented. The problem is regarded as a model selection problem. Three different roles for each possible predictor are considered: a variable can be a relevant classification predictor or not, and the irrelevant classification variables can be linearly dependent on a part of the relevant predictors or independent variables. This variable selection model was inspired by a previous work on variable selection in model-based clustering. A BIC-like model selection criterion is proposed. It is optimized through two embedded forward stepwise variable selection algorithms for classification and linear regression. The model identifiability and the consistency of the variable selection criterion are proved. Numerical experiments on simulated and real data sets illustrate the interest of this variable selection methodology. In particular, it is shown that this well ground variable selection model can be of great interest to improve the classification performance of the quadratic discriminant analysis in a high dimension context.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
65C60 Computational problems in statistics (MSC2010)


Mixmod; PRMLT; mclust
Full Text: DOI


[1] Banfield, J.D.; Raftery, A.E., Model-based gaussian and non-Gaussian clustering, Biometrics, 49, 803-821, (1993) · Zbl 0794.62034
[2] Bensmail, H.; Celeux, G., Regularized Gaussian discriminant analysis through eignenvalue decomposition, Journal of the American statistical association, 91, 1743-1748, (1996) · Zbl 0885.62068
[3] Biernacki, C.; Celeux, G.; Govaert, G.; Langrognet, F., Model-based cluster and discriminant analysis with the {\scmixmod} software, Computational statistics and data analysis, 51, 587-600, (2006) · Zbl 1157.62431
[4] Bishop, C.M., Pattern recognition and machine learning, (2006), Springer New York · Zbl 1107.68072
[5] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern recognition, 28, 781-793, (1995)
[6] Fraley, C.; Raftery, A.E., Enhanced software for model-based clustering, density estimation, and discriminant analysis: {\scmclust}, Journal of classification, 20, 263-286, (2003) · Zbl 1055.62071
[7] Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; Bloomfield, C.D.; Lander, E.S., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286, 531-537, (1999)
[8] Guyon, I.; Ellisseeff, A., An introduction to variable and feature selection, Journal of machine learning research, 3, 1157-1182, (2003) · Zbl 1102.68556
[9] Hastie, T.; Tibshirani, R.; Friedman, J., The elements of statistical learning, (2009), Springer New York
[10] Krishnapuram, B.; Carin, L.; Hartemink, A., Gene expression analysis: joint feature selection and classifier design, ()
[11] Mary-Huard, T.; Robin, S., Tailored aggregation for classification, IEEE transactions on pattern analysis and machine intelligence, 31, 2098-2105, (2009)
[12] Mary-Huard, T.; Robin, S.; Daudin, J.J., A penalized criterion for variable selection in classification, Journal of multivariate analysis, 98, 695-705, (2007) · Zbl 1118.62066
[13] Maugis, C.; Celeux, G.; Martin-Magniette, M.L., Variable selection for clustering with Gaussian mixture models, Biometrics, 65, 701-709, (2009) · Zbl 1172.62021
[14] Maugis, C.; Celeux, G.; Martin-Magniette, M.L., Variable selection in model-based clustering: A general variable role modeling, Computational statistics and data analysis, 53, 3872-3882, (2009) · Zbl 1453.62154
[15] C. Maugis, G. Celeux, M.L. Martin-Magniette, Variable selection in model-based discriminant analysis, Technical Report RR-7290, INRIA, 2010. · Zbl 1219.62103
[16] McLachlan, G., Discriminant analysis and statistical pattern analysis, (1992), Wiley-Interscience New York
[17] Murphy, B.T.; Raftery, A.E.; Dean, N., Variable selection and updating in model-based discriminant analysis for high-dimensional data with food authenticity applications, Annals of applied statistics, 4, 396-421, (2010) · Zbl 1189.62105
[18] Raftery, A.E.; Dean, N., Variable selection for model-based clustering, Journal of the American statistical association, 101, 168-178, (2006) · Zbl 1118.62339
[19] Schwarz, G., Estimating the dimension of a model, The annals of statistics, 6, 461-464, (1978) · Zbl 0379.62005
[20] Su, Y.; Murali, T.; Pavlovic, V.; Schaffer, M.; Kasif, S., Rank gene: identification of diagnostic genes based on expression data, Bioinformatics, 19, 1578-1579, (2003)
[21] Yang, A.J.; Xin-Yuan, S., Bayesian variable selection for disease classification using gene expression data, Bioinformatics, 26, 215-222, (2010)
[22] Young, D.M.; Odell, P.L., Feature-subset selection for statistical classification problems involving unequal covariance matrices, Communication in statistics – theory and methods, 15, 137-157, (1986) · Zbl 0607.62073
[23] Q. Zhang, H. Wang, A bic criterion for gaussian mixture model selection with application in discriminant analysis, Technical Report, Guanghua School of Management, Peking University, 2008.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.