×

Model-based clustering, classification, and discriminant analysis via mixtures of multivariate \(t\)-distributions. (English) Zbl 1252.62062

Summary: The last decade has seen an explosion of work on the use of mixture models for clustering. The use of the Gaussian mixture model has been common practice, with constraints sometimes imposed upon the component covariance matrices to give families of mixture models. Similar approaches have also been applied, albeit with less fecundity, to classification and discriminant analysis. We begin with an introduction to model-based clustering and a succinct account of the state-of-the-art. We then put forth a novel family of mixture models wherein each component is modeled using a multivariate \(t\)-distribution with an eigen-decomposed covariance structure. This family, which is largely a \(t\)-analogue of the well-known MCLUST family, is known as the \(t\)EIGEN family. The efficacy of this family for clustering, classification, and discriminant analysis is illustrated with both real and simulated data. The performance of this family is compared to its Gaussian counterpart on three real data sets.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H10 Multivariate distribution of statistics
65C60 Computational problems in statistics (MSC2010)

Software:

mclust; S-PLUS; PGMM; R
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Andrews, J.L., McNicholas, P.D.: Extending mixtures of multivariate t-factor analyzers. Stat. Comput. 21(3), 361–373 (2011a) · Zbl 1255.62171
[2] Andrews, J.L., McNicholas, P.D.: Mixtures of modified t-factor analyzers for model-based clustering, classification, and discriminant analysis. J. Stat. Plan. Inference 141(4), 1479–1486 (2011b) · Zbl 1204.62098
[3] Andrews, J.L., McNicholas, P.D., Subedi, S.: Model-based classification via mixtures of multivariate t-distributions. Comput. Stat. Data Anal. 55(1), 520–529 (2011) · Zbl 1247.62151
[4] Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3), 803–821 (1993) · Zbl 0794.62034
[5] Baum, L.E., Petrie, T., Soules, G., Weiss, N.: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat. 41, 164–171 (1970) · Zbl 0188.49603
[6] Besag, J., Green, P., Higdon, D., Mengersen, K.: Bayesian computation and stochastic systems. Stat. Sci. 10(1), 3–41 (1995) · Zbl 0955.62552
[7] Bouveyron, C., Girard, S., Schmid, C.: High-dimensional data clustering. Comput. Stat. Data Anal. 52(1), 502–519 (2007) · Zbl 1452.62433
[8] Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1995) · Zbl 05480211
[9] Dasgupta, A., Raftery, A.E.: Detecting features in spatial point processes with clutter via model-based clustering. J. Am. Stat. Assoc. 93, 294–302 (1998) · Zbl 0906.62105
[10] Day, N.E.: Estimating the components of a mixture of normal distributions. Biometrika 56, 463–474 (1969) · Zbl 0183.48106
[11] Dean, N., Murphy, T.B., Downey, G.: Using unlabelled data to update classification rules with applications in food authenticity studies. J. R. Stat. Soc., Ser. C, Appl. Stat. 55(1), 1–14 (2006) · Zbl 05188723
[12] Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc., Ser. B, Stat. Methodol. 39(1), 1–38 (1977) · Zbl 0364.62022
[13] Edwards, A.W.F., Cavalli-Sforza, L.L.: A method for cluster analysis. Biometrics 21, 362–375 (1965)
[14] Forina, M., Armanino, C., Castino, M., Ubigli, M.: Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25, 189–201 (1986)
[15] Fraley, C., Raftery, A.E.: How many clusters? Which clustering methods? Answers via model-based cluster analysis. Comput. J. 41(8), 578–588 (1998) · Zbl 0920.68038
[16] Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002) · Zbl 1073.62545
[17] Fraley, C., Raftery, A.E.: MCLUST:version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, Department of Statistics, University of Washington (2006, September). Minor revisions January 2007 and November 2007
[18] Gordon, A.D.: Classification. Chapman and Hall, London (1981)
[19] Greselin, F., Ingrassia, S.: Constrained monotone EM algorithms for mixtures of multivariate t distributions. Stat. Comput. 20(1), 9–22 (2010a)
[20] Greselin, F., Ingrassia, S.: Weakly homoscedastic constraints for mixtures of t-distributions. In: Fink, A., Lausen, B., Seidel, W., Ultsch, A. (eds.) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 219–228. Springer, Berlin/Heidelberg (2010b)
[21] Hastie, T., Tibshirani, R.: Discriminant analysis by Gaussian mixtures. J. R. Stat. Soc., Ser. B, Stat. Methodol. 58, 155–176 (1996) · Zbl 0850.62476
[22] Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985) · Zbl 0587.62128
[23] Hurley, C.: Clustering visualizations of multivariate data. J. Comput. Graph. Stat. 13(4), 788–806 (2004)
[24] Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995) · Zbl 0846.62028
[25] Kass, R.E., Wasserman, L.: A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. J. Am. Stat. Assoc. 90(431), 928–934 (1995) · Zbl 0851.62020
[26] Keribin, C.: Consistent estimation of the order of mixture models Sankhyā. Indian J. Stat., Ser. A 62(1), 49–66 (2000) · Zbl 1081.62516
[27] Leroux, B.G.: Consistent estimation of a mixing distribution. Ann. Stat. 20, 1350–1360 (1992) · Zbl 0763.62015
[28] Lindsay, B.G.: Mixture models: theory, geometry and applications. In: NSF-CBMS Regional Conference Series in Probability and Statistics, vol. 5. Institute of Mathematical Statistics, Hayward (1995) · Zbl 1163.62326
[29] Mangasarian, O.L., Street, W.N., Wolberg, W.H.: Breast cancer diagnosis and prognosis via linear programming. Operations Research 43(4), 570–577 (1995) · Zbl 0857.90073
[30] Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection for clustering with Gaussian mixture models. Biometrics 65(3), 701–709 (2009) · Zbl 1172.62021
[31] McLachlan, G.J.: The Classification and Mixture Maximum Likelihood Approaches to Cluster Analysis. Handbook of Statistics, vol. 2, pp. 199–208. North-Holland, Amsterdam (1982) · Zbl 0513.62064
[32] McLachlan, G.J.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New Jersey (1992) · Zbl 1108.62317
[33] McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and applications to clustering. Marcel Dekker, New York (1988) · Zbl 0697.62050
[34] McLachlan, G.J., Peel, D.: Robust cluster analysis via mixtures of multivariate t-distributions. In: Lecture Notes in Computer Science, vol. 1451, pp. 658–666. Springer, Berlin (1998)
[35] McLachlan, G.J., Peel, D.: Mixtures of factor analyzers. In: Proceedings of the Seventh International Conference on Machine Learning, pp. 599–606. Morgan Kaufmann, San Francisco (2000)
[36] McLachlan, G.J., Bean, R.W., Jones, L.B.-T.: Extension of the mixture of factor analyzers model to incorporate the multivariate t-distribution. Comput. Stat. Data Anal. 51(11), 5327–5338 (2007) · Zbl 1445.62053
[37] McNicholas, P.D.: Model-based classification using latent Gaussian mixture models. J. Stat. Plan. Inference 140(5), 1175–1181 (2010) · Zbl 1181.62095
[38] McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008)
[39] McNicholas, P.D., Murphy, T.B.: Model-based clustering of longitudinal data. Can. J. Stat. 38(1), 153–168 (2010a) · Zbl 1190.62120
[40] McNicholas, P.D., Murphy, T.B.: Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21), 2705–2712 (2010b)
[41] McNicholas, P.D., Murphy, T.B., McDaid, A.F., Frost, D.: Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput. Stat. Data Anal. 54(3), 711–723 (2010) · Zbl 1464.62131
[42] Meng, X.-L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80, 267–278 (1993) · Zbl 0778.62022
[43] Orchard, T., Woodbury, M.A.: A missing information principle: theory and applications. In: Le Cam, L.M., Neyman, J., Scott, E.L. (eds.) Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability. Theory of Statistics, vol. 1, pp. 697–715. University of California Press, Berkeley (1972) · Zbl 0263.62023
[44] R Development Core Team: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2010)
[45] Raftery, A.E., Dean, N.: Variable selection for model-based clustering. J. Am. Stat. Assoc. 101(473), 168–178 (2006) · Zbl 1118.62339
[46] Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971)
[47] Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978) · Zbl 0379.62005
[48] Scrucca, L.: Dimension reduction for model-based clustering. Stat. Comput. 20(4), 471–484 (2010)
[49] Sundberg, R.: Maximum likelihood theory for incomplete data from an exponential family. Scand. J. Stat. 1, 49–58 (1974) · Zbl 0284.62014
[50] Titterington, D.M., Smith, A.F.M., Makov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester (1985) · Zbl 0646.62013
[51] Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S-PLUS. Springer, Berlin (1999) · Zbl 0927.62002
[52] Wolfe, J.H.: A computer program for the maximum-likelihood analysis of types. USNPRA Technical Bulletin 65-15, US Naval Personal Research Activity, San Diego (1965)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.