×

Issues of robustness and high dimensionality in cluster analysis. (English) Zbl 1437.62011

Rizzi, Alfredo (ed.) et al., COMPSTAT. Proceedings in computational statistics. 17th symposium held in Rome, Italy, August 28 – September 1, 2006. With CD-Rom. Heidelberg: Physica-Verlag. 3-15 (2006).
Summary: Finite mixture models are being increasingly used to model the distributions of a wide variety of random phenomena. While normal mixture models are often used to cluster data sets of continuous multivariate data, a more robust clustering can be obtained by considering the \(t\) mixture model-based approach. Mixtures of factor analyzers enable model-based density estimation to be undertaken for high-dimensional data where the number of observations \(n\) is very large relative to their dimension \(p\). As the approach using the multivariate normal family of distributions is sensitive to outliers, it is more robust to adopt the multivariate \(t\) family for the component error and factor distributions. The computational aspects associated with robustness and high dimensionality in these approaches to cluster analysis are discussed and illustrated.
For the entire collection see [Zbl 1097.62502].

MSC:

62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803-821 (1993) · Zbl 0794.62034 · doi:10.2307/2532201
[2] Campbell, N.A.: Mixture models and atypical values. Math. Geol., 16, 465-477 (1984) · doi:10.1007/BF01886327
[3] Chang, W.C.: On using principal components before separating a mixture of two multivariate normal distributions. Appl. Stat., 32, 267-275 (1983) · Zbl 0538.62050 · doi:10.2307/2347949
[4] Coleman, D., Dong, X., Hardin, J., Rocke, D.M., Woodruff, D.L.: Some computational issues in cluster analysis with no a priori metric. Comp. Stat. Data Anal., 31, 1-11 (1999) · Zbl 0942.62068 · doi:10.1016/S0167-9473(99)00009-2
[5] Davies, P.L., Gather, U.: Breakdown and groups (with discussion). Ann. Stat., 33, 977-1035 (2005) · Zbl 1077.62041 · doi:10.1214/009053604000001138
[6] Dempster, A.P, Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc. B, 39, 1-38 (1977) · Zbl 0364.62022
[7] Donoho, D.L., Huber, J.: The notion of breakdown point. In: Bickel, P.J., Doksum, K.A., Hodges, J.L. (eds) A Festschrift for Erich L. Lehmann. Wadsworth, Belmont, CA (1983) · Zbl 0523.62032
[8] Fokoué, E., Titterington, D.M.: Mixtures of factor analyzers. Bayesian estimation and inference by stochastic simulation. Mach. Learn., 50, 73-94 (2002) · Zbl 1033.68085 · doi:10.1023/A:1020297828025
[9] Ghahramani, Z., Hinton, G.E.: The EM algorithm for mixtures of factor analyzers. Techncial Report, University of Toronto (1997)
[10] Hadi, A.S., Luccño, A.: Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Comp. Stat. Data Anal., 25, 251-272 (1997) · Zbl 0900.62119 · doi:10.1016/S0167-9473(97)00011-X
[11] Hampel, F.R. A general qualitative definition of robustness. Ann. Math. Stat., 42, 1887-1896 (1971) · Zbl 0229.62041
[12] Hartigan, J.A.: Statistical theory in clustering. J. Classif., 2, 63-76 (1975) · Zbl 0575.62058 · doi:10.1007/BF01908064
[13] Hennig, C.: Breakdown points for maximum likelihood estimators of location-scale mixtures. Ann. Stat., 32, 1313-1340 (2004) · Zbl 1047.62063 · doi:10.1214/009053604000000571
[14] Hinton, G.E., Dayan, P., Revov, M.: Modeling the manifolds of images of handwritten digits. IEEE Trans. Neur. Networks, 8, 65-73
[15] Huber, P.J.: Robust Statistics. Wiley, New York (1981) · Zbl 0536.62025
[16] Kent, J.T., Tyler, D.E., Vardi, Y.: A curious likelihood identity for the multivariate t-distribution. Comm. Stat. Sim Comp., 23, 441-453 (1994) · Zbl 0825.62035 · doi:10.1080/03610919408813180
[17] Kotz, S. Nadarajah, S.: Multivariate t distributions and their applications. Cambridge University Press, New York (2004) · Zbl 1100.62059
[18] Lawley, D.N., Maxwell, A.E.: Factor Analysis as a Statistical Method. Butterworths, London (1971) · Zbl 0251.62042
[19] Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1987) · Zbl 0665.62004
[20] Liu, C.: ML estimation of the multivariate t distribution and the EM algorithm. J. Multiv. Anal., 63, 296-312 (1997) · Zbl 0884.62059 · doi:10.1006/jmva.1997.1703
[21] Liu, C., Rubin, D.B.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika, 81, 633-648 (1994) · Zbl 0812.62028 · doi:10.1093/biomet/81.4.633
[22] Liu, C., Rubin, D.B.: ML estimation of the t distribution using EM and its extensions, ECM and ECME. Statistica Sinica, 5:19-39 (1995) · Zbl 0824.62047
[23] Liu, C., Rubin, D.B., Wu, Y.N.: Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika, 85, 755-770 (1998) · Zbl 0921.62071 · doi:10.1093/biomet/85.4.755
[24] Markatou, M.: Mixture models, robustness and the weighted likelihood methodology. Biom., 56, 483-486 (2000) · Zbl 1060.62511
[25] Markatou, M., Basu, A., Lindsay, B.G.: Weighted likelihood equations with bootstrap root search. J. Amer. Stat. Assoc., 93, 740-750 (1998) · Zbl 0918.62046 · doi:10.2307/2670124
[26] McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York (1988) · Zbl 0697.62050
[27] McLachlan, G.J., Peel, D.: Robust cluster analysis via mixtures of multivariate t distributions. Lec. Notes Comput. Sci., 1451, 658-666 (1998) · doi:10.1007/BFb0033290
[28] McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2000) · Zbl 0963.62061
[29] McLachlan, G.J., Peel, D.: Mixtures of factor analyzers. In: Langley, P. (ed) Proceedings of the Seventeenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco (2000) · Zbl 1256.62036
[30] McLachlan, G.J., Bean, R.W., Ben-Tovim Jones, L.: Extension of mixture of factor analyzers model to incorporate the multivariate t distribution. To appear in Comp. Stat. Data Anal. (2006) · Zbl 1445.62053
[31] McLachlan, G.J., Ng, S.-K., Bean, R.W.: Robust cluster analysis via mixture models. To appear in Aust. J. Stat. (2006)
[32] McLachlan, G.J., Peel, D., Bean, R.: Modelling high-dimensional data by mixtures of factor analyzers. Comp. Stat. Data Anal., 41, 379-388 (2003) · Zbl 1256.62036 · doi:10.1016/S0167-9473(02)00183-4
[33] Meng, X.L., van Dyk, D.: The EM algorithm—an old folk song sung to a fast new tune (with discussion). J. R. Stat. Soc. B, 59, 511-567 (1997) · Zbl 1090.62518 · doi:10.1111/1467-9868.00082
[34] Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80, 267-278 (1993) · Zbl 0778.62022 · doi:10.1093/biomet/80.2.267
[35] Müller, C.H., Neykov, N.: Breakdown points of trimmed likelihood estimators and related estimators in generalized linear models. J. Stat. Plann. Infer., 116, 503-519 (2004) · Zbl 1178.62074 · doi:10.1016/S0378-3758(02)00265-3
[36] Neykov, N., Filzmoser, P., Dimova, R., Neytchev, P.: Compstat 2004, Proceedings Computational Statistics. Physica-Verlag, Vienna (2004)
[37] Peel, D., McLachlan, G.J.: Robust mixture modelling using the t distribution. Stat. Comput., 10, 335-344 (2000) · doi:10.1023/A:1008981510081
[38] Rocke, D.M.: Robustness properties of S-estimators of multivariate location and shape in high dimension. Ann. Stat., 24, 1327-1345 (1996) · Zbl 0862.62049 · doi:10.1214/aos/1032526972
[39] Rocke, D.M., Woodruff, D.L.: Identification of outliers in multivariate data. J. Amer. Stat. Assoc., 91, 1047-1061 (1996) · Zbl 0882.62049 · doi:10.2307/2291724
[40] Rocke, D.M., Woodruff, D.L.: Robust estimation of multivariate location and shape. J. Stat. Plann. Infer., 57, 245-255 (1997) · Zbl 0900.62281 · doi:10.1016/S0378-3758(96)00047-X
[41] Rubin, D.B.: Iteratively reweighted least squares. In: Kotz, S., Johnson, N.L., and Read, C.B. (eds) Encyclopedia of Statistical Sciences, Vol. 4. Wiley, New York (1983)
[42] Tibshirani, R., Knight, K.: Model search by bootstrap “bumping”. J. Comp. Graph. Stat., 8, 671-686 (1999) · doi:10.2307/1390820
[43] Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analysers. Technical Report, Neural Computing Research Group, Aston University (1997)
[44] Vandev, D.L., Neykov, N.: About regression estimators with high breakdown point. Ann. Stat., 32, 111-129 (1998) · Zbl 1077.62513
[45] Woodruff, D.L., Rocke, D.M.: Heuristic search algorithms for the minimum volume ellipsoid. J. Comp. Graph. Stat., 2, 69-95 (1993) · doi:10.2307/1390956
[46] Woodruff, D.L., Rocke, D.M.: Computable robust estimation of multivariate location and shape using compound estimators. J. Amer. Stat. Assoc., 89, 888- · Zbl 0825.62485 · doi:10.2307/2290913
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.