×

High-dimensional data clustering. (English) Zbl 1452.62433

Summary: Clustering in high-dimensional spaces is a difficult problem which is recurrent in many domains, for example in image analysis. The difficulty is due to the fact that high-dimensional data usually exist in different low-dimensional subspaces hidden in the original space. A family of Gaussian mixture models designed for high-dimensional data which combine the ideas of subspace clustering and parsimonious modeling are presented. These models give rise to a clustering method based on the expectation-maximization algorithm which is called high-dimensional data clustering (HDDC). In order to correctly fit the data, HDDC estimates the specific subspace and the intrinsic dimension of each group. Experiments on artificial and real data sets show that HDDC outperforms existing methods for clustering high-dimensional data.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62-08 Computational methods for problems pertaining to statistics

Software:

ARPACK
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P., Automatic subspace clustering of high-dimensional data for data mining application, (), 94-105
[2] Banfield, J.; Raftery, A., Model-based gaussian and non-Gaussian clustering, Biometrics, 49, 803-821, (1993) · Zbl 0794.62034
[3] Bellman, R., Dynamic programming, (1957), Princeton University Press Princeton, NJ · Zbl 0077.13605
[4] Bezdek, J.C.; Coray, C.; Gunderson, R.; Watson, J., Detection and characterization of cluster substructure. I. linear structure: fuzzy \(c\)-lines; II: fuzzy \(c\)-varieties and convex combinations thereof, SIAM J. appl. math., 40, 339-357, (1981), 38-372 · Zbl 0475.62046
[5] Bibring, J.-P., 42 Co-authors, 2004. OMEGA: Observatoire pour la Minéralogie, l’Eau, les Glaces et l’Activité. ESA SP-1240: Mars Express: the Scientific Payload, pp. 37-49.
[6] Bocci, L.; Vicari, D.; Vichi, M., A mixture model for the classification of three-way proximity data, Comput. statist. data anal., 50, 7, 1625-1654, (2006) · Zbl 1445.62139
[7] Bock, H.-H., 1969. The equivalence of two extremal problems and its application to the iterative classification of multivariate data. In: Mathematisches Forschungsinstitut.
[8] Bock, H.-H., Automatische klassifikation, (1974), Vandenhoeck and Ruprecht Göttingen · Zbl 0207.19202
[9] Bock, H.-H., On the interface between cluster analysis, principal component clustering, and multidimensional scaling, (), 7-34
[10] Bock, H.-H., Probabilistic models in cluster analysis, Comput. statist. data anal., 23, 1, 5-28, (1996) · Zbl 0900.62324
[11] Bouveyron, C., Girard, S., Schmid, C., 2006. High-dimensional data clustering. Technical Report 1083M, LMC-IMAG, Université J. Fourier Grenoble 1. · Zbl 1452.62433
[12] Cattell, R., The scree test for the number of factors, Multivariate behavioral res., 1, 2, 245-276, (1966)
[13] Celeux, G.; Govaert, G., A classification EM algorithm for clustering and two stochastic versions, Comput. statist. data anal., 14, 315-332, (1992) · Zbl 0937.62605
[14] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Internat. J. pattern recogn., 28, 5, 781-793, (1995)
[15] De Soete, G.; Carroll, J.D., \(K\)-means clustering in low-dimensional Euclidean space, (), 212-219
[16] Demartines, P.; Hérault, J., Curvilinear component analysis: a self-organizing neural network for non linear mapping of data sets, IEEE trans. neural netw., 8, 1, 148-154, (1997)
[17] Dempster, A.; Laird, N.; Rubin, D., Maximum likelihood from incomplete data via the EM algorithm, J. roy. statist. soc., 39, 1, 1-38, (1977) · Zbl 0364.62022
[18] DeSarbo, W.S.; Cron, W.L., A maximum likelihood methodology for clusterwise linear regression, J. classification, 5, 249-282, (1988) · Zbl 0692.62052
[19] Diday, E., Introduction l’analyse factorielle typologique, Rev. statist. appl., 22, 29-38, (1974)
[20] Flury, B.; Gautschi, W., An algorithm for simultaneous orthogonal transformation of several positive definite symmetric matrices to nearly diagonal form, SIAM J. sci. statist. comput., 7, 169-184, (1986) · Zbl 0614.65043
[21] Flury, L.; Boukai, B.; Flury, B., The discrimination subspace model, J. amer. statist. assoc., 92, 438, 758-766, (1997) · Zbl 0888.62063
[22] Fraley, C.; Raftery, A., Model-based clustering, discriminant analysis and density estimation, J. amer. statist. assoc., 97, 611-631, (2002) · Zbl 1073.62545
[23] Girard, S., A nonlinear PCA based on manifold approximation, Comput. statist., 15, 2, 145-167, (2000) · Zbl 0976.62056
[24] Guyon, I.; Elisseeff, A., An introduction to variable and feature selection, J. Mach. learn. res., 3, 1157-1182, (2003) · Zbl 1102.68556
[25] Hastie, T.; Stuetzle, W., Principal curves, J. amer. statist. assoc., 84, 502-516, (1989) · Zbl 0679.62048
[26] Jain, A.; Marty, M.; Flynn, P., Data clustering: a review, ACM comput. surveys, 31, 3, 264-323, (1999)
[27] Jolliffe, I., Principal component analysis, (1986), Springer New York · Zbl 1011.62064
[28] Kohonen, T., Self-organizing maps, (1995), Springer New York
[29] Krzanowski, K.; Jonathan, P.; McCarthy, W.; Thomas, M., Discriminant analysis with singular covariance matrices: methods and applications in spectroscopic data, J. appl. statist., 44, 101-115, (1995) · Zbl 0821.62032
[30] Lehoucq, R.; Sorensen, D.; Yang, C., ARPACK users’ guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods, (1998), SIAM Publications Philadelphia · Zbl 0901.65021
[31] McLachlan, G.; Krishnan, T., The EM algorithm and extensions, (1997), Wiley Interscience New York · Zbl 0882.62012
[32] McLachlan, G.; Peel, D., Finite mixture models, (2000), Wiley Interscience New York · Zbl 0963.62061
[33] McLachlan, G.; Peel, D.; Bean, R., Modelling high-dimensional data by mixtures of factor analyzers, Comput. statist. data anal., 41, 379-388, (2003) · Zbl 1256.62036
[34] Parsons, L.; Haque, E.; Liu, H., Subspace clustering for high dimensional data: a review, SIGKDD explor. newslett., 6, 1, 90-105, (2004)
[35] Pavlenko, T., On feature selection, curse of dimensionality and error probability in discriminant analysis, J. statist. planning inference, 115, 565-584, (2003) · Zbl 1015.62066
[36] Pavlenko, T.; Von Rosen, D., Effect of dimensionality on discrimination, Statistics, 35, 3, 191-213, (2001) · Zbl 0980.62050
[37] Quandt, R.E.; Ramsey, J.B., Estimating mixtures of normal distributions and switching regressions, J. amer. statist. assoc., 73, 730-752, (1978) · Zbl 0401.62024
[38] Raftery, A.; Dean, N., Variable selection for model-based clustering, J. amer. statist. assoc., 101, 473, 168-178, (2006) · Zbl 1118.62339
[39] Roweis, S.; Saul, L., Nonlinear dimensionality reduction by locally linear embedding, Science, 290, 5500, 2323-2326, (2000)
[40] Schölkopf, B.; Smola, A.; Müller, K., Nonlinear component analysis as a kernel eigenvalue problem, Neural comput., 10, 1299-1319, (1998)
[41] Schott, J., Dimensionality reduction in quadratic discriminant analysis, Comput. statist. data anal., 66, 161-174, (1993) · Zbl 0937.62607
[42] Schwarz, G., Estimating the dimension of a model, Ann. statist., 6, 461-464, (1978) · Zbl 0379.62005
[43] Scott, D.; Thompson, J., Probability density estimation in higher dimensions, (), 173-179
[44] Tenenbaum, J.; De Silva, V.; Langford, J., A global geometric framework for non linear dimensionality reduction, Science, 290, 5500, 2319-2323, (2000)
[45] Tipping, M.; Bishop, C., Mixtures of probabilistic principal component analysers, Neural comput., 11, 2, 443-482, (1999)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.