zbMATH — the first resource for mathematics

Statistical theory in clustering. (English) Zbl 0575.62058
A number of statistical models for forming and evaluating clusters are reviewed. Hierarchical algorithms are evaluated by their ability to discover high density regions in a population, and complete linkage hopelessly fails; the others don’t do too well either. Single linkage is at least of mathematical interest because it is related to the minimum spanning tree and percolation.
Mixture methods are examined, related to k-means, and the failure of likelihood tests for the number of components is noted. The DIP test for estimating the number of modes in a univariate population measures the distance between the empirical distribution function and the closest unimodal distribution function (or k-modal distribution function when testing for k modes). Its properties are examined and multivariate extensions are proposed. Ultrametric and evolutionary distances on trees are considered briefly.

62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI
[1] BAKER, F.B. (1974), ”Stability of Two Hierarchical Grouping Techniques, Case I: Sensitivity to Data Errors,”Journal of the American Statistical Association, 69, 440–445.
[2] BINDER, D.A. (1978), Comment on ’Estimating Mixtures of Normal Distributions and Switching Regressions’,Journal of the American Statistical Association, 73, 746–747.
[3] BROADBENT, S.R., and HAMMERSLEY, J.M. (1957), ”Percolation Processes, I: Crystals and Mazes,”Proceedings of the Cambridge Philosophical Society, 53, 629–641. · Zbl 0091.13901
[4] DAY, N.E. (1969), ”Estimating the Components of a Mixture of Normal Distributions,”Biometrika, 56, 463–474. · Zbl 0183.48106
[5] DICK, N.P., and BOWDEN, D.C. (1973), ”Maximum Likelihood Estimation for Mixture of Two Normal Distributions,”Biometrics, 29, 781–790.
[6] EVERITT, B.S., and HAND, D.J. (1981),Finite Mixture Distributions, London: Chapman and Hall. · Zbl 0466.62018
[7] FITCH, W.M., and MARGOLIASH, E. (1967), ”Construction of Phylogenetic Trees,”Science N.Y., 155, 279–284.
[8] GOWER, J.C., and ROSS, G.J.S. (1969), ”Minimum Spanning Trees and Single Linkage Cluster Analysis,”Applied Statistics, 18, 54–65.
[9] HARTIGAN, J.A. (1967), ”Representation of Similarity Matrices by Trees,”Journal of the American Statistical Association, 62, 1140–1158.
[10] HARTIGAN, J.A. (1975),Clustering Algorithms, New York: John Wiley. · Zbl 0372.62040
[11] HARTIGAN, J.A. (1977), ”Distribution Problems in Clustering,” inClassification and Clustering, ed. J. V. Ryzin, New York: Academic Press.
[12] HARTIGAN, J.A. (1978), ”Asymptotic Distributions for Clustering Criteria,”The Annals of Statistics, 6, 117–131. · Zbl 0377.62033
[13] HARTIGAN, J.A. (1981), ”Consistency of Single Linkage for High Density Clusters,”Journal of the American Statistical Association, 76, 388–394. · Zbl 0468.62053
[14] HARTIGAN, J.A., and HARTIGAN, P.M. (1984), ”The Dip Test of Multimodality,”The Annals of Statistics, submitted. · Zbl 0575.62045
[15] HOSMER, D.W. (1973), ”A Comparison of Iterative Maximum Likelihood Estimates of the Parameters of a Mixture of Two Normal Distributions under Three Different Types of Sample,”Biometrics, 29, 761–770.
[16] JARDINE, C.J., JARDINE, N., and SIBSON, R. (1967), ”The Structure and Construction of Taxonomic Hierarchies,”Math. Biosciences, 1, 173–179. · Zbl 0163.14604
[17] JOHNSON, S.C. (1967), ”Hierarchical Clustering Schemes,”Psychometrika, 32, 241–254. · Zbl 1367.62191
[18] LING, R.F. (1973), ”A Probability Theory of Cluster Analysis,”Journal of the American Statistical Association, 68, 159–169. · Zbl 0285.62035
[19] MAC QUEEN, J. (1967), ”Some Methods for Classification and Analysis of Multivariate Observations,”Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297.
[20] POLLARD, D. (1982), ”A Central Limit Theorem for k-means Clustering,”Annals of Probability, 10, 919–926. · Zbl 0502.62055
[21] RAO, C.R. (1948), ”The Utilization of Multiple Measurements in Problems of Biological Classification,”Journal of the Royal Statistical Society, Series B, 10, 159–203. · Zbl 0034.07902
[22] SMYTHE, R.T., and WIERMAN, J.C. (1978), ”First Passage Percolation on the Square Lattice,”Leture Notes in Mathematics, 671, Berlin: Springer-Verlag. · Zbl 0379.60001
[23] WISHART, D. (1969), ”Mode Analysis: A Generalization of Nearest Neighbor Which Reduces Chaining Effects,” inNumerical Taxonomy, ed. A. J. Cole, London: Academic Press.
[24] WOLFE, J.H. (1970), ”Pattern Clustering by Multivariate Analysis,”Multivariate Behavioral Research, 5, 329–350.
[25] WOLFE, J.H. (1971), ”A Monte-Carlo Study of the Sampling Distribution of the Likelihood Ratio fro Mixtures of Multinormal Distributions,”Research Memorandum, 72–2, Naval Personnel and Research Training Laboratory, San Diego.
[26] WONG, M.A. (1982), ”A Hybrid Clustering Algorithm for Identifying High Density Clusters,”Journal of the American Statistical Association, 77, 841–847. · Zbl 0507.62061
[27] WONG, M.A., and LANE, T. (1983), ”A kth Nearest Neighbor Clustering Procedure,”Journal of the Royal Statistical Society, SeriesB, 45, 362–368. · Zbl 0535.62055
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.