Bock, H. H. On some significance tests in cluster analysis. (English) Zbl 0587.62048 J. Classif. 2, 77-108 (1985). The author investigates the properties of several significance tests for distinguishing between the hypothesis H of a ”homogeneous” population and an alternative A involving ”clustering” or ”heterogeneity”, with emphasis on the case of multidimensional observations \(x_ 1,...,x_ n\in R^ p.\) Four types of test statistics are considered: the (s-th) largest gap between observations, their mean distance (or similarity), the minimum within-cluster sum of squares resulting from a k-means algorithm, and the resulting maximum F statistic. If, for a given significance level (error probability) a, such a test statistic exceeds the corresponding critical value \(c=c(a)\), the hypothesis H of homogeneity is rejected (e.g., in favor of a clustering structure A). The asymptotic distributions under H are given for \(n\to \infty\) and the asymptotic power of the tests is derived for neighboring alternatives \(A=A_ n\) approaching A. In particular, the asymptotic distribution of the maximum F statistic is obtained. Moreover, the asymptotic power of the gap test is characterized by a speed factor (log n)\({}^{-1}\) (for \(A_ n\) converging to H), and by a factor \(n^{-1/4}\) for tests based on the mean similarity. Reviewer: A.Krzyzak Cited in 1 ReviewCited in 19 Documents MSC: 62F03 Parametric hypothesis testing 62F05 Asymptotic properties of parametric tests 62H30 Classification and discrimination; cluster analysis (statistical aspects) 62E20 Asymptotic distribution theory in statistics Keywords:cluster analysis; asymptotic normality; classification; significance tests; clustering; heterogeneity; mean distance; similarity; minimum within-cluster sum of squares; k-means algorithm; maximum F statistic; homogeneity; neighboring alternatives; asymptotic power; gap test × Cite Format Result Cite Review PDF Full Text: DOI References: [1] BARNETT, V., KAY, R., and SNEATH, P.H.A. (1979), ”A Familiar Statistic in an Unfamiliar Guise – A Problem in Clustering,”The Statistican, 28, 185–191. · doi:10.2307/2987867 [2] BAUBKUS, W. (1985), ”Minimizing the Variance Criterion in Cluster Analysis: Optimal Configurations in the Multidimensional Normal Case,” Diplomarbeit, Institute of Statistics, Technical University Aachen, 117 p. [3] BICKEL, P.J., and BREIMAN, L. (1983), ”Sums of Functions of Nearest Neighbor Distances, Moment Bounds, Limit Theorems and a Goodness of Fit Test,”Annals of Probability, 11, 185–214. · Zbl 0502.62045 · doi:10.1214/aop/1176993668 [4] BINDER, D.A. (1978), ”Bayesian Cluster Analysis,”Biometrika, 65, 31–38. · Zbl 0376.62007 · doi:10.1093/biomet/65.1.31 [5] BOCK, H.H. (1972), ”Statistische Modelle und Bayes’sche Verfahren zur Bestimmung einer unbekannten Klassifikation normalverteilter zufälliger Vektoren,”Metrika, 18, 120–132. · Zbl 0238.62067 · doi:10.1007/BF02614243 [6] BOCK, H.H. (1974),Automatische Klassifikation. Theoretische und praktische Methoden zur Gruppierung und Strukturierung von Daten (Clusteranalyse), Göttingen: Vandenhoeck & Ruprecht, 480 p. · Zbl 0279.62013 [7] BOCK, H.H. (1977), ”On Tests Concerning the Existence of a Classification,” inProceedings First International Symposium on Data Analysis and Informatics, Le Chesnay, France, Institut de Recherche en Informatique et en Automatique (IRIA), 449–464. [8] BOCK, H.H. (1981), ”Statistical Testing and Evaluation Methods in Cluster Analysis,” inProceedings on the Golden Jubilee Conference in Statistics: Applications and New Directions, December 1981, Calcutta, Indian Statistical Institute, 1984, 116–146. [9] BOCK, H.H. (1983), ”Statistische Testverfahren im Rahmen der Clusteranalyse,”Proceedings of the 7th Annual Meeting of the Gesellschaft für Klassifikation e.V., inStudien zur Klassifikation, Vol. 13, ed. M. Schader, Frankfurt: Indeks-Verlag, 161–176. [10] BRYANT, P., and WILLIAMSON, J.A. (1978), ”Asymptotic Behavior of Classification Maximum Likelihood Estimates,”Biometrika, 65, 273–281. · Zbl 0393.62011 · doi:10.1093/biomet/65.2.273 [11] COX, D.R. (1957), ”Note on Grouping,”Journal of the American Statistical Association, 52, 543–547. · Zbl 0088.35402 · doi:10.2307/2281704 [12] DAVID, H.A. (1981),Order Statistics, New York: Wiley, chap. 9.3, 9.4. · Zbl 0553.62046 [13] DEGENS, P.O. (1978), ”Clusteranalyse auf topologisch-masstheoretischer Grundlage,” Dissertation, Fachbereich Mathematik, Universitaet Muenchen. [14] DEL PINO, G.E. (1979), ”On the Asymptotic Distribution of k-spacings with Applications to Goodness-of-Fit Tests,”Annals of Statistics, 7, 1058–1065. · Zbl 0425.62026 · doi:10.1214/aos/1176344789 [15] DUBES, R., and JAIN, A.K. (1979), ”Validity Studies in Clustering Methodologies,”Pattern Recognition, 11, 235–254. · Zbl 0415.62041 · doi:10.1016/0031-3203(79)90034-7 [16] EBERL, W., and HAFNER, R. (1971), ”Die asymptotische Verteilung von Koinzidenzen,”Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 18, 322–332. · Zbl 0204.51403 · doi:10.1007/BF00535033 [17] ENGELMAN, L., and HARTIGAN, J.A. (1969), ”Percentage Points of a Test for Clusters,”Journal of the American Statistical Association, 64, 1647–1648. · doi:10.2307/2286096 [18] FLEISCHER, P.E. (1964), ”Sufficient Conditions for Achieving Minimum Distortion in a Quantizer,”IEEE Int. Conv. Rec., part 1, 104–111. [19] GHOSH, J.K., and SEN, P.K. (1984), ”On the Asymptotic Distribution of the Log Likelihood Ratio Statistic for the Mixture Model and Related Results,” Preprint, Calcutta: Indian Statistical Institute. [20] GIACOMELLI, F., WIENER, J., KRUSKAL, J.B., v. POMERANZ, J., and LOUD, A.V. (1971), ”Subpopulations of Blood Lymphocytes Demonstrated by Quantitative Cytochemistry,”Journal of Histochemistry and Cytochemistry, 19, 426–433. · doi:10.1177/19.7.426 [21] GRAY, R.M., and KARNIN, E.D. (1982), ”Multiple Local Optima in Vector Quantizers,”IEEE Trans. Information Theory, IT-28, 256–261. · Zbl 0476.94011 · doi:10.1109/TIT.1982.1056471 [22] HARTIGAN, J.A. (1975),Clustering Algorithms, New York: Wiley. · Zbl 0372.62040 [23] HARTIGAN, J.A. (1977), ”Distribution Problems in Clustering,” inClassification and Clustering, ed. J. van Ryzin, New York: Academic Press, 45–72. [24] HARTIGAN, J.A. (1978), ”Asymptotic Distributions for Clustering Criteria,”Annals of Statistics, 6, 117–131. · Zbl 0377.62033 · doi:10.1214/aos/1176344071 [25] HENZE, N. (1981), ”An Asymptotic Result on the Maximum Nearest Neighbor Distance Between Independent Random Vectors with an Application for Testing Goodness-of-Fit in \(\mathbb{R}\) p on Spheres,” Dissertation, University of Hannover, published inMetrika, 30, 245–260. [26] HENZE, N. (1982), ”The Limit Distribution for Maxima of Weightedr-th Nearest Neighbor Distances,”Journal of Applied Probability, 19, 334–354. · Zbl 0484.62034 · doi:10.2307/3213486 [27] KIEFFER, J.C. (1983), ”Uniqueness of Locally Optimal Quantizer for Log-concave Density and Convex Error Weighting Function,”IEEE Trans. Infromation, IT-29, 42–27. · Zbl 0521.94005 · doi:10.1109/TIT.1983.1056622 [28] KUO, M., and RAO, J.S. (1981), ”Limit Theory and Efficiences for Tests Based on Higher Order Spacings,” inProceedings on the Golden Jubilee Conference in Statistics: Applications and New Directions, December 1981, Calcutta: Indian Statistical Institute, 1984. [29] LEE, K.L. (1979), ”Multivariate Tests for Clusters,”Journal of the American Statistical Association, 74, 708–714. · Zbl 0421.62045 · doi:10.2307/2286996 [30] LEHMANN, E.L. (1955), ”Ordered Families of Distributions,”Annals of Mathematical Statistics, 26, 399–419. · Zbl 0065.11906 · doi:10.1214/aoms/1177728487 [31] LOEVE, M. (1963),Probability Theory, Princeton, NJ: van Nostrand. · Zbl 0108.14202 [32] NEWELL, G.F. (1963), ”Distribution for the Smallest Distance Between any Pair of the {\(\kappa\)}-th Nearest Neighbor Random Points on a Line,” inProc. Symp. Time Series Analysis, ed. M. Rosenblatt, New York: Wiley, 89–103. [33] OGAWA, J. (1951), ”Contributions to the Theory of Systematic Statistics I,”Osaka Mathematical Journal, 3, 175–213. · Zbl 0044.34301 [34] OGAWA, J. (1962), ”Determination of Optimum Spacings in the Case of Normal Distribution,” inContributions to Order Statistics, eds. A.E. Sarhan and B.G. Greenberg, New York: Wiley, p. 277 ff. [35] PERRUCHET, C. (1982), ”Les Epreuves de Classifiabilité en Analyse des Données,” Note technique NT/PAA/ATR/MTI/810, Issy-les-Moulineaux, France: Centre National d’Etudes de Télécommunications, September 1982. [36] PERRUCHET, C. (1983), ”Significance Tests for Clusters: Overview and Comments,” inNumerical Taxonomy, ed. J. Felsenstein, Berlin: Springer, 199–208. [37] POLLARD, D. (1981), ”Strong Consistency of k-means Clustering,”Annals of Statistics, 9, 135–140. · Zbl 0451.62048 · doi:10.1214/aos/1176345339 [38] POLLARD, D. (1982a), ”A Central Limit Theorem for k-means Clustering,”Annals of Probability, 10, 919–926. · Zbl 0502.62055 · doi:10.1214/aop/1176993713 [39] POLLARD, D. (1982b), ”Quantization and the Method of k-means,”IEEE Trans. Information Theory, IT-28, 119–205. · Zbl 0476.94010 [40] RANDLES, R.H., and WOLFE, D.A. (1979),Introduction to the Theory of Non-parametric Statistics, New York: Wiley. · Zbl 0529.62035 [41] SCHILLING, M.F. (1983a), ”Goodness of Fit Testing in \(\mathbb{R}\) m Based on the Weighted Empirical Distribution of Certain Nearest Neighbor Statistics,”Annals of Statistics, 11, 1–12. · Zbl 0518.62041 · doi:10.1214/aos/1176346051 [42] SCHILLING, M.F. (1983b), ”An Infinite-dimensional Approximation for Nearest Neighbor Goodness of Fit,”Annals of Statistics, 11, 13–24. · Zbl 0532.62076 · doi:10.1214/aos/1176346052 [43] SILVERMAN, B.W. (1976), ”Limit Theorems for Dissociated Random Variables,”Advances in Applied Probability, 8, 806–819. · Zbl 0355.60026 · doi:10.2307/1425935 [44] SNEATH, P.H.A. (1977a), ”A Method for Testing the Distinctness of Clusters: A Test of the Disjunction of Two Clusters in Euclidean Space as Measured by their Overlap,”Jour. Int. Assoc. Math. Geol., 9, 123–143. · doi:10.1007/BF02312508 [45] SNEATH, P.H.A. (1977b), ”Cluster Significance Tests and Their Relation to Measures of Overlap,” inProceedings First International Symposium on Data Analysis and Informatics, Versailles, September 1977, Institut de Recherche d’Informatique et d’Automatique (IRIA), Le Chesnay, France, 1, 15–36. [46] SNEATH, P.H.A. (1979a), ”The Sampling Distribution of the W Statistic of Disjunction for the Arbitrary Division of a Random Rectangular Distribution,”Journal. Int. Assoc. Math. Geol., 11, 423–429. · doi:10.1007/BF01029298 [47] SNEATH, P.H.A. (1979b), ”Basic Program for a Significance Test for 2 Clusters in Euclidean Space as Measured by Their Overlap,”Computers and Geosciences, 5, 143–155. · doi:10.1016/0098-3004(79)90001-3 [48] SPAETH, H. (1982),Cluster Analysis Algorithms, Chichester: Horwood. [49] SPAETH, H. (1983),Cluster-Formation und -Analyse, München-Wien: Oldenbourg. [50] TRUSHKIN, A.V. (1982), ”Sufficient Conditions for Uniqueness of a Locally Optimal Quantizer for a Class of Convex Error Weighting Functions,”IEEE Trans. Information Theory, IT-28, 187–198. · Zbl 0476.94012 · doi:10.1109/TIT.1982.1056480 [51] WALLENSTEIN, S.R., and NAUS, J.I. (1973), ”Probabilities for ak-th Nearest Neighbor Problem on the Line,”Ann. Probab., 1, 188–190. · Zbl 0263.60005 · doi:10.1214/aop/1176997037 [52] WALLENSTEIN, S.R., and NAUS, J.I. (1974), ”Probabilities of the Size of Largest Clusters and Smallest Intervals,”Journal of the American Statistical Association, 69, 690–697. · Zbl 0291.62070 · doi:10.2307/2286003 [53] WEISS, L. (1960), ”A Test of Fit Based on the Largest Sample Spacing,”SIAM Journal of the Society for Industrial and Applied Mathematics, 8, 295–299. · Zbl 0104.12507 · doi:10.1137/0108017 [54] WITTING, H., and NOELLE, G. (1979),Angewandte Mathematische Statistik, Stuttgart: B.G. Teubner, theorem 2.10. [55] WOLFE, J.H. (1970), ”Pattern Clustering by Multivariate Mixture Analysis,”Multivariate Behavioral Research, 5, 329–350. · doi:10.1207/s15327906mbr0503_6 [56] WOLFE, J.H. (1981), ”A Monte Carlo Study of the Sampling Distribution of the Likelihood Ratio for Mixture of Multinormal Distribution,” Technical Bulletin STB 72-2, San Diego: U.S. Naval Personal and Training Research Laboratory. [57] WOLFE, S.J. (1975), ”On the Unimodality of Spherically Symmetric Stable Distribution Functions,”Journal of Multivariate Analysis, 5, 236–242. · Zbl 0318.60009 · doi:10.1016/0047-259X(75)90040-8 This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.