A variable-selection heuristic for K-means clustering. (English) Zbl 1293.62237

Summary: One of the most vexing problems in cluster analysis is the selection and/or weighting of variables in order to include those that truly define cluster structure, while eliminating those that might mask such structure. This paper presents a variable-selection heuristic for nonhierarchical (K-means) cluster analysis based on the adjusted Rand index for measuring cluster recovery. The heuristic was subjected to Monte Carlo testing across more than 2200 datasets with known cluster structure. The results indicate the heuristic is extremely effective at eliminating masking variables. A cluster analysis of real-world financial services data revealed that using the variable-selection heuristic prior to the K-means algorithm resulted in greater cluster stability.


62P15 Applications of statistics to psychology
62H30 Classification and discrimination; cluster analysis (statistical aspects)


Full Text: DOI


[1] Anderberg, M.R. (1973).Cluster analysis for applications. New York, NY: Academic Press. · Zbl 0299.62029
[2] Arabie, P., & Hubert, L.J. (1994). Cluster analysis in marketing research. In R.P. Bagozzi (Ed),Advanced methods in marketing research (pp. 160–189). Oxford, England: Blackwell.
[3] Arabie, P., & Hubert, L.J. (1996). An overview of combinatorial data analysis. In P. Arabie, L.J. Hubert, & G. De Soete (Eds),Clustering and classification (pp. 5–63). River Edge, NJ: World Scientific Publishing. · Zbl 0902.62005
[4] Art, D., Gnanadesikan, R., & Kettenring, J.R. (1982). Data-based metrics for cluster analysis.Utilitas Mathematica, Series A, 21, 75–99. · Zbl 0501.62050
[5] Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., & Lewis, P.A. (1994). A study of the classification capabilities of neural networks using unsupervised learning: A comparison with K-means clustering.Psychometrika, 59, 509–525. · Zbl 0850.92057
[6] Balasubramanian, S., Gupta, S., Kamakura, W., & Wedel, M. (1998). Modelling large data sets in marketing.Statistica Neerlandica, 52, 303–323. · Zbl 0941.90044
[7] Berry, M.J.A., & Linoff, G. (1997).Data mining techniques: For marketing, sales, and customer support. New York, NY: John Wiley & Sons.
[8] Blattberg, R., Glazer, R., & Little, J. (1994).The marketing information revolution. Boston, MA: Harvard Business School Press.
[9] Box, G.E.P., & Muller, M.E. (1958). A note on the generation of random normal deviates.Annals of Mathematical Statistics, 29, 610–611. · Zbl 0085.13720
[10] Breckenridge, J.N. (1989). Replicating cluster analysis: Method, consistency, and validity.Multivariate Behavioral Research, 24, 147–161.
[11] Carmone, F.J., Kara, A., & Maxwell, S. (1999). HINoV: A new model to improve market segmentation by identifying noisy variables.Journal of Marketing Research, 36, 501–509.
[12] Chaturvedi, A., Carroll, J.D., Green, P.E., & Rotondo, J.A. (1997). A feature-based approach to market segmentation via overlapping K-centroids clustering.Journal of Marketing Research, 34, 370–377.
[13] Cheng, R., & Milligan, G.W. (1996). K-means clustering methods with influence detection.Educational and Psychological Measurement, 56, 833–838.
[14] Cormack, R.M. (1971). A review of classification (with Discussion).Journal of the Royal Statistical Society, Series A, 134, 321–367.
[15] DeSarbo, W.S., Carroll, J.D., Clark, L.A., & Green, P.E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with different weighting of variables.Psychometrika, 49, 57–78. · Zbl 0594.62067
[16] DeSarbo, W.S., Manrai, A.K., & Manrai, L.A. (1993). Non-spatial tree models for the assessment of competitive market structure: An integrated review of the marketing and psychometric literature. In J. Eliashberg & G. Lilien (Eds),Handbook in operations research and management science: Marketing (pp. 193–257), New York, NY: Elsevier.
[17] De Soete, G. (1986). Optimal variable weighting for ultrametric and additive tree clustering.Quality and Quantity, 20, 169–180.
[18] De Soete, G. (1988). OVWTRE: A program for optimal variable weighting for ultrametric and additive tree fitting.Journal of Classification, 5, 101–104.
[19] De Soete, G., DeSarbo, W.S., & Carroll, J.D. (1985). Optimal variable weighting for hierarchical clustering: An alternating least-squares algorithm.Journal of Classification, 2, 173–192. · Zbl 0585.62111
[20] Fowlkes, E.B., Gnanadesikan, R., & Kettenring, J.R. (1987). Variable selection in clustering and other contexts. In C.L. Mallows (Ed.),Design, data, and analysis (pp. 13–34). New York, NY: John Wiley & Sons.
[21] Fowlkes, E.B., Gnanadesikan, R., & Kettenring, J.R. (1988). Variable selection in clustering.Journal of Classification, 5, 205–228.
[22] Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings (with comments and rejoinder).Journal of the American Statistical Association, 78, 553–584. · Zbl 0545.62042
[23] Friedman, H.P., & Rubin, J. (1967). On some invariant criteria for grouping data.Journal of the American Statistical Association, 62, 1159–1178.
[24] Gnanadesikan, R., Kettenring, J.R., & Tsao, S.L. (1995). Weighting and selection of variables for cluster analysis.Journal of Classification, 12, 113–136. · Zbl 0825.62540
[25] Green, P.E., Carmone, F.J., & Kim, J. (1990). A preliminary study of optimal variable weighting in K-means clustering.Journal of Classification, 7, 271–285.
[26] Helsen, K., & Green, P.E. (1991). A computational study of replicated clustering with an application to market segmentation.Decision Sciences, 22, 1124–1141.
[27] Hubert, L., & Arabie, P. (1985). Comparing partitions.Journal of Classification, 2, 193–218. · Zbl 0587.62128
[28] Knuth, D.E. (1997).The art of computing: Vol. 1. Fundamental algorithms. Reading, MA: Addison-Wesley. · Zbl 0895.68055
[29] Krieger, A., & Green, P.E. (1999). A generalized rand-index method for consensus clustering of separate partitions of the same data base.Journal of Classification, 16, 63–89.
[30] Kruskal, J.B. (1972). Linear transformations of multivariate data to reveal clustering. In R.N. Shepard, A.K. Romney, & S.B. Nerlove (Eds),Multidimensional scaling: Theory and applications in the behavioral sciences (pp. 181–191). New York, NY: Seminar Press.
[31] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations.Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, I, 231–297. · Zbl 0214.46201
[32] McIntyre, R.M., & Blashfield, R.K. (1980). A nearest-centroid technique for evaluating the minimum variance clustering procedure.Multivariate Behavioral Research, 15, 225–238.
[33] Milligan, G.W. (1980). An examination of six types of the effect of six types of error perturbation on fifteen clustering algorithms.Psychometrika, 45, 325–342.
[34] Milligan, G.W. (1985). An algorithm for generating artificial test clusters.Psychometrika, 50, 123–127.
[35] Milligan, G.W. (1989). A validation study of a variable-weighting algorithm for cluster analysis.Journal of Classification, 6, 53–71.
[36] Milligan, G.W. (1996). Clustering validation: Results and implications for applied analyses. In P. Arabie, L.J. Hubert, & G. De Soete (Eds.),Clustering and classification (pp. 341–375). River Edge, NJ, World Scientific Publishing. · Zbl 0895.62069
[37] Milligan, G.W., & Cooper, M.C. (1986). A study of the comparability of external criteria for hierarchical cluster analysis.Multivariate Behavioral Research, 21, 441–458.
[38] Milligan, G.W., & Cooper, M.C. (1988). A study of the standardization of variables in cluster analysis.Journal of Classification, 5, 181–204.
[39] Milligan, G.W., Soon, S.C., & Sokol, L.M. (1983). The effect of cluster size, dimensionality, and the number of clusters on the recovery of true cluster structure.IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 40–47.
[40] Morey, L.C., Blashfield, R.K., & Skinner, H.A. (1983). A comparison of cluster analysis techniques within a sequential validation framework.Multivariate Behavioral Research, 18, 309–329.
[41] Rand, W.M. (1971). Objective criteria for evaluating clustering methods.Journal of the American Statistical Association, 66, 846–850.
[42] Rohlf, F.J. (1970). Adaptive hierarchical clustering schemes.Systematic Zoology, 19, 58–82.
[43] Salstone, R., & Stange, K. (1996). A computer program to calculate Hubert and Arabie’s adjusted Rand index.Journal of Classification, 13, 169–172.
[44] Ward, J.H. (1963). Hierarchical grouping to optimize an objective function.Journal of the American Statistical Association, 58, 236–244.
[45] Waller, N.G., Kaiser, H.A., Illian, J.B., & Manry, M. (1998). A comparison of the classification capabilities of the 1-dimensional Kohonen neural network with two partitioning and three hierarchical cluster analysis algorithms.Psychometrika, 63, 5–22. · Zbl 1063.62550
[46] Wedel, M., & Kamakura, W.A. (1997).Market segmentation: Conceptual and methodological foundations. Boston, MA: Kluwer Academic Publishers.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.