×

zbMATH — the first resource for mathematics

Selection of variables in cluster analysis: An empirical comparison of eight procedures. (English) Zbl 1143.62327
Summary: Eight different variable selection techniques for model-based and non-model-based clustering are evaluated across a wide range of cluster structures. It is shown that several methods have difficulties when non-informative variables (i.e., random noise) are included in the model. Furthermore, the distribution of the random noise greatly impacts the performance of nearly all of the variable selection procedures. Overall, a variable selection technique based on a variance-to-range weighting procedure coupled with the largest decreases in within-cluster sums of squares error performed the best. On the other hand, variable selection methods used in conjunction with finite mixture models performed the worst.

MSC:
62H30 Classification and discrimination; cluster analysis (statistical aspects)
91C20 Clustering in the social and behavioral sciences
62P15 Applications of statistics to psychology
Software:
EDA
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Banfield, J.D., & Raftery, A.E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics, 49, 803–821. · Zbl 0794.62034
[2] Bartholomew, D.J., & Knott, M. (1999). Latent variable models and factor analysis. London: Arnold. · Zbl 1066.62528
[3] Brusco, M.J., & Cradit, J.D. (2001). A variable-selection heuristic for K-means clustering. Psychometrika, 66, 249–270. · Zbl 1293.62237
[4] Carmone, F.J., Kara, A., & Maxwell, S. (1999). HINoV: A new model to improve market segment definition by identifying noisy variables. Journal of Marketing Research, 36, 501–509.
[5] Cormack, R.M. (1971). A review of classification. Journal of the Royal Statistical Society, Series A, 134, 321–367.
[6] Dempster, A.P., Laird, N.M., & Rubin, D.B. (1977). Maximum likelihood from incomplete data via the E-M algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. · Zbl 0364.62022
[7] DeSarbo, W.S., Carroll, J.D., Clark, L.A., & Green, P.E. (1984). Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika, 49, 57–78. · Zbl 0594.62067
[8] De Soete, G., DeSarbo, W.S., & Carroll, J.D. (1985). Optimal variable weighting for hierarchical clustering: An alternative least-squares algorithm. Journal of Classification, 2, 173–192. · Zbl 0585.62111
[9] Donoghue, J.R. (1990). Univariate screening measures for cluster analysis. Multivariate Behavioral Research, 30, 385–427.
[10] Dy, J.G., & Brodley, C.E. (2004). Feature selection for unsupervised learning. Journal of Machine Learning Research, 5, 845–889. · Zbl 1222.68187
[11] Fowlkes, E.B., & Mallows, C.L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78, 553–569. · Zbl 0545.62042
[12] Fowlkes, E.B., Gnanadesikan, R., & Kettenring, J.R. (1988). Variable selection in clustering. Journal of Classification, 5, 205–228.
[13] Friedman, J.H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266. · Zbl 0664.62060
[14] Friedman, J.H., & Meulman, J.J. (2004). Clustering objects on subsets of variables. Journal of the Royal Statistical Society, Series B, 66, 1–25. · Zbl 1060.62064
[15] Friedman, J.H., & Tukey, J.W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computing, 23, 881–890. · Zbl 0284.68079
[16] Gnanadesikan, R., Kettenring, J.R., & Tsao, S.L. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12, 113–136. · Zbl 0825.62540
[17] Goffe, W.L., Ferrier, G.D., & Rogers, J. (1994). Global optimization of statistical functions with simulated annealing. Journal of Econometrics, 60, 65–99. · Zbl 0789.62095
[18] Green, P.E., Carmone, F.J., & Kim, J. (1990). A preliminary study of optimal variable weighting in k-means clustering. Journal of Classification, 7, 271–285.
[19] Hubert, L.J., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. · Zbl 0587.62128
[20] Kruskal, J.B. (1969). Toward a practical method which helps uncover the structure of a set of observations by finding the line transformation which optimizes a new index of condensation. In R.C. Milton, & J.A. Nelder (Eds.), Statistical Computation (pp. 427–440). New York: Academic Press.
[21] Law, M.H.C., Figueiredo, M.A.T., & Jain, A.K. (2004). Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 1154–1166. · Zbl 05112235
[22] Martinez, W.L., & Martinez, A.R. (2001). Computational statistics handbook with MATLAB. Boca Raton: Chapman & Hall. · Zbl 0986.62104
[23] Martinez, W.L., & Martinez, A.R. (2005). Exploratory data analysis with MATLAB. Boca Raton: Chapman & Hall. · Zbl 1067.62005
[24] McLachlan, G.J., & Basford, K.E. (1988). Mixture models: Inference and applications to clustering. New York: Dekker. · Zbl 0697.62050
[25] McLachlan, G.J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley. · Zbl 0882.62012
[26] McLachlan, G.J., & Peel, D. (2000). Finite mixture models. New York: Wiley. · Zbl 0963.62061
[27] Milligan, G.W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45, 325–342.
[28] Milligan, G.W. (1985). An algorithm for generating artificial test clusters. Psychometrika, 50, 23–127.
[29] Milligan, G.W. (1989). A validation study of a variable weighting algorithm for cluster analysis. Journal of Classification, 6, 53–71.
[30] Montanari, A., & Lizzani, L. (2001). A projection pursuit approach to variable selection. Computational Statistics & Data Analysis, 35, 463–473. · Zbl 1080.62527
[31] Raftery, A.E., & Dean, N. (2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101, 168–178. · Zbl 1118.62339
[32] Steinley, D. (2003). Local optima in K-means clustering: What you don’t know may hurt you. Psychological Methods, 8, 294–304.
[33] Steinley, D. (2004a). Standardizing variables in K-means clustering. In D. Banks, L. House, F.R. McMorris, P. Arabie, & W. Gaul (Eds.), Classification, clustering, and data mining applications (pp. 53–60). New York: Springer.
[34] Steinley, D. (2004b). Properties of the Hubert–Arabie adjusted Rand index. Psychological Methods, 9, 386–396.
[35] Steinley, D. (2006a). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59, 1–34.
[36] Steinley, D. (2006b). Profiling local optima in K-means clustering: Developing a diagnostic technique. Psychological Methods, 11, 178–192.
[37] Steinley, D., & Brusco, M.J. (2007, in press). A new variable weighting and selection procedure for K-means cluster analysis. Psychometrika. · Zbl 1151.91731
[38] Steinley, D., & Henson, R. (2005). OCLUS: An analytic method for generating clusters with known overlap. Journal of Classification, 22, 221–250. · Zbl 1336.62191
[39] Steinley, D., & McDonald, R.P. (2007). Examining factor score distributions to determine the nature of latent spaces. Multivariate Behavioral Research, 42, 133–156.
[40] Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society, Series B, 63, 411–423. · Zbl 0979.62046
[41] van Buuren, S.V., & Heiser, W.J. (1989). Clustering N objects into K groups under an optimal scaling of variables. Psychometrika, 54, 699–706. · Zbl 04567856
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.