×

Guided projections for analyzing the structure of high-dimensional data. (English) Zbl 07498988

Summary: A powerful data transformation method named guided projections is proposed creating new possibilities to reveal the group structure of high-dimensional data in the presence of noise variables. Using projections onto a space spanned by a selection of a small number of observations allows measuring the similarity of other observations to the selection based on orthogonal and score distances. Observations are iteratively exchanged from the selection creating a nonrandom sequence of projections, which we call guided projections. In contrast to conventional projection pursuit methods, which typically identify a low-dimensional projection revealing some interesting features contained in the data, guided projections generate a series of projections that serve as a basis not just for diagnostic plots but to directly investigate the group structure in data. Based on simulated data, we identify the strengths and limitations of guided projections in comparison to commonly employed data transformation methods. We further show the relevance of the transformation by applying it to real-world datasets.

MSC:

62-XX Statistics
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Abdi, H.; William, L., Principal Component Analysis, Computational Statistics, 2, 443-459 (2010)
[2] Achlioptas, D., Database-Friendly Random Projections: Johnson-Lindenstrauss With Binary Coins, Journal of Computer and System Sciences, 66, 671-687 (2003) · Zbl 1054.68040
[3] Altman, N., An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression, The American Statistician, 46, 175-185 (1992)
[4] Baker, F. B.; Hubert, L. J., Measuring the Power of Hierarchical Cluster Analysis, Journal of the American Statistical Association, 70, 31-38 (1975) · Zbl 0311.62029
[5] Coifman, R. R.; Lafon, S., Diffusion Maps, Applied and Computational Harmonic Analysis, 21, 5-30 (2006) · Zbl 1095.68094
[6] Cook, D.; Buja, A.; Cabrera, J., Projection Pursuit Indexes Based on Orthonormal Function Expansions, Journal of Computational and Graphical Statistics, 2, 225-250 (1993)
[7] Cook, D.; Buja, A.; Cabrera, J.; Hurley, C., Grand Tour and Projection Pursuit, Journal of Computational and Graphical Statistics, 4, 155-172 (1995)
[8] De Leeuw, J., History of Nonlinear Principal Component Analysis, Visualization and Verbalization of Data (2011)
[9] Desgraupes, B., Clustering Indices, 1, 34 (2013), University of Paris Ouest-Lab ModalX
[10] ———, clusterCrit: Compute Clustering Validation Indices. (2016)
[11] Filzmoser, P.; Maronna, R.; Werner, M., Outlier Identification in High Dimensions, Computational Statistics & Data Analysis, 52, 1694-1711 (2008) · Zbl 1452.62370
[12] Friedman, J. H.; Tukey, J. W., A Projection Pursuit Algorithm for Exploratory Data Analysis, IEEE Transactions on Computers, c-23, 881-890 (1974) · Zbl 0284.68079
[13] Gattone, S. A.; Rocci, R., Clustering Curves on a Reduced Subspace, Journal of Computational and Graphical Statistics, 21, 361-379 (2012)
[14] Gorban, A. N.; Kégl, B.; Wunsch, D. C.; Zinovyev, A. Y., Principal Manifolds for Data Visualization and Dimension Reduction, 58 (2008), Berlin, Germany: Springer, Berlin, Germany · Zbl 1125.68003
[15] Guyon, I.; Elisseeff, A., An Introduction to Variable and Feature Selection, Journal of Machine Learning Research, 3, 1157-1182 (2003) · Zbl 1102.68556
[16] Hubert, L.; Schultz, J., Quadratic Assignment as a General Data Analysis Strategy, British Journal of Mathematical and Statistical Psychology, 29, 190-241 (1976) · Zbl 0356.92027
[17] Hubert, M.; Rousseeuw, P.; Branden, K., ROBPCA: A New Approach to Robust Principal Component Analysis, Technometrics, 47, 64-79 (2005)
[18] Hubert, M.; Van Driessen, K., Fast and Robust Discriminant Analysis, Computational Statistics & Data Analysis, 45, 301-320 (2004) · Zbl 1429.62247
[19] Hung, Y.-C.; Tseng, N.-F., Extracting Informative Variables in the Validation of Two-Group Causal Relationship, Computational Statistics, 28, 1151-1167 (2003) · Zbl 1305.65046
[20] Ilies, I.; Wilhelm, A., Projection-Based Partitioning for Large, High-Dimensional Datasets, Journal of Computational and Graphical Statistics, 19, 474-492 (2010)
[21] Larsen, B.; Aone, C., Fast and Effective Text Mining using Linear-Time Document Clustering, Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, 16-22 (1999), ACM
[22] Lee, E.; Cook, D., A Projection Pursuit Index for Large p Small n Data, Statistics and Computing, 10, 381-392 (2010)
[23] Li, P.; Hastie, T. J.; Church, K. W., Very Sparse Random Projections, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 287-296 (2006), ACM
[24] Mathai, A.; Provost, S. B., Quadratic Forms in Random Variables: Theory and Applications (1992), New York: Marcel Dekker, Inc, New York · Zbl 0792.62045
[25] Pomerantsev, A. L., Acceptance Areas for Multivariate Classification Derived by Projection Methods, Journal of Chemometrics, 22, 601-609 (2008)
[26] Qiu, W.; Joe, H., clusterGeneration: Random Cluster Generation (with Specified Degree of Separation) (2015)
[27] Rousseeuw, P. J., Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis, Journal of Computational and Applied Mathematics, 20, 53-65 (1987) · Zbl 0636.62059
[28] Serneels, S.; Croux, C.; Filzmoser, P.; Van Espen, P. J., Partial Robust M-Regression, Chemometrics and Intelligent Laboratory Systems, 79, 55-64 (2005)
[29] Ward Jr, J. H., Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association, 58, 236-244 (1963)
[30] Wickham, H.; Cook, D.; Hofmann, H.; Buja, A., et al., Tourr: An R package for Exploring Multivariate Data with Projections, Journal of Statistical Software, 40, 1-18 (2011)
[31] Witten, D. M.; Tibshirani, R.; Hastie, T., A Penalized Matrix Decomposition, with Applications to Sparse Principal Components And Canonical Correlation Analysis, Biostatistics, 10, 515-534 (2009) · Zbl 1437.62658
[32] Zou, H.; Hastie, T., Regularization and Variable Selection Via The Elastic Net, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301-320 (2005) · Zbl 1069.62054
[33] Zou, H.; Hastie, T.; Tibshirani, R., Sparse Principal Component Analysis, Journal of Computational and Graphical Statistics, 15, 265-286 (2006)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.