The geometry of kernelized spectral clustering. (English) Zbl 1312.62082

Summary: Clustering of data sets is a standard problem in many areas of science and engineering. The method of spectral clustering is based on embedding the data set using a kernel function, and using the top eigenvectors of the normalized Laplacian to recover the connected components. We study the performance of spectral clustering in recovering the latent labels of i.i.d. samples from a finite mixture of nonparametric distributions. The difficulty of this label recovery problem depends on the overlap between mixture components and how easily a mixture component is divided into two nonoverlapping components. When the overlap is small compared to the indivisibility of the mixture components, the principal eigenspace of the population-level normalized Laplacian operator is approximately spanned by the square-root kernelized component densities. In the finite sample setting, and under the same assumption, embedded samples from different components are approximately orthogonal with high probability when the sample size is large. As a corollary we control the fraction of samples mislabeled by spectral clustering under finite mixtures with nonparametric components.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G20 Asymptotic properties of nonparametric inference
Full Text: DOI arXiv Euclid


[1] Amini, A. A., Chen, A., Bickel, P. J. and Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist. 41 2097-2122. · Zbl 1277.62166
[2] Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15 1373-1396. · Zbl 1085.68119
[3] Cao, Y. and Chen, D.-R. (2011). Consistency of regularized spectral clustering. Appl. Comput. Harmon. Anal. 30 319-336. · Zbl 1216.68213
[4] Donath, W. E. and Hoffman, A. J. (1973). Lower bounds for the partitioning of graphs. IBM J. Res. Develop. 17 420-425. · Zbl 0259.05112
[5] Fiedler, M. (1973). Algebraic connectivity of graphs. Czechoslovak Math. J. 23 298-305. · Zbl 0265.05119
[6] GinĂ©, E. and Koltchinskii, V. (2006). Empirical graph Laplacian approximation of Laplace-Beltrami operators: Large sample results. In High Dimensional Probability. Institute of Mathematical Statistics Lecture Notes-Monograph Series 51 238-259. IMS, Beachwood, OH. · Zbl 1124.60030
[7] Jenssen, R. (2010). Kernel entropy component analysis. IEEE Trans. Pattern Anal. Mach. Intell. 32 847-860.
[8] Lawler, G. F. and Sokal, A. D. (1988). Bounds on the \(L^2\) spectrum for Markov chains and Markov processes: A generalization of Cheeger’s inequality. Trans. Amer. Math. Soc. 309 557-580. · Zbl 0716.60073
[9] Meila, M. and Shi, J. (2001). A random walks view of spectral segmentation. In Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics , Key West, FL.
[10] Ng, A. Y., Jordan, M. I. and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems ( NIPS ), Vancouver, BC, Canada.
[11] Reed, M. and Simon, B. (1980). Methods of Modern Mathematical Physics. I : Functional Analysis , 2nd ed. Academic Press, New York. · Zbl 0459.46001
[12] Rosasco, L., Belkin, M. and De Vito, E. (2010). On learning with integral operators. J. Mach. Learn. Res. 11 905-934. · Zbl 1242.62059
[13] Schiebinger, G., Wainwright, M. J. and Yu, B. (2015). Supplement to “The geometry of kernelized spectral clustering.” . · Zbl 1312.62082
[14] Shi, J. and Malik, J. (2000). Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22 888-905.
[15] Shi, T., Belkin, M. and Yu, B. (2009). Data spectroscopy: Eigenspaces of convolution operators and clustering. Ann. Statist. 37 3960-3984. · Zbl 1191.62114
[16] Stewart, G. W. (1971). Error bounds for approximate invariant subspaces of closed linear operators. SIAM J. Numer. Anal. 8 796-808. · Zbl 0232.47010
[17] von Luxburg, U., Belkin, M. and Bousquet, O. (2008). Consistency of spectral clustering. Ann. Statist. 36 555-586. · Zbl 1133.62045
[18] Yan, D., Chen, A. and Jordan, M. I. (2013). Cluster forests. Comput. Statist. Data Anal. 66 178-192. · Zbl 1471.62225
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.