×

Path-based spectral clustering: guarantees, robustness to outliers, and fast algorithms. (English) Zbl 1497.68430

Summary: We consider the problem of clustering with the longest-leg path distance (LLPD) metric, which is informative for elongated and irregularly shaped clusters. We prove finite-sample guarantees on the performance of clustering with respect to this metric when random samples are drawn from multiple intrinsically low-dimensional clusters in high-dimensional space, in the presence of a large number of high-dimensional outliers. By combining these results with spectral clustering with respect to LLPD, we provide conditions under which the Laplacian eigengap statistic correctly determines the number of clusters for a large class of data sets, and prove guarantees on the labeling accuracy of the proposed algorithm. Our methods are quite general and provide performance guarantees for spectral clustering with any ultrametric. We also introduce an efficient, easy to implement approximation algorithm for the LLPD based on a multiscale analysis of adjacency graphs, which allows for the runtime of LLPD spectral clustering to be quasilinear in the number of data points.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)

Software:

UCI-ml; COIL-20
PDFBibTeX XMLCite
Full Text: arXiv Link

References:

[1] E. Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18(177):1-86, 2018. · Zbl 1403.62110
[2] F. Alimoglu and E. Alpaydin. Methods of combining multiple classifiers based on different representations for pen-based handwritten digit recognition. InTAINN. Citeseer, 1996.
[3] N. Alon and B. Schieber.Optimal preprocessing for answering on-line product queries. Tel-Aviv University. The Moise and Frida Eskenasy Institute of Computer Sciences, 1987.
[4] M. Appel and R. Russo. The maximum vertex degree of a graph on uniform points in [0,1]d.Advances in Applied Probability, 29(3):567-581, 1997a. · Zbl 0899.05060
[5] M. Appel and R. Russo. The minimum vertex degree of a graph on uniform points in [0,1]d.Advances in Applied Probability, 29(3):582-594, 1997b. · Zbl 0899.05061
[6] M. Appel and R. Russo. The connectivity of a graph on uniform points on [0,1]d. Statistics & Probability Letters, 60(4):351-357, 2002. · Zbl 1039.05054
[7] E. Arias-Castro. Clustering based on pairwise distances when the data is of mixed dimensions.IEEE Transactions on Information Theory, 57(3):1692-1706, 2011. · Zbl 1366.62117
[8] E. Arias-Castro, G. Chen, and G. Lerman. Spectral clustering based on local linear approximations.Electronic Journal of Statistics, 5:1537-1587, 2011. · Zbl 1271.62132
[9] E. Arias-Castro, G. Lerman, and T. Zhang. Spectral clustering based on local PCA. Journal of Machine Learning Research, 18(9):1-57, 2017. · Zbl 1437.62225
[10] D. Arthur and S. Vassilvitskii.k-means++: The advantages of careful seeding. In SODA, pages 1027-1035. SIAM, 2007. · Zbl 1302.68273
[11] A. Azran and Z. Ghahramani. A new approach to data driven clustering. InICML, pages 57-64. ACM, 2006a.
[12] A. Azran and Z. Ghahramani. Spectral methods for automatic multiscale data clustering. InCVPR, volume 1, pages 190-197. IEEE, 2006b.
[13] S. Balakrishnan, M. Xu, A. Krishnamurthy, and A. Singh.Noise thresholds for spectral clustering. InNIPS, pages 954-962, 2011.
[14] S. Balakrishnan, S. Narayanan, A. Rinaldo, A. Singh, and L. Wasserman. Cluster trees on manifolds. InNIPS, pages 2679-2687, 2013.
[15] M. Banerjee, M. Capozzoli, L. McSweeney, and D. Sinha. Beyond kappa: A review of interrater agreement measures.Canadian journal of statistics, 27(1):3-23, 1999. · Zbl 0929.62117
[16] R.E. Bellman.Adaptive control processes: a guided tour. Princeton University Press, 2015.
[17] J.J. Benedetto and W. Czaja.Integration and modern analysis. Springer Science & Business Media, 2010. · Zbl 1191.28001
[18] J.L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509-517, 1975. · Zbl 0306.68061
[19] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In ICML, pages 97-104. ACM, 2006.
[20] R.B. Bhatt, G. Sharma, A. Dhall, and S. Chaudhury. Efficient skin region segmentation using low complexity fuzzy decision tree model. InINDICON, pages 1-4. IEEE, 2009.
[21] R.L. Bishop and R.J. Crittenden.Geometry of manifolds, volume 15. Academic press, 2011. · Zbl 0984.53001
[22] P.M. Camerini. The min-max spanning tree problem and some extensions.Information Processing Letters, 7(1):10-14, 1978. · Zbl 0373.05028
[23] H. Chang and D.-Y. Yeung. Robust path-based spectral clustering.Pattern Recognition, 41(1):191-203, 2008. · Zbl 1122.68525
[24] K. Chaudhuri and S. Dasgupta. Rates of convergence for the cluster tree. InNIPS, pages 343-351, 2010.
[25] G. Chen and G. Lerman. Foundations of a multi-way spectral clustering framework for hybrid linear modeling.Foundations of Computational Mathematics, 9(5):517-558, 2009a. · Zbl 1176.68155
[26] G. Chen and G. Lerman. Spectral curvature clustering (SCC).International Journal of Computer Vision, 81(3):317-330, 2009b.
[27] F. Chung.Spectral graph theory, volume 92. American Mathematical Society, 1997. · Zbl 0867.05046
[28] R.R. Coifman and S. Lafon. Diffusion maps.Applied and computational harmonic analysis, 21(1):5-30, 2006. · Zbl 1095.68094
[29] R.R. Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S.W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps.Proceedings of the National Academy of Sciences of the United States of America, 102(21):7426-7431, 2005. · Zbl 1405.42043
[30] E.D. Demaine, G.M. Landau, and O. Weimann. On Cartesian trees and range minimum queries. InICALP, pages 341-353. Springer, 2009. · Zbl 1248.68165
[31] E.D. Demaine, G.M. Landau, and O. Weimann. On Cartesian trees and range minimum queries.Algorithmica, 68(3):610-625, 2014. · Zbl 1360.68378
[32] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence, 35 (11):2765-2781, 2013.
[33] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. InKdd, volume 96, pages 226-231, 1996.
[34] J. Fan, W. Wang, and Y. Zhong. An‘∞eigenvector perturbation bound and its application to robust covariance estimation.Journal of Machine Learning Research, 18(207):1-42, 2018. · Zbl 1473.15015
[35] H. Federer. Curvature measures.Transactions of the American Mathematical Society, 93(3):418-491, 1959. · Zbl 0089.38402
[36] B. Fischer and J.M. Buhmann. Path-based clustering for grouping of smooth curves and texture segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4):513-518, 2003.
[37] B. Fischer, T. Z¨oller, and J. Buhmann. Path based pairwise data clustering with application to texture segmentation. InEnergy minimization methods in computer vision and pattern recognition, pages 235-250. Springer, 2001. · Zbl 1001.68765
[38] B. Fischer, V. Roth, and J.M. Buhmann. Clustering with the connectivity kernel. In NIPS, pages 89-96, 2004.
[39] J. Friedman, T. Hastie, and R. Tibshirani.The Elements of Statistical Learning, volume 1. Springer series in Statistics Springer, Berlin, 2001. · Zbl 0973.62007
[40] H. Gabow and R.E. Tarjan. Algorithms for two bottleneck optimization problems. Journal of Algorithms, 9:411-417, 1988. · Zbl 0653.90087
[41] N. Garcia Trillos and D. Slepcev. Continuum limit of total variation on point clouds. Archive for Rational Mechanics and Analysis, 220(1):193-241, 2016a. · Zbl 1336.68215
[42] N. Garcia Trillos and D. Slepcev. A variational approach to the consistency of spectral clustering.Applied and Computational Harmonic Analysis, 2016b. · Zbl 1396.49013
[43] N. Garcia Trillos, D. Slepcev, J. Von Brecht T. Laurent, and X. Bresson. Consistency of Cheeger and ratio graph cuts.Journal of Machine Learning Research, 17(181): 1-46, 2016. · Zbl 1392.62180
[44] N. Garcia Trillos, M. Gerlach, M. Hein, and D. Slepˇcev. Error estimates for spectral convergence of the graph Laplacian on random geometric graphs toward the Laplace-Beltrami operator.Foundations of Computational Mathematics, pages 1- 61, 2019a.
[45] N. Garcia Trillos, F. Hoffmann, and B. Hosseini. Geometric structure of graph Laplacian embeddings.arXiv preprint arXiv:1901.10651, 2019b.
[46] E.N. Gilbert. Random plane networks.Journal of the Society for Industrial and Applied Mathematics, 9(4):533-543, 1961. · Zbl 0112.09403
[47] J.M. Gonz´alez-Barrios and A.J. Quiroz. A clustering procedure based on the comparison between theknearest neighbors graph and the minimal spanning tree. Statistics & Probability Letters, 62(1):23-34, 2003. · Zbl 1101.62351
[48] L. Gy¨orfi, M. Kohler, A. Krzyzak, and H. Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006.
[49] T. Hagerup and C. R¨ub. A guided tour of Chernoff bounds.Information Processing Letters, 33(6):305-308, 1990. · Zbl 0702.60021
[50] J.A. Hartigan. Consistency of single linkage for high-density clusters.Journal of the American Statistical Society, 76(374):388-394, 1981. · Zbl 0468.62053
[51] T. Hastie, R. Tibshirani, and J. Friedman.Elements of Statistical Learning. Springer, 2009. · Zbl 1273.62005
[52] T.C. Hu. Letter to the editor: The maximum capacity route problem.Operations Research, 9(6):898-900, 1961.
[53] G. Hughes. On the mean accuracy of statistical pattern recognizers.IEEE Transactions on Information Theory, 14(1):55-63, 1968.
[54] M. Lichman. UCI machine learning repository, 2013. URLhttp://archive.ics. uci.edu/ml.
[55] M. Maggioni and J.M. Murphy. Learning by unsupervised nonlinear diffusion.Journal of Machine Learning Research, 20(160):1-56, 2019. · Zbl 1440.68233
[56] D. McKenzie and S. Damelin. Power weighted shortest paths for clustering Euclidean data.Foundations of Data Science, 1(3):307, 2019.
[57] G.J. McLachlan and K.E. Basford.Mixture models: Inference and applications to clustering, volume 84. Marcel Dekker, 1988. · Zbl 0697.62050
[58] M. Meila and J. Shi. Learning segmentation by random walks. InNIPS, pages 873-879, 2001.
[59] D.G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures by semidefinite programming.Information and Inference: A Journal of the IMA, 6(4):389-415, 2017. · Zbl 1381.62189
[60] J. Munkres. Algorithms for the assignment and transportation problems.Journal of the Society for Industrial and Applied Mathematics, 5(1):32-38, 1957. · Zbl 0083.15302
[61] J.M. Murphy and M. Maggioni. Diffusion geometric methods for fusion of remotely sensed data. InAlgorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XXIV, volume 10644, page 106440I. International Society for Optics and Photonics, 2018.
[62] J.M. Murphy and M. Maggioni. Unsupervised clustering and active learning of hyperspectral images with nonlinear diffusion.IEEE Transactions on Geoscience and Remote Sensing, 57(3):1829-1845, 2019.
[63] S.A. Nene, S.K. Nayar, and H. Murase. Columbia object image library (COIL-20). 1996.
[64] A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. InNIPS, pages 849-856, 2002.
[65] R. Ostrovsky, Y. Rabani, L.J. Schulman, and C. Swamy. The effectiveness of Lloydtype methods for thek-means problem. InFOCS, pages 165-176. IEEE, 2006.
[66] H.S. Park and C.-H. Jun. A simple and fast algorithm fork-medoids clustering. Expert Systems with Applications, 36(2):3336-3341, 2009.
[67] L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review.ACM SIGKDD Explorations Newsletter, 6(1):90-105, 2004.
[68] M. Penrose. The longest edge of the random minimal spanning tree.Annals of Applied Probability, 7(2):340-361, 1997. · Zbl 0884.60042
[69] R. Penrose. A strong law for the longest edge of the minimal spanning tree.Annals of Probability, 27(1):246-260, 1999. · Zbl 0944.60015
[70] M. Pollack. Letter to the editor: The maximum capacity through a network.Operations Research, 8(5):733-736, 1960.
[71] A.P. Punnen. A linear time algorithm for the maximum capacity path problem. European Journal of Operational Research, 53:402-404, 1991. · Zbl 0732.90085
[72] A. Rinaldo and L. Wasserman. Generalized density clustering.The Annals of Statistics, 38(5):2678-2722, 2010. · Zbl 1200.62066
[73] F.D. Roberts and S.H. Storey. A three-dimensional cluster problem.Biometrika, 55 (1):258-260, 1968.
[74] A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks.Science, 344(6191):1492-1496, 2014.
[75] G. Sanguinetti, J. Laidler, and N.D. Lawrence.Automatic determination of the number of clusters using spectral algorithms. InMLSP, pages 55-60. IEEE, 2005.
[76] G. Schiebinger, M.J. Wainwright, and B. Yu. The geometry of kernelized spectral clustering.Annals of Statistics, 43(2):819-846, 2015. · Zbl 1312.62082
[77] J. Shi and J. Malik. Normalized cuts and image segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888-905, 2000.
[78] R. Sibson. SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16(1):30-34, 1973.
[79] M. Soltanolkotabi and E.J. Cand‘es. A geometric analysis of subspace clustering with outliers.The Annals of Statistics, 40(4):2195-2238, 2012. · Zbl 1318.62217
[80] M. Soltanolkotabi, E. Elhamifar, and E.J. Cand‘es. Robust subspace clustering.Annals of Statistics, 42(2):669-699, 2014. · Zbl 1360.62353
[81] B. Sriperumbudur and I. Steinwart. Consistency and rates for clustering with DBSCAN. InAISTATS, pages 1090-1098, 2012.
[82] D. Stauffer and A. Aharony.Introduction to Percolation Theory. CRC Press, 1994. · Zbl 0862.60092
[83] H. Steinhaus. Sur la division des corps mat´eriels en parties.Bull. Acad. Polon. Sci., 4(12):801-804, 1957. · Zbl 0079.16403
[84] G.W Stewart. Matrix perturbation theory.Citeseer, 1990. · Zbl 0706.65013
[85] G. Szeg¨o. Inequalities for certain eigenvalues of a membrane of given area.Journal of Rational Mechanics and Analysis, 3:343-356, 1954. · Zbl 0055.08802
[86] L.N. Trefethen and D. Bau.Numerical linear algebra, volume 50. SIAM, 1997. · Zbl 0874.65013
[87] R. Vidal. Subspace clustering.IEEE Signal Processing Magazine, 28(2):52-68, 2011.
[88] U. Von Luxburg. A tutorial on spectral clustering.Statistics and Computing, 17(4): 395-416, 2007.
[89] V. Vu. A simple SVD algorithm for finding hidden partitions.Combinatorics, Probability and Computing, 27(1):124-140, 2018. · Zbl 1386.68110
[90] X. Wang, K. Slavakis, and G. Lerman. Riemannian multi-manifold modeling.arXiv preprint arXiv:1410.0095, 2014.
[91] H.F. Weinberger. An isoperimetric inequality for theN-dimensional free membrane problem.Journal of Rational Mechanics and Analysis, 5(4):633-636, 1956. · Zbl 0071.09902
[92] X. Xu, M. Ester, H.-P. Kriegel, and J. Sander. A distribution-based clustering algorithm for mining in large spatial databases. InICDE, pages 324-331. IEEE, 1998.
[93] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. InNIPS, pages 1601-1608, 2004.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.