×

Randomized near-neighbor graphs, giant components and applications in data science. (English) Zbl 1446.60013

Summary: If we pick \(n\) random points uniformly in \([0,1]^d\) and connect each point to its \(c_d \log{n}\) nearest neighbors, where \(d\geq 2\) is the dimension and \(c_d\) is a constant depending on the dimension, then it is well known that the graph is connected with high probability. We prove that it suffices to connect every point to \(c_{d,1} \log\log n\) points chosen randomly among its \(c_{d,2}\log n\) nearest neighbors to ensure a giant component of size \(n-o(n)\) with high probability. This construction yields a much sparser random graph with \(\sim n\log\log n\) instead of \(\sim n\log n\) edges that has comparable connectivity properties. This result has non-trivial implications for problems in data science where an affinity matrix is constructed: instead of connecting each point to its \(k\) nearest neighbors, one can often pick \(k'\ll k\) random points out of the \(k\) nearest neighbors and only connect to those without sacrificing quality of results. This approach can simplify and accelerate computation; we illustrate this with experimental results in spectral clustering of large-scale datasets.

MSC:

60D05 Geometric probability and stochastic geometry
60K35 Interacting random processes; statistical mechanics type models; percolation theory
05C40 Connectivity
05C80 Random graphs (graph-theoretic aspects)
62R07 Statistical aspects of big data and data science
PDF BibTeX XML Cite
Full Text: DOI arXiv Link

References:

[1] Balister, P. andBollobás, B. (2013). Percolation in the k-nearest neighbor graph. In Recent Results in Designs and Graphs: A Tribute to Lucia Gionfriddo (Quaderni di Matematica 28), eds Buratti, M.et al., pp. 83-100. Aracne.
[2] Balister, P., Bollobás, B., Sarkar, A. andWalters, M. (2005). Connectivity of random k-nearest-neighbour graphs. Adv. Appl. Prob.37 (1), 1-24. · Zbl 1079.05086
[3] Balister, P., Bollobás, B., Sarkar, A. andWalters, M. (2009). A critical constant for the k-nearest-neighbour model. Adv. Appl. Prob.41 (1), 1-12. · Zbl 1160.05333
[4] Balister, P., Bollobás, B., Sarkar, A. andWalters, M. (2009). Highly connected random geometric graphs. Discrete Appl. Math.157 (2), 309-320. · Zbl 1156.05054
[5] Beardwood, J., Halton, J. H. andHammersley, J. M. (1959). The shortest path through many points. Proc. Cambridge Philos. Soc. 55, 299-327. · Zbl 0118.35601
[6] Belkin, M. andNiyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput.15, 1373-1396. · Zbl 1085.68119
[7] Broutin, N., Devroye, L., Fraiman, N. andLugosi, G. (2014). Connectivity threshold of Bluetooth graphs. Random Struct. Algorithms44 (1), 45-66. · Zbl 1280.05069
[8] Coifman, R. andLafon, S. (2006). Diffusion maps. Appl. Comput. Harmon. Anal.21 (1), 5-30. · Zbl 1095.68094
[9] Erdős, P. andRényi, A. (1959). On random graphs I. Publ. Math. Debrecen6, 290-297. · Zbl 0092.15705
[10] Falgas-Ravry, V. andWalters, M. (2012). Sharpness in the k-nearest-neighbours random geometric graph model. Adv. Appl. Prob.44 (3), 617-634. · Zbl 1278.60142
[11] Few, L. (1955). The shortest path and the shortest road through n points. Mathematika2, 141-144. · Zbl 0067.12604
[12] Hein, M., Audibert, J.-Y. andVon Luxburg, U. (2005). From graphs to manifolds: weak and strong pointwise consistency of graph Laplacians. In Learning Theory (Lecture Notes Comput. Sci. 3559), pp. 470-485. Springer, Berlin. · Zbl 1095.68097
[13] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc.58 (301), 13-30. · Zbl 0127.10602
[14] Jones, P. W., Osipov, A. andRokhlin, V. (2011). Randomized approximate nearest neighbors algorithm. Proc. Nat. Acad. Sci. 108 (38), 15679-15686.
[15] Kolmogorov, A. N. andBarzdin, Y. (1993). On the realization of networks in three-dimensional space. In Selected Works of A. N. Kolmogorov (Mathematics and Its Applications (Soviet Series) 27), ed. A. N. Shiryayev, pp. 194-202. Springer, Dordrecht.
[16] Kusner, M. J., Tyree, S., Weinberger, K. andAgrawal, K. (2014). Stochastic neighbor compression. In 31st International Conference on Machine Learning (Proceedings of Machine Learning Research 32), pp. 622-630. PMLR.
[17] Lewis, D., Yang, Y., Rosen, T. andLi, F. (2004). RCV1: a new benchmark collection for text categorization research. J. Machine Learning Res.5, 361-397.
[18] Li, H., Linderman, G. C., Szlam, A., Stanton, K. P., Kluger, Y. andTygert, M. (2017). Algorithm 971: an implementation of a randomized algorithm for principal component analysis. ACM Trans. Math. Software43 (3), 28. · Zbl 1391.65085
[19] Linderman, G. C. andSteinerberger, S. (2019). Clustering with t-SNE, provably. SIAM J. Math Data Science1, 313-332.
[20] Loosli, G., Canu, S. andBottou, L. (2007). Training invariant support vector machines using selective sampling. In Large Scale Kernel Machines, eds L. Bottou et al., pp. 301-320. MIT Press. · Zbl 1222.68251
[21] Maier, M., Von Luxburg, U. andHein, M. (2009). Influence of graph construction on graph-based clustering measures. In Advances in Neural Information Processing Systems 21 (2008), pp. 1025-1032.
[22] Maier, M., Von Luxburg, U. andHein, M. (2013). How the result of graph clustering methods depends on the construction of the graph. ESAIM Prob. Statist.17, 370-418. · Zbl 1284.62382
[23] Margulis, G. (1973). Explicit constructions of concentrators. Problemy Peredachi Informatsii 9 (4), 71-80. English version: Problems Inform. Transmission10, 325-332 (1975).
[24] Mauldin, R. D. (ed.) (2015). The Scottish Book: Mathematics from The Scottish Café, with Selected Problems from The New Scottish Book, 2nd edn. Birkhäuser/Springer, Cham. · Zbl 1331.01039
[25] Munkres, J. (1957). Algorithms for the assignment and transportation problems. J. Soc. Indust. Appl. Math.5 (1), 32-38. · Zbl 0083.15302
[26] Penrose, M. (2003). Random Geometric Graphs (Oxford Studies in Probability 5). Oxford University Press, Oxford.
[27] Penrose, M. andPisztora, A. (1996). Large deviations for discrete and continuous percolation. Adv. Appl. Prob.28 (1), 29-52. · Zbl 0853.60085
[28] Penrose, M. andYukich, J. (2003). Weak laws of large numbers in geometric probability. Ann. Appl. Prob.13 (1), 277-303. · Zbl 1029.60008
[29] Pinsker, M. S. (1973). On the complexity of a concentrator. In Seventh International Teletraffic Congress (Stockholm, 1973), paper # 318.
[30] Rokhlin, V., Szlam, A. andTygert, M. (2009). A randomized algorithm for principal component analysis. SIAM J. Matrix Anal. Appl.31 (3), 1100-1124. · Zbl 1198.65035
[31] Shaham, U., Stanton, K., Li, H., Basri, R., Nadler, B. andKluger, Y. (2018). SpectralNet: spectral clustering using deep neural networks. In International Conference on Learning Representations, 2018.
[32] Singer, A. (2006). From graph to manifold Laplacian: the convergence rate. Appl. Comput. Harmon. Anal.21 (1), 128-134. · Zbl 1095.68102
[33] Steele, J. M. (1980). Shortest paths through pseudorandom points in the d-cube. Proc. Amer. Math. Soc. 80 (1), 130-134. · Zbl 0465.10044
[34] Steele, J. M. (1981). Subadditive Euclidean functionals and nonlinear growth in geometric probability. Ann. Prob.9 (3), 365-376. · Zbl 0461.60029
[35] Steele, J. M. (1997). Probability Theory and Combinatorial Optimization (CBMS-NSF Regional Conference Series in Applied Mathematics 69). SIAM, Philadelphia.
[36] Steinerberger, S. (2010). A new lower bound for the geometric traveling salesman problem in terms of discrepancy. Oper. Res. Lett.38 (4), 318-319. · Zbl 1193.90181
[37] Steinerberger, S. (2015). New bounds for the traveling salesman constant. Adv. Appl. Prob.47, 27-36 · Zbl 1309.60005
[38] Teng, S.-H. andYao, F. (2007). k-nearest-neighbor clustering and percolation theory. Algorithmica49 (3), 192-211. · Zbl 1131.60089
[39] Walters, M. (2011). Random geometric graphs. Surveys Combin.392, 365-402. · Zbl 1244.05206
[40] Walters, M. (2012). Small components in k-nearest neighbour graphs. Discrete Appl. Math.160 (13-14), 2037-2047. · Zbl 1246.05143
[41] Van Der Maaten, L. (2014). Accelerating t-SNE using tree-based algorithms. J. Machine Learning Res.15 (1), 3221-3245. · Zbl 1319.62134
[42] Van Der Maaten, L. andHinton, G. (2008). Visualizing data using t-SNE. J. Machine Learning Res.9, 2579-2605. · Zbl 1225.68219
[43] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statist. Comput.17 (4), 395-416.
[44] Xie, J., Girshick, R. andFarhadi, A. (2016). Unsupervised deep embedding for clustering analysis. In 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research 48), pp. 478-487. JMLR.
[45] Xue, F. andKumar, P. R. (2004). The number of neighbors needed for connectivity of wireless networks. Wireless Networks10, 169-181.
[46] Yukich, J. (1998). Probability Theory of Classical Euclidean Optimization Problems (Lecture Notes Math. 1675). Springer, Berlin.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.