## Randomized near-neighbor graphs, giant components and applications in data science.(English)Zbl 1446.60013

Summary: If we pick $$n$$ random points uniformly in $$[0,1]^d$$ and connect each point to its $$c_d \log{n}$$ nearest neighbors, where $$d\geq 2$$ is the dimension and $$c_d$$ is a constant depending on the dimension, then it is well known that the graph is connected with high probability. We prove that it suffices to connect every point to $$c_{d,1} \log\log n$$ points chosen randomly among its $$c_{d,2}\log n$$ nearest neighbors to ensure a giant component of size $$n-o(n)$$ with high probability. This construction yields a much sparser random graph with $$\sim n\log\log n$$ instead of $$\sim n\log n$$ edges that has comparable connectivity properties. This result has non-trivial implications for problems in data science where an affinity matrix is constructed: instead of connecting each point to its $$k$$ nearest neighbors, one can often pick $$k'\ll k$$ random points out of the $$k$$ nearest neighbors and only connect to those without sacrificing quality of results. This approach can simplify and accelerate computation; we illustrate this with experimental results in spectral clustering of large-scale datasets.

### MSC:

 60D05 Geometric probability and stochastic geometry 60K35 Interacting random processes; statistical mechanics type models; percolation theory 05C40 Connectivity 05C80 Random graphs (graph-theoretic aspects) 62R07 Statistical aspects of big data and data science

### Keywords:

$$k$$-nn graph; random graph; connectivity; sparsification

### Software:

RCV1; t-SNE; Algorithm 971; SpectralNet
Full Text: