Randomized near-neighbor graphs, giant components and applications in data science

George C Linderman; Gal Mishne; Ariel Jaffe; Yuval Kluger; Stefan Steinerberger

doi:10.1017/jpr.2020.21

Randomized near-neighbor graphs, giant components and applications in data science

J Appl Probab. 2020 Jun;57(2):458-476. doi: 10.1017/jpr.2020.21. Epub 2020 Jul 16.

Authors

George C Linderman¹, Gal Mishne¹, Ariel Jaffe¹, Yuval Kluger², Stefan Steinerberger³

Affiliations

¹ Postal address: Applied Mathematics, Yale University, New Haven, CT 06511.
² Dept. of Pathology & Applied Mathematics, Yale University, New Haven, CT 06511.
³ Dept. of Mathematics, Yale University, New Haven, CT 06511.

Abstract

If we pick n random points uniformly in [0, 1] ^d and connect each point to its c _d log n-nearest neighbors, where d ≥ 2 is the dimension and c _d is a constant depending on the dimension, then it is well known that the graph is connected with high probability. We prove that it suffices to connect every point to c _d,1 log log n points chosen randomly among its c _d,2 log n-nearest neighbors to ensure a giant component of size n - o(n) with high probability. This construction yields a much sparser random graph with ~ n log log n instead of ~ n log n edges that has comparable connectivity properties. This result has nontrivial implications for problems in data science where an affinity matrix is constructed: instead of connecting each point to its k nearest neighbors, one can often pick k' ≪ k random points out of the k nearest neighbors and only connect to those without sacrificing quality of results. This approach can simplify and accelerate computation; we illustrate this with experimental results in spectral clustering of large-scale datasets.

Keywords: connectivity; k–nearest neighbor graph; random graph; sparsification.

Abstract

Grants and funding