Locality sensitive hash function

Let h:d{1,,m}h: \mathbb{R}^d \rightarrow \{1,…,m\} be a random hash function.

We call hh locality sensitive for similarity function s(𝐪,𝐲)s(\mathbf{q},\mathbf{y}) if Pr[h(𝐪)==h(𝐲)]\mathrm{Pr}[h(\mathbf{q})==h(\mathbf{y})] is

Locality Sensitive Hash Family

For distances r1,r2r_1,r_2 with r1<r2r_1 < r_2, a family of hash functions :US\mathcal{H}: U \rightarrow S is (r1,r2,p1,p2)(r_1,r_2,p_1,p_2)-locality sensitive if for any x,yUx,y \in U:

  1. If d(x,y)r1d(x,y)\leq r_1 then Prh[h(x)=h(y)]p1\mathrm{Pr}_{h\in\mathcal{H}}[h(x)=h(y)]\geq p_1
  2. If d(x,y)r2d(x,y)\geq r_2 then Prh[h(x)=h(y)]p2\mathrm{Pr}_{h\in\mathcal{H}}[h(x)=h(y)]\leq p_2

(For our purposes, we always have p2<p1p_2 < p_1. I.e. if two points are close together, they have a strictly higher probability of hashing to the same bucket than two points that are far apart. This property is ultimately what allows us to perform efficient near neighbor search.)

Tunable LSH

Full LSH scheme has two parameters to tune: tt tables and rr bands

s-curve tuning, collision probability vs Jaccard similarity


Crucial for Indyk and Motwani (1998) theorem.

Also see SimHash and MinHash (Broder, 1997).

References:

  1. https://web.stanford.edu/class/cs246/slides/03-lsh.pdf
  2. https://www.cs.princeton.edu/courses/archive/fall18/cos521/Lectures/lec12.pdf
  3. https://users.cs.duke.edu/~kamesh/CPS294-PDF/Lecture4.pdf
  4. https://www.pinecone.io/learn/series/faiss/locality-sensitive-hashing-random-projection/
  5. http://infolab.stanford.edu/~ullman/mining/2009/similarity3.pdf