Locality sensitive hash function

Let $h: \mathbb{R}^d \rightarrow \{1,…,m\}$ be a random hash function.

We call $h$ locality sensitive for similarity function $s(\mathbf{q},\mathbf{y})$ if $\mathrm{Pr}[h(\mathbf{q})==h(\mathbf{y})]$ is

higher when $\mathbf{q}$ and $\mathbf{y}$ are more similar, i.e. $s(\mathbf{q},\mathbf{y})$ higher
lower when $\mathbf{q}$ and $\mathbf{y}$ are more dissimilar, i.e. $s(\mathbf{q},\mathbf{y})$ lower

Locality Sensitive Hash Family

For distances $r_1,r_2$ with $r_1 < r_2$ , a family of hash functions $\mathcal{H}: U \rightarrow S$ is $(r_1,r_2,p_1,p_2)$ -locality sensitive if for any $x,y \in U$ :

If $d(x,y)\leq r_1$ then $\mathrm{Pr}_{h\in\mathcal{H}}[h(x)=h(y)]\geq p_1$
If $d(x,y)\geq r_2$ then $\mathrm{Pr}_{h\in\mathcal{H}}[h(x)=h(y)]\leq p_2$

(For our purposes, we always have $p_2 < p_1$ . I.e. if two points are close together, they have a strictly higher probability of hashing to the same bucket than two points that are far apart. This property is ultimately what allows us to perform efficient near neighbor search.)

Tunable LSH

Full LSH scheme has two parameters to tune: $t$ tables and $r$ bands

effect of increasing number of tables $t$ : fewer false negatives, more false positives
effect of increasing number of bands $r$ : more false negatives, fewer false positives

s-curve tuning, collision probability vs Jaccard similarity

Crucial for Indyk and Motwani (1998) theorem.

Also see SimHash and MinHash (Broder, 1997).

References: