Locality sensitive hash function
Let
be a random hash function.
We call
locality sensitive for similarity function
if
is
- higher when
and
are more similar, i.e.
higher
- lower when
and
are more dissimilar, i.e.
lower
Locality Sensitive Hash
Family
For distances
with
,
a family of hash functions
is
-locality
sensitive if for any
:
- If
then
- If
then
(For our purposes, we always have
.
I.e. if two points are close together, they have a strictly higher
probability of hashing to the same bucket than two points that are far
apart. This property is ultimately what allows us to perform efficient
near neighbor search.)
Tunable LSH
Full LSH scheme has two parameters to tune:
tables and
bands
- effect of increasing number of tables
:
fewer false negatives, more false positives
- effect of increasing number of bands
:
more false negatives, fewer false positives
s-curve tuning, collision probability vs Jaccard similarity
Crucial for Indyk and
Motwani (1998) theorem.
Also see SimHash and MinHash (Broder, 1997).
References:
- https://web.stanford.edu/class/cs246/slides/03-lsh.pdf
- https://www.cs.princeton.edu/courses/archive/fall18/cos521/Lectures/lec12.pdf
- https://users.cs.duke.edu/~kamesh/CPS294-PDF/Lecture4.pdf
- https://www.pinecone.io/learn/series/faiss/locality-sensitive-hashing-random-projection/
- http://infolab.stanford.edu/~ullman/mining/2009/similarity3.pdf