SimHash
a Locality
sensitive hash function scheme for Cosine similarity:
- Let
be randomly chosen with each entry
.
- Let
be a Uniformly Random
Hash Function.
- Define the LSH hash function
as:
Proof
Let
.
Will show that (theorem to prove):
Intermediate result to show that:
Consider random vector
and its hyperplane. Since it is drawn from the standard normal
distribution, the direction of
is uniformly distributed around the unit circle. Similarly, the
hyperplane is also uniformly distributed around the unit circle.
The sign of the inner
product specifies
which side of the
hyperplane is
on. Intuitively, the probability
that and are
on the same side of the hyperplane is proportional to their angle.
The probability that they lie on different sides of the
hyperplane is the probability that the random hyperplane falls
between and which
is .
Then the probability that they lie on the same side of the
hyperplane
is .
In higher dimensions, we can use the same intuition. There is always
some rotation
matrix such
that and are
spanned by the first two standard basis vectors and have the same cosine
similarity
as and .
Then we can apply the result in one dimension
to and .
SimHash can be tuned, just like our MinHash based function for Jaccard similarity
- Suppose
be randomly chosen with each
.
- Let
be a Uniformly Random
Hash Function
-
is defined
- .
for a random vector
.
see Hashing
References:
- M. S. Charikar, “Similarity estimation techniques from rounding
algorithms,” in Proceedings of the thiry-fourth annual ACM symposium
on Theory of computing, Montreal Quebec Canada: ACM, May 2002, pp.
380–388. doi: 10.1145/509907.509965.
- https://www.chrismusco.com/amlds2023/notes/lecture05.html#SimHash
- https://en.wikipedia.org/wiki/SimHash
- https://ferd.ca/simhashing-hopefully-made-simple.html
- https://datascience.stackexchange.com/questions/6086/minhashing-vs-simhashing
- https://people.cs.umass.edu/~cmusco/CS514F20/slides/lecture8/lecture8Compressed.pdf
- https://sumonbis.github.io/academic-project/simhash/
- https://www.fromkk.com/posts/near-duplicate-with-simhash/