SimHash

a Locality sensitive hash function scheme for Cosine similarity:

Proof

Let θ=θ(𝐱,𝐲)\theta = \theta(\mathbf{x},\mathbf{y}). Will show that (theorem to prove): Pr[h(𝐱)=h(𝐲)]=(1θ𝚷+1m)\mathrm{Pr}[h(\mathbf{x}) = h(\mathbf{y})] = (1 − \frac{θ}{\mathbf{Π}}+\frac{1}{m}) Intermediate result to show that: Pr(𝐠,𝐱=𝐠,𝐲)=1θπ\begin{align*} \Pr( \langle \mathbf{g}, \mathbf{x} \rangle = \langle \mathbf{g}, \mathbf{y} \rangle ) = 1 - \frac{\theta}{\pi} \end{align*} Consider random vector 𝐠\mathbf{g} and its hyperplane. Since it is drawn from the standard normal distribution, the direction of 𝐠\mathbf{g} is uniformly distributed around the unit circle. Similarly, the hyperplane is also uniformly distributed around the unit circle. The sign of the inner product 𝐠,𝐱\langle \mathbf{g},\mathbf{x} \rangle specifies which side of the hyperplane 𝐱\mathbf{x} is on. Intuitively, the probability that 𝐱\mathbf{x} and 𝐲\mathbf{y} are on the same side of the hyperplane is proportional to their angle.

The probability that they lie on different sides of the hyperplane is the probability that the random hyperplane falls between 𝐱\mathbf{x} and 𝐲\mathbf{y} which is 2θ2π\frac{2\theta}{2\pi}. Then the probability that they lie on the same side of the hyperplane is 1θπ1−\frac{\theta}{\pi}.

In higher dimensions, we can use the same intuition. There is always some rotation matrix 𝐔\mathbf{U} such that 𝐔𝐱\mathbf{Ux} and 𝐔𝐲\mathbf{Uy} are spanned by the first two standard basis vectors and have the same cosine similarity as 𝐱\mathbf{x} and 𝐲\mathbf{y}. Then we can apply the result in one dimension to 𝐔𝐱\mathbf{Ux} and 𝐔𝐲\mathbf{Uy}.


SimHash can be tuned, just like our MinHash based function for Jaccard similarity


SimHash(x)=sign(x,t)SimHash(x) = \mathrm{sign}(⟨x,t⟩) for a random vector tt.


see Hashing

References:

  1. M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, Montreal Quebec Canada: ACM, May 2002, pp. 380–388. doi: 10.1145/509907.509965.
  2. https://www.chrismusco.com/amlds2023/notes/lecture05.html#SimHash
  3. https://en.wikipedia.org/wiki/SimHash
  4. https://ferd.ca/simhashing-hopefully-made-simple.html
  5. https://datascience.stackexchange.com/questions/6086/minhashing-vs-simhashing
  6. https://people.cs.umass.edu/~cmusco/CS514F20/slides/lecture8/lecture8Compressed.pdf
  7. https://sumonbis.github.io/academic-project/simhash/
  8. https://www.fromkk.com/posts/near-duplicate-with-simhash/