MinHash (Broder, 1997)

Choose kk random hash functions h1,,hk:{1,,n}[0,1]h_1,…,h_k: \{1,…,n\} \rightarrow [0,1]. For i1,,ki \in 1,…,k, let ci=minj,𝐪j=1hi(j)c_i = \min_{j,\mathbf{q}_j =1} h_i(j). C(𝐪)=[c1,...,ck]C(\mathbf{q}) = [c_1,...,c_k]

Claim: For all ii, Pr[ci(𝐪)=ci(𝐲)]=J(𝐪,𝐲)=|𝐪𝐲||𝐪𝐲|\mathrm{Pr}[c_i(\mathbf{q})=c_i(\mathbf{y})]=J(\mathbf{q},\mathbf{y})=\frac{|\mathbf{q}\cap\mathbf{y}|}{|\mathbf{q}\cup\mathbf{y}|} Proof:

#incomplete


See Hashing.

References:

  1. A. Z. Broder, “On the resemblance and containment of documents,” in Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), Salerno, Italy: IEEE Comput. Soc, 1998, pp. 21–29. doi: 10.1109/SEQUEN.1997.666900.
  2. https://en.wikipedia.org/wiki/MinHash
  3. https://www.chrismusco.com/amlds2023/notes/lecture05.html
  4. https://people.cs.umass.edu/~cmusco/CS514F21/slides/lecture9/lecture9Annotated.pdf
  5. https://ekzhu.com/datasketch/minhash.html
  6. https://users.cs.utah.edu/~jeffp/teaching/cs5955/L5-Minhash.pdf
  7. https://aksakalli.github.io/2016/03/01/jaccard-similarity-with-minhash.html