This library provides tools for b-bit MinHash algorism.
Please file an issue. (Japanese forum is here.)
Put the following dependency into pom.xml:
<dependency>
<groupId>org.codelibs</groupId>
<artifactId>minhash</artifactId>
<version>0.1.0</version>
</dependency>
MinHash class provides tools to calculate MinHash.
// Lucene's tokenizer parses a text. WhitespaceTokenizer is used if null.
Tokenizer tokenizer = null;
// The number of bits for each hash value.
int hashBit = 1;
// A base seed for hash functions.
int seed = 0;
// The number of hash functions.
int num = 128;
// Analyzer for 1-bit 128 hash.
Analyzer analyzer = MinHash.createAnalyzer(tokenizer, hashBit, seed,
num);
String text = "Fess is very powerful and easily deployable Enterprise Search Server.";
// Calculate a minhash value. The size is hashBit*num.
byte[] minhash = MinHash.calculate(analyzer, text);
compare method returns a similarity between texts. The value is from 0 to 1. But a value below 0.5 means different texts.
// Compare a similar text.
String text1 = "Fess is very powerful and easily deployable Search Server.";
byte[] minhash1 = MinHash.calculate(analyzer, text1);
assertEquals(0.953125f, MinHash.compare(minhash, minhash1));
// Compare a different text.
String text2 = "Solr is the popular, blazing fast open source enterprise search platform";
byte[] minhash2 = MinHash.calculate(analyzer, text2);
assertEquals(0.453125f, MinHash.compare(minhash, minhash2));