Giter VIP home page Giter VIP logo

minhash's Introduction

MinHash Library

Overview

This library provides tools for b-bit MinHash algorism.

Issues/Questions

Please file an issue. (Japanese forum is here.)

Installation

Maven

Put the following dependency into pom.xml:

<dependency>
  <groupId>org.codelibs</groupId>
  <artifactId>minhash</artifactId>
  <version>0.1.0</version>
</dependency>

References

Calculate MinHash

MinHash class provides tools to calculate MinHash.

// Lucene's tokenizer parses a text. WhitespaceTokenizer is used if null.
Tokenizer tokenizer = null;
// The number of bits for each hash value.
int hashBit = 1;
// A base seed for hash functions.
int seed = 0;
// The number of hash functions.
int num = 128;
// Analyzer for 1-bit 128 hash.
Analyzer analyzer = MinHash.createAnalyzer(tokenizer, hashBit, seed,
    num);

String text = "Fess is very powerful and easily deployable Enterprise Search Server.";

// Calculate a minhash value. The size is hashBit*num.
byte[] minhash = MinHash.calculate(analyzer, text);

Compare Texts

compare method returns a similarity between texts. The value is from 0 to 1. But a value below 0.5 means different texts.

// Compare a similar text.
String text1 = "Fess is very powerful and easily deployable Search Server.";
byte[] minhash1 = MinHash.calculate(analyzer, text1);
assertEquals(0.953125f, MinHash.compare(minhash, minhash1));

// Compare a different text.
String text2 = "Solr is the popular, blazing fast open source enterprise search platform";
byte[] minhash2 = MinHash.calculate(analyzer, text2);
assertEquals(0.453125f, MinHash.compare(minhash, minhash2));

minhash's People

Contributors

marevol avatar

Watchers

Hasan Kara avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.