Giter VIP home page Giter VIP logo

simhash-cpp's Introduction

Simhash Near-Duplicate Detection

Build Status

This library enables the identification of all fingerprints that are nearly identical to a query fingerprint. In this context, a fingerprint is an unsigned 64-bit integer.

It also comes with an auxillary function designed to generate a fingerprint given a char* and a length. This fingeprint is generated with a tokenizer and a hash function (both of which may be provided as template paramters). Using a cyclic hash function, it then performs simhash on a moving window of tokens (as defined by the tokenizer).

The default hash function is jenkins hash and the default tokenizer splits on non-alpha characters.

Status: Production Team: Big Data Scope: External Open Source: Yes Critical: Yes

Usage

Library

The library provides two utilities for finding simhashes:

  • Simhash::find_all finds all matching pairs of simhashes
  • Simhash::find_clusters finds clusters of matching simhashes (see #clustering)

Binaries

This also provides two binaries to facilitate use from other languages. They both read hashes as newline-separated decimal strings, and print out newline-separated JSON arrays.

  • simhash-find-all writes all matching pairs as arrays with two elements
  • simhash-find-clusters writes all clusters as arrays of simhashes

Both have the following common arguments:

  • --input names a file from which to read (defaults to -, meaning stdin)
  • --output names a file to which to write (defaults to -, meaning stdout)
  • --blocks sets the number of blocks to use for simhash matching
  • --distance sets the maximum bit distance for considering matches

Architecture

In this context, there is a large corpus of known fingerprints, and we would like to determine all the fingerprints that differ by our query by k or fewer bits. To accomplish this, we divide up the 64 bits into at m blocks, where m is greater than k. If hashes A and B differ by at most k bits, then at least m - k groups are the same.

Choosing all the unique combinations of m - k blocks, we perform a permutation on each of the hashes for the documents so that those blocks are first in the hash. Perhaps a picture would illustrate it better:

63------53|52------42|41-----32|31------21|20------10|09------0|
|    A    |     B    |    C    |     D    |     E    |    F    |

If m = 6, k = 3, we'll choose permutations:
- A B C D E F
- A B D C E F
- A B E C D F
...
- C D F A B E
- C E F A B D
- D E F A B C

This generates a number of tables that can be put into sorted order, and then a small range of candidates can be found in each of those tables for a query, and then each candidate in that range can be compared to our query.

The corpus is represented by the union of these tables, could conceivably be hosted on a separate machine. And each of these tables is also amenable to sharding, where each shard would comprise a contiguous range of numbers. For example, you might divide a table into 256 shards, where each shard is associated with each of the possible first bytes.

The best partitioning remains to be seen, likely from experimentation, but the basis of this is the table. The table tracks hashes inserted into it subject to a permutation associated with the table. This permutation is described as a vector of bitmasks of contiguous bit ranges, whose populations sum to 64.

Example

Let's suppose that our corpus has a fingerprint:

0100101110111011001000101111101110111100001010011101100110110101

and we have a query:

0100101110111011011000101111101110011100001010011100100110110101

and they differ by only three bits which happen to fall in blocks B, D and E:

63------53|52------42|41-----32|31------21|20------10|09------0|
|    A    |     B    |    C    |     D    |     E    |    F    |
|         |          |         |          |          |         |
0000000000000000010000000000000000100000000000000001000000000000

Since any fingerprint matching the query differs by at most 3 bits, at most 3 blocks can differ, and at least 3 must match. Whatever table has the 3 blocks that do not differ as the leading blocks will match the query when doing a scan. In this case, the table that's permuted A C F B D E will match. It's important to note that it's possible for a query to match from more than one table. For example, if two of the non-matching bits are in the same block, or the query differs by fewer than 3 bits.

Clustering

The current clustering implementation considers the clusters to be the connected components of the connectivity graph. In other words:

For a simhash to be a member of a cluster, it must be a match with at least one
member of that cluster.

This does mean that a cluster may have pairs of members that aren't matches. For examples, (A, B, C, D) might be a cluster where A matches B, which matches C, which matches D, but A and D are too far apart to be a match.

simhash-cpp's People

Contributors

b4hand avatar eyvindn avatar simongog avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simhash-cpp's Issues

Migrate off of libjudy

This has been discussed in the past, but both #7 and seomoz/simhash-py#8 are related to libjudy's reliance on undefined behavior which is now causing the library to malfunction on newer versions of GCC. It's caused us enough pain in the past that we should probably just bite the bullet and pick something new.

Implement method that returns all similar hashes

Hi,

Currently in the master branch there is a find_all method that returns all pair of matches.

Would be nice if there was a method that given a t_hash it returns all the similar ones to the t_hash

something like:

find(hashes, t_hash) -> returns a vector of all the similar hashes with t_hash

License?

@dlecocq Dan: what would the license for this fine piece of code? Thanks!

fnv.h and murmur.h appear to be stubs

The files fnv.h and murmur.h appear to be stubs and don't actually implement any hashing function.

Additionally, murmur.h does not contain a proper include guard.

The jenkins hash is the only valid implementation. I don't know if this is intentional or not, but it seems like it would be a good idea to drop these files if there's no plans of filling them in.

Unable to use simhash-find-all

I have tried the below command but it's giving me an error

g++ -o "simhash-find-all" simhash-find-all.cpp --input "/home/sheel/Downloads/simhash-cpp-master/temp.txt" --output "/home/sheel/Downloads/simhash-cpp-master/tmp2.txt"

Error -
g++: error: unrecognized command-line option ‘--input’

bench exits with error status 1 (libjudy related?)

Trying to compile simhash-cpp, I am running into a couple of issues. The first one is that I am unable to compile test.cpp. The issue is described already in #2.

The other issue is related to bench.cpp. It compiles, but when I am trying to run the executable, I get the following error message:

Inserting 10000000 hashes...
Running 40000000 queries...
Queries complete with 0 errors
There are 9997291 items in the table
File 'simhash.h', line 137: Judy1FreeArray(), JU_ERRNO_* == 9, ID == 113
zsh: exit 1     ./bench 10000000

I have used the following packages for compilation:

judy 1.0.5
g++ 4.9.2
Catch build 53

Any idea what is happening?

EDIT: The JU_ERRNO above corresponds to (from July.h):

// JU_ERRNO_CORRUPT occurs when Judy detects an impossible value in a Judy data
// structure:
//
// Note:  The Judy data structure contains some redundant elements that support
// this type of checking.

        JU_ERRNO_CORRUPT        = 9

test.cpp have problems.

when I did make, I got this.
Help help~

test.cpp:158:70: error: no match for call to ‘(Simhash::Simhash<>) (const char_, std::basic_string::size_type)’
hash.hpp:26:11: note: candidate is:
hash.hpp:39:16: note: Simhash::hash_t Simhash::Simhash::operator()(char__) [with Hash = Simhash::jenkins, Simhash::hash_t = long unsigned int]
hash.hpp:39:16: note: candidate expects 1 argument, 2 provided
test.cpp:161:70: error: no match for call to ‘(Simhash::Simhash<>) (const char_, std::basic_string::size_type)’
hash.hpp:26:11: note: candidate is:
hash.hpp:39:16: note: Simhash::hash_t Simhash::Simhash::operator()(char**) [with Hash = Simhash::jenkins, Simhash::hash_t = long unsigned int]
hash.hpp:39:16: note: candidate expects 1 argument, 2 provided
test.cpp:175:47: error: no match for call to ‘(Simhash::Simhash<>) (const char*, std::basic_string::size_type)’
hash.hpp:26:11: note: candidate is:
hash.hpp:39:16: note: Simhash::hash_t Simhash::Simhash::operator()(char**) [with Hash = Simhash::jenkins, Simhash::hash_t = long unsigned int]
hash.hpp:39:16: note: candidate expects 1 argument, 2 provided
make: *** [test] Error 1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.