Giter VIP home page Giter VIP logo

olivierh59500 / simple_simhash Goto Github PK

View Code? Open in Web Editor NEW

This project forked from optimyze/simple_simhash

0.0 0.0 0.0 13 KB

A pure ANSI-C implementation of calculating a SimHash over 4-byte tuples (including multiplicities) for a given byte stream. Simple and reasonably fast, no dynamic memory allocations (outside of some stack usage). Uses a counting bloom filter to count multiplicities while keeping memory consumption constant.

License: Apache License 2.0

Makefile 3.03% C 91.15% C++ 5.82%

simple_simhash's Introduction

simple_simhash

A pure ANSI-C implementation of calculating a SimHash over 4-byte tuples (including multiplicities) for a given byte stream. Simple and reasonably fast, no dynamic memory allocations (outside of some stack usage).

Calculates a 256-bit hash for the input stream. Two input streams can be compared to each other by calculating the hamming distance between the hashes; this is just a matter of XOR and popcount.

The output approximates the cosine similarity -- S = |W1 \cap W2| / sqrt(|W1||W2|). The sets W1 and W2 in this scenario are the 4-byte sequences encountered in the input buffers, where the n-th occurrence of the same sequence is treated as distinct from the n+1-th occurrence.

Let's calculate through an example:

Buffer1: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Buffer2: ABCDEFGHIJKLMNOPQRSTUVWXYZ____abcdefghijklmnopqrstuvwxyz

There are 52 4-tuples in the first buffer, and 56 4-tuples in the second buffer.

Buffer2 does not contain the tuples XYZa, YZab, Zabc which are present in Buffer1, and Buffer1 does not contain the tuples XYZ_, YZ__, Z___, ____, ___a, __ab, _abc which are present in Buffer2.

This means that the cosine similarity is: 49 / sqrt(53 * 56) = 49 / sqrt(2968) = 49 / 54 ~= 0.899. The distance is hence a bit more than 0.1.

Let's have a look:

$ ./simhash_compare ./testfile1.txt ./testfile2.txt 
95e24d0f855f73aa-563488dc5716e32f-0095f987c0616a19-0be46dc9599106b8
d5c04d0e865f73aa-deb4885d5636ea27-1691f983c0616a19-1bc1edea59938679
Hamming Distance: 34 - 0.13

Seems to work.

The code itself is well-suited to be embedded into kernel drivers etc. - the runtime is predictable in terms of the number of bytes that need to be processed, outside of some stack use no dynamic allocations can happen, there is no recursion and no other unwanted surprises.

simple_simhash's People

Contributors

thomasdullien avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.