1e0ng / simhash Goto Github PK

View Code? Open in Web Editor NEW

960.0 22.0 222.0 511 KB

A Python Implementation of Simhash Algorithm

License: MIT License

Python 100.00%

simhash's Introduction

simhash

This is a Python implementation of Simhash.

Getting Started

http://leons.im/posts/a-python-implementation-of-simhash-algorithm/

Build Status

simhash's People

Stargazers

Watchers

Forkers

fanzhang312 chicoq mekk dlebech xmzhao bentanust changguanghua pombredanne littlecho charleszhong xiaohehly zhouna endsh jonaswangqi kengelke scribu xmpy wollmers andradeandrey sunshuzhou adityakiran vibster wuan lenovor extant-q sericwong forrestgan zhuhk b-cube likaiguo guohongli ptrourke hecor linuxcer evanvolgas cjauvin nkhuyu xsongx scofield0li ahzhao dotcheng frank1993 ii0 qingniufly wangming28 shuimu mars1994 nikhilfulzele-innoplexus lovetimil kamendula hhungry vivekpemawat jordicolomer jfanzhao liuxialong ranksense yuye2133 romain-fontugne liyongbao shihuaxing johnb30 roomthily xiaojingyi zhlj rivercold rock999 kangx694 yuwenlidao btimby radishboo bobquest33 liuleigit luxiao ozlevka fengchu0618 louiekang sjian1990 rongyousu stevenksaint volker48 anastasia yellmi anubhav722 vimfun coocoky bisonmin huizyh luozhen chennqqi wutenghu lvious akellne weifengchiu dizorg wxjttxs frankfqchen wsnooker lefugang alicia0001 zle1992

simhash's Issues

Equality operator for Simhash

Hi,

Have you considered implementing the __eq__ operator for the Simhash class, to compare two .value() values for two simhashes? That way a simhash is much more easily integrated into comparison flows like Python's unittests.

I could submit a PR if you're interested in this.

Is a remove feature available in Simhash Index?

I see an "add" feature is already there. Thanks!

Simhash().value method

is this correct? is that what's expected? thanks!

print '%x' % Simhash('aa').value
86f24ba207a4912
print '%x' % Simhash('aaa').value
86f24ba207a4912
print '%x' % Simhash('aaaa').value
86f24ba207a4912
print '%x' % Simhash('aaaaa').value
86f24ba207a4912
print '%x' % Simhash('aaaaab').value
86f24ba207a4912
print '%x' % Simhash('aaaaabb').value
86f24ba207a4912
print '%x' % Simhash('aaaaabbb').value
86f24ba207a4912

how many bits of MD5 in Python?

I see you set f = 64 in this programme, but the method hexdigest() of md5 returns 128 bits,
I think it is a contradiction.

Issue downloading latest version 2.1.0 from pipy with Python 3

I see the latest version 2.1.0 listed on pipy.org here, however when I try to install it I get an error for 'no matching distribution found':

$ pip3 install simhash==2.1.0

ERROR: Could not find a version that satisfies the requirement simhash==2.1.0 (from versions: 0.0.2, 0.0.3, 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.1.8, 0.1.10, 0.1.11, 1.0.0, 1.0.1, 1.0.11, 1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.3.0, 1.4.0, 1.4.2, 1.4.4, 1.4.6, 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1, 1.8.0, 1.9.0, 1.10, 1.10.1, 1.10.2, 1.11.0, 2.0.0)
ERROR: No matching distribution found for simhash==2.1.0

Is it possible something happened with the latest release during the publish process to pipy?

Distance between two Simhash is not the same as other Simhash implementations.

https://github.com/leonsim/simhash/blob/5ea7411823d86c61555fe5f7f8b76b85f97cdc01/simhash/__init__.py#L94
should be

    if v[i] > 0:

Other Simhash implementation don't set the bit in the result hash to 1 if the count in the result vector is 0. examples: https://github.com/seomoz/simhash-cpp/blob/e7aacb1642f406ff0815cf402e909d2002473812/src/simhash.cpp
and https://github.com/admazely/simhash/blob/master/main.js
You can also check this paper of the university of saskatchewan (page 3): https://www.cs.usask.ca/~croy/papers/2011/URKH_WCRE2011_simCad.pdf

SimhashIndex.get_near_dups lost the order of similarity of index

I custom it by like below:
`
ans = PriorityQueue()

    for key in self.get_keys(simhash):
        dups = self.bucket[key]
        self.log.debug('key:%s', key)
        if len(dups) > 200:
            self.log.warning('Big bucket found. key:%s, len:%s', key, len(dups))

        for dup in dups:
            sim2, obj_id = dup.split(',', 1)
            sim2 = Simhash(long(sim2, 16), self.f)

            d = simhash.distance(sim2)
            if d <= self.k:
                ans.put((d, obj_id))
    res = []
    tmp = {}
    while not ans.empty():
        d, obj_id = ans.get()
        if obj_id not in tmp:
            res.append(str(obj_id))
            tmp[obj_id] = 1`

Seems not work on Chinese character?

text1 = '**'
text2 = '**人'
words1 = list(words1)
words2 = list(words2)
print(Simhash(words1).distance(Simhash(words2)))

the result is 14. It seems not work on Chinese ?

Delete this line

Dear author,

Could you please delete this line, https://github.com/1e0ng/simhash/blob/master/simhash/__init__.py#L210 ? I think there is no need to print all the debug information

处理中文会报错？

`---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
in ()
----> 1 Simhash(s2)

C:\Users\Administrator\Anaconda2\lib\site-packages\simhash__init__.pyc in init(self, value, f, reg, hashfunc)
47 self.value = value.value
48 elif isinstance(value, basestring):
---> 49 self.build_by_text(unicode(value))
50 elif isinstance(value, collections.Iterable):
51 self.build_by_features(value)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 4072: ordinal not in range(128)`

Update for python 3.4

I just installed simhash using pip for python 3.4. There are a few minor issues that break when using python3. Would you be able to run through your code and update for python3?

请问这个版本是针对于中文文档的simhash实现吗

Import Error

Hi,
I faced the problem of "ImportError: cannot import name 'Simhash'", but I suppose I install the package correctly...

Gmpy2 causes installation issues downstream

#44 introduces gmpy2 dependency, which causes major issues downstream. Error:
src/gmpy.h:252:20: fatal error: mpfr.h: No such file or directory
gmpy2 doesn't seem to be pip installable, at least not on OSX.

Simhash warning: Big Bucket Found

How does a bigger bucket effect the result in the code? Except for time does it effect the result?

Calculate distance together should improve the speed 10x

This should be 10x faster than current implementation.

However, the value calculate should also change, as the highest bit array is on the left side.

import time
from gmpy2 import popcount

import numpy as np

a = '1000111110100001010100011111111101111111000111010011110001110111'
b = '1111010101001010101001001101010001'


def get_value(v):
    ans = 0
    masks = [1 << i for i in range(64)]
    for i in range(64):
        if v[i] > 0:
            ans |= masks[i]
    return ans


a = '0' * (64 - len(a)) + a
b = '0' * (64 - len(b)) + b


def dis1(a, b):
    x = (a ^ b) & ((1 << 64) - 1)
    ans = 0
    while x:
        ans += 1
        x &= x - 1
    return ans


def dis2(a, b):
    return popcount(a ^ b)


a_np = np.array(list(map(lambda x: int(x), a)))
b_np = np.array(list(map(lambda x: int(x), b)))

a_value = get_value(a_np)
b_value = get_value(b_np)
a_int = int(a, 2)
b_int = int(b, 2)
s = time.time()
for i in range(1000000):
    d = dis1(a_value, b_value)
print(time.time() - s, d)
s = time.time()
for i in range(1000000):
    d = dis2(a_int, b_int)
print(time.time() - s, d)

output:

4.0435950756073 35
0.323089599609375 35

如何在本地运行这个程序呢

运行test_simhash.py时候出错:
➜ tests git:(master) python test_simhash.py

...E..

ERROR: test_sparse_features (main.TestSimhash)

Traceback (most recent call last):
File "test_simhash.py", line 68, in test_sparse_features
shs.append(Simhash(features))
File "build/bdist.macosx-10.9-intel/egg/simhash/init.py", line 41, in init
self.build_by_features(value)
File "build/bdist.macosx-10.9-intel/egg/simhash/init.py", line 63, in build_by_features
hashs = [self.hashfunc(w.encode('utf-8')) for w in features]
AttributeError: 'tuple' object has no attribute 'encode'

好像使用的不是simhash目录下的程序，而是系统自带的

分词

您好，请问一下您是如何不同语言的分词的？
Hello， How do you separate phrases with different language?

Unable to see near by duplication

My code:

from simhash import Simhash, SimhashIndex

    data = {
        1: 'How are you? I Am fine. blar blar blar blar blar Thanks.',
        2: 'How are you i am fine. blar blar blar blar blar than',
        3: 'This is simhash test.',
    }
    objs = [(str(k), Simhash(v)) for k, v in data.items()]
    index = SimhashIndex(objs)
    print(index.bucket_size())

    # s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
    s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())
    print(index.get_near_dups(s1))

    index.add('4', s1)
    print(index.get_near_dups(s1))

I also tried for

s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())

OUTPUT:

7
[]
['4']

Expected Output:

7
['1']
['4','1']

The Setting of F Parameters

it is said parameter f is the dimensions of fingerprints.
How does the f parameter affect the hash value of the result？
I tried to compute the results of "abc" and "aaa", and found that the length was 64, and the other was 63.

Need more explanatory documentation to use code.

Functions and their parameters explanation required.

distance() always 0 ?

problems is :

the different string
always same Simhash ...

thanx for all suggest

example code as:

print as_target
sh1 = Simhash(as_target)
print '%x' % Simhash(get_features(as_target)).value
print id(sh1)

print as_user
sh2 = Simhash(as_user)
print '%x' % Simhash(get_features(as_user)).value
print id(sh2)

print sh1.distance(sh2)

the stdout like:

0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
415435ffb1fa93d4
4367884816
0000001000000010001100001100010010000000000000000000000010000010000000100000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001
415435ffb1fa93d4
4367884944
0

why Simhash code is not 64 bits

I use this module to calculate Chinese article，and I do not always get 64 bits simhash code，sometimes 63 or 62bits。
What can I do to make it always 64 bits？

how to suport big bucket calculating fast？

i use more than 10000 pieces texts , use it genenerate features, elapsed time 8 miniutes。 theoretically,it too long. how can i solve this problem?

Fast simhash calculation using numpy

I'm just opening this for documentation / FYI, since I previously sent a pull request on optimizations (#48) -- feel free to close. :)

It is possible to calculate simhashes about 10 times as fast with a dependency on numpy. Here's a simple example:

import numpy as np
import hashlib
import simhash

def hashfunc(obj):
    return hashlib.md5(obj.encode('utf8')).digest()[-8:]

def np_simhash(features):
    bytestring = b''.join(hashfunc(f) * w for f, w in features)
    bitarray = np.unpackbits(np.frombuffer(bytestring, dtype='>B'))
    rows = np.reshape(bitarray, (-1, 64))
    sums = np.sum(rows, 0)
    return int.from_bytes(np.packbits(sums > len(features) // 2).tobytes(), "big")

features = [("foo", 1), ("bar", 2)]

assert np_simhash(features) == simhash.Simhash(features).value

This runs about 10 times as fast in my testing (using randomly generated documents of a few hundred tokens, and also real-life text documents in 3-word shingles). With this implementation the underlying hash function becomes the bottleneck, instead of bit counting being the bottleneck as it currently is.

This might not be a good fit for this library, because of the dependency on numpy. My simple example would also need a little tweaking to keep the API of the current library; and it might want to do the summing in batches to avoid using too much RAM; and it would need some sort of fallback for large weights where * w is a bad idea. But I wanted to share in case the speedup is of interest.

How to use simhash() to check the similarity of different .html files？

Recently, I am studying how to use the simhash algorithm to check duplicates among a large number of .html files to identify whether different websites are duplicate websites.But I observed that the simhash package you provided seems to only support adding text directly to the code to calculate the simhash value to determine whether it is similar.And I want to ask how can I use your package to check duplicates between different .html files.

When using simhash for exact text search

Well, in my case, I use simhash to search short text(twitter like), but not the keyword search, but the exact text search.

When the dataset is about 200,000, everything works fine, and it takes 0.01s to find the exact short text. But when dataset as about 20,000,000, there are so many Big bucket found, and it because slow and sometime did not return.

I think there are something i can do, after check some blogs, I rise k from 3 to 10, but it slow down even at small dataset. @leonsim recommend customized hash function, but I don't understand why and how.

It seems SimhashIndex is broken

The following example

from simhash import Simhash, SimhashIndex
data = {
    1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
    2: u'How are you i am fine. blar blar blar blar blar than',
    3: u'This is simhash test.',
    4: u'How are you i am fine. blar blar blar blar blar thank1',
}
objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs)

print map(lambda x: x[1].value, objs)
print index.bucket_size()

s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
print index.get_near_dups(s1)

index.add(5, s1)
print index.get_near_dups(s1)

produces unexpected result

[8440240427182201978, 8440240356449459322, 9984379969213434071L, 17663612459742043242L]
10
[]
[u'5']

Also passing k=3 or other value into SimhashIndex produces
File "build/bdist.linux-x86_64/egg/simhash/init.py", line 112, in get_near_dups
File "build/bdist.linux-x86_64/egg/simhash/init.py", line 56, in init
Exception: Bad parameter with type <type 'in

Seemly wrong result

>>> a
u'\u0114\u0115\u0116\u0117\u0118\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u011b\u011c\u0114\u0115\u0116\u0117\u0118\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u011b\u011c\u0114\u0115\u0116\u0117\u0118\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u011b\u011c\u0114\u0115\u0116\u0117\u0118\u0119\u011a\u0119\u011a\u0119\u011a\u011b\u011c\u011d\u011e\u0114'
>>> b
u"!#$%$%&'()*+,\xa9\xaa\xab\xac\xad\xae\xaf\xae\xaf\xb0\xb1\xb2<=>?"
>>> Simhash(a).distance(Simhash(b))
0

extensive logging

It would be great, if the simhash module could use a custom logger. The advantage of this approach is that certain messages could be muted solely for the module, which is particularly useful, when using a couple of different modules that do extensive logging at the same time.

log = logging.getLogger("simhash")

log.debug("my logging message")

Dimension of simhash fingerprint not always equal to 64

Hi, appreciate the great work !
There are something I'm confused.

I'm dealing with chinese html text, so i customize get_features function. Specifically, I extract the text content from .html file first, then use jieba.analyse.extract_tags to get topK keywords and its tf-idf weights.

Then I call sh = Simhash(get_features(content)).value to get simhash fingerprint, and I call len(bin(sh)[2:]) to check the dimension of simhash fingerprint, [2:] since there is always '0b' in front of the simhash bytes.
But here comes the question, I found that the dimension of simhash bytes is not always equal to 64 although the default self.f is set to 64, it can be 61,62,63,64 as far as I have seen.

Have you ever encounter this problem before? I really wonder why, am I using the right method to check the dimension of simhash bytes? How could the dimension vary instead of always equal to 64 ? @1e0ng

Thanks !

Serialization and deserialization support

Hi,
There is no doubt that this project helps me a lot when I try to do deduplication on 130,000 wiki docs. However, it makes me headache when I try to re-run the simhash building process even with multiprocessing. Is there anyone who plan to add serialization and deserialization support in order to save time?

It seems that ZODB may be a proper backend tool.(Certainly I will have a try.)

cannot create Simhash instance by long integer lower than 9223372036854775808 on linux

Python 2.7.14 64bit
x86_64 Linux
AMD64 Windows

Try:

print Simhash(4390059585430954713).value

Encounter an error:

Traceback (most recent call last): File "/home/mypc/test_simhash.py", line 4, in <module> print Simhash(4390059585430954713).value File "/home/mypc/.local/lib/python2.7/site-packages/simhash/__init__.py", line 58, in __init__ raise Exception('Bad parameter with type {}'.format(type(value))) Exception: Bad parameter with type <type 'int'>

Then I found the python's sys.maxint was 9223372036854775807 on Linux, but 2147483647 on Windows, even the Windows was a 64bit system. I examine code, and I got isinstance(value, long) at line 60 in file simhash/init.py. So I have to add isinstance(value, int) into it, for the 4390059585430954713 is low than 9223372036854775808 but great than 2147483647, and the code Simhash(4390059585430954713) executed unsuccessfully on Linux but successfully on Windows. I don't know weither this change is right because I just learn little about simhash algorithm. So I post this issue to make sure.

Thanks for your efforts!

Simhash returning same value for different length strings

Simhash is returning the same hash value for different strings.
The strings may contain the same words, but the word count and length of the string are diffrent.
The small reproducible sample is given below,

from simhash import Simhash


def get_simhash_value(input: str):
    simhash_value = Simhash(value=input).value
    return simhash_value


res1 = get_simhash_value("test test test test test test test test test test")
res2 = get_simhash_value("test test test test test test test")
print(res1)
print(res2)

The original text content is two HTML contents after removing all tags and scripts.

converting simhash strings to simhash objects

thanks for your effort implementing the algorithm first of all!
When storing hashes in a database I obviously can only extract strings from it.
Is there any possibility to convert them back to Simhash type (e.g. to measure distance)?

How to suppress "Big bucket found"

How can effectively I suppress "Big bucket found" message? On larger files, I frequently see thousands of them in an output, which makes it difficult to review the results.

different list of tokens with different weight generates the same value?

I am using this package to generate simhash for my text. I have found that different list of tokens may generate the same hash value, which is confusing to me.

from simhash import Simhash

x = [('恋爱', 2.61855744598), ('脱单', 0.29491400873), ('闪婚', 0.29491400873)]
y = [('恋爱', 3.92783616897), ('结婚', 3.282195995975)]

print(Simhash(x).value)
print(Simhash(x).value)

The last two print statements will print the same hash value 12369878657857125584. I am not sure why does it happen. Is this the intended behavior?

Python2.6 run it ,error

error info:
Traceback (most recent call last):
File "test.py", line 2, in
from simhash import Simhash
File "/usr/lib/python2.6/site-packages/simhash/init.py", line 69
features = {k:sum(1 for _ in g) for k, g in groupby(sorted(features))}
^
SyntaxError: invalid syntax

this is my code

coding=utf-8

from simhash import Simhash

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

#print(Simhash('哎，江明也才发现，这边页面里面隐藏跳转都有好多页面，全部很满，基本不能共用，又没设计图，周三做完不可能了，做出来多半也是废的会重做').value)
#print(Simhash("this is test demo wewr").value)

Security warning on link in README

The following link: https://leons.im/posts/a-python-implementation-of-simhash-algorithm/ gives me a security warning in chrome with NET::ERR_CERT_DATE_INVALID.

Usage of `collections.Iterable` which is absent in Python 3.10

Hello, thank you for your beautiful library!

In Python 3.9+ (corrected: 3.10!) collections module does not contain Iterable. It is deprecated since Python 3.3, and collections.abc.Iterable is suggested to be used instead.

It is used in the code though:

simhash/simhash/__init__.py

Line 75 in 8987a06

elif isinstance(value, collections.Iterable):

, which causes a crash.

Broken link in a comment

Please replace broken reference to a pdf file in SimhashIndex.offsets

https://github.com/leonsim/simhash/blob/7337c9ae353dbdc32666e88ba07a75170c19b79c/simhash/__init__.py#L200

Question regarding bucket implementation

Hi,

I've read your implementation of finding near duplicates but something doesn't seem clear to me and hoped you could clarify about it. You use a dictionary to store a concatenation of the simhash chunks and index as keys and a concatenation the full simhash and simhash id as values.

keys: yield '%x:%x' % (c, i)

values: for key in self.get_keys(simhash): v = '%x,%s' % (simhash.value, obj_id) self.bucket[key].add(v)

Since dictionaries don't allow to repeat keys as it updates the value, if two simhashes have the same value for the same chunk at the same permutation index (for whatever reason), are we not losing one of the simhashes when building the SimhashIndex object and missing a possible duplicate?

ImportError: cannot import name 'Simhash'

[root@test simhash]# py
Python 3.5.4 (default, Jul 10 2018, 23:55:37)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.

from simhash import Simhash
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name 'Simhash'
import simhash
sh=simhash.Simhash()
Traceback (most recent call last):
File "", line 1, in
AttributeError: module 'simhash' has no attribute 'Simhash'