This is a Python implementation of Simhash.
http://leons.im/posts/a-python-implementation-of-simhash-algorithm/
A Python Implementation of Simhash Algorithm
License: MIT License
This is a Python implementation of Simhash.
http://leons.im/posts/a-python-implementation-of-simhash-algorithm/
Hi,
Have you considered implementing the __eq__
operator for the Simhash class, to compare two .value()
values for two simhashes? That way a simhash is much more easily integrated into comparison flows like Python's unittests.
I could submit a PR if you're interested in this.
I see an "add" feature is already there. Thanks!
is this correct? is that what's expected? thanks!
print '%x' % Simhash('aa').value
86f24ba207a4912
print '%x' % Simhash('aaa').value
86f24ba207a4912
print '%x' % Simhash('aaaa').value
86f24ba207a4912
print '%x' % Simhash('aaaaa').value
86f24ba207a4912
print '%x' % Simhash('aaaaab').value
86f24ba207a4912
print '%x' % Simhash('aaaaabb').value
86f24ba207a4912
print '%x' % Simhash('aaaaabbb').value
86f24ba207a4912
I see you set f = 64 in this programme, but the method hexdigest()
of md5 returns 128 bits,
I think it is a contradiction.
I see the latest version 2.1.0
listed on pipy.org here, however when I try to install it I get an error for 'no matching distribution found':
$ pip3 install simhash==2.1.0
ERROR: Could not find a version that satisfies the requirement simhash==2.1.0 (from versions: 0.0.2, 0.0.3, 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.1.4, 0.1.5, 0.1.6, 0.1.7, 0.1.8, 0.1.10, 0.1.11, 1.0.0, 1.0.1, 1.0.11, 1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.3.0, 1.4.0, 1.4.2, 1.4.4, 1.4.6, 1.5.0, 1.5.1, 1.6.0, 1.6.1, 1.6.2, 1.7.0, 1.7.1, 1.8.0, 1.9.0, 1.10, 1.10.1, 1.10.2, 1.11.0, 2.0.0)
ERROR: No matching distribution found for simhash==2.1.0
Is it possible something happened with the latest release during the publish process to pipy?
if v[i] > 0:
Other Simhash implementation don't set the bit in the result hash to 1 if the count in the result vector is 0. examples: https://github.com/seomoz/simhash-cpp/blob/e7aacb1642f406ff0815cf402e909d2002473812/src/simhash.cpp
and https://github.com/admazely/simhash/blob/master/main.js
You can also check this paper of the university of saskatchewan (page 3): https://www.cs.usask.ca/~croy/papers/2011/URKH_WCRE2011_simCad.pdf
I custom it by like below:
`
ans = PriorityQueue()
for key in self.get_keys(simhash):
dups = self.bucket[key]
self.log.debug('key:%s', key)
if len(dups) > 200:
self.log.warning('Big bucket found. key:%s, len:%s', key, len(dups))
for dup in dups:
sim2, obj_id = dup.split(',', 1)
sim2 = Simhash(long(sim2, 16), self.f)
d = simhash.distance(sim2)
if d <= self.k:
ans.put((d, obj_id))
res = []
tmp = {}
while not ans.empty():
d, obj_id = ans.get()
if obj_id not in tmp:
res.append(str(obj_id))
tmp[obj_id] = 1`
text1 = '**'
text2 = '**人'
words1 = list(words1)
words2 = list(words2)
print(Simhash(words1).distance(Simhash(words2)))
the result is 14
. It seems not work on Chinese ?
Dear author,
Could you please delete this line, https://github.com/1e0ng/simhash/blob/master/simhash/__init__.py#L210 ? I think there is no need to print all the debug information
`---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
in ()
----> 1 Simhash(s2)
C:\Users\Administrator\Anaconda2\lib\site-packages\simhash__init__.pyc in init(self, value, f, reg, hashfunc)
47 self.value = value.value
48 elif isinstance(value, basestring):
---> 49 self.build_by_text(unicode(value))
50 elif isinstance(value, collections.Iterable):
51 self.build_by_features(value)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 4072: ordinal not in range(128)`
I just installed simhash using pip for python 3.4. There are a few minor issues that break when using python3. Would you be able to run through your code and update for python3?
Hi,
I faced the problem of "ImportError: cannot import name 'Simhash'", but I suppose I install the package correctly...
#44 introduces gmpy2 dependency, which causes major issues downstream. Error:
src/gmpy.h:252:20: fatal error: mpfr.h: No such file or directory
gmpy2 doesn't seem to be pip installable, at least not on OSX.
How does a bigger bucket effect the result in the code? Except for time does it effect the result?
This should be 10x faster than current implementation.
However, the value
calculate should also change, as the highest bit array is on the left side.
import time
from gmpy2 import popcount
import numpy as np
a = '1000111110100001010100011111111101111111000111010011110001110111'
b = '1111010101001010101001001101010001'
def get_value(v):
ans = 0
masks = [1 << i for i in range(64)]
for i in range(64):
if v[i] > 0:
ans |= masks[i]
return ans
a = '0' * (64 - len(a)) + a
b = '0' * (64 - len(b)) + b
def dis1(a, b):
x = (a ^ b) & ((1 << 64) - 1)
ans = 0
while x:
ans += 1
x &= x - 1
return ans
def dis2(a, b):
return popcount(a ^ b)
a_np = np.array(list(map(lambda x: int(x), a)))
b_np = np.array(list(map(lambda x: int(x), b)))
a_value = get_value(a_np)
b_value = get_value(b_np)
a_int = int(a, 2)
b_int = int(b, 2)
s = time.time()
for i in range(1000000):
d = dis1(a_value, b_value)
print(time.time() - s, d)
s = time.time()
for i in range(1000000):
d = dis2(a_int, b_int)
print(time.time() - s, d)
output:
4.0435950756073 35
0.323089599609375 35
运行test_simhash.py时候出错:
➜ tests git:(master) python test_simhash.py
Traceback (most recent call last):
File "test_simhash.py", line 68, in test_sparse_features
shs.append(Simhash(features))
File "build/bdist.macosx-10.9-intel/egg/simhash/init.py", line 41, in init
self.build_by_features(value)
File "build/bdist.macosx-10.9-intel/egg/simhash/init.py", line 63, in build_by_features
hashs = [self.hashfunc(w.encode('utf-8')) for w in features]
AttributeError: 'tuple' object has no attribute 'encode'
好像使用的不是simhash目录下的程序,而是系统自带的
您好, 请问一下您是如何不同语言的分词的?
Hello, How do you separate phrases with different language?
My code:
from simhash import Simhash, SimhashIndex
data = {
1: 'How are you? I Am fine. blar blar blar blar blar Thanks.',
2: 'How are you i am fine. blar blar blar blar blar than',
3: 'This is simhash test.',
}
objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs)
print(index.bucket_size())
# s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())
print(index.get_near_dups(s1))
index.add('4', s1)
print(index.get_near_dups(s1))
I also tried for
s1 = Simhash('How are you i am fine. blar blar blar blar blar thank'.split())
OUTPUT:
7
[]
['4']
Expected Output:
7
['1']
['4','1']
Functions and their parameters explanation required.
problems is :
thanx for all suggest
example code as:
print as_target
sh1 = Simhash(as_target)
print '%x' % Simhash(get_features(as_target)).value
print id(sh1)
print as_user
sh2 = Simhash(as_user)
print '%x' % Simhash(get_features(as_user)).value
print id(sh2)
print sh1.distance(sh2)
the stdout like:
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
415435ffb1fa93d4
4367884816
0000001000000010001100001100010010000000000000000000000010000010000000100000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001
415435ffb1fa93d4
4367884944
0
I use this module to calculate Chinese article,and I do not always get 64 bits simhash code,sometimes 63 or 62bits。
What can I do to make it always 64 bits?
i use more than 10000 pieces texts , use it genenerate features, elapsed time 8 miniutes。 theoretically,it too long. how can i solve this problem?
I'm just opening this for documentation / FYI, since I previously sent a pull request on optimizations (#48) -- feel free to close. :)
It is possible to calculate simhashes about 10 times as fast with a dependency on numpy. Here's a simple example:
import numpy as np
import hashlib
import simhash
def hashfunc(obj):
return hashlib.md5(obj.encode('utf8')).digest()[-8:]
def np_simhash(features):
bytestring = b''.join(hashfunc(f) * w for f, w in features)
bitarray = np.unpackbits(np.frombuffer(bytestring, dtype='>B'))
rows = np.reshape(bitarray, (-1, 64))
sums = np.sum(rows, 0)
return int.from_bytes(np.packbits(sums > len(features) // 2).tobytes(), "big")
features = [("foo", 1), ("bar", 2)]
assert np_simhash(features) == simhash.Simhash(features).value
This runs about 10 times as fast in my testing (using randomly generated documents of a few hundred tokens, and also real-life text documents in 3-word shingles). With this implementation the underlying hash function becomes the bottleneck, instead of bit counting being the bottleneck as it currently is.
This might not be a good fit for this library, because of the dependency on numpy. My simple example would also need a little tweaking to keep the API of the current library; and it might want to do the summing in batches to avoid using too much RAM; and it would need some sort of fallback for large weights where * w
is a bad idea. But I wanted to share in case the speedup is of interest.
Recently, I am studying how to use the simhash algorithm to check duplicates among a large number of .html files to identify whether different websites are duplicate websites.But I observed that the simhash package you provided seems to only support adding text directly to the code to calculate the simhash value to determine whether it is similar.And I want to ask how can I use your package to check duplicates between different .html files.
Well, in my case, I use simhash to search short text(twitter like), but not the keyword search, but the exact text search.
When the dataset is about 200,000, everything works fine, and it takes 0.01s to find the exact short text. But when dataset as about 20,000,000, there are so many Big bucket found
, and it because slow and sometime did not return.
I think there are something i can do, after check some blogs, I rise k from 3 to 10, but it slow down even at small dataset. @leonsim recommend customized hash function, but I don't understand why and how.
The following example
from simhash import Simhash, SimhashIndex
data = {
1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
2: u'How are you i am fine. blar blar blar blar blar than',
3: u'This is simhash test.',
4: u'How are you i am fine. blar blar blar blar blar thank1',
}
objs = [(str(k), Simhash(v)) for k, v in data.items()]
index = SimhashIndex(objs)
print map(lambda x: x[1].value, objs)
print index.bucket_size()
s1 = Simhash(u'How are you i am fine. blar blar blar blar blar thank')
print index.get_near_dups(s1)
index.add(5, s1)
print index.get_near_dups(s1)
produces unexpected result
[8440240427182201978, 8440240356449459322, 9984379969213434071L, 17663612459742043242L]
10
[]
[u'5']
Also passing k=3 or other value into SimhashIndex produces
File "build/bdist.linux-x86_64/egg/simhash/init.py", line 112, in get_near_dups
File "build/bdist.linux-x86_64/egg/simhash/init.py", line 56, in init
Exception: Bad parameter with type <type 'in
>>> a
u'\u0114\u0115\u0116\u0117\u0118\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u011b\u011c\u0114\u0115\u0116\u0117\u0118\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u011b\u011c\u0114\u0115\u0116\u0117\u0118\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u0119\u011a\u011b\u011c\u0114\u0115\u0116\u0117\u0118\u0119\u011a\u0119\u011a\u0119\u011a\u011b\u011c\u011d\u011e\u0114'
>>> b
u"!#$%$%&'()*+,\xa9\xaa\xab\xac\xad\xae\xaf\xae\xaf\xb0\xb1\xb2<=>?"
>>> Simhash(a).distance(Simhash(b))
0
It would be great, if the simhash module could use a custom logger. The advantage of this approach is that certain messages could be muted solely for the module, which is particularly useful, when using a couple of different modules that do extensive logging at the same time.
log = logging.getLogger("simhash")
log.debug("my logging message")
Hi, appreciate the great work !
There are something I'm confused.
I'm dealing with chinese html text, so i customize get_features
function. Specifically, I extract the text content from .html file first, then use jieba.analyse.extract_tags
to get topK keywords and its tf-idf weights.
Then I call sh = Simhash(get_features(content)).value
to get simhash fingerprint, and I call len(bin(sh)[2:])
to check the dimension of simhash fingerprint, [2:]
since there is always '0b' in front of the simhash bytes.
But here comes the question, I found that the dimension of simhash bytes is not always equal to 64 although the default self.f
is set to 64, it can be 61,62,63,64 as far as I have seen.
Have you ever encounter this problem before? I really wonder why, am I using the right method to check the dimension of simhash bytes? How could the dimension vary instead of always equal to 64 ? @1e0ng
Thanks !
Hi,
There is no doubt that this project helps me a lot when I try to do deduplication on 130,000 wiki docs. However, it makes me headache when I try to re-run the simhash building process even with multiprocessing. Is there anyone who plan to add serialization and deserialization support in order to save time?
It seems that ZODB may be a proper backend tool.(Certainly I will have a try.)
Python 2.7.14 64bit
x86_64 Linux
AMD64 Windows
Try:
print Simhash(4390059585430954713).value
Encounter an error:
Traceback (most recent call last): File "/home/mypc/test_simhash.py", line 4, in <module> print Simhash(4390059585430954713).value File "/home/mypc/.local/lib/python2.7/site-packages/simhash/__init__.py", line 58, in __init__ raise Exception('Bad parameter with type {}'.format(type(value))) Exception: Bad parameter with type <type 'int'>
Then I found the python's sys.maxint was 9223372036854775807 on Linux, but 2147483647 on Windows, even the Windows was a 64bit system. I examine code, and I got isinstance(value, long)
at line 60 in file simhash/init.py. So I have to add isinstance(value, int)
into it, for the 4390059585430954713 is low than 9223372036854775808 but great than 2147483647, and the code Simhash(4390059585430954713)
executed unsuccessfully on Linux but successfully on Windows. I don't know weither this change is right because I just learn little about simhash algorithm. So I post this issue to make sure.
Thanks for your efforts!
Simhash is returning the same hash value for different strings.
The strings may contain the same words, but the word count and length of the string are diffrent.
The small reproducible sample is given below,
from simhash import Simhash
def get_simhash_value(input: str):
simhash_value = Simhash(value=input).value
return simhash_value
res1 = get_simhash_value("test test test test test test test test test test")
res2 = get_simhash_value("test test test test test test test")
print(res1)
print(res2)
The original text content is two HTML contents after removing all tags and scripts.
thanks for your effort implementing the algorithm first of all!
When storing hashes in a database I obviously can only extract strings from it.
Is there any possibility to convert them back to Simhash type (e.g. to measure distance)?
How can effectively I suppress "Big bucket found" message? On larger files, I frequently see thousands of them in an output, which makes it difficult to review the results.
I am using this package to generate simhash for my text. I have found that different list of tokens may generate the same hash value, which is confusing to me.
from simhash import Simhash
x = [('恋爱', 2.61855744598), ('脱单', 0.29491400873), ('闪婚', 0.29491400873)]
y = [('恋爱', 3.92783616897), ('结婚', 3.282195995975)]
print(Simhash(x).value)
print(Simhash(x).value)
The last two print statements will print the same hash value 12369878657857125584
. I am not sure why does it happen. Is this the intended behavior?
error info:
Traceback (most recent call last):
File "test.py", line 2, in
from simhash import Simhash
File "/usr/lib/python2.6/site-packages/simhash/init.py", line 69
features = {k:sum(1 for _ in g) for k, g in groupby(sorted(features))}
^
SyntaxError: invalid syntax
this is my code
from simhash import Simhash
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
#print(Simhash('哎,江明也才发现,这边页面里面隐藏跳转都有好多页面,全部很满,基本不能共用,又没设计图,周三做完不可能了,做出来多半也是废的会重做').value)
#print(Simhash("this is test demo wewr").value)
The following link: https://leons.im/posts/a-python-implementation-of-simhash-algorithm/ gives me a security warning in chrome with NET::ERR_CERT_DATE_INVALID
.
Hello, thank you for your beautiful library!
In Python 3.9+ (corrected: 3.10!) collections
module does not contain Iterable
. It is deprecated since Python 3.3, and collections.abc.Iterable
is suggested to be used instead.
It is used in the code though:
Line 75 in 8987a06
Please replace broken reference to a pdf file in SimhashIndex.offsets
Hi,
I've read your implementation of finding near duplicates but something doesn't seem clear to me and hoped you could clarify about it. You use a dictionary to store a concatenation of the simhash chunks and index as keys and a concatenation the full simhash and simhash id as values.
keys: yield '%x:%x' % (c, i)
values: for key in self.get_keys(simhash): v = '%x,%s' % (simhash.value, obj_id) self.bucket[key].add(v)
Since dictionaries don't allow to repeat keys as it updates the value, if two simhashes have the same value for the same chunk at the same permutation index (for whatever reason), are we not losing one of the simhashes when building the SimhashIndex object and missing a possible duplicate?
[root@test simhash]# py
Python 3.5.4 (default, Jul 10 2018, 23:55:37)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
from simhash import Simhash
Traceback (most recent call last):
File "", line 1, in
ImportError: cannot import name 'Simhash'
import simhash
sh=simhash.Simhash()
Traceback (most recent call last):
File "", line 1, in
AttributeError: module 'simhash' has no attribute 'Simhash'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.