I what to find duplicates,so I need pop a hash then compare left hashes. <p dir="a

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

how to restore minhashes from MinHashLSH with redis backend? about datasketch HOT 8 CLOSED

ekzhu commented on August 15, 2024

how to restore minhashes from MinHashLSH with redis backend?

from datasketch.

Comments (8)

duhaime commented on August 15, 2024 1

If you want to find all matches, you can do something like what you're saying (if I understand you).

Create an empty dictionary that will map hashbands (sequences of minhashes) to input identifiers. For each input object, store a unique id and generate the minhashes. Pass a sliding window over the minhashes and combine each sequence of n minhash values into a hashbands (this will concatenate several minhashes into one object). Then for each hashband you get, add your input object's id to the list of ids in which that hashband occurs.

This gives you a mapping from hashbands to lists of ids, where ids that occur in the same list have the same hashband (ie are estimated to have some jaccard similarity). I'm doing exactly this with billions of hashbands on some supercompute clusters and it works quite nicely.

Perhaps this is not what you are trying to articulate though...

from datasketch.

ae-foster commented on August 15, 2024

Hi @bung87 do you think you could give a bit more detail about your question? What's your exact use case?

from datasketch.

bung87 commented on August 15, 2024

Question is :
here's two type of data:

articles may store to redis
create_lsh may use redis backed

Should I store both articles and lsh's minhashes to redis?

in production also I need store key of each articles.

things am trying to do below;

def duplicates(content, debug=False):
 
    if type(content) == list:
        articles = content
    else:
        articles = split_articles(content)
    minhashs = create_minhashs(articles, debug=debug)
    result = defaultdict(list)

    matched = []
    lsh = create_lsh(minhashs, threshold=0.75, num_perm=128, debug=debug)
    while minhashs:
        cur_index = len(minhashs) - 1
        cur = minhashs.pop()
        if not cur or cur_index in matched:
            continue
        # debug and print("creat lsh done %s" % cur_index)
        r = lsh.query(cur)
        if r:
            for idx in r:
                if idx != cur_index and idx not in matched and idx not in result:
                    result[cur_index].append(idx)
                    matched.append(idx)
    return result

from datasketch.

ae-foster commented on August 15, 2024

@bung87 I'm still a bit unsure what your objective is. In general though, don't use the Redis backend unless you need to. If you can solve your problem within a Python process that will be the way to go!

from datasketch.

bung87 commented on August 15, 2024

hmm ..actually I just want make ** only one copy of data(keys and hashes)**,since datasketch just provide query interface,for finding duplicates,I must first make a minhash which needs the text as input.so in this case I must feed lsh with minhashes,and also store text dataset for each query,

what I need is this project does, and it does just make one copy of data.
https://github.com/mattilyra/LSH/blob/master/examples/Introduction.ipynb

in production that may not be a problem (in mine case). I just want make sure as if there is a clear way to achieve this.

thank you!

from datasketch.

duhaime commented on August 15, 2024

@bung87 I for one am still not clear what you wish to accomplish. What makes you think multiple copies of data are being stored? What data exactly is being stored multiple times and where is it being stored?

from datasketch.

bung87 commented on August 15, 2024

I think it just needs keys and minhashes to find duplicates,but for now I also needs store text for create Minhash as query function requires.
if use redis as backend it store hashes(data here),as query needs minhash (which needs text) (data here)
the way I think it could be lsh->insert minhashes->find_duplicates ,no more query.

from datasketch.

bung87 commented on August 15, 2024

thanks, I do exactly what you say,I put mine code above there and it works,here I open this issue just make sure I am not missing a simple clean way.

from datasketch.

how to restore minhashes from MinHashLSH with redis backend? about datasketch HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent