Giter VIP home page Giter VIP logo

Comments (8)

duhaime avatar duhaime commented on August 15, 2024 1

If you want to find all matches, you can do something like what you're saying (if I understand you).

Create an empty dictionary that will map hashbands (sequences of minhashes) to input identifiers. For each input object, store a unique id and generate the minhashes. Pass a sliding window over the minhashes and combine each sequence of n minhash values into a hashbands (this will concatenate several minhashes into one object). Then for each hashband you get, add your input object's id to the list of ids in which that hashband occurs.

This gives you a mapping from hashbands to lists of ids, where ids that occur in the same list have the same hashband (ie are estimated to have some jaccard similarity). I'm doing exactly this with billions of hashbands on some supercompute clusters and it works quite nicely.

Perhaps this is not what you are trying to articulate though...

from datasketch.

ae-foster avatar ae-foster commented on August 15, 2024

Hi @bung87 do you think you could give a bit more detail about your question? What's your exact use case?

from datasketch.

bung87 avatar bung87 commented on August 15, 2024

Question is :
here's two type of data:

articles may store to redis
create_lsh may use redis backed

Should I store both articles and lsh's minhashes to redis?

in production also I need store key of each articles.

things am trying to do below;

def duplicates(content, debug=False):
 
    if type(content) == list:
        articles = content
    else:
        articles = split_articles(content)
    minhashs = create_minhashs(articles, debug=debug)
    result = defaultdict(list)

    matched = []
    lsh = create_lsh(minhashs, threshold=0.75, num_perm=128, debug=debug)
    while minhashs:
        cur_index = len(minhashs) - 1
        cur = minhashs.pop()
        if not cur or cur_index in matched:
            continue
        # debug and print("creat lsh done %s" % cur_index)
        r = lsh.query(cur)
        if r:
            for idx in r:
                if idx != cur_index and idx not in matched and idx not in result:
                    result[cur_index].append(idx)
                    matched.append(idx)
    return result

from datasketch.

ae-foster avatar ae-foster commented on August 15, 2024

@bung87 I'm still a bit unsure what your objective is. In general though, don't use the Redis backend unless you need to. If you can solve your problem within a Python process that will be the way to go!

from datasketch.

bung87 avatar bung87 commented on August 15, 2024

hmm ..actually I just want make ** only one copy of data(keys and hashes)**,since datasketch just provide query interface,for finding duplicates,I must first make a minhash which needs the text as input.so in this case I must feed lsh with minhashes,and also store text dataset for each query,

what I need is this project does, and it does just make one copy of data.
https://github.com/mattilyra/LSH/blob/master/examples/Introduction.ipynb

in production that may not be a problem (in mine case). I just want make sure as if there is a clear way to achieve this.

thank you!

from datasketch.

duhaime avatar duhaime commented on August 15, 2024

@bung87 I for one am still not clear what you wish to accomplish. What makes you think multiple copies of data are being stored? What data exactly is being stored multiple times and where is it being stored?

from datasketch.

bung87 avatar bung87 commented on August 15, 2024

I think it just needs keys and minhashes to find duplicates,but for now I also needs store text for create Minhash as query function requires.
if use redis as backend it store hashes(data here),as query needs minhash (which needs text) (data here)
the way I think it could be lsh->insert minhashes->find_duplicates ,no more query.

from datasketch.

bung87 avatar bung87 commented on August 15, 2024

thanks, I do exactly what you say,I put mine code above there and it works,here I open this issue just make sure I am not missing a simple clean way.

from datasketch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.