Giter VIP home page Giter VIP logo

Comments (3)

ekzhu avatar ekzhu commented on July 17, 2024

Thank you for your interest in this project! It's really encouraging!

This is a very interesting issue. I my past research projects when I needed to create MinHash LSH indexes for hundreds of millions of sets, I used another library in Golang. It is much more memory efficient.

However I am very much interested in making this Python library memory efficient. One thing I learnt from using a (disk-based) database for storing the LSH index is that it is very slow. See https://github.com/ekzhu/go-sql-lsh. So I think we have two options:

  1. Optimize the Python code to make it more memory efficient. I am sure a constant factor improvement can be done, since the current code was not written with any optimization in mind.
  2. Utilize a storage engine to store the MinHash LSH data structure. The disk I/O can be a huge issue here. Perhaps this can be overcome by using distributed in-memory systems or smart caching.

I am curious, what is the memory efficiency of using the current MinHash LSH implementation? Approximately how many MinHashes per 1 GB of memory usage?

from datasketch.

duhaime avatar duhaime commented on July 17, 2024

Thanks so much for your response, @ekzhu, it's very helpful to see these other libraries!

I took a look through the Minhash and LSH classes last night looking for some memory optimizations and ended up getting quite an education! If you are aware of memory optimizations that can reduce storage requirements within an in-memory implementation, those would be great, but I certainly didn't see anything obvious.

Even with additional optimizations in place, I suppose for any arbitrarily sized project db storage would help. My main question now is whether this library is the place for the db hooks. Right now, I'm thinking the Minhash and LSH classes are beautifully clean, so it might make more sense for users to subclass the LSH class and modify the insert() and query() methods to suit their own situations.

For the moment, I think I'll go in that direction, rather than introduce any changes to this excellent API, so it should be safe to close this issue.

from datasketch.

ekzhu avatar ekzhu commented on July 17, 2024

Thank you for kind words! In fact @fpug is working on a new MinHash class LeanMinHash that reduces the memory usage and provide faster serialization than the current MinHash class. This should be helpful when you want to deal with external storage.

If the external storage solution is generic enough, it may be a good addition to this library.

from datasketch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.