Comments (3)
Thank you for your interest in this project! It's really encouraging!
This is a very interesting issue. I my past research projects when I needed to create MinHash LSH indexes for hundreds of millions of sets, I used another library in Golang. It is much more memory efficient.
However I am very much interested in making this Python library memory efficient. One thing I learnt from using a (disk-based) database for storing the LSH index is that it is very slow. See https://github.com/ekzhu/go-sql-lsh. So I think we have two options:
- Optimize the Python code to make it more memory efficient. I am sure a constant factor improvement can be done, since the current code was not written with any optimization in mind.
- Utilize a storage engine to store the MinHash LSH data structure. The disk I/O can be a huge issue here. Perhaps this can be overcome by using distributed in-memory systems or smart caching.
I am curious, what is the memory efficiency of using the current MinHash LSH implementation? Approximately how many MinHashes per 1 GB of memory usage?
from datasketch.
Thanks so much for your response, @ekzhu, it's very helpful to see these other libraries!
I took a look through the Minhash and LSH classes last night looking for some memory optimizations and ended up getting quite an education! If you are aware of memory optimizations that can reduce storage requirements within an in-memory implementation, those would be great, but I certainly didn't see anything obvious.
Even with additional optimizations in place, I suppose for any arbitrarily sized project db storage would help. My main question now is whether this library is the place for the db hooks. Right now, I'm thinking the Minhash and LSH classes are beautifully clean, so it might make more sense for users to subclass the LSH class and modify the insert()
and query()
methods to suit their own situations.
For the moment, I think I'll go in that direction, rather than introduce any changes to this excellent API, so it should be safe to close this issue.
from datasketch.
Thank you for kind words! In fact @fpug is working on a new MinHash class LeanMinHash
that reduces the memory usage and provide faster serialization than the current MinHash
class. This should be helpful when you want to deal with external storage.
If the external storage solution is generic enough, it may be a good addition to this library.
from datasketch.
Related Issues (20)
- Advice for compression of a big graph HOT 3
- Distributed MinHashLSH HOT 3
- Poor default args in MinHashLSH? HOT 1
- Is is possible to rename already created index? HOT 1
- Add C-minHash variant HOT 11
- Synchronous Mongodb Storage HOT 3
- Merging (Identically Specified) MinHashLSH objects HOT 11
- Impact of MinHashLSH threshold on memory usage HOT 2
- Too large minhashLSH index HOT 10
- Is the bumber of bands correct? HOT 3
- Choice of np.uint64? HOT 11
- def jaccard 's denominator is self not [self union other] . HOT 2
- How to Use MinHash and MinHashLSH to Identify Comprehensive Documents and Partial Matches? HOT 3
- Forever growing index HOT 4
- HNSW: `HNSW.add` will not set the entry point of new levels HOT 2
- Process-safe, no mem bloat, implementation of LSH HOT 1
- Implementing MinHash retrieval from keys for MinHashLSHForest HOT 2
- Cassandra storage not compatible with Python 3.12 HOT 5
- Question: Effects of Bit Truncation on MinhashLSH? HOT 15
- uint64 overflow risk
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasketch.