Comments (4)
On the different top-k query results: this is expected. Because the methods are all approximate rather than exact, results may not be 100% accurate. The common work flow of LSH Forest for top-k is to first retrieve "top-r" results from the index, sort the r number of results by exact similarity computation (or using MinHash signatures), and then output the top-k. r is bigger than k (e.g. 10x bigger).
For MinHashLSH, you cannot directly control the number of results returned. When your result sizes are large you may want to use a higher threshold (e.g. 0.9). Have multiple LSH indexes optimized for a series of descending thresholds can help too.
from datasketch.
Thanks for your suggestions! You mentioned "Have multiple LSH indexes optimized for a series of descending thresholds can help too". Can you give me some more guidance on how can I achieve this? Thanks!
from datasketch.
Nothing fancy. What I meant is just some version of the following:
lshes = {0.9 : MinHashLSH(threshold=0.9),
0.8: MinHashLSH(threshold=0.8),
0.7: MinHashLSH(threshold=0.7),
0.6: MinHashLSH(threshold=0.6),
0.5: MinHashLSH(threshold=0.5),}
So your query processing can start with the highest threshold, try the lshes in descending order of thresholds, until you get enough query results.
from datasketch.
OK, I understand. Thanks!
from datasketch.
Related Issues (20)
- Advice for compression of a big graph HOT 3
- Distributed MinHashLSH HOT 3
- Poor default args in MinHashLSH? HOT 1
- Is is possible to rename already created index? HOT 1
- Add C-minHash variant HOT 11
- Synchronous Mongodb Storage HOT 3
- Merging (Identically Specified) MinHashLSH objects HOT 11
- Impact of MinHashLSH threshold on memory usage HOT 2
- Too large minhashLSH index HOT 10
- Is the bumber of bands correct? HOT 3
- Choice of np.uint64? HOT 11
- def jaccard 's denominator is self not [self union other] . HOT 2
- How to Use MinHash and MinHashLSH to Identify Comprehensive Documents and Partial Matches? HOT 3
- Forever growing index HOT 4
- HNSW: `HNSW.add` will not set the entry point of new levels HOT 2
- Process-safe, no mem bloat, implementation of LSH HOT 1
- Implementing MinHash retrieval from keys for MinHashLSHForest HOT 2
- Cassandra storage not compatible with Python 3.12 HOT 5
- Question: Effects of Bit Truncation on MinhashLSH? HOT 15
- uint64 overflow risk
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasketch.