Comments (3)
Some rough ideas on how to implement this:
Keep the same L2Lsh Mapping structure. This seems to be strictly a query-time optimization.
Add a parameter probes: Int
to the L2Lsh query structure, controlling how many additional hashes of length k
(k is called M in the original paper..) are generated for each of the L
tables.
Hashing function will generate the additional probes
hashes right after it generates the standard hash.
Roughly:
for l in range(L):
generate the original hash and do some bookkeeping for below..
for p in range(probes):
generate additional hash
Otherwise the query looks exactly the same. It takes the generated hashes and goes on to look them up the same as any other approx. query.
Implement the naive version first (enumerate, score, sort all possible perturbations). Make sure that you can get equivalent recall on SIFT with fewer tables and ideally but not necessarily shorter query time. Then go back and implement the optimized version which precomputes perturbation sets, estimates the scores, etc. Make sure the optimized matches the naive.
from elastiknn.
Implemented the naive version in #123.
As far I can tell, the naivety is not a bottleneck in the current benchmarking configuration (k = 2, probes = 3
).
In this configuration the overwhelming bottleneck is (understandably) still the countMatches
method.
The only thing I could find related to the multiprobe implementation was taking 0.1% of the runtime on search threads.
Even with k = 3, probes = 27
, any footprint from the hashWithProbes
method is minimal compared to countMatches
.
So for now there are likely more worthwhile things to implement than the optimized version.
from elastiknn.
🎆 🥳 🎈
Implemented Algorithm 1 from Qin et. al. in #124. This generates the perturbation sets iteratively instead of generating them exhaustively and sorting all of them. I'd say this is good enough or now. There doesn't seem to be a need to implement the expected scores optimization from section 4.5, and it's also not obvious to me why/how it works.
from elastiknn.
Related Issues (20)
- Cross-build for Elasticsearch 7.x and 8.x HOT 11
- Stop publishing Scala and Java libraries
- Migrate to Scala 3
- JAVA api
- RecallSuite tests are extremely slow in Github Actions HOT 2
- Adding elastiknn as an extension in the Elastic cloud fails with releases 8.4.2.1 and 8.4.3.0 HOT 4
- Migrate documentation site to github pages HOT 1
- Integrate with Coveralls for test coverage
- Try PyLucene for ann-benchmarks implementation
- Upgrade ann-benchmarks to 8.6.2 (or latest)
- Try Vectors from Project Panama for vector similarity computations HOT 1
- Plugin [.installing-18148280304972249747] is missing a descriptor properties file HOT 1
- Run benchmarks in Github Actions on a standalone EC2 instance HOT 1
- Try vectors from Project Panama for LSH operations HOT 3
- can't create a mapping HOT 1
- Try quick select algorithm for KthGreatest implementation HOT 4
- Try resampling vectors to speed up L2LshModel
- Try getting rid of HashAndFreq to minimize allocations HOT 1
- Try re-using threadlocal arrays in ArrayHitCounter HOT 2
- Try caching the query vector's FloatVector segments when computing distance HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from elastiknn.