Comments (7)
I believe this implementation uses 5 bits per count (See: https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/RegisterSet.java#L22). This means that to store 2p counts, you do need 5*(2p)/8 bytes -- however, it looks like RegisterSet uses (2p)/6 integers. This is equivalent to 4*(2p)/6 bytes, meaning 5/8 bytes and 4/6 bytes per count respectively.
You will note that 4/6 is a larger amount, and is even larger than strictly needed. I believe this is due to an implementation detail where the last two bits of an integer are left wasted. However, that should not cause a correctness issue in and of itself.
Your example failing case may still be valid for a different reason though? I am having a little trouble following the context exactly.. maybe you can include a failing test case?
from stream-lib.
The failing test is simple, feed the estimator with 2p synthetic hashes containing 64-p bits set to 0 and the full set of 2p prefixes, with p < 32, the result should be Long.MAX_VALUE, but since only 5 bits are used for keeping track of the count, the cardinality is off.
from stream-lib.
Increasing the bits per count has a non-trivial effect on memory footprint, and makes the choice between hll and hllp a little less straight forward. Additionally, I have doubts about the possibility of a real use case involving cardinalities on the order of Long.MAX_VALUE. (iirc, the original paper even says cardinalities close to that amount would require additional corrections that they didn't bother with for that reason).
However, it does seem like it would be easy enough to implement, and the hllp paper does suggest upgrading to 6 in general. It would have to be specific to hllp though (hll does not want more than 5), and ideally would not have a negative impact on existing use cases.
from stream-lib.
Google did end up needing 6 bits. I'm not aware of use cases outside of
that which required 6 bits. Should we make it a config param?
On Sun, Mar 15, 2015, 2:33 PM Ian Barfield [email protected] wrote:
Increasing the bits per count has a non-trivial affect on memory
footprint, and makes the choice between hll and hllp a little less straight
forward. Additionally, I have doubts about the possibility of a real use
case involving cardinalities on the order of Long.MAX_VALUE. (iirc, the
original paper even says cardinalities close to that amount would require
additional corrections that they didn't bother with for that reason).However, it does seem like it would be easy enough to implement, and the
hllp paper does suggest upgrading to 6 in general. It would have to be
specific to hllp though (hll does not want more than 5), and ideally would
not have a negative impact on existing use cases.—
Reply to this email directly or view it on GitHub
#88 (comment).
from stream-lib.
That is probably reasonable. We could also do a dynamic resizing if we wanted to get real fancy, but I'm not sure off hand how much of that value space is already handled by the sparse set rep.
In any event, the title/ initial post of this issue seems to no longer be representative of the continuing discussion, so maybe we should move this to a new issue called either "hllp should use six bits per register" or "better support for different register sizes" (depending on how you look at it).
from stream-lib.
+1
On Mon, Mar 16, 2015 at 12:30 PM Ian Barfield [email protected]
wrote:
That is probably reasonable. We could also do a dynamic resizing if we
wanted to get real fancy, but I'm not sure off hand how much of that value
space is already handled by the sparse set rep.In any event, the title/ initial post of this issue seems to no longer be
representative of the continuing discussion, so maybe we should move this
to a new issue called either "hllp should use six bits per register" or
"better support for different register sizes" (depending on how you look at
it).—
Reply to this email directly or view it on GitHub
#88 (comment).
from stream-lib.
closed in favor of #89
lmk if this does not seem to address your concerns
from stream-lib.
Related Issues (20)
- Implement LogLog-Beta/β (new cardinality estimation algo) HOT 1
- Implement LogLog-Beta HOT 1
- Explanation needed on StreamSummary's top-k results HOT 2
- Evaluate new HLL implementation against our current HLL++
- p10 > p50, p90 > p99 for a simple digest HOT 9
- Generic type for Count-Min Sketch
- Question: What's the purpose of the `size` field in CountMinSketch?
- CountMinSketch overflow check causes a lot of allocations HOT 4
- maybe there is an error in RegisterSet.java HOT 1
- Counter/StreamSummary class can't deserialize non-primitive type HOT 1
- 3.0 released? HOT 5
- Provide information on tuning/size parameters
- the question of com.clearspring.analytics.stream.SampleSet
- Perhaps a mistake in HyperLogLog HOT 1
- how does maxFalsePosProbability work HOT 2
- Sources published to maven central for 2.9.6 are for test-jar HOT 2
- HyperLogLogPlusPlus sparse precision 32 accuracy problem
- Deploy latest version to maven-central HOT 2
- CountMinSketch with Int type in the internal table
- BloomFilter doesn't work when amount of bits > 2^32 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stream-lib.