Giter VIP home page Giter VIP logo

Comments (7)

tea-dragon avatar tea-dragon commented on June 11, 2024

I believe this implementation uses 5 bits per count (See: https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/RegisterSet.java#L22). This means that to store 2p counts, you do need 5*(2p)/8 bytes -- however, it looks like RegisterSet uses (2p)/6 integers. This is equivalent to 4*(2p)/6 bytes, meaning 5/8 bytes and 4/6 bytes per count respectively.

You will note that 4/6 is a larger amount, and is even larger than strictly needed. I believe this is due to an implementation detail where the last two bits of an integer are left wasted. However, that should not cause a correctness issue in and of itself.

Your example failing case may still be valid for a different reason though? I am having a little trouble following the context exactly.. maybe you can include a failing test case?

from stream-lib.

hbs avatar hbs commented on June 11, 2024

The failing test is simple, feed the estimator with 2p synthetic hashes containing 64-p bits set to 0 and the full set of 2p prefixes, with p < 32, the result should be Long.MAX_VALUE, but since only 5 bits are used for keeping track of the count, the cardinality is off.

from stream-lib.

tea-dragon avatar tea-dragon commented on June 11, 2024

Increasing the bits per count has a non-trivial effect on memory footprint, and makes the choice between hll and hllp a little less straight forward. Additionally, I have doubts about the possibility of a real use case involving cardinalities on the order of Long.MAX_VALUE. (iirc, the original paper even says cardinalities close to that amount would require additional corrections that they didn't bother with for that reason).

However, it does seem like it would be easy enough to implement, and the hllp paper does suggest upgrading to 6 in general. It would have to be specific to hllp though (hll does not want more than 5), and ideally would not have a negative impact on existing use cases.

from stream-lib.

abramsm avatar abramsm commented on June 11, 2024

Google did end up needing 6 bits. I'm not aware of use cases outside of
that which required 6 bits. Should we make it a config param?

On Sun, Mar 15, 2015, 2:33 PM Ian Barfield [email protected] wrote:

Increasing the bits per count has a non-trivial affect on memory
footprint, and makes the choice between hll and hllp a little less straight
forward. Additionally, I have doubts about the possibility of a real use
case involving cardinalities on the order of Long.MAX_VALUE. (iirc, the
original paper even says cardinalities close to that amount would require
additional corrections that they didn't bother with for that reason).

However, it does seem like it would be easy enough to implement, and the
hllp paper does suggest upgrading to 6 in general. It would have to be
specific to hllp though (hll does not want more than 5), and ideally would
not have a negative impact on existing use cases.


Reply to this email directly or view it on GitHub
#88 (comment).

from stream-lib.

tea-dragon avatar tea-dragon commented on June 11, 2024

That is probably reasonable. We could also do a dynamic resizing if we wanted to get real fancy, but I'm not sure off hand how much of that value space is already handled by the sparse set rep.

In any event, the title/ initial post of this issue seems to no longer be representative of the continuing discussion, so maybe we should move this to a new issue called either "hllp should use six bits per register" or "better support for different register sizes" (depending on how you look at it).

from stream-lib.

abramsm avatar abramsm commented on June 11, 2024

+1

On Mon, Mar 16, 2015 at 12:30 PM Ian Barfield [email protected]
wrote:

That is probably reasonable. We could also do a dynamic resizing if we
wanted to get real fancy, but I'm not sure off hand how much of that value
space is already handled by the sparse set rep.

In any event, the title/ initial post of this issue seems to no longer be
representative of the continuing discussion, so maybe we should move this
to a new issue called either "hllp should use six bits per register" or
"better support for different register sizes" (depending on how you look at
it).


Reply to this email directly or view it on GitHub
#88 (comment).

from stream-lib.

tea-dragon avatar tea-dragon commented on June 11, 2024

closed in favor of #89

lmk if this does not seem to address your concerns

from stream-lib.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.