addthis / stream-lib Goto Github PK

View Code? Open in Web Editor NEW

2.3K 2.3K 559.0 1.94 MB

Stream summarizer and cardinality estimator.

License: Apache License 2.0

Shell 0.52% Java 99.48%

java

stream-lib's People

Contributors

Stargazers

Watchers

Forkers

oby1 pruivo evergage jkff bjzu jasongiedymin amuraru ghais mflu yuandaxing agiamas ywdong andradeandrey maximecaron ilyaraz imaxxs mlnick bazaarvoice whoschek chenbekor daggerrz jasonkolb meangrape zzzhc mravi bizo-build bsletten westtc hansmire shannonyu jmfn huamulan strategist922 rincarotechnologies pitch fiodor lina1006 eniton ssserj kshtzsharma48 lt1946 dreamfrog cswanghan datav soar-penguin liangxiao changguanghua jinbochen congmo monolithic adderllyer jackingod newsky zcourts duyvk thaingo huobao36 anismiles gertjana epollan m-chin antoine-tran assaron anujsrc ptgoetz qiuqiang1985 sguo catrixs ccstormlee tchapm jiabai ppurang ajain0184 seancarr dinic bazhenov mscheevel degravity dsc cykl vinceyang eric-vlaanderen xorlev edvorkin tml boundary hayesdavis jakubjanecek bookup thiagodepaulo maedhroz whiter4bbit master-zhuang andrewclegg polyfractal bik1979 piedro amutu 13911205241 yukim

stream-lib's Issues

There are no checks on the Count-Min Sketch parameters

During the construction of the count-min sketches I suggest to make some checks to the parameter values.
I am talking of width, depth, confidence, epsilon.
I would require the Count-Min Sketch user to use only some values in a range [minimum allowed value, maximum allowed value].
In this ways "possible surprises" are avoided.

Publish artifact to Maven Central

Please publish this to Maven Central. See this guide for instructions:

https://docs.sonatype.org/display/Repository/Sonatype+OSS+Maven+Repository+Usage+Guide

Generic type for Count-Min Sketch

Hi guys,

First of all thank you the great work you put in this library !

I am using Count-Min sketch implementation and I wonder why the only supported types are "long" and "String".
Indeed, there are some types of object such as Doubles or Integers that are easily and for sure convertible to String.
For Objects in general, the method toString() could be called, like in the Murmurhash used in HyperLogLog.
Is there a specific reason that you decided not to use it for CountMinSketch ?

Best regards,

Arnaud

Can I use it to check if the specified item exist?

I have a case that I need to check if the current user is the newer. Is this jar support?

TDigest percentiles are off on the last centroid (and close by percentiles have inverted values)

I have a serialized TDigest in Base64:
AAAAAkBZAAAAAAAAAAABLkkzm5BGwGAARy0nAEYp0ABGlTYARLegAEU48ABHBuMARTgQAESsIABFSPAARDqAAEaLFgBEpKAARYYwAEQVgABE9mAARTdgAESuYABD3QAARiY4AEWVgABDMwAARTeAAEXByABEYcAARajAAEQwwABF5RgARShwAENWAABF/WAARMAgAEQxAABFGEAAQsoAAEUZMABF4wAARaVgAEV54ABFbuAARVYgAEQqQABGLPwARKngAEQwgABEhOAARIzgAER7gABEoaAARPOgAEXFoABFLkAARHWAAEVDwABE2WAAQ3kAAESfIABFdbAARX6gAESHAABELoAARXxQAETDgABDNQAARa5wAEU6oABFIkAARcg4AESlIABEpDAARcM0AEWy+ABFZlgARPSwAEVRcABE+aAAQmQAAESAgABF22wARZh4AESmUABEbkAAQ+yAAEQ24ABFb3AARdYoAEXyNABD70AARgjMq0WGQqtFxTgARQDAAEUBAABCNAAARRCQAEU/aABEy5AARPlAAETwkABFL1gARa0wAESiYABFR/AARJaVVUUmLVVF5uQARbnSq0Wc7VVFXVAARaTwAEWFtABFFpgARGJAAESQ0ABFMbAAQ08qqkW40qtFgIgARZYYAEWNUABFmOgARD2VVUWYM1VFyWYARWqiq0UGkABE3JqrRWFIAESeAABF+34ARUR0AETgwABElVAARIagAEUfcABFhZFVRKyqq0YD1qtFbPqqRMTdVUPG4ABFNOVVRMUqq0UNiqtE0AAARZe6q0XLUKtFaBAARDLQAEVwSqtFpQarRYb8AEXYSABFNlAARI5gAET+iABFgt4ARYlQAEVdkABEv3AARU5sAEWchgBFPHAARSGgAEW6iABFH/AARQFQAESnqABEZbAARNlwAETB5VVEpVVVRNoFVUUA2ABFVNVVRQ1qq0TMeABFTC6rRcuNVUTVVVVFX6AAROG1VUVlAABF2h1VRYf5VUUarVVFrKVVRMxAAETy9VVE6cqrRNtwAEXxbABFEeVVRTcwAEUOGqpEynVVRTU1VUWxWABEv0AARQ3YAEUN4qtGDBQAROLKq0VnIABFWcAARfqYAEWSaABFhSgAReyiq0XmhVVGEVFVRYohVUUOqABE72AARRvgAEYUigBFVTgARP8QAETH0ABGHsAARSs4AEUgSABFfLAARQ8gAEWXuABFTVAARl1AAEUBQABDuIAAQ6uAAEUDkABD/4AARbl4AEVbEABFEOAAReMIAEXSkABGL8AARW8wAEYHTABFFbAAQ0gAAEUBsABFt4AARY8wAEY0FABGDwwARJWAAEVIoABFSZAARZVoAEYOsABFNVAARgWwAEU5EABForgARSAQAEXb8ABFStAARaa4AEPTgABErMAARU8gAEQVwABE7CAARh8UAET1gABGN+AARn9sAEQlAABFOsAARZ1oAEPlgABGIoQARhb8AEW8GABD8oAARhOMAEVXgABE7AAARtkeAEZZ/ABGl7gARtqgAEW9qABGybIARnG0AEUs4ABGiKoAQ9eAAEeglQBGCMwARnf0AEbJvgBGfmgAR0fpAEgXB4BHsy4ASAMpAEgYdkBMBzAdAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQECAQEBAQECAgEBAQICAgIBAgEBAQICAQECAgICAgIDAQMBAgEBAgEBAgMCAQIDAgIDAwMBAgMBAgIDAgEDAwEDBAIDAwICAwQBAgICAgMDAwMEAQMDAwMDBAQBAwQEAgEEBAIEAwIEAgQBAgMBBAICAwMCAgMDBAMDAQMDAwMCAwMCAwMCAQMDAgMCAgMCAwMBAgIDAQMDAQMCAQECAgECAQICAQECAgEBAQEBAQEBAQECAQICAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQE=

To get it back into a TDigest tree:
byte[] data = DatatypeConverter.parseBase64Binary(encoded);
ByteBuffer bb = ByteBuffer.wrap(data);
TDigest td = TDigest.fromBytes(bb);

where encode is the base64 string above.

when you query for quantiles it happens that:
td.quantile(0.999d) > td.quantile(0.9999d)
which should never happen.

I believe part of th eproblem is on https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java
line 320 (https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L320).
I believe in that line (center.count() - (q - t)) should be instead (q - t)
Symmetrically line 317 (https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L317)
should move from (q - t) to (center.count() - (q - t)) restablishing the symmetry.

The same holds true for lines 300 and 303 (theu have to have the (q-t) and (center.count() - (q - t)) swapped around.
I tried that td.quantile(0.5d) > td.quantile(0.5001d).

I am not sure it is the right fix but the invariants should be respected:
td.quantile(0.999d)<=td.quantile(0.9999d)
as well as
td.quantile(0.5d)<=td.quantile(0.5001d)

Thanks

how to serialize HLL using kryo

Hi, we are using HLL in Storm project. What we are expecting is to serialize HLL object using Kryo and sent it to the other Bolt component for computing. But it was failed when trying to serialize HLL object by following code for test:

        Kryo kryo = new Kryo();
        kryo.setInstantiatorStrategy(new SerializingInstantiatorStrategy());
        kryo.register(HyperLogLogPlus.class);
        try {
            Output output = new Output(new FileOutputStream("/Users/Felix/Desktop/file.bin"));
            kryo.writeObject(output, card);
            output.close();

            Input input = new Input(new FileInputStream("/Users/Felix/Desktop/file.bin"));
            HyperLogLogPlus someObject = kryo.readObject(input, HyperLogLogPlus.class);
            System.out.println(someObject.cardinality());
            input.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

We got following errors:

Exception in thread "main" com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): com.clearspring.analytics.stream.cardinality.HyperLogLog
	at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050)
	at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217)
	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:626)
	at hyperloglog.TT.main(TT.java:45)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Does it mean that HLL can not be serialized by Kryo? Or, has any way to do that correctly?

Thank you very much.

Implement LogLog-Beta/β (new cardinality estimation algo)

A new paper released this month introduces a new cardinality estimation algorithm called LogLog-Beta/β:

https://arxiv.org/abs/1612.02284

"The new algorithm uses only one formula and needs no additional bias
corrections for the entire range of cardinalities, therefore, it is more
efficient and simpler to implement. Our simulations show that the accuracy
provided by the new algorithm is as good as or better than the accuracy
provided by either of HyperLogLog or HyperLogLog++."
Some comments about its accuracy (graphs included) can be found in this PR: redis/redis#3677

Question: What's the purpose of the `size` field in CountMinSketch?

Why can't I do something like this:

        CountMinSketch sketch = new CountMinSketch(0.00001, 0.99999, 1);
        sketch.add(1, Long.MAX_VALUE);
        sketch.add(2, 22);

It fails because of the overflow check.

Typos, documentation and parameters

Hi guys,

thanks for such a nice collection of algorithms, I've been playing a bit with and stumbled over a few things I would like to clarify:

HyperLogLog:151,161 seems to have a redundant "+ 1", perhaps appeared on copy-paste from LogLog, suggested correction:

final int r = Long.numberOfLeadingZeros((hashedValue << this.log2m) | (1 << (this.log2m - 1)) + 1;

TestLogLog uses 1.04 as error estimate parameter (beta), while other sources, e.g. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining and the original Falojet's papers, uses 1.30. What is correct?
Furthermore, TestLogLog asserts the estimated value to be within 2 * se from the actual value, while TestHyperLogLog checks if it is within 3 * se? Is there any good reason for this?
HyperLogLog:94,106 mentions 1.106 vs 1.04, which is correct? (Googling 1.106 led me to the beta-32-bound given in the Flajolet's 2007 paper, while 1.04 corresponds to beta-infinity.)
CountThenEstimate:217 -- perhaps it should be LogLog(bytes) instead of LinearCounting(bytes)?

Thanks and cheers,
Simon

StreamSummary get wrong result

code：

result：

the result isn't order by the value

ConcurrentStreamSummary does not implement Space-Saving algorithm

The ConcurrentStreamSummary implementation does not appear to implement the Space-Saving algorithm. In particular it appears to try to keep track of the minimum value (and introduces a race condition in the process), and remove this if the capacity is reached.

Also there appears to be a bug where the counts of the previous minimum is added to the count of the element added. For example the result of the following:

        ConcurrentStreamSummary<String> vs = new ConcurrentStreamSummary<String>(3);
        String[] stream = {"X", "X", "Y", "Y", "Y", "Z", "Z", "Z", "B"};
        for (String i : stream) {
            vs.offer(i);
        }
        List<ScoredItem<String>> topK = vs.peekWithScores(3);

is the list of strings {"B", "Y", "Z"} (not {"X", "Y", "Z"}) as the score of "X" was added to that of "B" when it was removed.

Varint bug

The shift should be >>> otherwise it does sign extension, bug?
writeUnsignedVarLong((value << 1) ^ (value >> 63), out);

Varint bug

if (i > 63) is not correct, should be 70

The max number of bytes in a 64 bit varint is 10.

HyperLogLog returns negative cardinality()

Hi, I am trying out HyperLogLog and this simple Groovy snippet is giving me negative cardinality() results (although the negation of the returned value is quite accurate)

HyperLogLog hll = new HyperLogLog(12)

for (long i=0; i<2e8; ++i) {
    hll.offer(i)
}
log.info("hll estimate of 1e8 elements {}", hll.cardinality())

outputs

13:58:22.914 [main] INFO  misc.HyperLogLogTest - hll estimate of 1e8 elements -208378698

Something wrong here? Let me know if you can't reproduce. I am using 'com.clearspring.analytics:stream:2.1.1' from maven.

Argument checks in HyperLogLogPlus constructor need to be more restrictive

The HyperLogLogPlus implementation only supports parameters p and sp for which 4 <= p <= sp <= 25. However, there is only a check for 4 <= p <= sp <= 32. Here is some test code with sp=26 that demonstrates the problem:

        HyperLogLogPlus hll = new HyperLogLogPlus(25, 26);

        hll.offerHashed(0X7FFFFF8000000000L);
        hll.offerHashed(0XFFFFFF8000000000L); 

        assertEquals(2, hll.cardinality()); // fails, hll.cardinality() returns 1

The expected cardinality is 2, because the index (first 25 bits) as well as the sparse index (first 26 bits) are different.

HyperLogLogPlus#getBytes corrupts when called from multiple thread

Since HLLP#getBytes uses Varint.writeUnsignedVarInt(int) which is not thread safe I see corruption in returned byte array when serializing different HLLP objects from multiple threads. (stream-lib version is 2.5.1)

I think one way is to switch to Varint.writeUnsignedVarInt(int, DataOutput).

A link in HyperLogLog.java is not accessible

A link in HyperLogLog.java is not accessible at line 52

http://dsiutils.dsi.unimi.it/
https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLog.java#L52
I want to know about how to deal with large number of counters
Can you place with a new link

HyperLogLog builder with log2m

It would be nice to have the option to use log2m or rsd when using the HyperLogLog builder.

e.g.

HyperLogLog.builder().withLog2m(10).build()
HyperLogLog.builder().withRsd(0.01).build()

For example for use with CountThenEstimate.
Currently using an int by mistake is not obvious

new HyperLogLog.Builder(10).build() //wrong

Additionally an incorrect rsd is only detected after tipping in CountThenEstimate iso on construction.

Implementing `equals`, `hashCode` and `toString` for CountMinSketch

I am interested to add these 3 method to CountMinSketch:

#equals()
#hashCode()
#toString()

I can create a pull request if you are interested...

The code would be something like this:

    @Override
    public String toString() {
        return "CountMinSketch{" +
                "relativeError=" + eps +
                ", confidence=" + confidence +
                ", depth=" + depth +
                ", width=" + width +
//                ", table=" + Arrays.toString(table) +
//                ", hashA=" + Arrays.toString(hashA) +
                ", size=" + size +
                '}';
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) {
            return true;
        }
        if (o == null || getClass() != o.getClass()) {
            return false;
        }

        final CountMinSketch that = (CountMinSketch) o;

        if (depth != that.depth) {
            return false;
        }
        if (width != that.width) {
            return false;
        }

        if (Double.compare(that.eps, eps) != 0) {
            return false;
        }
        if (Double.compare(that.confidence, confidence) != 0) {
            return false;
        }

        if (size != that.size) {
            return false;
        }

        if (!Arrays.deepEquals(table, that.table)) {
            return false;
        }
        return Arrays.equals(hashA, that.hashA);
    }

    @Override
    public int hashCode() {
        int result;
        long temp;
        result = depth;
        result = 31 * result + width;
        result = 31 * result + Arrays.deepHashCode(table);
        result = 31 * result + Arrays.hashCode(hashA);
        result = 31 * result + (int) (size ^ (size >>> 32));
        temp = Double.doubleToLongBits(eps);
        result = 31 * result + (int) (temp ^ (temp >>> 32));
        temp = Double.doubleToLongBits(confidence);
        result = 31 * result + (int) (temp ^ (temp >>> 32));
        return result;
    }

RegisterSet incorrectly allocates the space needed to track counts

The int array that's allocated in RegisterSet does not contain enough entries to accomodate correct counts for a given value of p.

Each count takes 6 bits, meaning that to store 2^p counts you need 6*(2^p)/8 bytes, not 2^p/6.

The counts are therefore incorrect, which can be detected by feeding for example hashes from 0 to (2^p-1) << (64 -p) in steps of 1 << (64 - p), the result should be 2**64-1 but it is 6194288074

incorrect threshold used in HLLPlus Normal threshold calculation

The threshold data is used to determine if we should use linear counting or normal HLL estimation for small cardinalities. I was incorrectly subtracting 4 from p when doing the index lookup of the threshold data.

QDigest does not implement neither Serializable nor Externalizable

Hi guys,

I was wondering if the fact that QDigest does not implement neither Serializable nor Externalizable is intentional?
I'm afraid it prevents QDigest from being serialized using ObjectOutputStream, which could be very useful (and for better or worse is still a quite common way to serialize stuff).

Unless I'm missing something, it would be pretty straight forward to fix, since QDigest already provides a serialize and deserialize methods, they just need to be exposed as part of the standard Java serialization interfaces.

What do you think?
It it makes sense, I'd be more than willing to create a pull request.

-Stas

count min sketch test bugs

for instance, on line 119 a <String, Long> map has an important get call made with an integer. I think I made some kind of attempt at fixing the weirdness I found when I made the conservative add sketch tests, but that was a while ago, and the fixes are probably not ideal.

What does merge() in AdaptiveCounting do?

I expected that merging two AdaptiveCounting objects would add the cardinalities but that did not happen. The output of the following was:

A: 1000
B: 1000
C: 1000

I expected that C would have a cardinality of 2000. I realize that I can simply add the two cardinalities (A and B) but I wanted to understand what merge does.

Here is the code that I am using:

    ICardinality estimatorA = new AdaptiveCounting(16);
    for (long i = 0; i < 1000; i++) {
        estimatorA.offer(i);
    }
    System.out.println("A: " + estimatorA.cardinality());

    ICardinality estimatorB = new AdaptiveCounting(16);
    for (long i = 0; i < 1000; i++) {
        estimatorB.offer(i);
    }
    System.out.println("B: " + estimatorB.cardinality());

    ICardinality estimatorC = estimatorA.merge(estimatorB);
    System.out.println("C: " + estimatorC.cardinality());

Thanks.

Incorrectly handled edge cases in HyperLogLogPlus merge

The merging of temporary list into sparse list in HyperLogLogPlus assumes that using normal integer ordering each index will always appear in contiguous blocks in the sorted temp list. This is not the case; it is easy to construct a counter example.

Suppose p = 11, sp = 20. We offer 3 objects to the HLLP that have the following hashes:

00000001010000000000 0^41 1 ...
00000001010000000000 0^31 1 ...
01010000000000011100 ...

The elipses mean that I don't care about the bits to the right, and 0^n means n 0s.
The first two hashes have the same idx' and by assumption should be contiguous in the ordering ( the consumeDuplicates function depends on this ). However, if you encode each of these hashes you get ( in the same order as above )

... 000000010100000000000101011 = 655403
... 000000010100000000001000001 = 655425
... 000000010100000000000111000 = 655416

here the ... mean that all higher order bits are 0s. Notice that the third is between the first and second in sort order, even though the first and second appear together.

This can be avoided by using a custom comparator when sorting the temporary list that is a dictionary compare on idx' then encoded hash.

Is there any way to let hll++ cardinality increasing？

Is there any way to let hll++ cardinality bigger than before offer one new value?

CountMinSketch is not serializable

Could we make com.clearspring.analytics.stream.frequency.CountMinSketch implementing java.io.Serializable please?

I am getting this error:
org.apache.commons.lang3.SerializationException: java.io.NotSerializableException: com.clearspring.analytics.stream.frequency.CountMinSketch
at org.apache.commons.lang3.SerializationUtils.serialize(SerializationUtils.java:156)

HLL++ in sparse mode can be large than in normal mode

Today I played a little bit with HLL++ in sparse mode. I have tens of millions HLL estimators and most of them have a low cardinality. Using HLL or HLL++ in normal mode is not memory efficient for this use case. To be picky I don't care that much about memory consumption, I am trying to minimize the serialized size of the estimators .

This whole idea behind the sparse mode is to not waste memory with the normal representation when we can do better for small cardinality. It sounds reasonable to switch back to the normal representation as soon as the sparse mode consume more memory.

However I don't observe a such behavior with HyperLogLogPlus:

HyperLogLog:

        HyperLogLog hll = new HyperLogLog(14);
        System.out.println(hll.getBytes().length);

=> 10932

HyperLogLogPlus in normal mode

        HyperLogLogPlus hllp = new HyperLogLogPlus(14);
        System.out.println(hllp.getBytes().length);

=> 10940

Empty HyperLogLogPlus in sparse mode

        HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);
        System.out.println(hllp.getBytes().length);

=> 16

5K elements with HyperLogLogPlus in sparse mode

        Random r = new Random();

        HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);

        for (int i = 0; i < 5000; i++) {
            hllp.offer(Integer.toString(r.nextInt()));
        }

        System.out.println(hllp.getBytes().length);

=> 25495

According to the source code the sparseSetThreshold only depends of p and is set to 12288 for p = 14. It means that if the set contains 12000 elements, I'm wasting almost 40KBytes compared to the normal representation.

Am I wrong ? Is this behavior expected ?

My second question would be: Do we really want to create the RegisterSet even when we are in Sparse mode ? It ruins the goal to be memory efficient. It currently does not matters for me since my bottleneck is the serialization size but I find this quite surprising.

2.9.1 HyperLogLogPlus does not count cardinality < 10000 at 14,32

Run this code:

    HyperLogLogPlus hll = new HyperLogLogPlus(14, 32);
    for (int i = 0; i < 10000; i++) {
        UUID uuid = UUID.randomUUID();
        hll.offer(uuid);
    }
    System.out.println("Cardinality = " + hll.cardinality());

In 2.9.0, the result is correct (9999 / 10000).

In 2.9.1, the result is 0.

It breaks at sp > 30

Merge two TDigests throws NullPointerException

Sometimes when I merged two tdigests by call the add method of one tdigest, it threw NullPointerException as follows:
java.lang.NullPointerException
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:85)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:81)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:81)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:81)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:81)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:81)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:81)
at com.clearspring.analytics.stream.quantile.TDigest.add(TDigest.java:146)
at com.clearspring.analytics.stream.quantile.TDigest.add(TDigest.java:171)

use logging framework for tests

As noted by derekgr, the system.out spam has gotten a mite out of control. Finely detailed statements are great for debugging or maybe learning but disabling them until needed is what logging frameworks are for. I suggest just a test scoped slf4j simple logger.

Evaluate new HLL implementation against our current HLL++

https://github.com/axiomhq/hyperloglog

HyperLogLog incorrectly estimates very high cardinality values

HyperLogLog uses a correction equation to estimate values > 1/30*2^32. Unfortunately in java:

Math.pow(2,32) == Math.pow(-2,32)

which is very sad. This broke our implementation.

While investigating the issue I also found that the long range correction actually hurts the precision of the estimate in some cases so I'm going to add an option to return the uncorrected value.

HyperLogLogPlus merge introduces error when counters overlap

When merging two HyperLogLogPlus counters that have a large intersection of their underlying sets it is possible to introduce a large amount of error in cardinality estimation. This is caused by checking if the size of the two sparse lists sums to a number greater than the sparseThreshold. If the two lists share many elements, then their actual merge should be a list much smaller than the sparse threshold - this can cause a sparse counter to be promoted to a normal counter long before it should. If this happens it both wastes space and, if it happens too early can cause large errors in cardinality estimation ( I have seen up to 30% in tests ) due to the bias estimation curves not having samples for cardinalities that low.

Forcing merges to always merge sparse lists before checking size helps, but does not completely eliminate the problem. I believe this is due to problems in the merge implementation that I will open another issue for.

The easiest way to reproduce this error is to produce two counters with identical elements at just over 1/2 the sparseThreshold, then merge them.

Possibly out of range

I'm using com.clearspring.analytics.stream.quantile.QDigest class to approximate 100k datum, which is possibly summing this will result higher than int64 range. Found this when running on amazon EMR

Caused by: java.lang.IllegalArgumentException: Can only accept values in the range 0..4611686018427387903, got 9223372036854775807
    at com.clearspring.analytics.stream.quantile.QDigest.offer(QDigest.java:125)
    at com.liveramp.cascading_ext.combiner.lib.QuantileExactAggregator.partialAggregate(QuantileExactAggregator.java:38)
    at com.liveramp.cascading_ext.combiner.lib.QuantileExactAggregator.partialAggregate(QuantileExactAggregator.java:17)
    at com.liveramp.cascading_ext.combiner.CombinerFunctionContext.combineAndEvict(CombinerFunctionContext.java:130)
    at com.liveramp.cascading_ext.combiner.CombinerFunction.operate(CombinerFunction.java:130)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
    ... 11 more

i suppose because offer method parameter defined as long, is there any work around for this?

consider support for additional hll(p) register sizes (currently only 5 is used)

(this issue takes over for #88 after clarification)

There is a trade off between bit usage and ultra-high cardinality accuracy. The current HLLP implementation uses 5 bits per register similar to HLL. The benefit is that it is an easier comparison between the two, and more efficient for most use cases. However, iirc, the original paper does recommend going up to 6 bits regardless, and it is not terribly difficult to modify the register size to be a non-constant (other than serialization format concerns).

Additionally, although the lower cardinality space is fairly well covered by the sparse set representation, it is also possible that there may be benefit to allowing an even lower register size. This may work even better if some kind of additional, secondary dynamic switch is supported. eg. "SPARSE -> NORMAL_4 -> NORMAL_5 -> NORMAL_6" or something. The runtime performance may be tricky to do well in that case though.

The easiest solution is to add a config parameter and somehow deal with serialization issues.

Count-Min Sketch Constant for Calculating Width

The width of the sketch according to the paper should be set to ceil(e/epsilon) where e is Euler's number. However, I noticed in the current code that this is just set to 2.0.

I'm curious as to why this is. Is it good enough in practice for it to not make a difference?

Implement LogLog-Beta

A new paper released this month introduces a new cardinality estimation algorithm called LogLog-Beta/β:

https://arxiv.org/abs/1612.02284

The long range correction in HyperLogLog does not work, increases error

will modify to disable by default but leave api available in case someone wants to re-enable manually

maybe there is an error in RegisterSet.java

the line 52 of RegisterSet.java is else if (bits % Integer.SIZE == 0). in function getSizeForCount.

public static int getSizeForCount(int count) {
    int bits = getBits(count);
    if (bits == 0) {
        return 1;
    } else if (bits % Integer.SIZE == 0) {
        return bits;
    } else {
        return bits + 1;
    }
}

the parameter count is the bucket num, RegisterSet.REGISTER_SIZE is the number of bucket per Int, this value is 5. the code "int bits = getBits(count)" return the lenth of int array for bucket, I think the judgement of line 52 should be if(count%REGISTER_SIZE ==0), if the mode is zero, the value of bits is the length of int array for bucket.

Explanation needed on StreamSummary's top-k results

Hello,
when running this test :

StreamSummary<String> vs = new StreamSummary<String>(3);
String[] stream = { "X", "X", "Y", "Z", "A", "B", "C", "X", "X", "A", "A", "A", "1", "2", "3", "4", "5", "6" };
for (String i : stream) {
    vs.offer(i);
}
System.out.println(vs);

I get as result : [{6:[{4:5},{5:5},{6:5}]}].

I wonder if this is normal behavior or a bug. I would suggest to reset counter.count to 0 in the method StreamSummary#offerReturnAll when a new item overrides the item with least hits.

Thanks for your analysis. Best regards.

CountMinSketch#size should be long everywhere in the code

This code shows that the CountMinSketch size() method does not work as expected.

        double confidence = 0.999;
        double epsilon = 0.0001;
        int seed = 1;

        CountMinSketch sketch = new CountMinSketch(epsilon, confidence, seed);
        sketch.add(0, Integer.MAX_VALUE);
        long expectedSize = Integer.MAX_VALUE;
        Assert.assertEquals(expectedSize, sketch.size());

        CountMinSketch newSketch = CountMinSketch.merge(sketch, sketch);
        // bug: the next line is failing
        Assert.assertEquals(2 * expectedSize, newSketch.size());

There are two parts in the code where size is used wrongly as int:
https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java#L68
https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java#L186

It should be long everywhere.

Happy to fix it and add some unit tests creating a pull request.
Please let me know

HLL becomes incorrect if cardinality is over billions

@Override
public long cardinality() {
    double registerSum = 0;
    int count = registerSet.count;
    double zeros = 0.0;
    for (int j = 0; j < registerSet.count; j++) {
        int val = registerSet.get(j);
        registerSum += 1.0 / (1 << val);        // SHOULD BE -- 1L << val
        if (val == 0) {
            zeros++;
        }
    }

HyperLogLogPlus memory/sparse to normal transistion issue

Hi,

We noticed this when we were playing around with serialization. If we have a HyperLogLogPlus object A and either a) merge it with another one B into a third HLL++ object C or b) ask A it's cardinality a lot, during the process of offering a bunch of data items to it, the memory usage of the original object A seems to grow linearly beyond where it should have switched over to the "normal" mode of operation (vice "sparse" mode). We believe we have narrowed this behavior down to the following. When a HLL++ object is asked its cardinality or used as the input to merge, the mergeTempList method is called and the temp list's contents are moved into the sparseSet. However, in the offer method, it only checks to see if we should switch to normal mode if the tempSet exceeds some threshold, and only then if the sparseSet exceeds a different threshold:

case SPARSE:
//Call the sparse encoding scheme which attempts to stuff as much helpful data into 32 bits as possible
int k = encodeHash(x, p, sp);
//Put the encoded data into the temp set
tmpSet[tmpIndex++] = k;
if (tmpIndex > sortThreshold)
{
mergeTempList();
if (sparseSet != null && sparseSet.length > sparseSetThreshold)
{
convertToNormal();
}
}
return true;

However if we are doing cardinality() or merge() calls on this object then the tempSet is emptied previously and it never gets to the threshold it needs to meet this condition, and the set is never converted to normal mode.

I didn't have time to try it out yet but perhaps in the above code checking to see if the sparseSet is above the sparseSetThreshold regardless of the state of the tmpSet, and using that to convert to normal mode, would fix this issue. Something like this:

            }
            if (sparseSet != null && sparseSet.length > sparseSetThreshold)
                {
                    mergeTempList();
                    convertToNormal();
                }
            return true;

(I'm not sure if the second mergeTempList is really needed or if it can be done as an additional condition to the previous if statement)

CountMinSketch#size goes to overflow

This code (unit test) fails:

        CountMinSketch sketch = new CountMinSketch(0.0001, 0.99999, 1);
        sketch.add(3, Long.MAX_VALUE);
        long size1 = sketch.size();
        assertEquals(Long.MAX_VALUE, size1);

        sketch.add(4, 1);
        long size2 = sketch.size();
        assertTrue(size2 > size1);

I would check the size at construction and also after calling #add() or #merge().

the size must be always >= 0 (at construction but also at after any operation)
the size after an operation such as add or merge must be bigger than the "previous size"
If the condition is not satisfy (because of the overflow), I would throw an exception.

Let me know if you agree.
In case I could create a pull request with my solution.

LogLog crashes on k=31

If LogLog is constructed with k = 31, exception is thrown:
Exception in thread "main" java.lang.NegativeArraySizeException
at com.clearspring.analytics.stream.cardinality.LogLog.(LogLog.java:76)

Actually, k can't be greater than 30, because (1 << 31) is equal to -1.

HLL++: large values of p lead to AIOOBE

The HLL++ paper describe the following preconditions for p and p':

"The input data set S, the precision p, the precision p' used in the sparse representation where p ∈ [4..p'] and p'≤64"

However Bias-Correction Data are only provided for p [4..18]. To estimate the cardinality when E < 5.m, the current implementation performs a lookup in the rawEstimateData table as follow:

        double[] estimateVector = rawEstimateData[p - 4];

The issue is when p > 22 and E < 5.m, and ArrayIndexOutOfBoundException is thrown. I'm not familiar with large precision and not sure about what to do:

Forbid p > 18 since we don't have accurate bias correction tables
Use the 18 table for all p >= 18
Don't use the bias correction algorithm for large p

The issue can easily be reproduced:

HyperLogLogPlus hllp = new HyperLogLogPlus(19);
hllp.cardinality();

Do you have any idea about this one ? The TestHyperLogLogPlus.selfMerge uses larges p, up to 20, and the Javadoc says p can be up to p' == 64.

HLL++ offer() always returns true when format SPARSE

Hi,

HyperLogLogPlus.offerHashed() constantly returns true in SPARSE format, even if hash already counted.

https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java#L309

Test:

@Test
public void testOffer() {
    HyperLogLogPlus hll = new HyperLogLogPlus(5,25);
    assertTrue(hll.offer("ABC"));
    assertFalse(hll.offer("ABC")); // boom!
}

CountMinSketch overflow check causes a lot of allocations

It starts from String.valueOf(item) here

stream-lib/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java

Line 205 in 5868141

checkSizeAfterAdd(String.valueOf(item), count);

String concatenation here

stream-lib/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java

Line 189 in 5868141

checkSizeAfterOperation(previousSize, "add(" + item + "," + count + ")", size);

and here

stream-lib/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java

Line 289 in 5868141

checkSizeAfterOperation(previousSize, "merge(" + estimator + ")", size);

Could be related to #140

p10 > p50, p90 > p99 for a simple digest

        TDigest tdigest = new TDigest(100);
        tdigest.add(0.18615591526031494);
        tdigest.add(0.4241943657398224);
        tdigest.add(0.8813006281852722);

        System.out.println("p10: " + tdigest.quantile(0.1));
        System.out.println("p50: " + tdigest.quantile(0.5));
        System.out.println("p90: " + tdigest.quantile(0.9));
        System.out.println("p95: " + tdigest.quantile(0.95));
        System.out.println("p99: " + tdigest.quantile(0.99));

The output doesn't look right:

p10: 0.35278283059597015
p50: 0.30517514050006866
p90: 1.018432506918907
p95: 0.9498665675520899
p99: 0.8950138160586358

p10 > p50
p90 > p95 > p99

Should I use https://github.com/tdunning/t-digest/blob/master/src/main/java/com/tdunning/math/stats/MergingDigest.java instead?

addthis / stream-lib Goto Github PK

stream-lib's People

Contributors

Stargazers

Watchers

Forkers

stream-lib's Issues

Recommend Projects

Recommend Topics

Recommend Org