Giter VIP home page Giter VIP logo

stream-lib's People

Contributors

abramsm avatar addthis-buildbot avatar andrewbts avatar b4hand avatar bobpoekert avatar cburroughs avatar codahale avatar cykl avatar danglotb avatar dave2718 avatar david-wobrock avatar emopers avatar epollan avatar huitseeker avatar jiayangodc avatar jkff avatar kujon avatar lukasnalezenec avatar michaelklishin avatar mythguided avatar oertl avatar rxin avatar seancarr avatar ssserj avatar strongh avatar tdunning avatar tea-dragon avatar tedpearson avatar yuesong avatar yukim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stream-lib's Issues

count min sketch test bugs

for instance, on line 119 a <String, Long> map has an important get call made with an integer. I think I made some kind of attempt at fixing the weirdness I found when I made the conservative add sketch tests, but that was a while ago, and the fixes are probably not ideal.

HyperLogLog returns negative cardinality()

Hi, I am trying out HyperLogLog and this simple Groovy snippet is giving me negative cardinality() results (although the negation of the returned value is quite accurate)

HyperLogLog hll = new HyperLogLog(12)

for (long i=0; i<2e8; ++i) {
    hll.offer(i)
}
log.info("hll estimate of 1e8 elements {}", hll.cardinality())

outputs

13:58:22.914 [main] INFO  misc.HyperLogLogTest - hll estimate of 1e8 elements -208378698

Something wrong here? Let me know if you can't reproduce. I am using 'com.clearspring.analytics:stream:2.1.1' from maven.

HyperLogLog builder with log2m

It would be nice to have the option to use log2m or rsd when using the HyperLogLog builder.

e.g.

HyperLogLog.builder().withLog2m(10).build()
HyperLogLog.builder().withRsd(0.01).build()

For example for use with CountThenEstimate.
Currently using an int by mistake is not obvious

new HyperLogLog.Builder(10).build() //wrong

Additionally an incorrect rsd is only detected after tipping in CountThenEstimate iso on construction.

2.9.1 HyperLogLogPlus does not count cardinality < 10000 at 14,32

Run this code:

    HyperLogLogPlus hll = new HyperLogLogPlus(14, 32);
    for (int i = 0; i < 10000; i++) {
        UUID uuid = UUID.randomUUID();
        hll.offer(uuid);
    }
    System.out.println("Cardinality = " + hll.cardinality());

In 2.9.0, the result is correct (9999 / 10000).

In 2.9.1, the result is 0.

It breaks at sp > 30

ConcurrentStreamSummary does not implement Space-Saving algorithm

The ConcurrentStreamSummary implementation does not appear to implement the Space-Saving algorithm. In particular it appears to try to keep track of the minimum value (and introduces a race condition in the process), and remove this if the capacity is reached.

Also there appears to be a bug where the counts of the previous minimum is added to the count of the element added. For example the result of the following:

        ConcurrentStreamSummary<String> vs = new ConcurrentStreamSummary<String>(3);
        String[] stream = {"X", "X", "Y", "Y", "Y", "Z", "Z", "Z", "B"};
        for (String i : stream) {
            vs.offer(i);
        }
        List<ScoredItem<String>> topK = vs.peekWithScores(3);

is the list of strings {"B", "Y", "Z"} (not {"X", "Y", "Z"}) as the score of "X" was added to that of "B" when it was removed.

QDigest does not implement neither Serializable nor Externalizable

Hi guys,

I was wondering if the fact that QDigest does not implement neither Serializable nor Externalizable is intentional?
I'm afraid it prevents QDigest from being serialized using ObjectOutputStream, which could be very useful (and for better or worse is still a quite common way to serialize stuff).

Unless I'm missing something, it would be pretty straight forward to fix, since QDigest already provides a serialize and deserialize methods, they just need to be exposed as part of the standard Java serialization interfaces.

What do you think?
It it makes sense, I'd be more than willing to create a pull request.

-Stas

What does merge() in AdaptiveCounting do?

I expected that merging two AdaptiveCounting objects would add the cardinalities but that did not happen. The output of the following was:

A: 1000
B: 1000
C: 1000

I expected that C would have a cardinality of 2000. I realize that I can simply add the two cardinalities (A and B) but I wanted to understand what merge does.

Here is the code that I am using:

    ICardinality estimatorA = new AdaptiveCounting(16);
    for (long i = 0; i < 1000; i++) {
        estimatorA.offer(i);
    }
    System.out.println("A: " + estimatorA.cardinality());

    ICardinality estimatorB = new AdaptiveCounting(16);
    for (long i = 0; i < 1000; i++) {
        estimatorB.offer(i);
    }
    System.out.println("B: " + estimatorB.cardinality());

    ICardinality estimatorC = estimatorA.merge(estimatorB);
    System.out.println("C: " + estimatorC.cardinality());

Thanks.

Merge two TDigests throws NullPointerException

Sometimes when I merged two tdigests by call the add method of one tdigest, it threw NullPointerException as follows:
java.lang.NullPointerException
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:85)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:81)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:81)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:81)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:81)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:79)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:81)
at com.clearspring.analytics.stream.quantile.GroupTree.add(GroupTree.java:81)
at com.clearspring.analytics.stream.quantile.TDigest.add(TDigest.java:146)
at com.clearspring.analytics.stream.quantile.TDigest.add(TDigest.java:171)

maybe there is an error in RegisterSet.java

the line 52 of RegisterSet.java is else if (bits % Integer.SIZE == 0). in function getSizeForCount.

public static int getSizeForCount(int count) {
    int bits = getBits(count);
    if (bits == 0) {
        return 1;
    } else if (bits % Integer.SIZE == 0) {
        return bits;
    } else {
        return bits + 1;
    }
}

the parameter count is the bucket num, RegisterSet.REGISTER_SIZE is the number of bucket per Int, this value is 5. the code "int bits = getBits(count)" return the lenth of int array for bucket, I think the judgement of line 52 should be if(count%REGISTER_SIZE ==0), if the mode is zero, the value of bits is the length of int array for bucket.

Argument checks in HyperLogLogPlus constructor need to be more restrictive

The HyperLogLogPlus implementation only supports parameters p and sp for which 4 <= p <= sp <= 25. However, there is only a check for 4 <= p <= sp <= 32. Here is some test code with sp=26 that demonstrates the problem:

        HyperLogLogPlus hll = new HyperLogLogPlus(25, 26);

        hll.offerHashed(0X7FFFFF8000000000L);
        hll.offerHashed(0XFFFFFF8000000000L); 

        assertEquals(2, hll.cardinality()); // fails, hll.cardinality() returns 1 

The expected cardinality is 2, because the index (first 25 bits) as well as the sparse index (first 26 bits) are different.

Possibly out of range

I'm using com.clearspring.analytics.stream.quantile.QDigest class to approximate 100k datum, which is possibly summing this will result higher than int64 range. Found this when running on amazon EMR

Caused by: java.lang.IllegalArgumentException: Can only accept values in the range 0..4611686018427387903, got 9223372036854775807
    at com.clearspring.analytics.stream.quantile.QDigest.offer(QDigest.java:125)
    at com.liveramp.cascading_ext.combiner.lib.QuantileExactAggregator.partialAggregate(QuantileExactAggregator.java:38)
    at com.liveramp.cascading_ext.combiner.lib.QuantileExactAggregator.partialAggregate(QuantileExactAggregator.java:17)
    at com.liveramp.cascading_ext.combiner.CombinerFunctionContext.combineAndEvict(CombinerFunctionContext.java:130)
    at com.liveramp.cascading_ext.combiner.CombinerFunction.operate(CombinerFunction.java:130)
    at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:99)
    ... 11 more

i suppose because offer method parameter defined as long, is there any work around for this?

CountMinSketch overflow check causes a lot of allocations

It starts from String.valueOf(item) here

checkSizeAfterAdd(String.valueOf(item), count);
String concatenation here
checkSizeAfterOperation(previousSize, "add(" + item + "," + count + ")", size);
and here
checkSizeAfterOperation(previousSize, "merge(" + estimator + ")", size);

Could be related to #140

Implement LogLog-Beta

A new paper released this month introduces a new cardinality estimation algorithm called LogLog-Beta/β:

https://arxiv.org/abs/1612.02284

"The new algorithm uses only one formula and needs no additional bias
corrections for the entire range of cardinalities, therefore, it is more
efficient and simpler to implement. Our simulations show that the accuracy
provided by the new algorithm is as good as or better than the accuracy
provided by either of HyperLogLog or HyperLogLog++."
Some comments about its accuracy (graphs included) can be found in this PR.

HyperLogLogPlus merge introduces error when counters overlap

When merging two HyperLogLogPlus counters that have a large intersection of their underlying sets it is possible to introduce a large amount of error in cardinality estimation. This is caused by checking if the size of the two sparse lists sums to a number greater than the sparseThreshold. If the two lists share many elements, then their actual merge should be a list much smaller than the sparse threshold - this can cause a sparse counter to be promoted to a normal counter long before it should. If this happens it both wastes space and, if it happens too early can cause large errors in cardinality estimation ( I have seen up to 30% in tests ) due to the bias estimation curves not having samples for cardinalities that low.

Forcing merges to always merge sparse lists before checking size helps, but does not completely eliminate the problem. I believe this is due to problems in the merge implementation that I will open another issue for.

The easiest way to reproduce this error is to produce two counters with identical elements at just over 1/2 the sparseThreshold, then merge them.

Generic type for Count-Min Sketch

Hi guys,

First of all thank you the great work you put in this library !

I am using Count-Min sketch implementation and I wonder why the only supported types are "long" and "String".
Indeed, there are some types of object such as Doubles or Integers that are easily and for sure convertible to String.
For Objects in general, the method toString() could be called, like in the Murmurhash used in HyperLogLog.
Is there a specific reason that you decided not to use it for CountMinSketch ?

Best regards,

Arnaud

Implementing `equals`, `hashCode` and `toString` for CountMinSketch

I am interested to add these 3 method to CountMinSketch:

  • #equals()
  • #hashCode()
  • #toString()

I can create a pull request if you are interested...

The code would be something like this:

    @Override
    public String toString() {
        return "CountMinSketch{" +
                "relativeError=" + eps +
                ", confidence=" + confidence +
                ", depth=" + depth +
                ", width=" + width +
//                ", table=" + Arrays.toString(table) +
//                ", hashA=" + Arrays.toString(hashA) +
                ", size=" + size +
                '}';
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) {
            return true;
        }
        if (o == null || getClass() != o.getClass()) {
            return false;
        }

        final CountMinSketch that = (CountMinSketch) o;

        if (depth != that.depth) {
            return false;
        }
        if (width != that.width) {
            return false;
        }

        if (Double.compare(that.eps, eps) != 0) {
            return false;
        }
        if (Double.compare(that.confidence, confidence) != 0) {
            return false;
        }

        if (size != that.size) {
            return false;
        }

        if (!Arrays.deepEquals(table, that.table)) {
            return false;
        }
        return Arrays.equals(hashA, that.hashA);
    }

    @Override
    public int hashCode() {
        int result;
        long temp;
        result = depth;
        result = 31 * result + width;
        result = 31 * result + Arrays.deepHashCode(table);
        result = 31 * result + Arrays.hashCode(hashA);
        result = 31 * result + (int) (size ^ (size >>> 32));
        temp = Double.doubleToLongBits(eps);
        result = 31 * result + (int) (temp ^ (temp >>> 32));
        temp = Double.doubleToLongBits(confidence);
        result = 31 * result + (int) (temp ^ (temp >>> 32));
        return result;
    }

Typos, documentation and parameters

Hi guys,

thanks for such a nice collection of algorithms, I've been playing a bit with and stumbled over a few things I would like to clarify:

  • HyperLogLog:151,161 seems to have a redundant "+ 1", perhaps appeared on copy-paste from LogLog, suggested correction:
final int r = Long.numberOfLeadingZeros((hashedValue << this.log2m) | (1 << (this.log2m - 1)) + 1;
  • TestLogLog uses 1.04 as error estimate parameter (beta), while other sources, e.g. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining and the original Falojet's papers, uses 1.30. What is correct?
  • Furthermore, TestLogLog asserts the estimated value to be within 2 * se from the actual value, while TestHyperLogLog checks if it is within 3 * se? Is there any good reason for this?
  • HyperLogLog:94,106 mentions 1.106 vs 1.04, which is correct? (Googling 1.106 led me to the beta-32-bound given in the Flajolet's 2007 paper, while 1.04 corresponds to beta-infinity.)
  • CountThenEstimate:217 -- perhaps it should be LogLog(bytes) instead of LinearCounting(bytes)?

Thanks and cheers,
Simon

HLL++: large values of p lead to AIOOBE

The HLL++ paper describe the following preconditions for p and p':

"The input data set S, the precision p, the precision p' used in the sparse representation where p ∈ [4..p'] and p'≤64"

However Bias-Correction Data are only provided for p [4..18]. To estimate the cardinality when E < 5.m, the current implementation performs a lookup in the rawEstimateData table as follow:

        double[] estimateVector = rawEstimateData[p - 4];

The issue is when p > 22 and E < 5.m, and ArrayIndexOutOfBoundException is thrown. I'm not familiar with large precision and not sure about what to do:

  • Forbid p > 18 since we don't have accurate bias correction tables
  • Use the 18 table for all p >= 18
  • Don't use the bias correction algorithm for large p

The issue can easily be reproduced:

HyperLogLogPlus hllp = new HyperLogLogPlus(19);
hllp.cardinality();

Do you have any idea about this one ? The TestHyperLogLogPlus.selfMerge uses larges p, up to 20, and the Javadoc says p can be up to p' == 64.

HyperLogLogPlus memory/sparse to normal transistion issue

Hi,

We noticed this when we were playing around with serialization. If we have a HyperLogLogPlus object A and either a) merge it with another one B into a third HLL++ object C or b) ask A it's cardinality a lot, during the process of offering a bunch of data items to it, the memory usage of the original object A seems to grow linearly beyond where it should have switched over to the "normal" mode of operation (vice "sparse" mode). We believe we have narrowed this behavior down to the following. When a HLL++ object is asked its cardinality or used as the input to merge, the mergeTempList method is called and the temp list's contents are moved into the sparseSet. However, in the offer method, it only checks to see if we should switch to normal mode if the tempSet exceeds some threshold, and only then if the sparseSet exceeds a different threshold:

case SPARSE:
//Call the sparse encoding scheme which attempts to stuff as much helpful data into 32 bits as possible
int k = encodeHash(x, p, sp);
//Put the encoded data into the temp set
tmpSet[tmpIndex++] = k;
if (tmpIndex > sortThreshold)
{
mergeTempList();
if (sparseSet != null && sparseSet.length > sparseSetThreshold)
{
convertToNormal();
}
}
return true;

However if we are doing cardinality() or merge() calls on this object then the tempSet is emptied previously and it never gets to the threshold it needs to meet this condition, and the set is never converted to normal mode.

I didn't have time to try it out yet but perhaps in the above code checking to see if the sparseSet is above the sparseSetThreshold regardless of the state of the tmpSet, and using that to convert to normal mode, would fix this issue. Something like this:

case SPARSE:
//Call the sparse encoding scheme which attempts to stuff as much helpful data into 32 bits as possible
int k = encodeHash(x, p, sp);
//Put the encoded data into the temp set
tmpSet[tmpIndex++] = k;
if (tmpIndex > sortThreshold || )
{
mergeTempList();

            }
            if (sparseSet != null && sparseSet.length > sparseSetThreshold)
                {
                    mergeTempList();
                    convertToNormal();
                }
            return true;

(I'm not sure if the second mergeTempList is really needed or if it can be done as an additional condition to the previous if statement)

Incorrectly handled edge cases in HyperLogLogPlus merge

The merging of temporary list into sparse list in HyperLogLogPlus assumes that using normal integer ordering each index will always appear in contiguous blocks in the sorted temp list. This is not the case; it is easy to construct a counter example.

Suppose p = 11, sp = 20. We offer 3 objects to the HLLP that have the following hashes:

00000001010000000000 0^41 1 ...
00000001010000000000 0^31 1 ...
01010000000000011100 ...

The elipses mean that I don't care about the bits to the right, and 0^n means n 0s.
The first two hashes have the same idx' and by assumption should be contiguous in the ordering ( the consumeDuplicates function depends on this ). However, if you encode each of these hashes you get ( in the same order as above )

... 000000010100000000000101011 = 655403
... 000000010100000000001000001 = 655425
... 000000010100000000000111000 = 655416

here the ... mean that all higher order bits are 0s. Notice that the third is between the first and second in sort order, even though the first and second appear together.

This can be avoided by using a custom comparator when sorting the temporary list that is a dictionary compare on idx' then encoded hash.

CountMinSketch#size should be long everywhere in the code

This code shows that the CountMinSketch size() method does not work as expected.

        double confidence = 0.999;
        double epsilon = 0.0001;
        int seed = 1;

        CountMinSketch sketch = new CountMinSketch(epsilon, confidence, seed);
        sketch.add(0, Integer.MAX_VALUE);
        long expectedSize = Integer.MAX_VALUE;
        Assert.assertEquals(expectedSize, sketch.size());

        CountMinSketch newSketch = CountMinSketch.merge(sketch, sketch);
        // bug: the next line is failing
        Assert.assertEquals(2 * expectedSize, newSketch.size());

There are two parts in the code where size is used wrongly as int:
https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java#L68
https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java#L186

It should be long everywhere.

Happy to fix it and add some unit tests creating a pull request.
Please let me know

Explanation needed on StreamSummary's top-k results

Hello,
when running this test :

StreamSummary<String> vs = new StreamSummary<String>(3);
String[] stream = { "X", "X", "Y", "Z", "A", "B", "C", "X", "X", "A", "A", "A", "1", "2", "3", "4", "5", "6" };
for (String i : stream) {
    vs.offer(i);
}
System.out.println(vs);

I get as result : [{6:[{4:5},{5:5},{6:5}]}].

I wonder if this is normal behavior or a bug. I would suggest to reset counter.count to 0 in the method StreamSummary#offerReturnAll when a new item overrides the item with least hits.

Thanks for your analysis. Best regards.

Varint bug

if (i > 63) is not correct, should be 70

The max number of bytes in a 64 bit varint is 10.

CountMinSketch#size goes to overflow

This code (unit test) fails:

        CountMinSketch sketch = new CountMinSketch(0.0001, 0.99999, 1);
        sketch.add(3, Long.MAX_VALUE);
        long size1 = sketch.size();
        assertEquals(Long.MAX_VALUE, size1);

        sketch.add(4, 1);
        long size2 = sketch.size();
        assertTrue(size2 > size1);

I would check the size at construction and also after calling #add() or #merge().

  • the size must be always >= 0 (at construction but also at after any operation)
  • the size after an operation such as add or merge must be bigger than the "previous size"
    If the condition is not satisfy (because of the overflow), I would throw an exception.

Let me know if you agree.
In case I could create a pull request with my solution.

RegisterSet incorrectly allocates the space needed to track counts

The int array that's allocated in RegisterSet does not contain enough entries to accomodate correct counts for a given value of p.

Each count takes 6 bits, meaning that to store 2^p counts you need 6*(2^p)/8 bytes, not 2^p/6.

The counts are therefore incorrect, which can be detected by feeding for example hashes from 0 to (2^p-1) << (64 -p) in steps of 1 << (64 - p), the result should be 2**64-1 but it is 6194288074

LogLog crashes on k=31

If LogLog is constructed with k = 31, exception is thrown:
Exception in thread "main" java.lang.NegativeArraySizeException
at com.clearspring.analytics.stream.cardinality.LogLog.(LogLog.java:76)

Actually, k can't be greater than 30, because (1 << 31) is equal to -1.

Varint bug

The shift should be >>> otherwise it does sign extension, bug?
writeUnsignedVarLong((value << 1) ^ (value >> 63), out);

Implement LogLog-Beta/β (new cardinality estimation algo)

A new paper released this month introduces a new cardinality estimation algorithm called LogLog-Beta/β:

https://arxiv.org/abs/1612.02284

"The new algorithm uses only one formula and needs no additional bias
corrections for the entire range of cardinalities, therefore, it is more
efficient and simpler to implement. Our simulations show that the accuracy
provided by the new algorithm is as good as or better than the accuracy
provided by either of HyperLogLog or HyperLogLog++."
Some comments about its accuracy (graphs included) can be found in this PR: redis/redis#3677

use logging framework for tests

As noted by derekgr, the system.out spam has gotten a mite out of control. Finely detailed statements are great for debugging or maybe learning but disabling them until needed is what logging frameworks are for. I suggest just a test scoped slf4j simple logger.

how to serialize HLL using kryo

Hi, we are using HLL in Storm project. What we are expecting is to serialize HLL object using Kryo and sent it to the other Bolt component for computing. But it was failed when trying to serialize HLL object by following code for test:

        Kryo kryo = new Kryo();
        kryo.setInstantiatorStrategy(new SerializingInstantiatorStrategy());
        kryo.register(HyperLogLogPlus.class);
        try {
            Output output = new Output(new FileOutputStream("/Users/Felix/Desktop/file.bin"));
            kryo.writeObject(output, card);
            output.close();

            Input input = new Input(new FileInputStream("/Users/Felix/Desktop/file.bin"));
            HyperLogLogPlus someObject = kryo.readObject(input, HyperLogLogPlus.class);
            System.out.println(someObject.cardinality());
            input.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

We got following errors:

Exception in thread "main" com.esotericsoftware.kryo.KryoException: Class cannot be created (missing no-arg constructor): com.clearspring.analytics.stream.cardinality.HyperLogLog
	at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050)
	at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217)
	at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:626)
	at hyperloglog.TT.main(TT.java:45)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Does it mean that HLL can not be serialized by Kryo? Or, has any way to do that correctly?

Thank you very much.

HLL becomes incorrect if cardinality is over billions

@Override
public long cardinality() {
    double registerSum = 0;
    int count = registerSet.count;
    double zeros = 0.0;
    for (int j = 0; j < registerSet.count; j++) {
        int val = registerSet.get(j);
        registerSum += 1.0 / (1 << val);        // SHOULD BE -- 1L << val
        if (val == 0) {
            zeros++;
        }
    }

HLL++ in sparse mode can be large than in normal mode

Today I played a little bit with HLL++ in sparse mode. I have tens of millions HLL estimators and most of them have a low cardinality. Using HLL or HLL++ in normal mode is not memory efficient for this use case. To be picky I don't care that much about memory consumption, I am trying to minimize the serialized size of the estimators .

This whole idea behind the sparse mode is to not waste memory with the normal representation when we can do better for small cardinality. It sounds reasonable to switch back to the normal representation as soon as the sparse mode consume more memory.

However I don't observe a such behavior with HyperLogLogPlus:

HyperLogLog:

        HyperLogLog hll = new HyperLogLog(14);
        System.out.println(hll.getBytes().length);

=> 10932

HyperLogLogPlus in normal mode

        HyperLogLogPlus hllp = new HyperLogLogPlus(14);
        System.out.println(hllp.getBytes().length);

=> 10940 

Empty HyperLogLogPlus in sparse mode

        HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);
        System.out.println(hllp.getBytes().length);

=> 16

5K elements with HyperLogLogPlus in sparse mode

        Random r = new Random();

        HyperLogLogPlus hllp = new HyperLogLogPlus(14, 14);

        for (int i = 0; i < 5000; i++) {
            hllp.offer(Integer.toString(r.nextInt()));
        }

        System.out.println(hllp.getBytes().length);

=> 25495

According to the source code the sparseSetThreshold only depends of p and is set to 12288 for p = 14. It means that if the set contains 12000 elements, I'm wasting almost 40KBytes compared to the normal representation.

Am I wrong ? Is this behavior expected ?

My second question would be: Do we really want to create the RegisterSet even when we are in Sparse mode ? It ruins the goal to be memory efficient. It currently does not matters for me since my bottleneck is the serialization size but I find this quite surprising.

p10 > p50, p90 > p99 for a simple digest

        TDigest tdigest = new TDigest(100);
        tdigest.add(0.18615591526031494);
        tdigest.add(0.4241943657398224);
        tdigest.add(0.8813006281852722);

        System.out.println("p10: " + tdigest.quantile(0.1));
        System.out.println("p50: " + tdigest.quantile(0.5));
        System.out.println("p90: " + tdigest.quantile(0.9));
        System.out.println("p95: " + tdigest.quantile(0.95));
        System.out.println("p99: " + tdigest.quantile(0.99));

The output doesn't look right:

p10: 0.35278283059597015
p50: 0.30517514050006866
p90: 1.018432506918907
p95: 0.9498665675520899
p99: 0.8950138160586358
  • p10 > p50
  • p90 > p95 > p99

Should I use https://github.com/tdunning/t-digest/blob/master/src/main/java/com/tdunning/math/stats/MergingDigest.java instead?

TDigest percentiles are off on the last centroid (and close by percentiles have inverted values)

I have a serialized TDigest in Base64:
AAAAAkBZAAAAAAAAAAABLkkzm5BGwGAARy0nAEYp0ABGlTYARLegAEU48ABHBuMARTgQAESsIABFSPAARDqAAEaLFgBEpKAARYYwAEQVgABE9mAARTdgAESuYABD3QAARiY4AEWVgABDMwAARTeAAEXByABEYcAARajAAEQwwABF5RgARShwAENWAABF/WAARMAgAEQxAABFGEAAQsoAAEUZMABF4wAARaVgAEV54ABFbuAARVYgAEQqQABGLPwARKngAEQwgABEhOAARIzgAER7gABEoaAARPOgAEXFoABFLkAARHWAAEVDwABE2WAAQ3kAAESfIABFdbAARX6gAESHAABELoAARXxQAETDgABDNQAARa5wAEU6oABFIkAARcg4AESlIABEpDAARcM0AEWy+ABFZlgARPSwAEVRcABE+aAAQmQAAESAgABF22wARZh4AESmUABEbkAAQ+yAAEQ24ABFb3AARdYoAEXyNABD70AARgjMq0WGQqtFxTgARQDAAEUBAABCNAAARRCQAEU/aABEy5AARPlAAETwkABFL1gARa0wAESiYABFR/AARJaVVUUmLVVF5uQARbnSq0Wc7VVFXVAARaTwAEWFtABFFpgARGJAAESQ0ABFMbAAQ08qqkW40qtFgIgARZYYAEWNUABFmOgARD2VVUWYM1VFyWYARWqiq0UGkABE3JqrRWFIAESeAABF+34ARUR0AETgwABElVAARIagAEUfcABFhZFVRKyqq0YD1qtFbPqqRMTdVUPG4ABFNOVVRMUqq0UNiqtE0AAARZe6q0XLUKtFaBAARDLQAEVwSqtFpQarRYb8AEXYSABFNlAARI5gAET+iABFgt4ARYlQAEVdkABEv3AARU5sAEWchgBFPHAARSGgAEW6iABFH/AARQFQAESnqABEZbAARNlwAETB5VVEpVVVRNoFVUUA2ABFVNVVRQ1qq0TMeABFTC6rRcuNVUTVVVVFX6AAROG1VUVlAABF2h1VRYf5VUUarVVFrKVVRMxAAETy9VVE6cqrRNtwAEXxbABFEeVVRTcwAEUOGqpEynVVRTU1VUWxWABEv0AARQ3YAEUN4qtGDBQAROLKq0VnIABFWcAARfqYAEWSaABFhSgAReyiq0XmhVVGEVFVRYohVUUOqABE72AARRvgAEYUigBFVTgARP8QAETH0ABGHsAARSs4AEUgSABFfLAARQ8gAEWXuABFTVAARl1AAEUBQABDuIAAQ6uAAEUDkABD/4AARbl4AEVbEABFEOAAReMIAEXSkABGL8AARW8wAEYHTABFFbAAQ0gAAEUBsABFt4AARY8wAEY0FABGDwwARJWAAEVIoABFSZAARZVoAEYOsABFNVAARgWwAEU5EABForgARSAQAEXb8ABFStAARaa4AEPTgABErMAARU8gAEQVwABE7CAARh8UAET1gABGN+AARn9sAEQlAABFOsAARZ1oAEPlgABGIoQARhb8AEW8GABD8oAARhOMAEVXgABE7AAARtkeAEZZ/ABGl7gARtqgAEW9qABGybIARnG0AEUs4ABGiKoAQ9eAAEeglQBGCMwARnf0AEbJvgBGfmgAR0fpAEgXB4BHsy4ASAMpAEgYdkBMBzAdAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQECAQEBAQECAgEBAQICAgIBAgEBAQICAQECAgICAgIDAQMBAgEBAgEBAgMCAQIDAgIDAwMBAgMBAgIDAgEDAwEDBAIDAwICAwQBAgICAgMDAwMEAQMDAwMDBAQBAwQEAgEEBAIEAwIEAgQBAgMBBAICAwMCAgMDBAMDAQMDAwMCAwMCAwMCAQMDAgMCAgMCAwMBAgIDAQMDAQMCAQECAgECAQICAQECAgEBAQEBAQEBAQECAQICAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQE=

To get it back into a TDigest tree:
byte[] data = DatatypeConverter.parseBase64Binary(encoded);
ByteBuffer bb = ByteBuffer.wrap(data);
TDigest td = TDigest.fromBytes(bb);

where encode is the base64 string above.

when you query for quantiles it happens that:
td.quantile(0.999d) > td.quantile(0.9999d)
which should never happen.

I believe part of th eproblem is on https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java
line 320 (https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L320).
I believe in that line (center.count() - (q - t)) should be instead (q - t)
Symmetrically line 317 (https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L317)
should move from (q - t) to (center.count() - (q - t)) restablishing the symmetry.

The same holds true for lines 300 and 303 (theu have to have the (q-t) and (center.count() - (q - t)) swapped around.
I tried that td.quantile(0.5d) > td.quantile(0.5001d).

I am not sure it is the right fix but the invariants should be respected:
td.quantile(0.999d)<=td.quantile(0.9999d)
as well as
td.quantile(0.5d)<=td.quantile(0.5001d)

Thanks

There are no checks on the Count-Min Sketch parameters

During the construction of the count-min sketches I suggest to make some checks to the parameter values.
I am talking of width, depth, confidence, epsilon.
I would require the Count-Min Sketch user to use only some values in a range [minimum allowed value, maximum allowed value].
In this ways "possible surprises" are avoided.

CountMinSketch is not serializable

Could we make com.clearspring.analytics.stream.frequency.CountMinSketch implementing java.io.Serializable please?

I am getting this error:
org.apache.commons.lang3.SerializationException: java.io.NotSerializableException: com.clearspring.analytics.stream.frequency.CountMinSketch
at org.apache.commons.lang3.SerializationUtils.serialize(SerializationUtils.java:156)

consider support for additional hll(p) register sizes (currently only 5 is used)

(this issue takes over for #88 after clarification)

There is a trade off between bit usage and ultra-high cardinality accuracy. The current HLLP implementation uses 5 bits per register similar to HLL. The benefit is that it is an easier comparison between the two, and more efficient for most use cases. However, iirc, the original paper does recommend going up to 6 bits regardless, and it is not terribly difficult to modify the register size to be a non-constant (other than serialization format concerns).

Additionally, although the lower cardinality space is fairly well covered by the sparse set representation, it is also possible that there may be benefit to allowing an even lower register size. This may work even better if some kind of additional, secondary dynamic switch is supported. eg. "SPARSE -> NORMAL_4 -> NORMAL_5 -> NORMAL_6" or something. The runtime performance may be tricky to do well in that case though.

The easiest solution is to add a config parameter and somehow deal with serialization issues.

HyperLogLog incorrectly estimates very high cardinality values

HyperLogLog uses a correction equation to estimate values > 1/30*2^32. Unfortunately in java:

Math.pow(2,32) == Math.pow(-2,32)

which is very sad. This broke our implementation.

While investigating the issue I also found that the long range correction actually hurts the precision of the estimate in some cases so I'm going to add an option to return the uncorrected value.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.