Giter VIP home page Giter VIP logo

datasketches-java's Introduction

Maven Central Language grade: Java Total alerts Coverage Status

=================

Apache® DataSketches™ Core Java Library Component

This is the core Java component of the DataSketches library. It contains all of the sketching algorithms and can be accessed directly from user applications.

This component is also a dependency of other components of the library that create adaptors for target systems, such as the Apache Pig adaptor and the Apache Hive adaptor.

Note that we have a parallel core component for C++ and Python implementations of the same sketch algorithms, datasketches-cpp.

Please visit the main DataSketches website for more information.

If you are interested in making contributions to this site please see our Community page for how to contact us.


Maven Build Instructions

NOTE: This component accesses resource files for testing. As a result, the directory elements of the full absolute path of the target installation directory must qualify as Java identifiers. In other words, the directory elements must not have any space characters (or non-Java identifier characters) in any of the path elements. This is required by the Oracle Java Specification in order to ensure location-independent access to resources: See Oracle Location-Independent Access to Resources

A JDK8 with Hotspot or JDK11 with Hotspot is required to compile

This component depends on the datasketches-memory component, and, as a result, must be compiled with one of the above JDKs. If your application only relies on the APIs of this component no special JVM arguments are required. However, if your application also directly relies on the APIs of the datasketches-memory component, you may need additional JVM arguments. Please refer to the datasketches-memory README for details.

If your application uses Maven, you can also use the pom.xml of this component as an example of how to automatically configure the JVM arguments for compilation and testing based on the version of the JDK.

Recommended Build Tool

This DataSketches component is structured as a Maven project and Maven is the recommended Build Tool.

There are two types of tests: normal unit tests and tests run by the strict profile.

To run normal unit tests:

$ mvn clean test

To run the strict profile tests (only supported in Java 8):

$ mvn clean test -P strict

To install jars built from the downloaded source:

$ mvn clean install -DskipTests=true

This will create the following jars:

  • datasketches-java-X.Y.Z.jar The compiled main class files.
  • datasketches-java-X.Y.Z-tests.jar The compiled test class files.
  • datasketches-java-X.Y.Z-sources.jar The main source files.
  • datasketches-java-X.Y.Z-test-sources.jar The test source files
  • datasketches-java-X.Y.Z-javadoc.jar The compressed Javadocs.

Dependencies

Run-time

There is one run-time dependency:

  • org.apache.datasketches : datasketches-memory

Testing

See the pom.xml file for test dependencies.

Special Build / Test Instructions for Eclipse

Building and running tests using JDK 8 should not be a problem.

However, with JDK 9+, and Eclipse versions up to and including 4.22.0 (2021-12), Eclipse fails to translate the required JPMS JVM arguments specified in the POM compiler or surefire plugins into the .classpath file, causing illegal reflection access errors eclipse-m2e/m2e-core Bug 543631.

There are two ways to fix this:

Method 1: Manually update .classpath file:

Open the .classpath file in a text editor and find the following classpathentry element (this assumes JDK11, change to suit):

	<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-11">
		<attributes>
			<attribute name="module" value="true"/>
			<attribute name="maven.pomderived" value="true"/>
		</attributes>
	</classpathentry>

Then edit it as follows:

	<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-11">
		<attributes>
			<attribute name="module" value="true"/>
			<attribute name="add-exports" value="java.base/jdk.internal.misc=ALL-UNNAMED:java.base/jdk.internal.ref=ALL-UNNAMED"/>
			<attribute name="add-opens" value="java.base/java.nio=ALL-UNNAMED:java.base/sun.nio.ch=ALL-UNNAMED"/>
			<attribute name="maven.pomderived" value="true"/>
		</attributes>
	</classpathentry>

Finally, refresh.

Method 2: Manually update Module Dependencies

In Eclipse, open the project Properties / Java Build Path / Module Dependencies ...

  • Select java.base
  • Select Configured details
  • Select Expose Package...
    • Enter Package = java.nio
    • Enter Target module = ALL-UNNAMED
    • Select button: opens
    • Hit OK
  • Select Expose Package...
    • Enter Package = jdk.internal.misc
    • Enter Target module = ALL-UNNAMED
    • Select button: exports
    • Hit OK
  • Select Expose Package...
    • Enter Package = jdk.internal.ref
    • Enter Target module = ALL-UNNAMED
    • Select button: exports
    • Hit OK
  • Select Expose Package...
    • Enter Package = sun.nio.ch
    • Enter Target module = ALL-UNNAMED
    • Select button: opens
    • Hit OK

NOTE: If you execute Maven/Update Project... from Eclipse with the option Update project configuration from pom.xml checked, all of the above will be erased, and you will have to redo it.

Known Issues

SpotBugs

  • Make sure you configure SpotBugs with the /tools/FindBugsExcludeFilter.xml file. Otherwise, you may get a lot of false positive or low risk issues that we have examined and eliminated with this exclusion file.

datasketches-java's People

Contributors

alexandersaydakov avatar cheddar avatar davecromberge avatar dengliming avatar freakyzoidberg avatar gianm avatar gitter-badger avatar inigoillan avatar jgeraerts avatar jmalkin avatar justin8712 avatar leerho avatar niketh avatar p- avatar paulk-asert avatar pavelvesely avatar romseygeek avatar will-lauer avatar xcorail avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasketches-java's Issues

NPE when trying to grow the buffer in DoublesSketch

Hi, Druid uses the QuantilesSketch to compute approximate quantiles. To keep the approximate quantiles, Druid creates a DoublesUnion which is backed by a WriableMemory wrapping a DirectByteBuffer. Since Druid does not know how many items could be there in advance but the buffer size should be fixed, it estimates the initial size of Memory to be large enough to hold one billion items. The below code snippet shows this and can be found in https://github.com/apache/druid/blob/master/extensions-core/datasketches/src/main/java/org/apache/druid/query/aggregation/datasketches/quantiles/DoublesSketchMergeBufferAggregatorHelper.java#L47-L53.

  public void init(final ByteBuffer buffer, final int position)
  {
    final WritableMemory mem = getMemory(buffer);
    final WritableMemory region = mem.writableRegion(position, maxIntermediateSize);
    final DoublesUnion union = DoublesUnion.builder().setMaxK(k).build(region);
    putUnion(buffer, position, union);
  }

This is causing a problem that DirectUpdateDoublesSketch throws NPE at this line when there are actually more than one billion items added in the union (reported in apache/druid#11544). DirectUpdateDoublesSketch tried to allocate extra memory to hold more items than that was initially estimated, but WritableMemory in it was BBWritableMemoryImpl (because it was created by wrapping DirectByteBuffer) which returns null in getMemoryRequestServer(). I thought that the fix could be returning a valid memoryRequestServer for BBWritableMemoryImpl, but the Javadoc of WriableMemory.getMemoryRequestServer states that this method is supposed to return null for non-direct memory. So, I was not sure what the right fix would be. What is the reason for non-direct memory to return null in getMemoryRequestServer()?

Avoid unnecessary allocations in HllSketch

MurmurHash3 does a bunch of unnecessary allocations, which means that every HllSketch update there are at least two object allocations. All these objects are very short lived and need to be garbage collected which is undesirable.

Given the other sketches code doesn't do this and we don't guarantee thread safety, this seems to be an oversight. Let me know if you'd like contributions for this fix.

https://github.com/DataSketches/sketches-core/blob/master/src/main/java/com/yahoo/sketches/hash/MurmurHash3.java#L59
https://github.com/DataSketches/sketches-core/blob/master/src/main/java/com/yahoo/sketches/hash/MurmurHash3.java#L253

ThetaSketch update hangs

Here's a minimal repro:

Union sketch = new SetOperationBuilder()
        .setNominalEntries(4096)
        .buildUnion();
for (int i = 0; i < 64; i++) {
    sketch.update(i);
}

// serialize + deserialize
sketch = Sketches.wrapUnion(WritableMemory.wrap(sketch.toByteArray()));

for (int i = 64; i < 128; i++) {
    sketch.update(i);
}

sketch.update(128); // this hangs

It seems to hang in HashOperations#fastHashSearchOrInsert

I'm guessing there is some state that isn't getting serialized?

could not access class in package 'sun.misc'

I used sketch-core in an hadoop-style enviorment with QuantileSketch and Memory class, but encountered the following problem. Because of the security policy of the platform, the librabry couldn't used. please help me!

Caused by: java.lang.RuntimeException: Unable to acquire Unsafe. 
    at com.yahoo.sketches.memory.UnsafeUtil.<clinit>(UnsafeUtil.java:105)
    ... 5 more
Caused by: java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "accessClassInPackage.sun.misc")
    at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
    at java.security.AccessController.checkPermission(AccessController.java:884)
    at java.lang.SecurityManager.checkPermission(SecurityManager.java:549)
    at com.alibaba.apsara.sandking.SandboxSecurityManager.checkPermission(SandboxSecurityManager.java:354)
    at java.lang.SecurityManager.checkPackageAccess(SecurityManager.java:1564)
    at com.alibaba.apsara.sandking.SandboxLauncher$AppClassLoader.loadClass(SandboxLauncher.java:247)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at com.yahoo.sketches.memory.UnsafeUtil.<clinit>(UnsafeUtil.java:68)
    ... 5 more

Merge and Update a Theta Sketch

I'm trying to update a theta sketch using a spark typed imperative aggregate. In order to achieve this, I want a representation of the sketch which supports both update and union. In my brief experimentation, this doesn't seem possible (or I could be missing something).

  • com.yahoo.sketches.theta.Sketch doesn't support update or union
  • com.yahoo.sketches.theta.CompactSketch doesn't support update, but can be merged via PairwiseSetOperations#union
  • com.yahoo.sketches.theta.Union seems the most feasible; it can be updated and converted to a compact result, but ultimately it would be fairly awkward to update and merge several times.

I hope I'm just missing something simple; thanks in advance!

DoublesUnion serde issue

Stacktrace:

java.lang.AssertionError: reqOffset: 8, reqLength: 8, (reqOff + reqLen): 16, allocSize: 8
	at com.yahoo.memory.UnsafeUtil.assertBounds(UnsafeUtil.java:141)
	at com.yahoo.memory.WritableMemoryImpl.getLong(WritableMemoryImpl.java:250)
	at com.yahoo.sketches.quantiles.DoublesUnionImpl.heapifyInstance(DoublesUnionImpl.java:80)
	at com.yahoo.sketches.quantiles.DoublesUnionBuilder.heapify(DoublesUnionBuilder.java:87)

Repro:

        DoublesUnion union = DoublesUnion.builder().build();
        DoublesUnionBuilder.heapify(Memory.wrap(union.toByteArray()));

Theta sketch - Concurrent union implementation

The implementation of a concurrent theta sketch is based on two main design choices

  1. updating threads write into a local buffer which is propagated in the background to a shared sketch
  2. query threads read from a snapshot created by the shared sketch, namely they always see a consistent state of the shared sketch.
    More details can be found here https://datasketches.github.io/docs/Theta/ConcurrentThetaSketch.html

This issue discusses 3 design alternatives for implementing a concurrent union operation.

Design option I: This design works very similar to the design of concurrent theta sketch. It allows updating only the local buffers, and querying only the shared union object.

  1. ConcurrentSharedUnionImpl extends Union overrides all update methods with UnsupportedException
  2. ConcurrentHeapUnionBuffer extends Union overrides getResult and getByteArray methods with UnsupportedException
  3. Add to SetOperationBuilder 2 methods:
    -buildShared gets ConcurrentSharedThetaSketch returns ConcurrentSharedUnionImpl
    -buildLocal gets ConcurrentSharedUnionImpl returns ConcurrentHeapUnionBuffer

Design option II: This design has some pros over the first design alternative mainly simplicity, however it means the “backend” is a shared sketch, and querying the union object is done through one of the local buffers, which is only available to the updating threads. Question: is this a reasonable setting?

  1. ConcurrentHeapUnionBuffer extends Union supports all methods both updates and queries
  2. Add to SetOperationBuilder 1 method:
    -buildLocal gets ConcurrentSharedThetaSketch returns ConcurrentHeapUnionBuffer

Design option III: This design is very similar to the second option however users cannot query the local buffer. This design assumes that queries can be delegated to the ConcurrentSharedThetaSketch “backend” which is available somewhere in the system. Question: is this a reasonable setting?

  1. ConcurrentHeapUnionBuffer extends Union overrides getResult and getByteArray methods with UnsupportedException
  2. Add to SetOperationBuilder 1 method:
    -buildLocal gets ConcurrentSharedThetaSketch returns ConcurrentHeapUnionBuffer

Backward compatibility for empty sketches generated from older versions

The latest version of sketches doesn't seem to be backward compatible for empty sketches generated using older versions. For example in an union operation, using an empty sketch such as AQMDAAAazJM= generated using the older versions would throw the error:

java.lang.AssertionError: reqOffset: 8, reqLength: 8, (reqOff + reqLen): 16, allocSize: 8

    at com.yahoo.memory.UnsafeUtil.assertBounds(UnsafeUtil.java:167)
    at com.yahoo.memory.BaseState.assertValidAndBoundsForRead(BaseState.java:330)
    at com.yahoo.memory.BaseWritableMemoryImpl.getNativeOrderedLong(BaseWritableMemoryImpl.java:284)
    at com.yahoo.memory.WritableMemoryImpl.getLong(WritableMemoryImpl.java:133)
    at com.yahoo.sketches.theta.UnionImpl.update(UnionImpl.java:278)

@AlexanderSaydakov @leerho This is the same problem that I had discussed with you and I'm logging this issue for tracking purposes.

How to serialize|deserialize sketch?

Hi,
thank you for that great library, one question: how to serialize|deserialize sketch?
For example, HeapUpdateSketch has toByteArray method and I can write to file, but how to deserialize from byte array?

Result list of FrequentItems sketch empty after a while

I'm currently running the FrequentItem Sketch on the ratings only dataset:
http://jmcauley.ucsd.edu/data/amazon/links.html
I'm creating the sketch and update it with the ASIN (Product ID) from the dataset. In the beginning it works but after a while the result simply gets empty.

TopNQueryResult{resultList=[TopNQueryResultItem{item=0007444117, estimate=1180, lowerBound=753, upperBound=1180}, TopNQueryResultItem{item=0007442920, estimate=975, lowerBound=548, upperBound=975}, TopNQueryResultItem{item=0007386648, estimate=872, lowerBound=445, upperBound=872}]}
TopNQueryResult{resultList=[TopNQueryResultItem{item=0007444117, estimate=1180, lowerBound=724, upperBound=1180}, TopNQueryResultItem{item=0007442920, estimate=975, lowerBound=519, upperBound=975}]}
TopNQueryResult{resultList=[TopNQueryResultItem{item=0007444117, estimate=1180, lowerBound=677, upperBound=1180}]}
TopNQueryResult{resultList=[TopNQueryResultItem{item=0007444117, estimate=1180, lowerBound=641, upperBound=1180}]}
TopNQueryResult{resultList=[TopNQueryResultItem{item=0007444117, estimate=1180, lowerBound=603, upperBound=1180}]}
TopNQueryResult{resultList=[]}

The TopNQueryResult is just a wrapper for reboxing the results.
SketchMapSize is 64 and error type is ErrorType.NO_FALSE_POSITIVES. I assume this is due to the error type?

0.10.0 SketchesReadOnlyException

Seeing the following after upgrading to 0.10.0:

com.yahoo.sketches.SketchesReadOnlyException: Write operation attempted on a read-only class.
	at com.yahoo.sketches.theta.DirectQuickSelectSketchR.hashUpdate(DirectQuickSelectSketchR.java:288)
	at com.yahoo.sketches.theta.UnionImpl.update(UnionImpl.java:261)

Didn't see anything helpful in the Union#update(Sketch sketchIn) javadocs regarding what might be going wrong.

Interface differences

Hi,
first thanks a lot for this amazing library! Really appreciated.

I'm currently wondering about the interface of the HllSketch. Is there a design choice in using Generics in the ItemsSketch but not in the HllSketch?
Similarly, the ThetaSketch uses a Builder Pattern while others do not. Would be great to know about these design choices!

Native Memory

When I create a NativeMemory object from a byte[] with a length greater than 1024, the remaining members of the array become 0 when I attempt to operate on them. Am I missing something?

Deserialize HLLSketch

HLLSketch has a toByteArray method for serialization, but I can't figure out how to construct a HLLSketch from a byte array. Am I missing something?

Theta sketch intersection estimation value is greater than source sketch estimation value

Sample code:

import org.apache.datasketches.memory.Memory;
import org.apache.datasketches.theta.*;

import java.nio.ByteOrder;
import java.util.Base64;

import static org.apache.datasketches.Util.DEFAULT_UPDATE_SEED;

public class ThetaSketchIntesectionApp {
    public static void main(String[] args) {
        byte[] sketch1Arr = Base64.getDecoder().decode("");
        final Memory serializedSketch = Memory.wrap(sketch1Arr,
                                                    0,
                                                    sketch1Arr.length,
                                                    ByteOrder.nativeOrder());
        Sketch sketch1 = Sketch.wrap(serializedSketch, DEFAULT_UPDATE_SEED);

        byte[] sketch2Arr = Base64.getDecoder().decode("");
        final Memory serializedSketch2 = Memory.wrap(sketch2Arr,
                                                    0,
                                                     sketch2Arr.length,
                                                    ByteOrder.nativeOrder());
        Sketch sketch2 = Sketch.wrap(serializedSketch2, DEFAULT_UPDATE_SEED);

        Sketch intSketch = Intersection.builder()
                    .buildIntersection()
                    .intersect(sketch1, sketch2);

        System.out.println("Sketch#1 Estimation: " + sketch1.getEstimate());
        System.out.println("Sketch#2 Estimation: " + sketch2.getEstimate());
        System.out.println("Sketch#1 and Sketch#2 Intersection Estimation: " + intSketch.getEstimate());
    }
}

Result:

Sketch#1 Estimation: 2.6420417306809786E8
Sketch#2 Estimation: 20693.591312562978
Sketch#1 and Sketch#2 Intersection Estimation: 64502.97194045358

Does Union support multi Sketch flavor updates?

I am trying to understand why the following code doesn't work as intended

Version

 <groupId>com.yahoo.datasketches</groupId>
 <artifactId>sketches-core</artifactId>
 <version>0.12.0</version>

Code

    Union set1 = Union.builder().setNominalEntries(4096).buildUnion();
    set1.update(1);
    set1.update(2);
    set1.update(3);
    set1.update(4);
    set1.update(5);
    set1.update(6);
    set1.update(6);
    set1.update(6);

    Union set2 = Union.builder().setNominalEntries(4096).buildUnion();
    set2.update(1);

    AnotB aNotB = Sketches.setOperationBuilder().setNominalEntries(4096).buildANotB();
    aNotB.update(set1.getResult(), set2.getResult());
    System.out.println(aNotB.getResult().getEstimate());

    Union finalUnion = Union.builder().setNominalEntries(4096).buildUnion();
    finalUnion.update(aNotB.getResult());
    System.out.println(finalUnion.getResult().getEstimate());

I am expecting the printout to be

5.0
5.0

but instead I get

5.0
0.0

I must be doing something trivially wrong, or does Union update only support compactSketch from Union?

Maximal estimate error

Hi, is there a formula for cpc or hll sketches to calculate max error for estimate ? I wonder what are the practices for assuring correctness(high accuracy) of execution while using sketches ?

Sketches, probablistic models for performing such operations as distinct counting looks great in both terms storage and performance, however crucial thing for me is whether there are hard boundaries for accuracy and error ?

Util.java throws InvalidPathException on Windows

The test suite fails with numerous errors on Windows:

[ERROR] Failures:
[ERROR]   UtilTest.resourceBytesCorrect:348 » InvalidPath Illegal char <:> at index 2: /...
[ERROR]   KllFloatsSketchTest.deserializeOneItemV1:355 » InvalidPath Illegal char <:> at...
[ERROR]   ForwardCompatibilityTest.check030_1000:49->getAndCheck:121 » InvalidPath Illeg...
[ERROR]   ForwardCompatibilityTest.check030_50:39->getAndCheck:121 » InvalidPath Illegal...
[ERROR]   ForwardCompatibilityTest.check060_1000:69->getAndCheck:121 » InvalidPath Illeg...
[ERROR]   ForwardCompatibilityTest.check060_50:59->getAndCheck:121 » InvalidPath Illegal...
[ERROR]   ForwardCompatibilityTest.check080_1000:89->getAndCheck:121 » InvalidPath Illeg...
[ERROR]   ForwardCompatibilityTest.check080_50:79->getAndCheck:121 » InvalidPath Illegal...
[ERROR]   ForwardCompatibilityTest.check083_1000:109->getAndCheck:121 » InvalidPath Ille...
[ERROR]   ForwardCompatibilityTest.check083_50:99->getAndCheck:121 » InvalidPath Illegal...
[ERROR]   ArrayOfDoublesUnionTest.noSupportHeapifyV0_9_1 » Test
Expected exception of t...
[ERROR]   ArrayOfDoublesUnionTest.noSupportWrapV0_9_1 » Test
Expected exception of type...
[ERROR]   CompactSketchWithDoubleSummaryTest.serialVersion1Compatibility:192 » InvalidPath
[INFO]
[ERROR] Tests run: 1387, Failures: 13, Errors: 0, Skipped: 0

The stacktrace of one of the errors is shown below:

[ERROR] serialVersion1Compatibility(org.apache.datasketches.tuple.CompactSketchWithDoubleSummaryTest)  Time elapsed: 0.002 s  <<< FAILURE!
java.nio.file.InvalidPathException: Illegal char <:> at index 2: /D:/projects/incubator-datasketches-java/target/test-classes/CompactSketchWithDoubleSummary4K_serialVersion1.bin
        at sun.nio.fs.WindowsPathParser.normalize(WindowsPathParser.java:182)
        at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:153)
        at sun.nio.fs.WindowsPathParser.parse(WindowsPathParser.java:77)
        at sun.nio.fs.WindowsPath.parse(WindowsPath.java:94)
        at sun.nio.fs.WindowsFileSystem.getPath(WindowsFileSystem.java:255)
        at java.nio.file.Paths.get(Paths.java:84)
        at org.apache.datasketches.Util.getResourceBytes(Util.java:761)
        at org.apache.datasketches.tuple.CompactSketchWithDoubleSummaryTest.serialVersion1Compatibility(CompactSketchWithDoubleSummaryTest.java:192)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124)
        at org.testng.internal.Invoker.invokeMethod(Invoker.java:583)
        at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:719)
        at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:989)
        at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:125)
        at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:109)
        at org.testng.TestRunner.privateRun(TestRunner.java:648)
        at org.testng.TestRunner.run(TestRunner.java:505)
        at org.testng.SuiteRunner.runTest(SuiteRunner.java:455)
        at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:450)
        at org.testng.SuiteRunner.privateRun(SuiteRunner.java:415)
        at org.testng.SuiteRunner.run(SuiteRunner.java:364)
        at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
        at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:84)
        at org.testng.TestNG.runSuitesSequentially(TestNG.java:1208)
        at org.testng.TestNG.runSuitesLocally(TestNG.java:1137)
        at org.testng.TestNG.runSuites(TestNG.java:1049)
        at org.testng.TestNG.run(TestNG.java:1017)
        at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:135)
        at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeMulti(TestNGDirectoryTestSuite.java:193)
        at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:94)
        at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:146)
        at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
        at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
        at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
        at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

Alternative to reservoir sketches

I recently did a review of the literature and some experiments with reservoir sampling, and found there are lower overhead sampling algorithms equivalent to the implementation of "algorithm R" in this library. I would like to contribute the best performing implementation to the library.

I have a few questions

  • does this library need another reservoir sketch and is it likely to accept one? That is, is preparing a patch a worthwhile use of my time?
  • Is my testing strategy agreeable? If not, what kind of tests would I need to provide? My strategy is:
    • generate input data from a known parametric distribution
    • sort the data to introduce sequential bias which must not be evident in the sample
    • take a sample
    • evaluate the maximum likelihood estimator over the sample to check I recover the distribution parameters within some tolerance.
  • is it better to replace or to complement ReservoirLongsSketch/ReservoirItemsSketch?

getPMF of UpdateDoublesSketch is giving different results for same data

I am using tuple sketch to get the distribution of data by using Probability Histogram of values, For larger datasets I am getting different distribution for same data.

I am doing something like this:

ArrayOfDoublesSketch unionSketch = getUnionOfSketches(List list);

ArrayOfDoublesSketch result = unionSketch.getResult() ;
UpdateDoublesSketch quantilesSketch = DoublesSketch.builder().build();
ArrayOfDoublesSketchIterator itr =result.iterator();
while (itr.next()) {
quantilesSketch.update(itr.getValues()[0]);
}
double [] fr = quantilesSketch.getPMF(new double[] {1.0, 2.0, 3.0, 4.0, 5.0,6.0,7.0,8.0});

public static ArrayOfDoublesUnion getUnionOfSketches(List arrayOfDoublesSketches ){
ArrayOfDoublesUnion union = new ArrayOfDoublesSetOperationBuilder().buildUnion();
for(ArrayOfDoublesSketch arrayOfDoublesSketch: arrayOfDoublesSketches){
union.update(arrayOfDoublesSketch);
}
return union;
}

In fr[] I am getting different values for same data on multiple runs. Basically the values are oscillating between 2-3 values.

Question related to union performance

Hi, I just wanted to ask whether CpcSketch is fastest Sketch for performing union ? And whether is there other way to speed up unions than decreasing logK param ?

`

static class UnionPerfTest {
    public static void main(String[] args) {

        int logK = 11;
        Random r = new Random();

        CpcSketch fllcpc = new CpcSketch(logK);
        IntStream.generate(() -> r.nextInt(10_000_000)).distinct().limit(100_000).forEach(fllcpc::update);

        CpcSketch sllcpc = new CpcSketch(logK);
        IntStream.generate(() -> r.nextInt(10_000_000)).distinct().limit(1_000_000).forEach(sllcpc::update);

        long blackHole = 0;
        long start = System.currentTimeMillis();
        for (int i = 0; i < 1_000_000; i++) {
            CpcUnion cpcUnion = new CpcUnion(logK);

            cpcUnion.update(sllcpc);
            cpcUnion.update(fllcpc);

            blackHole += cpcUnion.getResult().getEstimate();
        }

        System.out.println("Millis: " + (System.currentTimeMillis() - start));
        System.out.println("bc: " + blackHole);
    }
}

`

NullPointer exception when resize factor != X1

Hi, I'm running data sketches on spark using scala serialisation wrapper like this.

@throws(classOf[IOException])
  private def writeObject(out: ObjectOutputStream): Unit = synchronized {
    out.writeObject(union.toByteArray)
  }

  @throws(classOf[IOException])
  private def readObject(in: ObjectInputStream): Unit = synchronized {
    val bytes = in.readObject().asInstanceOf[Array[Byte]]
    Sketches.wrapUnion(new NativeMemory(bytes))
  }

But when my sketches don't have ResizeFactor.X1 property set I get a null pointer error:

Caused by: java.lang.NullPointerException
  at com.yahoo.sketches.theta.DirectQuickSelectSketch.hashUpdate(DirectQuickSelectSketch.java:404)
  at com.yahoo.sketches.theta.UpdateSketch.update(UpdateSketch.java:169)
  at com.yahoo.sketches.theta.UnionImpl.update(UnionImpl.java:269)
  at com.myproject.spark.SerUnion.update(SerUnion.scala:32)
  at $anonfun$createDataSketches$2.apply(<console>:73)
  at $anonfun$createDataSketches$2.apply(<console>:73)
  at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:190)
  at org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:189)
  at org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:144)
  at org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
  at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:195)
  at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
  at org.apache.spark.scheduler.Task.run(Task.scala:86)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

I'm using version 0.8.3 for the following modules.

 'com.yahoo.datasketches:sketches:0.8.3'
 'com.yahoo.datasketches:memory:0.8.3'
 'com.yahoo.datasketches:sketches-core:0.8.3'

Quantile error bounds

Is it possible to get upper/lower bounds for quantile results? E.g. I'd like to know that the median is 5.0 +/- 0.05 with 95% confidence.

There seem to be comments that mention that the normalized rank error can't directly be applied to determine these bounds, but surely there must be some way to bound the error here?

Naming Conventions ItemSketch

Small nit. Since we (Splice Machine) are using both frequencies and quantiles, it would be nice if they were not named the same (ItemSketch)...

Update sketches will have large error rate in some cases.

All of update sketch don't deal with duplicate datas.
So in such situation:

val sketch1 = UpdateSketch()
sketch1.update(1)
sketch1.update(1)
sketch2.update(xxx)
val sketch2 = UpdateSketch()
sketch2.update(1)
sketch2.update(xxx)
anotb.update(sketch1, sketch2)
anotb.getResult()

Int this cases, because the number 1 add two times and only remove once, so it should counts in the final result.
But actually it will not. It will cause very large error rate in small data.

PairwiseSetOperations.aNotB throws ArrayOutOfBoundsException

Reproduced with:

        UpdateSketch one = new UpdateSketchBuilder().setNominalEntries(4096).build();
        UpdateSketch two = new UpdateSketchBuilder().setNominalEntries(4096).build();
        UpdateSketch three = new UpdateSketchBuilder().setNominalEntries(4096).build();
        for (int i = 0; i < 1_000_000; i++) {
            one.update(i);
            two.update(1_000_000 + i);
            three.update(2_000_000 + i);
        }
        PairwiseSetOperations.aNotB(PairwiseSetOperations.intersect(one.compact(), two.compact()), three.compact());

The bug itself doesn't seem to be in the implementation of aNotB, but rather that CompactSketch.isEmpty() (which aNotB checks) is deceiving, and may return false despite the internal array being length 0. When aNotB checks to see if the sketch is empty, it sees it is not, and then proceeds to attempt the set difference which fails to the zero length array. If you attempt the same thing, but replace one.compact() with a newly constructed empty sketch, aNotB works as expected.

ArrayIndexOutOfBoundsException during serialization

Seeing this sporadically for FrequentItems:

java.lang.ArrayIndexOutOfBoundsException: 13
	at com.yahoo.sketches.frequencies.ReversePurgeItemHashMap.getActiveValues(ReversePurgeItemHashMap.java:180)
	at com.yahoo.sketches.frequencies.ItemsSketch.toByteArray(ItemsSketch.java:316)

as well as Quantiles:

java.lang.ArrayIndexOutOfBoundsException: null
	at java.lang.System.arraycopy(Native Method)
	at com.yahoo.sketches.quantiles.ItemsByteArrayImpl.combinedBufferToItemsArray(ItemsByteArrayImpl.java:104)
	at com.yahoo.sketches.quantiles.ItemsByteArrayImpl.toByteArray(ItemsByteArrayImpl.java:55)
	at com.yahoo.sketches.quantiles.ItemsSketch.toByteArray(ItemsSketch.java:471)
	at com.yahoo.sketches.quantiles.ItemsSketch.toByteArray(ItemsSketch.java:461)

Has anyone seen this before? Might it be related to memory corruption as we suspected in #175?

Does the Kappa parameter for KLL determine the number of retained elements?

AFAIK KLL uses a cascade of compactors to produce its interval estimates.

What I'm wondering is what exactly is the effect of the K parameter. Does setting K to a specific value mean that overall, that many values will be retained between all the compactors?

Or does K indicate some other aspect of the sketch? In that case, is it actually possible to determine how many values are being retained in the compactors?

CompactSketch ArrayIndexOutOfBoundsException

Hi, we are using Theta Sketches java library to calculate reach metrics. Based on the Java Example from the Data Sketch website, we are using Union to join multiple sketches and then get the CompactSketch in binary format.

However, we do observe issues when we get CompactSketch from Union as the following stacktrace:

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 137 out of bounds for length 137 
at org.apache.datasketches.theta.CompactSketch.compactCache(CompactSketch.java:97) 
at org.apache.datasketches.theta.UnionImpl.getResult(UnionImpl.java:238) 
at org.apache.datasketches.theta.UnionImpl.getResult(UnionImpl.java:212) 

Can you guys let us know under what case this would happen and what's the root cause?

Thanks,
Bill

DirectDoublesSketch has error (maybe in propagate-carry?)

Running the following code (yes, allocating way more memory than needed):

final int k = 16;
final int n = k * 2 + 7;
final ByteBuffer bb = ByteBuffer.allocateDirect(100 + n << 3);
final Memory mem = AllocMemory.wrap(bb);

final DoublesSketchBuilder dsb = DoublesSketch.builder();
dsb.initMemory(mem);
final DoublesSketch ds = dsb.build(k);

for (int i = 0; i < n; ++i) {
  ds.update(i);
}

System.out.println(ds.toString(true, true));

The base buffer seems ok, but data In the first level has unexpected 0.0s. If I use n = k * 4 + m, I was getting 0.0 values in the middle of the level.

Quantiles DirectDoublesSketch DATA DETAIL:

 BaseBuffer   :       32.0      33.0      34.0      35.0      36.0      37.0      38.0
 Valid | Level
   T       0:        0.0       0.0       0.0       2.0       4.0       6.0       8.0      10.0      12.0      14.0      16.0      18.0      20.0      22.0      24.0      26.0
### END DATA DETAIL

Quantiles Sketch Density

If you have a quantiles sketch for a bounded set of data, how would you provide an estimate for data out of those bounds?

For example,

Known values:

int[] values = new int[]{1,2,3,4,5,6,7,8,9,10};

Estimate Between
[12,15]

Can I get the average density from the known values and then apply that to the unknown values?

Thanks
John

Duplicate detection

Hey folks,

Was wondering if you had any thoughts on approximate duplicate detection algorithms. One could conceivably use an HLL (compare cardinality to total stream length), but I don't think the confidence interval there will be satisfyingly tight.

Found this paper which seems potentially interesting: http://www.vldb.org/pvldb/vol6/p589-dutta.pdf

In general, have you had any requests for supporting approximate membership queries?

EDIT: let me know if this isn't the proper channel for discussions like these

Testing equality of HllSketch

The following code snippet fails:

        Union hll1 = new Union(12);
        Stream.of("1", "2", "3", "4", "5", "6", "7", "8", "9").forEach(hll1::update);

        Union hll2 = Union.heapify(Memory.wrap(hll1.toCompactByteArray()));

        if (!Arrays.equals(hll1.toCompactByteArray(), hll2.toCompactByteArray())) {
            throw new AssertionError("Hlls not equal");
        }

(using com.yahoo.sketches.hll.Union)

Is this a bug in hll serde?

Track Purge count in ReversePurgeItemHashMap

I want to add a purgeCount field to the the ReversePurgeItemHashMap.
I also want to expose this in the ItemsSketch object

My use case is this: I want to be able to tell if my Stream is an enum (i.e. contains < k distinct values). This can be achieved by creating a ItemsSketch with capacity k, and then seeing if the map is ever purged.

I am happy to implement this; what do you guys think?

tuple.QuickSelectSketch throws NullPointerException

I'm running into this NullPointerException error when getting results for a tuple.Union:

java.lang.NullPointerException
at com.yahoo.sketches.tuple.QuickSelectSketch.rebuild(QuickSelectSketch.java:434)
at com.yahoo.sketches.tuple.QuickSelectSketch.rebuild(QuickSelectSketch.java:409)
at com.yahoo.sketches.tuple.Union.getResult(Union.java:65)

After reading the source code for tuple.QuickSelectSketch, this seems to be linked to a null "summaries_" array.

I'm having difficulty diagnosing the problem because it doesn't seem to affect every getResult operation, only a few of them. The only similarity I can see is that the failing Unions have a single tuple.CompactSketch with 0 retained entries and a theta less than 1.0--though some Unions with a tuple.CompactSketch with 0 retained entries and a theta less than 1.0 don't fail.

The code is something like:

val union = new Union(k, summarySetOperations)
union.update(sketch)
union.getResult

Range Selectivity

My reading of the quantiles sketch api is that it includes the values provided as the bounds.

int[] foo = new int[] {1,2,3,4,5,6,7,8,9,10};

so the getCDF can handle easily [2,5].

Is there a clear way to handle (2,5)?

Would I need to increment in the outer code (transform (2,5) to [3,4]?

Need flexible ArrayOfByte tuple sketch with creation option from theta sketch

I'm building an Druid aggregator to do distinct counting with a frequency estimate (how many unique values that occurred a minimum number of times). To build this, I want to use an efficient tuple sketch implementation. Druid has an aggregator that uses the ArrayOfDouble tuple sketch, but I really want to do integral counting, so a LongTuple sketch would be ideal, but an ArrayOfByte tuple sketch would probably allow me the most flexibility, allowing me to treat the bytes as an int, long, or any other value that I want.

For this to work, I would need to control how the tuple value to computed in both union and intersection collision cases. ArrayOfDouble assumes double sum when a union occurs, but allows an ArrayOfDoubleSketchCombiner instance to be used to compute the result in an intersection case. I would want the Combiner to be used for both Union, Intersection, and Difference (separate methods for each).

The other feature I need is the ability to create one of these new sketches from a theta sketch, providing a default value for the tuple. Its not hard to do this, but it requires access to some internals of the theta sketch. If the new tuple sketch lived in the same package as the theta sketch, it could access these. If not, some part of the protections in theta sketch would need to be loosened. In order to properly construct a tuple sketch from a theta sketch, it looks like I need access to Sketch.getCache(), SKetch.getThetaLong(), and Sketch.getSeedHash().

FrequentItems IllegalArgumentException

Still working on the repo, but we recently saw this in our logs:

java.lang.IllegalArgumentException: reqOffset: 4, reqLength: , (reqOff + reqLen): 1413567575, allocSize: 480
  at com.yahoo.memory.UnsafeUtil.checkBounds(UnsafeUtil.java:156)
  at com.yahoo.sketches.ArrayOfStringsSerDe.deserializeFromMemory(ArrayOfStringsSerDe.java:54)
  at com.yahoo.sketches.ArrayOfStringsSerDe.deserializeFromMemory(ArrayOfStringsSerDe.java:23)
  at com.yahoo.sketches.frequencies.ItemsSketch.getInstance(ItemsSketch.java:263)

Posting here in case someone might know just from the message what might be going wrong.

cpc/hll - hardcoded hash function

I noticed that both CpcSketch and HllSketch use MurmurHash3 hashing, and it's hardcoded. In some cases, inputs may have been hashed already for other purposes, or a faster hash function may be preferred. It would be nice if there's a way to specify the hash function(constructor/setter/builder etc.), and let MurmurHash3 be the default.

Python Bindings for HLL Sketch

As per the discussion in this thread in the google group - https://groups.google.com/d/msg/sketches-user/8TaAXaT_6qo/A2JJkIuZBQAJ

there is traction for having a python binding for the different sketch families in sketches-core, similar to how the library has for pig and hive. I was thinking we could get started on the python adaptors by having a wrapper library for the hyperloglog sketches. Would that be a good place to start?

For Pig and Hive the bindings were defined as UDFs that pig and hive scripts can use. How will we define the wrapper classes in python? Will it be something on the lines of Jython - http://www.jython.org/jythonbook/en/1.0/JythonAndJavaIntegration.html

NULL handling in Quantiles sketch

If the input data contains NULL values, Quantiles is throwing NP exception.

$ cat data.json
{"value": 1, "category": "a"}
{"value": 2, "category": "a"}
{"value": null, "category": "a"}

$ cat test.pig
register memory-0.11.0.jar;
register sketches-core-0.11.1.jar;
register sketches-pig-0.11.0.jar;

define dataToSketch com.yahoo.sketches.pig.quantiles.DataToDoublesSketch();
--a = load 'data.txt' as (value:double, category);
a = load 'data.json' Using JsonLoader('value:double, category: chararray');
b = group a by category;
c = foreach b generate flatten(group) as (category), flatten(dataToSketch(a.value)) as sketch;
dump c;

$ pig -x local test.pig 

...
2019-02-05 19:38:49,748 [Thread-24] WARN  org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing (Name: b: Local Rearrange[tuple]{chararray}(false) - scope-37 Operator Key: scope-37): org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [POUserFunc (Name: POUserFunc(com.yahoo.sketches.pig.quantiles.DataToDoublesSketch$IntermediateFinal)[tuple] - scope-28 Operator Key: scope-28) children: null at []]: java.lang.NullPointerException
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:287)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.processOnePackageOutput(PigCombiner.java:198)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:176)
	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigCombiner$Combine.reduce(PigCombiner.java:52)
	at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
	at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1502)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1436)
	at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
	at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
	at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [POUserFunc (Name: POUserFunc(com.yahoo.sketches.pig.quantiles.DataToDoublesSketch$IntermediateFinal)[tuple] - scope-28 Operator Key: scope-28) children: null at []]: java.lang.NullPointerException
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:364)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:404)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:321)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
	... 12 more
Caused by: java.lang.NullPointerException
	at com.yahoo.sketches.pig.quantiles.DataToDoublesSketch$IntermediateFinal.exec(DataToDoublesSketch.java:320)
	at com.yahoo.sketches.pig.quantiles.DataToDoublesSketch$IntermediateFinal.exec(DataToDoublesSketch.java:269)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:326)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextTuple(POUserFunc.java:365)
	at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:359)
	... 15 more
Input(s):
Failed to read data from "data.json"

Output(s):
Failed to produce result in "file:/tmp/temp-120668530/tmp-2117801155"

Counting Nulls

I am currently using Theta, Quantile, and Frequent Sketches to plugin to a RDBMS statistics structure. It was not clear to me if there is an existing place to count nulls or is that something I should implement separately.

Set Operations should support implied hash seeds

Currently, all the set operations enforce that incoming sketches have a specified hash seed. If no hash seed is specified, the default hash seed is implied. As an option, these operations should be able to infer the hash seed from the first sketch seen, allowing the operations to be used with non-standard seeds without having to specify the seed (a secret) up front. This has the effect of moving a possible error (mismatched seeds) later in the processing and possibly hiding the error all together if only a single sketch is seen, but has the advantage that users of the set operations don't have to know what their seeds are.

Comparison between DataSketches, DDSketch, t-digests et al.

I see DataSketches and DDSketch both offer a floating-point quantile-based sketch based on Agarwal et al.'s Mergeable summaries. I also noticed that DataSketches maintains an extensive test suite for benchmarking purposes. It would be nice to see a performance comparison between these two mergeable quantile sketching implementations, as their interfaces (at least at first glance) appear to be quite similar. Or are there any significant differences I might have overlooked? Thank you!

cc: @CharlesMasson @richardstartin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.