Giter VIP home page Giter VIP logo

dynatrace-oss / dynahist Goto Github PK

View Code? Open in Web Editor NEW
45.0 5.0 8.0 1.88 MB

DynaHist: A Dynamic Histogram Library for Java

License: Apache License 2.0

Java 99.23% Python 0.77%
quantile-estimation data-sketches histogram data-structures approximation-algorithms quantiles memory-efficiency dynamic-allocation histogram-library hdrhistogram order-statistics java quantile ddsketch compression-algorithm sketches

dynahist's Introduction

logo

DynaHist: A Dynamic Histogram Library for Java

License Maven Central javadoc Quality Gate Status Coverage Java 11 or higher

This Java library contains histogram implementations with configurable bin layouts specifically designed for fast value recording. The basis are three different implementations:

  • The static histogram enables an allocation-free recording of values, as the internal bin count array is already fully occupied during construction.
  • The dynamic histogram is memory-efficient as it resizes the internal bin count array on demand. Furthermore, it dynamically adjusts the number of bits used for each counter.
  • The preprocessed histogram is an immutable implementation which contains the cumulative bin counts. In this way sublinear queries for order statistics are possible through binary search. If many of those queries are performed subsequently, it is recommended to convert to a preprocessed histogram first.

The library ships with predefined bin layout implementations:

  • LogOptimalLayout: Allows to specify absolute and relative bin width limits, where one of them must be satisfied over a given value range. In this way the error of recorded values can be controlled. This layout is most space-efficient but involves a logarithm evaluation.
  • LogLinearLayout: Trades space for speed by replacing the logarithm of LogOptimalLayout by a piecewise linear function. For the same bin width limits this layout results in up to 44% more bins and therefore in a correspondingly larger memory footprint.
  • LogQuadraticLayout: This layout is a compromise between LogOptimalLayout and LogLinearLayout. It uses a piecewise quadratic approximation to reduce the space overhead to about 8% compared to the optimal mapping.
  • CustomLayout: Allows to set the bin boundaries individually. It can be used to map a histogram, which was recorded with some fine-grained bin layout, to a coarser custom bin layout with well-defined bins. For example, this can be useful as a preparatory step for creating a visualization of the histogram. This mapping should not be used for high-frequency recording as it involves a slow binary search.
  • OpenTelemetryExponentialBucketsLayout: A layout that is compatible with OpenTelemetry exponential histograms (see specification).

Basic Functionality

// Defining a bin layout
Layout layout = LogQuadraticLayout.create(1e-5, 1e-2, -1e9, 1e9); // use bins with maximum 
                                                                  // absolute width of 1e-5 
                                                                  // or relative width of 1% 
                                                                  // to cover the range [-1e9, 1e9]
// Creating a dynamic histogram
Histogram histogram = Histogram.createDynamic(layout);             

// Adding values to the histogram
histogram.addValue(-5.5); // adds the value -5.5
histogram.addValue(4.3, 6); // adds the value 4.3 six times
histogram.addAscendingSequence(i -> i + 1, 1000000000); // adds the first billion positive integers

// Querying the histogram
histogram.getTotalCount();
histogram.getMin();
histogram.getMax();
histogram.getValue(1); // returns an estimate of the 2nd smallest value
histogram.getValue(3, ValueEstimator.UPPER_BOUND); // returns an upper bound of the 4th smallest value
histogram.getQuantile(0.5); // returns an estimate of the sample median
histogram.getQuantile(0.5, ValueEstimator.LOWER_BOUND); // returns a lower bound of the sample median

// Merging histograms
histogram.addHistogram(otherHistogram);

// Serialization
histogram.write(dataOutput); // write histogram to a java.io.DataOutput
histogram.readAsDynamic(layout, dataInput); // read dynamic histogram from a java.io.DataInput

Getting Started

To add a dependency on DynaHist using Maven, use the following:

<dependency>
    <groupId>com.dynatrace.dynahist</groupId>
    <artifactId>dynahist</artifactId>
    <version>1.4</version>
</dependency>

To add a dependency using Gradle:

implementation group: 'com.dynatrace.dynahist', name: 'dynahist', version: '1.4'

History

At Dynatrace we were looking for a data sketch with a fast update time, which can also answer order statistic queries with error guarantees. As an example, such a data structure should be able to provide the 99th percentile with a maximum relative error of 1%. Other data structures like t-digest do not have strict error limits. In our search, we finally came across HdrHistogram, a histogram implementation that intelligently selects bin boundaries so that the relative error is limited over a range of many orders of magnitude. The core of HdrHistogram is a fast mapping of values to bin indices by bit twiddling, which reduces the recording time to less than 10ns. Although we loved this idea, this data structure did not quite meet our requirements for several reasons:

  • The original HdrHistogram was designed for recording integer values. Usually we are dealing with floating point values. The wrapper class for double values, which is shipped with HdrHistogram, introduces an indirection, which slows down the recording.
  • Another disadvantage is that HdrHistogram does not give you full control over the error specification. It is only possible to define the number of significant digits corresponding to relative errors of 10%, 1%, 0.1%, etc. It is not possible to select a relative error of 5%. You must fall back on 1%, which unnecessarily increases the number of bins and wastes memory space.
  • HdrHistogram has no support for negative values. Two histograms, one for the positive and one for the negative value range, have to be used instead.
  • HdrHistogram does not keep track of the exact minimum and maximum values.
  • With HdrHistogram it is not possible to define the maximum error for values that are between zero and the range where the relative error limit applies.
  • The mapping of values to bin indices is fast, but not optimal. The mapping used by HdrHistogram requires about 40% more bins than necessary to satisfy the specified relative error. In 2015 we have proposed a better and similarly fast mapping for HdrHistogram (see HdrHistogram/HdrHistogram#54) with less than 10% space overhead. However, as this would have resulted in an incompatible change, the author of HdrHistogram decided not to pursue our idea any further.

Therefore, we started developing our own histogram data sketch which uses the proposed better mapping and which also solves all the mentioned issues. After many years of successful application and the emergence of an open source initiative at Dynatrace, we decided to publish this data structure as a separate library here on GitHub.

Benchmarks

For our benchmarks we used random values drawn from a reciprocal distribution (log-uniform distribution) with a minimum value of 1000 and a maximum value of 1e12. In order not to distort the test results, we have generated 1M random numbers in advance and kept them in main memory. For the comparison with HdrHistogram we used the DoubleHistogram with highestToLowestValueRatio=1e9 and numberOfSignificantValueDigits=2. To record values with equivalent precision we used an absolute bin width limit of 10 and a relative bin width limit of 1% over the range [0, 1e12]. The corresponding layouts LogLinearLayout(10, 0.01, 0, 1e12), LogQuadraticLayout(10, 0.01, 0, 1e12), and LogOptimal(10, 0.01, 0, 1e12) have been combined with the static and dynamic implementations of DynaHist resulting in 6 different cases.

The recording speed was measured using JMH on a Dell Precision 5530 Notebook with an Intel Core i9-8950HK CPU. We measured the average time to insert the 1M random values into an empty histogram data structure, from which we derived the average time for recording a single value. All four investigated DynaHist variants outperform HdrHistogram's DoubleHistogram significantly. The static histogram implementation with the LogLinearLayout was the fastest one and more than 35% faster than HdrHistogram.

Recording Speed

The memory usage of the histogram data structures was analyzed after adding 1M random values as in the speed benchmark before. Again due to the better bin layout DynaHist significantly outperforms HdrHistogram. Especially the dynamic histogram implementation together with LogOptimalLayout or LogQuadraticLayout reduce the memory footprint by more than 85% compared to HdrHistogram.

Memory Footprint

Similarly, the serialization, which is more or less a memory snapshot of the dynamic histogram implementation, is much more compact than that of HdrHistogram.

Raw Serialization

The space advantage is maintained even with compression. The reason is that DynaHist requires much fewer bins to guarantee the same relative error and therefore less information has to be stored.

Compressed Serialization

Bin Layouts

A Layout specifies the bins of a histogram. The regular bins of a layout span a certain value range. In addition, DynaHist uses an underflow and an overflow bin to count the number of values which are below or beyond that value range, respectively.

DynaHist comes with three Layout implementations LogOptimalLayout, LogLinearLayout and LogQuadraticLayout which can be configured using an absolute bin width limit a and a relative bin width limit r. If b(i) denotes the bin boundary between the (i-1)-th and the i-th bin, the i-th bin covers the interval [b(i), b(i+1)]. Then the absolute bin width limit can be expressed as

|b(i+1) - b(i)| <= a                                    (1)

and the relative bin width limit corresponds to

|b(i+1) - b(i)| / max(|b(i)|, |b(i+1)|) <= r.           (2)

If a bin satisfies either (1) or (2), any point of [b(i), b(i+1)] approximates a recorded value mapped to the i-th bin with either a maximum absolute error of a or a maximum relative error of r. In particular, if the midpoint of the interval (b(i) + b(i+1)) / 2 is chosen as estimate of the recorded value, the error is even limited by the absolute error bound a/2 or the relative error bound r/2, respectively.

The bin boundaries of a layout, for which either (1) or (2) holds for all its bins, must satisfy

b(i+1) <= b(i) + max(a, r * b(i))

in the positive value range. For simplicity, we focus on the positive value range. However, similar considerations can be made for the negative range. Obviously, an optimal mapping, that minimizes the number of bins and therefore the memory footprint, would have

b(i+1) = b(i) + max(a, r * b(i)). 

We call bins close to zero and having a width of a subnormal bins. Normal bins are those representing an interval with larger values, which have a width defined by the relative bin width limit. The following figure shows an example of such a bin layout.

Layout Figure

In this example, the bins are enumerated with integers from -9 to 8. The first and last bin indices address the underflow and the overflow bin, respectively. The widths of normal bins correspond to a geometric sequence. For an optimal bin layout

b(i) = c * (1 + r)^i

must hold for all bin boundaries in the normal range with some constant c. Values of the normal range can be mapped to its corresponding bin index by the following mapping function

f(v) := floor(g(v))    with    g(v) := (log(v) - log(c)) / log(1 + r).

While log(1 + r) and log(c) can be precomputed, the calculation of log(v) remains which is an expensive operation on CPUs. Therefore, DynaHist defines alternatives to the optimal mapping used by LogOptimalLayout that trade space efficiency for less computational costs. Obviously, if g(v) is replaced by any other function that is steeper over the whole value range, the error limits will be maintained. Therefore, DynaHist uses linear (LogLinearLayout) or quadratic (LogQuadraticLayout) piecewise functions instead. Each piece spans a range between powers of 2 [2^k, 2^(k+1)]. The coefficients of the polynomials are chosen such that the resulting piecewise function is continuous and the slope is always greater than or equal to that of g(v). The theoretical space overhead of LogLinearLayout is about 44% and that of LogQuadraticLayout is about 8% compared to LogOptimalLayout.

License

Apache Version 2.0

dynahist's People

Contributors

andreas-doppelhofer avatar centic9 avatar dependabot[bot] avatar markusremplbauer avatar oertl avatar sullis avatar sunnychanwork avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dynahist's Issues

Role of null values in (de)serialized histograms

Ambiguity Report

Context, creating a Rust implementation. To keep the exercise finite/do-able within the resource and time constraints available, I intend that a 1.0 release:

  • Will not support any serialization version prior to the release date.
  • Will not support any serialized histogram produced by an implementation that could serialize a null value.

While I haven't finished, right now it isn't clear to me what the role of null values is in a DynaHist histogram.

Related to this, there is no explicit statement(s) about what data types are supported in a DynaHist.
As best I can tell the supported data types are Java's Int, Long and IEEE 754 - excluding NaN, and the infinities.

Expected Behavior

In DynaHist null is never a (de)serialized value.

Current Behavior

Current code comments in src/dynahist/src/main/java/com/dynatrace/dynahist/serialization/SerializationReader.java:

Implementations should never return {@code null} except for the case {@code null} was really the serialized value.

We're curious to known the context for expecting null in a (de)serialized histogram.

Possible Solution

Document a requirement that implementation MUST ensure that null is never a serialized value.

Detailed Description

Likely a update to the (de)serialization comments:

  • README.md
  • SerializationReader.java
  • SerializayionWriter.java
  • Test case

Possible Implementation

Happy to draft a PR but the maintainers will likely find it quicker to add any such details themselves?

histogram of 100 billion elements

Dear Dynahist team,

I have 100 billion elements (value between 0.75 and 1, can be any one, 4 digits, one element per line) stored in a file (100G), I want to plot a histogram of this file, bin size 0.005, starting from 0.75 to 1 and count number of elements fall in each bin. I did not find any other library/tool that can do the job. Exact counting is not possible because it requires a huge mount of memory. Since Dynahist is a Java library and I am not a java person at all. I think I have no choice but to do it myself using this library.

Want to hear your suggestions.

Thanks,

Jianshu

Tests are failing when run on non-English locale

When executing gradlew check I get the following test-failures.

This is on Ubuntu Linux, Java 11, German Locale via LANG=de_AT.UTF-8

org.junit.ComparisonFailure: expected:<-5[,50000000000000000E+00 - -5,]50000000000000000E+0...> but was:<-5[.50000000000000000E+00 - -5.]50000000000000000E+0...>
	at org.junit.Assert.assertEquals(Assert.java:117)
	at org.junit.Assert.assertEquals(Assert.java:146)
	at com.dynatrace.dynahist.examples.HistogramUsage.addSingleValue(HistogramUsage.java:67)
org.junit.ComparisonFailure: expected:<-5[,50000000000000000E+00 - -5,]50000000000000000E+0...> but was:<-5[.50000000000000000E+00 - -5.]50000000000000000E+0...>
	at org.junit.Assert.assertEquals(Assert.java:117)
	at org.junit.Assert.assertEquals(Assert.java:146)
	at com.dynatrace.dynahist.examples.HistogramUsage.addValueWithMultiplicity(HistogramUsage.java:83)
org.junit.ComparisonFailure: expected:< 0[,00000000000000000E+00 -  9,99999999999999900E-01 : *
 1,00000000000000000E+03 -  9,99999999999999800E+03 : *****
 1,00000000000000000E+04 -  7,]77237591081370300E+0...> but was:< 0[.00000000000000000E+00 -  9.99999999999999900E-01 : *
 1.00000000000000000E+03 -  9.99999999999999800E+03 : *****
 1.00000000000000000E+04 -  7.]77237591081370300E+0...>
	at org.junit.Assert.assertEquals(Assert.java:117)
	at org.junit.Assert.assertEquals(Assert.java:146)
	at com.dynatrace.dynahist.examples.ResponseTimeExample.mappingResponseTimes1(ResponseTimeExample.java:43)
org.junit.ComparisonFailure: expected:< 0[,00000000000000000E+00 -  9,99999999999999900E-01 :                  14
 1,00000000000000000E+00 -  9,99999999999999800E+00 :                 114
 1,00000000000000000E+01 -  9,99999999999999900E+01 :                 924
 1,00000000000000000E+02 -  9,99999999999999900E+02 :                6971
 1,00000000000000000E+03 -  9,99999999999999800E+03 :               47866
 1,00000000000000000E+04 -  9,]98000950924521900E+0...> but was:< 0[.00000000000000000E+00 -  9.99999999999999900E-01 :                  14
 1.00000000000000000E+00 -  9.99999999999999800E+00 :                 114
 1.00000000000000000E+01 -  9.99999999999999900E+01 :                 924
 1.00000000000000000E+02 -  9.99999999999999900E+02 :                6971
 1.00000000000000000E+03 -  9.99999999999999800E+03 :               47866
 1.00000000000000000E+04 -  9.]98000950924521900E+0...>
	at org.junit.Assert.assertEquals(Assert.java:117)
	at org.junit.Assert.assertEquals(Assert.java:146)
	at com.dynatrace.dynahist.examples.ResponseTimeExample.mappingResponseTimes2(ResponseTimeExample.java:65)

Feature: Publication of reference implementation repository (JOSS)

Congratulations @oertl on a very interesting and substantial advance to the state of the art.
Quite disappointing that there has not been more widespread adoption and implementation....:

Feature Request

Summary

Publish the Dynahist reference implementation.

Feature Description

Publish the Java reference implementation repository to for example the Journal of Open Source Software

Here is an example repository/publication and what the review process can look like.

I've reviewed previously and IMO the implementation is good to go - you can (transparently) suggest reviewers to the editor.

I'd be happy to review, however my PhD is in math-stats & option pricing, so perhaps reaching out to some of the people that implemented HDRHistogram and the Julia community would yield more feedback and interest?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.