hdrhistogram / hdrhistogram-go Goto Github PK

A pure Go implementation of Gil Tene's HDR Histogram.

License: MIT License

Go 99.08% Makefile 0.92%

hdrhistogram-go's Introduction

HdrHistogram

HdrHistogram: A High Dynamic Range (HDR) Histogram

This repository currently includes a Java implementation of HdrHistogram. C, C#/.NET, Python, Javascript, Rust, Erlang, and Go ports can be found in other repositories. All of which share common concepts and data representation capabilities. Look at repositories under the HdrHistogram organization for various implementations and useful tools.

Note: The below is an excerpt from a Histogram JavaDoc. While much of it generally applies to other language implementations as well, some details may vary by implementation (e.g. iteration and synchronization), so you should consult the documentation or header information of the specific API library you intend to use.

HdrHistogram supports the recording and analyzing of sampled data value counts across a configurable integer value range with configurable value precision within the range. Value precision is expressed as the number of significant digits in the value recording, and provides control over value quantization behavior across the value range and the subsequent value resolution at any given level.

For example, a Histogram could be configured to track the counts of observed integer values between 0 and 3,600,000,000 while maintaining a value precision of 3 significant digits across that range. Value quantization within the range will thus be no larger than 1/1,000th (or 0.1%) of any value. This example Histogram could be used to track and analyze the counts of observed response times ranging between 1 microsecond and 1 hour in magnitude, while maintaining a value resolution of 1 microsecond up to 1 millisecond, a resolution of 1 millisecond (or better) up to one second, and a resolution of 1 second (or better) up to 1,000 seconds. At its maximum tracked value (1 hour), it would still maintain a resolution of 3.6 seconds (or better).

The HdrHistogram package includes the Histogram implementation, which tracks value counts in long fields, and is expected to be the commonly used Histogram form. IntHistogram and ShortHistogram, which track value counts in int and short fields respectively, are provided for use cases where smaller count ranges are practical and smaller overall storage is beneficial.

HdrHistogram is designed for recording histograms of value measurements in latency and performance sensitive applications. Measurements show value recording times as low as 3-6 nanoseconds on modern (circa 2012) Intel CPUs. AbstractHistogram maintains a fixed cost in both space and time. A Histogram's memory footprint is constant, with no allocation operations involved in recording data values or in iterating through them. The memory footprint is fixed regardless of the number of data value samples recorded, and depends solely on the dynamic range and precision chosen. The amount of work involved in recording a sample is constant, and directly computes storage index locations such that no iteration or searching is ever involved in recording data values.

A combination of high dynamic range and precision is useful for collection and accurate post-recording analysis of sampled value data distribution in various forms. Whether it's calculating or plotting arbitrary percentiles, iterating through and summarizing values in various ways, or deriving mean and standard deviation values, the fact that the recorded data information is kept in high resolution allows for accurate post-recording analysis with low [and ultimately configurable] loss in accuracy when compared to performing the same analysis directly on the potentially infinite series of sourced data values samples.

A common use example of HdrHistogram would be to record response times in units of microseconds across a dynamic range stretching from 1 usec to over an hour, with a good enough resolution to support later performing post-recording analysis on the collected data. Analysis can include computing, examining, and reporting of distribution by percentiles, linear or logarithmic value buckets, mean and standard deviation, or by any other means that can be easily added by using the various iteration techniques supported by the Histogram. In order to facilitate the accuracy needed for various post-recording analysis techniques, this example can maintain a resolution of ~1 usec or better for times ranging to ~2 msec in magnitude, while at the same time maintaining a resolution of ~1 msec or better for times ranging to ~2 sec, and a resolution of ~1 second or better for values up to 2,000 seconds. This sort of example resolution can be thought of as "always accurate to 3 decimal points." Such an example Histogram would simply be created with a highestTrackableValue of 3,600,000,000, and a numberOfSignificantValueDigits of 3, and would occupy a fixed, unchanging memory footprint of around 185KB (see "Footprint estimation" below).

Histogram variants and internal representation

The HdrHistogram package includes multiple implementations of the AbstractHistogram class:

Histogram, which is the commonly used Histogram form and tracks value counts in long fields.
IntHistogram and ShortHistogram, which track value counts in int and short fields respectively, are provided for use cases where smaller count ranges are practical and smaller overall storage is beneficial (e.g. systems where tens of thousands of in-memory histogram are being tracked).
AtomicHistogram and SynchronizedHistogram (see 'Synchronization and concurrent access' below)

Internally, data in HdrHistogram variants is maintained using a concept somewhat similar to that of floating point number representation: Using an exponent a (non-normalized) mantissa to support a wide dynamic range at a high but varying (by exponent value) resolution. AbstractHistogram uses exponentially increasing bucket value ranges (the parallel of the exponent portion of a floating point number) with each bucket containing a fixed number (per bucket) set of linear sub-buckets (the parallel of a non-normalized mantissa portion of a floating point number). Both dynamic range and resolution are configurable, with highestTrackableValue controlling dynamic range, and numberOfSignificantValueDigits controlling resolution.

Synchronization and concurrent access

In the interest of keeping value recording cost to a minimum, the commonly used Histogram class and it's IntHistogram and ShortHistogram variants are NOT internally synchronized, and do NOT use atomic variables. Callers wishing to make potentially concurrent, multi-threaded updates or queries against Histogram objects should either take care to externally synchronize and/or order their access, or use the ConcurrentHistogram, AtomicHistogram, or SynchronizedHistogram or variants.

A common pattern seen in histogram value recording involves recording values in a critical path (multi-threaded or not), coupled with a non-critical path reading the recorded data for summary/reporting purposes. When such continuous non-blocking recording operation (concurrent or not) is desired even when sampling, analyzing, or reporting operations are needed, consider using the Recorder and SingleWriterRecorder recorder variants that were specifically designed for that purpose. Recorders provide a recording API similar to Histogram, and internally maintain and coordinate active/inactive histograms such that recording remains wait-free in the presence of accurate and stable interval sampling.

It is worth mentioning that since Histogram objects are additive, it is common practice to use per-thread non-synchronized histograms or SingleWriterRecorders, and use a summary/reporting thread to perform histogram aggregation math across time and/or threads.

Iteration

Histograms support multiple convenient forms of iterating through the histogram data set, including linear, logarithmic, and percentile iteration mechanisms, as well as means for iterating through each recorded value or each possible value level. The iteration mechanisms are accessible through the HistogramData available through getHistogramData(). Iteration mechanisms all provide HistogramIterationValue data points along the histogram's iterated data set, and are available for the default (corrected) histogram data set via the following HistogramData methods:

percentiles: An Iterable<HistogramIterationValue> through the histogram using a PercentileIterator
linearBucketValues: An Iterable<HistogramIterationValue> through the histogram using a LinearIterator
logarithmicBucketValues: An Iterable<HistogramIterationValue> through the histogram using a LogarithmicIterator
recordedValues: An Iterable<HistogramIterationValue> through the histogram using a RecordedValuesIterator
allValues: An Iterable<HistogramIterationValue> through the histogram using a AllValuesIterator

Iteration is typically done with a for-each loop statement. E.g.:

 for (HistogramIterationValue v :
      histogram.getHistogramData().percentiles(ticksPerHalfDistance)) {
     ...
 }

 for (HistogramIterationValue v :
      histogram.getRawHistogramData().linearBucketValues(unitsPerBucket)) {
     ...
 }

The iterators associated with each iteration method are resettable, such that a caller that would like to avoid allocating a new iterator object for each iteration loop can re-use an iterator to repeatedly iterate through the histogram. This iterator re-use usually takes the form of a traditional for loop using the Iterator's hasNext() and next() methods.

So to avoid allocating a new iterator object for each iteration loop:

 PercentileIterator iter =
    histogram.getHistogramData().percentiles().iterator(ticksPerHalfDistance);
 ...
 iter.reset(percentileTicksPerHalfDistance);
 for (iter.hasNext() {
     HistogramIterationValue v = iter.next();
     ...
 }

Equivalent Values and value ranges

Due to the finite (and configurable) resolution of the histogram, multiple adjacent integer data values can be "equivalent". Two values are considered "equivalent" if samples recorded for both are always counted in a common total count due to the histogram's resolution level. HdrHistogram provides methods for determining the lowest and highest equivalent values for any given value, as well as determining whether two values are equivalent, and for finding the next non-equivalent value for a given value (useful when looping through values, in order to avoid a double-counting count).

Corrected vs. Raw value recording calls

In order to support a common use case needed when histogram values are used to track response time distribution, Histogram provides for the recording of corrected histogram value by supporting a recordValueWithExpectedInterval() variant is provided. This value recording form is useful in [common latency measurement] scenarios where response times may exceed the expected interval between issuing requests, leading to "dropped" response time measurements that would typically correlate with "bad" results.

When a value recorded in the histogram exceeds the expectedIntervalBetweenValueSamples parameter, recorded histogram data will reflect an appropriate number of additional values, linearly decreasing in steps of expectedIntervalBetweenValueSamples, down to the last value that would still be higher than expectedIntervalBetweenValueSamples.

To illustrate why this corrective behavior is critically needed in order to accurately represent value distribution when large value measurements may lead to missed samples, imagine a system for which response times samples are taken once every 10 msec to characterize response time distribution. The hypothetical system behaves "perfectly" for 100 seconds (10,000 recorded samples), with each sample showing a 1msec response time value. At each sample for 100 seconds (10,000 logged samples at 1 msec each). The hypothetical system then encounters a 100 sec pause during which only a single sample is recorded (with a 100 second value). The raw data histogram collected for such a hypothetical system (over the 200 second scenario above) would show ~99.99% of results at 1 msec or below, which is obviously "not right". The same histogram, corrected with the knowledge of an expectedIntervalBetweenValueSamples of 10msec will correctly represent the response time distribution. Only ~50% of results will be at 1 msec or below, with the remaining 50% coming from the auto-generated value records covering the missing increments spread between 10msec and 100 sec.

Data sets recorded with and without an expectedIntervalBetweenValueSamples parameter will differ only if at least one value recorded with the recordValue method was greater than its associated expectedIntervalBetweenValueSamples parameter. Data sets recorded with an expectedIntervalBetweenValueSamples parameter will be identical to ones recorded without it if all values recorded via the recordValue calls were smaller than their associated (and optional) expectedIntervalBetweenValueSamples parameters.

When used for response time characterization, the recording with the optional expectedIntervalBetweenValueSamples parameter will tend to produce data sets that would much more accurately reflect the response time distribution that a random, uncoordinated request would have experienced.

Footprint estimation

Due to its dynamic range representation, Histogram is relatively efficient in memory space requirements given the accuracy and dynamic range it covers. Still, it is useful to be able to estimate the memory footprint involved for a given highestTrackableValue and numberOfSignificantValueDigits combination. Beyond a relatively small fixed-size footprint used for internal fields and stats (which can be estimated as "fixed at well less than 1KB"), the bulk of a Histogram's storage is taken up by its data value recording counts array. The total footprint can be conservatively estimated by:

 largestValueWithSingleUnitResolution =
        2 * (10 ^ numberOfSignificantValueDigits);
 subBucketSize =
        roundedUpToNearestPowerOf2(largestValueWithSingleUnitResolution);

 expectedHistogramFootprintInBytes = 512 +
      ({primitive type size} / 2) *
      (log2RoundedUp((highestTrackableValue) / subBucketSize) + 2) *
      subBucketSize

A conservative (high) estimate of a Histogram's footprint in bytes is available via the getEstimatedFootprintInBytes() method.

hdrhistogram-go's People

Stargazers

Watchers

hdrhistogram-go's Issues

Overflow while calculating mean

I am dealing with a large distribution of values. In your mean calculation here total overflows because you keep adding without keeping it down. May I suggest to use a running mean? Because this is anyway an approximate mean, the rounding error that occurs should be negligible.

// Mean returns the approximate arithmetic mean of the recorded values.
func (h *Histogram) Mean() float64 {
    if h.totalCount == 0 {
        return 0
    }
    var total int64
    i := h.iterator()
    for i.next() {
        if i.countAtIdx != 0 {
            total += i.countAtIdx * h.medianEquivalentValue(i.valueFromIdx)
        }
    }
    return float64(total) / float64(h.totalCount)
}

why int64 and not float64?

Do you have any plans to add support for recording float64 values as well as ints, similar to the Java version?

Record Float64 values in HDR

Is there any particular reason why int64 was chosen over float64 for the type of recorded values? I am interested in incorporating the HDR into our existing metrics pipeline, but our pipeline transmits metrics as float64 type. If I forked this project and refactored to use the float, are there potential issues that perhaps are not immediately obvious? I'm just wondering if int64 was chosen for a very specific reason, or was a somewhat arbitrary decision. Worst case scenario I can update my pipeline services to convert from float to int and back again, but I would really like to avoid that. The histogram I am most familiar with is implemented in Prometheus and accepts float64 as value.

recording value 11 out of range 0-200

I was perplexed that this

package main

import (
    "fmt"

    "github.com/codahale/hdrhistogram"
)

func main() {
    h := hdrhistogram.New(0, 200, 4)
    err := h.RecordValue(11)
    fmt.Println(err)
}

produced this

$ go run y.go 
value 11 is too large to be recorded

Then I checked the unit tests, and they all use 1 as the min. Is 1 the minimum minimum? I'm happy to send a documentation patch if you think this should be explicitly warned about.

class intervals and counts?

Hello,
do i understand correctly that this histogram is not able to return class intervals ("buckets") and the counts thereof? (which is what i thought histograms do)
the closest i could find was the func (*Histogram) CumulativeDistribution but the brackets returned are delineated by the quantile they're at (e.g. how many values have been included) as opposed to value thresholds.

thanks

Include coverage info

I've manually pushed the current master coverage info there:
https://codecov.io/gh/HdrHistogram/hdrhistogram-go
but we need to add it GH actions on a per commit/PR basis.

Tag semver releases?

Hi! I notice that this project doesn't have any tagged releases. Would you mind adding some SemVer-compatible release tags? It would really, really help those of us using dep and similar tools.

go1.12 got error parsing go.mod: unexpected module path "github.com/HdrHistogram/hdrhistogram-go"

This MR bb05e18 changed module path, while using go mod in go1.12 occurs this error:

go: github.com/codahale/[email protected]: parsing go.mod: unexpected module path "github.com/HdrHistogram/hdrhistogram-go"

Weird quantile values

Hello,

I have a pending change set which is on hold because I am probably doing something stupid with this library. Would you mind having a look at what's going on? Reported values are zero with sigfigs=5. When sigfigs=3 all quantiles report the same number.

Enable Sourcegraph

I want to use Sourcegraph for hdrhistogram code search, browsing, and usage examples. Can an admin enable Sourcegraph for this repository? Just go to https://sourcegraph.com/github.com/codahale/hdrhistogram. (It should only take 30 seconds.)

Thank you!

Fix godoc to use new repository URL

The original godoc location is still showing the old repository content
http://godoc.org/github.com/codahale/hdrhistogram

Need to generate documentation to
http://godoc.org/github.com/HdrHistogram/hdrhistogram

from github CI and update the reference in the README.md file

compare two histogram image

hello, I have a question: how to use your project to implement similarity comparison of two histogram images?

PackedHistogram support

Further reference: https://www.javadoc.io/static/org.hdrhistogram/HdrHistogram/2.1.12/org/HdrHistogram/PackedHistogram.html

CumulativeDistribution: Bracket should track value

Looking at Bracket, I think it that for it to be useful for reporting cumulative distribution, it should be tracking the value, Either instead if or in addition to the count.

In a world with non-repeating precise values, the count for a given quantile is a direct function of the total count in the data set. count = f(total_count, quantile) would be invariant to the data values recorded. In a world with repeating values (and/orimprecise ones) the cumulative count can "jump" by more than one between distinguishable value levels, but will still approximate the invariant behavior, so it doesn't add much useful info.

Including the actual count iterated to along with the value at the quantile can be useful for sanity checking (I use that for printing the cumulative count column right next to the value, which makes it easy to tell when multiple iterated percentile levels fall within the same sub-bucket, this is sometimes useful when looking at the very high nines)

getBucketIndex calculates index wrong, overflows on small values

There is an error in the following code
pow2Ceiling := bitLen(v | h.subBucketMask)

Leading zero base is missing. In this case it is constant 64
pow2Ceiling := 64 - bitLen(v | h.subBucketMask)

codahale/hdrhistogram repo has been transferred under the github HdrHstogram umbrella

hdrhist impl. does not track values correctly for seconds since epoch

see pull request #23

Export and import histograms

Hi Coda!

I have a case when I'm collecting histograms from various data sources via API and would like to merge them to get the combined histogram.

What do you think is the best way to export/import histograms for this package?

One way to do this is to add Export() Dump and Import(Dump) (*Histogram, error) functions that would operate on the following data structure:

type Dump struct {
    Version int
    LowestTrackableValue int64
    HighestTrackableValue int64
    SignificantFigures int64
    Counts []int64
}

The other approach is to make the raw iterator public, so I can implement any serialization/de-serialization logic in my app just by using iterator over counter values.

I can send a PR if you think that any of these approaches makes sense.

Support for textual output percentile distribution

As seen on other implementation we should support outputting text percentile distributions as follows:

       Value     Percentile TotalCount 1/(1-Percentile)

      79.360 0.000000000000          1           1.00
      80.435 0.166666666667         17           1.20
      80.896 0.333333333333         36           1.50
      81.050 0.500000000000         52           2.00
      81.152 0.583333333333         59           2.40
      81.254 0.666666666667         70           3.00
      81.357 0.750000000000         76           4.00
      81.459 0.791666666667         86           4.80
      81.459 0.833333333333         86           6.00
      81.510 0.875000000000         93           8.00
      81.510 0.895833333333         93           9.60
      81.510 0.916666666667         93          12.00
      81.562 0.937500000000         94          16.00
      81.613 0.947916666667         98          19.20
      81.613 0.958333333333         98          24.00
      81.613 0.968750000000         98          32.00
      81.613 0.973958333333         98          38.40
      81.613 0.979166666667         98          48.00
      81.664 0.984375000000         99          64.00
      81.664 0.986979166667         99          76.80
      81.664 0.989583333333         99          96.00
      86.067 0.992187500000        100         128.00
      86.067 1.000000000000        100
#[Mean    =       80.964, StdDeviation   =        0.746]
#[Max     =       86.067, Total count    =          100]
#[Buckets =           26, SubBuckets     =         2048]

panic: runtime error: index out of range

After upgrading to v1.1.2, occasionally we are observing the following panic:

panic: runtime error: index out of range [25600] with length 25600
github.com/HdrHistogram/hdrhistogram-go.(*Histogram).getCountAtIndexGivenBucketBaseIdx(...)
	/go/pkg/mod/github.com/!hdr!histogram/[email protected]/hdr.go:599
github.com/HdrHistogram/hdrhistogram-go.(*Histogram).getValueFromIdxUpToCount(0xb7, 0x3ff800000)
	/go/pkg/mod/github.com/!hdr!histogram/[email protected]/hdr.go:361 +0xb7
github.com/HdrHistogram/hdrhistogram-go.(*Histogram).ValueAtPercentile(0xc0000b6480, 0x0)
	/go/pkg/mod/github.com/!hdr!histogram/[email protected]/hdr.go:335 +0x65
github.com/HdrHistogram/hdrhistogram-go.(*Histogram).ValueAtQuantile(...)
	/go/pkg/mod/github.com/!hdr!histogram/[email protected]/hdr.go:319

Histogram is created with:

hdrhistogram.NewWindowed(windowCount, 1, maxLatency.Nanoseconds(), 3)

WindowedHistogram is being rotated on every 10mins and Histogram.ValueAtQuantile() is called on a Histogram produced by WindowedHistogram.Merge().

This panic happened twice in the last ten days. But I'm not able to reproduce it on a local environment. This application is running on production more than a year and we haven't seen this issue with earlier versions.

Eliminate Panic?

If any of the New methods receives a significant figure param outside the acceptable range the library will trigger a panic.

This makes the library unsuitable as-is for use cases that involve user input. From a consumer standpoint, this could be worked around by making a wrapper function that returns an error if the input isn't valid - but I think it would be cleaner to not have the library panic.

However, changing New to return an error will break the Libraries API, would you accept a PR for this?

Return smaller Snapshot

Would it be possible to return only an array of all counts up to TotalCount? Given a log-normal distribution it is very likely that the high numbers dont have a single count.

That means if you serialize the hdr histogram snapshot as JSON for example you can optimize massively for small snapshots by returning only an array of counts up to the last one that has a positive count.

Since the import already iterates over countsLen it could be used to initialize h.counts as well and only the Export() needs an additional loop.

Test on recent Go versions

Are you open to a PR that enables Travis testing on more recent Go versions?

I was thinking of testing only on Go 1.5, 1.6, and tip.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.