Giter VIP home page Giter VIP logo

goestimators's Introduction

Fast Estimator Algorithms for Go: (Hyper|Super)?LogLog, Bloom filter

HyperLogLog (and LogLog and SuperLogLog) for Go

This algorithms in this family implement fast approximate counting of arbitrary data (here implemented as binary keys), with a trivially small memory footprint. Their workings are similar enough that the only point of difference is the summarisation step, where the approximation is calculated from internal data. The same internal data can be used for both LogLog, SuperLogLog and HyperLogLog approximation.

The number of LogLog buckets (which affects accuracy, but soon saturates to diminishing returns) is adjustable, with 1024 being the recommended default, and 32 being the smallest practical size. This number must be an integral power of 2.

This library was implemented from scratch, by interpreting the following papers and tutorials:

Any implementation errors are probably my own.

This library uses the Murmur3 hash function implementation from github.com/spaolacci/murmur3.

Example

ll, _ := NewLogLog(1024)
var buf := make([]byte, 8)
// Add 100000 random b-byte buffers to the set
for i := 0; i < 100000; i++ {
	rand.Read(buf)
	ll.Observe(buf)  // Observes the buffer, i.e. updates internal representation from it
}
est := ll.Estimate() // Returns the estimated number of unique observed buffers
fmt.Println("Estimate of a set of 100k random entries: ", est)

This example uses the plain old LogLog estimation algorithm, implemented in the Estimate() function. All three algorithms can be used from the same observation results:

  • Estimate() - LogLog (fastest)
  • SuperEstimate() - SuperLogLog
  • HyperEstimate() - HyperLogLog (slowest)

Performance

Measured on an i5-5200U, the benchmark results (using 1024 buckets) are:

BenchmarkLogLog-4                        2000000               712 ns/op
BenchmarkSuperLogLog-4                   2000000               763 ns/op
BenchmarkHyperLogLog-4                    500000              2367 ns/op
BenchmarkObservationLogLog-4            20000000                61.6 ns/op

Since HyperLogLog improves accuracy only slightly compared to LogLog and SuperLogLog, users should decide for themselves if they can accept the hit in performance.

Accuracy

A typical run of the test functions (using 1024 buckets) looks like this:

LogLogRandom_1M 918170
SuperLogLogRandom_1M 932190
HyperLogLogRandom_1M 1059335

These are the estimates for observing 1M sequential 8-byte buffers (i.e. reasonably unique) by the respective algorithms. YMMV.

Bloom filter

"Bloom filters" test for the absence of an element in a large set. If a Bloom filter returns negative, the element certainly isn't in the set. If it returns positive, the element might be in the set. Memory usage for this is trivially small.

This algorithm was implemented from scratch, mostly by reading the Wikipedia article. It uses the Murmur3 function for the 64-bit hash implementation.

For performance reasons, there are only two supported Bloom filter sizes: 256-bit (which is 32 bytes), and 65536-bit (8 KiB). The 256-bit Bloom filter implements two hash functions (k=2), and the 65536-bit one implements four hash functions (k=4).

Example

const N = 100
b := NewBloom(goestimators.Size256)

// Add N random 64-bit numbers to the set
var i uint64
buf := make([]byte, 8)
for i = 0; i < N; i++ {
	uint64ToBytes(i, buf)
	b.Observe(buf)
}

// Test for existance of N+10 elements
notfound := 0
for i = 0; i < N+10; i++ {
	uint64ToBytes(i, buf)
	if !b.Check(buf) {
		notfound++
	}
}

goestimators's People

Contributors

ivoras avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

xiaoconstantine

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.