twitter / algebird Goto Github PK

View Code? Open in Web Editor NEW

2.3K 235.0 346.0 9.34 MB

Abstract Algebra for Scala

Home Page: https://twitter.github.io/algebird

License: Apache License 2.0

Scala 97.15% Java 1.79% Ruby 1.06%

algebird's Introduction

Algebird

Overview

Abstract algebra for Scala. This code is targeted at building aggregation systems (via Scalding or Apache Storm). It was originally developed as part of Scalding's Matrix API, where Matrices had values which are elements of Monoids, Groups, or Rings. Subsequently, it was clear that the code had broader application within Scalding and on other projects within Twitter.

See the Algebird website for more information.

What can you do with this code?

> sbt algebird-core/console

Welcome to Scala 2.12.14 (OpenJDK 64-Bit Server VM, Java 11.0.1).
Type in expressions for evaluation. Or try :help.

scala> import com.twitter.algebird._
import com.twitter.algebird._

scala> import com.twitter.algebird.Operators._
import com.twitter.algebird.Operators._

scala> Map(1 -> Max(2)) + Map(1 -> Max(3)) + Map(2 -> Max(4))
res0: scala.collection.immutable.Map[Int,com.twitter.algebird.Max[Int]] = Map(2 -> Max(4), 1 -> Max(3))

In the above, the class Max[T] signifies that the + operator should actually be max (this is accomplished by providing an implicit instance of a typeclass for Max that handles +).

Model a wide class of "reductions" as a sum on some iterator of a particular value type. For example, average, moving average, max/min, set union, approximate set size (in much less memory with HyperLogLog), approximate item counting (using CountMinSketch).
All of these combine naturally in tuples, vectors, maps, options and more standard scala classes.
Implementations of Monoids for interesting approximation algorithms, such as Bloom filter, HyperLogLog and CountMinSketch. These allow you to think of these sophisticated operations like you might numbers, and add them up in hadoop or online to produce powerful statistics and analytics.

Documentation

To learn more and find links to tutorials and information around the web, check out the Algebird website.

The latest API docs are hosted on Algebird's ScalaDoc index.

Get Involved + Code of Conduct

Pull requests and bug reports are always welcome! Check out our Contributing guide for information on what we most need help with and how you can get started contributing.

Discussion occurs primarily on the Algebird Gitter channel:

We also monitor the Algebird mailing list.

Issues should be reported on the GitHub issue tracker.

We use a lightweight form of project governance inspired by the one used by Apache projects.

Please see Contributing and Committership for our code of conduct and our pull request review process.

The TL;DR is send us a pull request, iterate on the feedback + discussion, and get a +1 from a Committer in order to get your PR accepted.

The current list of active committers (who can +1 a pull request) can be found here: Committers

A list of contributors to the project can be found here: Contributors

Maven

Algebird modules are available on maven central. The current groupid and version for all modules is, respectively, "com.twitter" and 0.13.5.

See Algebird's page on the Scaladex for information on all published artifacts and their associated Scala versions. Algebird currently supports Scala 2.10, 2.11 and 2.12.

Projects using Algebird

Scalding
Spark
Simmer
Summingbird
Scio
Packetloop (see this tweet)
Ebay uses Algebird for machine learning: ScalaDays talk
Apple (FEAR Team)

Other projects built with Algebird, as compiled by the Scaladex:

Authors

Oscar Boykin http://twitter.com/posco
Avi Bryant http://twitter.com/avibryant
Edwin Chen http://twitter.com/echen
ellchow http://github.com/ellchow
Mike Gagnon https://twitter.com/gmike
Moses Nakamura https://twitter.com/mnnakamura
Steven Noble http://twitter.com/snoble
Sam Ritchie http://twitter.com/sritchie
Ashutosh Singhal http://twitter.com/daashu
Argyris Zymnis http://twitter.com/argyris

License

Licensed under the Apache License, Version 2.0.

Thanks to Yourkit

YourKit supports open source projects with innovative and intelligent tools for monitoring and profiling Java and .NET applications. YourKit is the creator of YourKit Java Profiler, YourKit .NET Profiler, and YourKit YouMonitor.

algebird's People

Contributors

Stargazers

Watchers

Forkers

johnynek ealliaume yuzeh echen singhala mosesn avibryant snoble imaxxs ellchow mikegagnon azymnis steamshon krishnanraman aaron-siegel splittingfield bkirwi tixxit rjhall dreamfrog alfa07 weilinear genba dleen ccsevers oztc wlue isaka lashchenko adamilardi david206 rlotun fiq reconditesea smarden1 anujsrc mansurashraf stephanh lijinhui olegych jcoveney aniket486 tsdeng mediamath ai-align crvidya uservidya michael-natapov prasadwagle hansmire eshioji sigmoidanalytics gigamonkey sagemintblue manuzhang cheecheeo xuanhan863 apsaltis ddccffvv travisbrown allenday cfregly debop julienledem bdacode themodernlife rgcase alkhaled evanvolgas adriaanm complexqubit tripple-drops lkjx77 banno securapawn coderxiang solidm lucentcosmos alvinhom debasish83 bitisony emaxerrno rnadrag lvicentesanchez cjwebb vineethvarghese yingjiemiao fengjian428 zhangmiao97 agiledon mark80 asciola berngp zacharyvoase danielhfrank mlmanapat jinlee fignewtons miguelperalvo daniellesucher

algebird's Issues

Use Spire's typeclasses.

Not trolling. The definitions of Semigroup, Monoid, VectorSpace etc in https://github.com/non/spire are not obviously incompatible with us, and it might be nicer for the broader Scala community if we could unify.

FlowerFilter monoid

http://eng.42go.com/a-simple-time-decaying-approximate-membership-filter/

This stores each item T in each of it's hash bucket indices (so it is considerably larger than a normal bloom filter), but it considers something present if t: T exists in one of the hashed locations.

This is interesting because it can only remember recent items, which is often useful for some streaming join, or some decayed approximation algorithm.

ApproximateBoolean Confidence Incorrectness

Consider:

ApproximateBoolean(false, 0.6) && ApproximateBoolean(false, 0.4)

This yields ApproximateBoolean(false, 0.6), which is actually inaccurate. The false is correct, but the probability is low. The actual probability should be as follows: (0.6 + 0.4) - (0.6 * 0.4) = 0.76. Which is to say, the probability that the left is false, or the right is false (non-exclusively).

There is a similar problem with the || operation.

The implementations are nearly correct, they just require a bit of refinement. Here is a corrected && operator (bonus: also much faster!):

  def &&(that: ApproximateBoolean): ApproximateBoolean = {
    if(isTrue && that.isTrue) {
      //We need both to be correct:
      ApproximateBoolean(true, withProb * that.withProb)
    } else if (this.isTrue || that.isTrue) {
      ApproximateBoolean(false, if (!this.isTrue) withProb else that.withProb)
    } else {
      ApproximateBoolean(false, this.withProb + that.withProb - (this.withProb * that.withProb))
    }
  }

The || operator should be analogous.

PairMonoid/Semigroup to combine Set/Hyperloglog

Here is the idea, two functions:

sealed abstract class EitherAlgebra[+L,+R]
case class LeftAlg[+L](get: L) extends EitherAlgebra[L, Nothing]
case class RightAlg[+R](get: R) extends EitherAlgebra[Nothing,R]

//Then we have a monoid/semigroup on L,R and:

def monoid(shouldConvert: (R) => Boolean, convert: (R) => L): Monoid[EitherAlgebra[L,R]]

Then, shouldConvert(s: Set[T]) = s.size > threshold, and convert(s: Set[T]): HLLInstance = iterate over all the items and insert them.

In this way, we can make a monoid that is exact when size is small, and converts to approximate when we cross some size boundary (say about the size of the serialized hyperloglog).

We can do the same with count-min-sketch with a Map[K,Long]. The pattern is: use the exact data structure when the size is less than the approximate, and then switch on demand.

sortedSum implementation

Could this be useful inside of Monoid?

import scala.util.Sorting.stableSort

def sortedSum[T: Ordering: Manifest](vs: Seq[T]): T = Monoid.sum(stableSort(vs))

Generalize Sparse/Dense vector

The recent refactoring of HLL does almost exactly what I am contemplating doing with MinHash, except that HLL is Map[Int,Max[Byte]] and I would want Map[Int,Min[Int]]. I'm wondering if we want to generalize to a OptionallySparseVector[V], with an OptionallySparseVectorMonoid[V:Monoid]. (Better name would be appreciated).

trait Approximate

As Oscar pointed out in #44, we're starting to get a lot of approximate collections. We tend to implement them right now as returning, say, a Long, and then having various other methods for describing the degree of error. Maybe instead they should be returning something like

trait Approximate[T:Numeric] {
   def estimate : T
   def max : T
   def min : T
   def probWithinBounds : Double
}

Approximate collections API.

There is something going on with some of the data-structures we have here: BloomFilter is a kind of approximate Set. HyperLogLog is an approximate set with Cardinality. Count-min-sketch is an approximate Map (so Count-min-sketch has an approximate set on the keys).

I think we could define a few traits to represent these concepts.

Something like:

trait ApproximateSet[V] {
  def mayContain(v: V): Boolean
  def containsFalsePosProb: Double
  def containsFalseNegProb: Double
}

In fact, most of these have false neg prob == 0.0, so maybe we should encode that in the type (for cases where you want to be sure you don't miss anything).

trait ApproxAcceptingSet[V] extends ApproximateSet[V] {
  final def containsFalseNegProb = 0.0
}

Maybe we want to punt and avoid the returning the false negatives, and if we want to support set union (which these as monoids would have):

trait ApproximateSet[V,S<:ApproximateSet[V,S]] {
  // if we return false, the item is definitely NOT in this set.
  def mayContain(v: V): Boolean
  // f-bounded polymorphism to preserve the types:
  def union(that: S): S
}

For size, maybe something like:

trait ApproximateCardinality {
  def approxSize: Long
}

For Count-min-sketch, it looks like we need something like:

trait ApproximateMap[K,V] {
  def approxValue(k: K): Option[V]
}

In this view, I think CMS extends ApproximateMap, BloomFilter extends ApproximateSet, and HyperLogLog extends ApproximateSet with ApproximateCardinality

Add Countable[T]

Not totally sure about this:

Think it is:

trait Countable[T] {
  def next(t: T): Option[T]
  def prev(t: T): Option[T]
}

an infinite T might never return None, but Int, Long, Double etc instances would.

Deal with Abelian/Commutative objects better.

There are often optimizations the can be done if a semigroup/monoid/group is commutative.

How can we best communicate that?

Two obvious solutions: a property of the instance (.isCommutative). This rules out the ability to have some methods that only apply to commutative objects (because that check is at runtime).

A second type-class approach would be something like:

// use this to check at runtime:
sealed trait CommutativeStatus[O] {
  def commutes: Boolean
}
trait IsNotCommutative[O] extends CommutativeStatus[O] { val commutes = false }
// use this when you want to require commutativity
trait IsCommutative[O] extends CommutativeStatus[O] { val commutes = true }

trait LowP {
  implicit def defaultIsNot[O]: CommutativeStatus new IsNotCommutative[O] { }
}

object CommutativeStatus extends LowP {
  implicit def commutes[O](implicit isc: IsCommutative[O]): CommutativeStatus = isc
}

We can add instances of IsCommutative to the Monoid/Group/Semigroup objects.

Use sbt-boilerplate instead of custom codegen

https://github.com/sbt/sbt-boilerplate

That might be cleaner to generate the tuple/product algebras.

Count-min-sketch on any Monoid

You can think of Count-min-sketch as an approximate: Map[Long,Long].

The algorithm seems to generalize to Map[Long,V : Ordering : Monoid] because we use the count (monoid) and the min (ordering).

If we include a 64 bit hash from K => Long, then we get

SketchMap[K, V](implicit keyhash: Hashable[K,Long], monoidv: Monoid[V], ord: Ordering[V])

And generalize the CMS code in the obvious way.

Should we remove CMS and move to the above? I don't know.

More Hashable discussion

It seemed like there was some progress here: #35

Did anything get decided? It would be nice to be able to use other types in bloomfilter (and implement a bloom join).

I'd be happy to help out with it but I don't want to step on anything in progress.

algebird-serialization

This module should contain thrift mappings for all Algebird case classes. Out of this we'll publish scrooge and java jars of these structs.

Cokleisli star operators

Algebird is much more than just the Monoid trait now. Isn't it time we added the Cokleisli star operators?

Algebird should have a "testing" submodule.

This would allow libraries that use Algebird to write scalacheck tests using the laws.

Add BigInt/BigInteger Ring

Useful in cases where we might need very large exact counts.

Chernoff test

We should write some reusable code in algebird-test to have a scalacheck property that is true with probability at least P. Then, using the chernoff bound, we should verify that this claim is true with a very high probability (something like p=10^{-9}), so that the tests will almost always pass.

Update readme

Algebird is now much more than just the Monoid trait. Anyone want to take a stab at updating the readme with all the algebird goodness?

algebird-util depends on algebird-test

Is this intentional?
As it is now, anything that depends on algebird somewhere pulls in algebird-test, and transitively, its dependencies: specs2 and scalacheck, onto the runtime classpath.

Add join method to Aggregator object

We need to add something like:

object Aggregator {
  def join2(ag1: Aggregator[A1,B1,C1], ag2: Aggregator[A2,B2,C2]):
    Aggregator[(A1,A2),(B1,B2),(C1,C2)] = new Aggregator[(A1,A2),(B1,B2),(C1,C2)] {
    def prepare(a: (A1,A2)) = (ag1.prepare(a._1), ag2.prepare(a._2))
    // etc..., to bad this kind of lifting is not easier to do automatically. I don't see how...
  }
  // and so on for join3, join4, up to join22.  Ideally these are all code-gen in scala.
}

InvertibleAffineFn

Once we pull in a Bijection dependency,

case class InvertibleAffineFunction[R](slope: R, intercept: R)(implicit field: Field[R])
  extends Bijection[R,R] {
  lazy val toFn: Function1[R,R] = {x => this.apply(x) }
  override def apply(x: R): R = field.plus(field.times(slope, x), intercept)
  override def invert(x: R): R = field.minus(field.div(x, slope), field.div(intercept, slope))
  def andThen(other: InvertibleAffineFunction[R]): InvertibleAffineFunction[R] =
    InvertibleAffineFunction[R](slope * other.slope, intercept + (slope * other.intercept))
  def compose(other: InvertibleAffineFunction[R]): InvertibleAffineFunction[R] = andThen(other)
}

add murmur hash to HyperLogLog

Add adjoin trick to add identity to pseduo-rings

http://en.wikipedia.org/wiki/Pseudo-ring#Adjoining_an_identity_element

What we want is something like:

// in Ring object:
  def adjoinOne[T](implicit pseudoring: Ring[T]): Ring[(BigInt, T)] = new Ring[(BigInt,T)] {
    lazy val one = (BigInt(1), pseudoring.zero)
    lazy val zero = (BigInt(0), pseudoring.zero)
   // and follow the rules above.
  }

This would allow us to use a Map as a true ring, not a pseudo-ring, which could be useful in Matrix.scala (but honestly, this is mostly a correctness issue, I don't think matrix.scala often assumes ring.one exists).

PoissonDecayedValue

For counts-based data, this may be a better approach than the current exponential decay. See http://word.bitly.com/post/41284219720/forget-table

publishing 0.1.12

Any last minute (binary compatible) fixes should be submitted in the next 24 hours.

Goal is to publish on Thursday.

Use scala.math.Equiv[T] where we need generic equivalence.

Range queries for CountMinSketch

See eg http://www.cse.cuhk.edu.hk/~taoyf/course/5705f10/lec8.pdf for how this works.

I have a basic working implementation of the DyadicRange logic (for positive integer keys, which you can map whatever else to), but it needs to be hooked up to levels instances of CMS. We also want to implement binary search using this to get quantile queries.

case class DyadicRange(maxValue : Long = Long.MaxValue) {
  val levels = math.ceil(math.log(maxValue) / math.log(2)).toInt

  def indicesForPoint(v : Long) = (1 to levels).map{level => (level, indexForPoint(v, level))}
  def indicesForRange(start : Long, end : Long)  : List[(Int,Long)] = indicesForRange(start, end, levels)

  def indexForPoint(v : Long, level : Int) = v >> (level - 1)
  def rangeForIndex(i : Long, level : Int) = (i << (level-1), ((i+1) << (level-1)) - 1)

  def indicesForRange(start : Long, end : Long, maxLevel : Int) : List[(Int,Long)] = {
    if(start > end) {
      Nil
    } else if(start == end) {
      List((1, start))
    } else {
      val startIndex = indexForPoint(start, maxLevel)
      val (a,b) = rangeForIndex(startIndex, maxLevel)
      if(a >= start) {
        if(b <= end) {
          List((maxLevel, startIndex)) ++ indicesForRange(start, a - 1, maxLevel) ++ indicesForRange(b + 1, end, maxLevel) 
        } else {
          indicesForRange(start, end, maxLevel - 1)
        }
      } else {
        val (a2, b2) = rangeForIndex(startIndex + 1, maxLevel)
        if(b2 <= end && a2 <= end) {
          List((maxLevel, startIndex + 1)) ++ indicesForRange(start, a2 - 1, maxLevel - 1) ++ indicesForRange(b2 + 1, end, maxLevel)
        } else {
          indicesForRange(start, end, maxLevel - 1)
        }
      }
    }
  }

}

Flaky test in HyperLogLog

I got the following failure, unrelated to any code change that I did:

[error] x HyperLogLog should
[info]   + count with 5-bits
[error]   x count with 6-bits
[error]     0.3370433439011089 is not less than 0.325 (HyperLogLogTest.scala:56)

Specialize Max, Min, First and Last

Max, Min, First and Last should be specialized on the usual primitives (which are the salient use cases) in order to minimize (un-)boxing overhead

Evaluate function for lifting into MapAlgebra

  protected def mergeLookup[T,U]
  (keys: Iterable[T], lookup: (T) => Option[V], present: (T) => U): Map[U,V] = {
    val kvPairs = keys map { key =>
      val v = lookup(key) getOrElse monoid.zero
      (present(key), v): (U,V)
    }
    MapAlgebra.sum(kvPairs)
  }

SummingQueue fails w/ capacity of 0

scala> SummingQueue[Int](0)
java.lang.IllegalArgumentException
    at java.util.concurrent.ArrayBlockingQueue.<init>(ArrayBlockingQueue.java:169)
    at SummingQueue.<init>(<console>:27)
    at SummingQueue$.apply(<console>:21)
    at .<init>(<console>:22)
    at .<clinit>(<console>)
    at .<init>(<console>:11)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:704)
    at scala.tools.nsc.interpreter.IMain$Request$$anonfun$14.apply(IMain.scala:920)
    at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
    at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
    at java.lang.Thread.run(Thread.java:680)

Add Equatable[T] typeclass and use it where needed

This will be of use in tests and elsewhere.

trait Equatable[T] extends Function2[T,T,Boolean] {
  def apply(left: T, right: T): Boolean
}

object Equatable {
  def apply[T](left: T, right:T)(implicit eq: Equatable[T]): Boolean = eq(left, right)
  def from[T](fn :(T,T) => Boolean) = new Equatable[T] { def apply(l: T, r:T) = fn(l,r) }

  //Make usual ones for numbers and all the containers we use in groups/monoids/etc...

  // Might be useful to have something like:
  def withEps[T:Metric](eps: Double) = from[T] { Metric(_,_) <= eps } 
}

Ring.mapRing => diverging chain of implicits.

Starting here: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/Ring.scala#L104

This seems to preclude some stuff using the scalding matrix API, for example templated functions which operate on generic matrices and do multiplication. I guess the reason is that the compiler gives up trying to figure out all possible kinds of values that can be in the matrix and still lead to a valid function.

Some code duplication in tests

Currently a lot of the tests contain code for doing approximate equality of maps, lists, etc. It would be nice to move these to a common trait.

Count-Min-Sketch Heavy Hitters duplication.

When adding two CountMinSketch objects, is it intentional that the heavy hitters are duplicated when they're summed using the monoid? The heavyHitters method on CMS is correct because it filters out. Here's an example where I add two CMS objects, and inspect the heavy hitters internal representation.

scala> import com.twitter.algebird._
import com.twitter.algebird._

scala> val monoid = CMS.monoid(0.01, 0.01, 0, 0.01)
monoid: com.twitter.algebird.CountMinSketchMonoid = com.twitter.algebird.CountMinSketchMonoid@3219ab8d

scala> monoid.create(Seq(1L, 1L, 1L))
res0: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> monoid.create(Seq(1L, 1L, 1L, 1L))
res1: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> monoid.plus(res0, res1)
res2: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> res2.asInstanceOf[CMSInstance].heavyHitters
res3: Set[Long] = TreeSet(1)

scala> res2.asInstanceOf[CMSInstance].hhs
res4: com.twitter.algebird.HeavyHitters = HeavyHitters(TreeSet(HeavyHitter(1,3), HeavyHitter(1,4)))

I would expect that the internal representation of hhs would remove the duplicate value. Since I serialize the data, it becomes a major waste of space storing a bunch of old values. For now, I just remove duplicate 'item' HeavyHitter keys when serialized, but I was wondering if this was intentional, since I can see that being the case for faster summing.

Count-Min-Sketch Monoid yields inaccurate frequency.

scala> import com.twitter.algebird._
import com.twitter.algebird._

scala> implicit val monoid = CMS.monoid(0.01, 0.01, 0, 0.01)
monoid: com.twitter.algebird.CountMinSketchMonoid = com.twitter.algebird.CountMinSketchMonoid@56fb2ac4

scala> val two = monoid.create(Seq(1L, 1L))
two: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
scala> val three = monoid.create(Seq(1L, 1L, 1L))
three: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
scala> two.frequency(1L)
res0: com.twitter.algebird.Approximate[Long] = Approximate(2,2,2,0.99)

scala> three.frequency(1L)
res1: com.twitter.algebird.Approximate[Long] = Approximate(3,3,3,0.99)

scala> monoid.plus(two, three)
res2: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> res2.frequency(1L)
res3: com.twitter.algebird.Approximate[Long] = Approximate(2,2,2,0.99)

scala> res2.asInstanceOf[CMSInstance].hhs
res4: com.twitter.algebird.HeavyHitters = HeavyHitters(TreeSet(HeavyHitter(1,2), HeavyHitter(1,3)))

Summing two and three should give me '5' as the frequency, but it's giving me two. Might be related to the duplicated Heavy Hitters issue.

Use a MurmurHash instead of MD5 for HyperLogLog

Cryptographic hashes are often slower than non-cryptographic. And MD5 suffers from known collision problems.

I've been using MurmurHash in my hyperloglog impl with good results. Scala is not my language of choice, but I'd provide a patch if you guys wanted.

Add monoids for other streaming algorithms.

Particularly Count-min sketch:

https://github.com/clearspring/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java

Add implicitNotFound to type classes

http://suereth.blogspot.com/2011/03/annotate-your-type-classes.html

We should put something like that on Semigroup/Monoid/Ring/Field/Metric/VectorSpace.

Add a way to signal + commutative, * is commutative.

Here is an idea:

// sigil trait
trait CommutativePlus

// sigil trait
trait CommutativeTimes

Then we can add with CommutativePlus to most of the Monoids/Semigroups/etc... and we can use isInstanceOf to test at runtime, and type constraints to check at compile-time.

Implement FenwickTreeGroup

http://en.wikipedia.org/wiki/Fenwick_tree

Use SortedMap[K,V: Group] and make a group on SortedMap[K,V: Group] such that at K you have the sum(k <= K) V_k

Then, add a method on the Group object to get the sum over a range by getting the sum at the lower bound and subtracting that from the value at the upper bound of the range.

Count-min-sketch on any T + Serialization

CountMinSketch should work on any T with a Hashable or at least %> Array[Byte](to use an injection/bijection).

Also, it should be easy to serialize the count-min-sketch (this complicates always including the heavy-hitters.

Burst Detection Monoid

I'd like to see a monoid for detecting bursts in streams of data:

http://www.cs.cornell.edu/home/kleinber/bhs.pdf

/cc @johnynek

Linear bloom filters

See http://comonad.com/reader/2008/linear-bloom-filters/:

So to recap, we took a normal (or counted or spectral) Bloom filter, crossbred it with Litwin's linear hash table and found that the mutant offspring is an approximation of a set that is better suited to sharing over the network than either structure alone, with a memory usage profile similar to that of a linear hash table. Interestingly as a side effect you can go one step further and allow for transmission of a requested subset of the exact hashes present in which case we've really only used the Blooms to provide partial information about the underlying linear hash table, which can aid in the subsequent join process.

And yes, they are probably better named something like Bloomed linear hash tables, but that doesn't roll off the tongue.

We should implement Q-Digest for efficient quantile estimations - this is probably better than using CMS as in #58. For reference, there's a Java implementation here:
https://github.com/clearspring/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/QDigest.java

Add Bounded[T]

// string/list are instance here, 
trait LowerBounded[T] {
  def ordering: Ordering[T]
  def min: T
}


trait UpperBounded[T] {
  def ordering: Ordering[T]
  def min: T
}

// primitive types are instances here
trait Bounded[T] {
  def ordering: Ordering[T]
  def min: T
  def max: T
}

// Make an implicit Lower/Upper if you have a Bounded (don't subclass)