Giter VIP home page Giter VIP logo

algebird's Introduction

Algebird

Build status Codecov branch Latest version Chat

Overview

Abstract algebra for Scala. This code is targeted at building aggregation systems (via Scalding or Apache Storm). It was originally developed as part of Scalding's Matrix API, where Matrices had values which are elements of Monoids, Groups, or Rings. Subsequently, it was clear that the code had broader application within Scalding and on other projects within Twitter.

See the Algebird website for more information.

What can you do with this code?

> sbt algebird-core/console

Welcome to Scala 2.12.14 (OpenJDK 64-Bit Server VM, Java 11.0.1).
Type in expressions for evaluation. Or try :help.

scala> import com.twitter.algebird._
import com.twitter.algebird._

scala> import com.twitter.algebird.Operators._
import com.twitter.algebird.Operators._

scala> Map(1 -> Max(2)) + Map(1 -> Max(3)) + Map(2 -> Max(4))
res0: scala.collection.immutable.Map[Int,com.twitter.algebird.Max[Int]] = Map(2 -> Max(4), 1 -> Max(3))

In the above, the class Max[T] signifies that the + operator should actually be max (this is accomplished by providing an implicit instance of a typeclass for Max that handles +).

  • Model a wide class of "reductions" as a sum on some iterator of a particular value type. For example, average, moving average, max/min, set union, approximate set size (in much less memory with HyperLogLog), approximate item counting (using CountMinSketch).
  • All of these combine naturally in tuples, vectors, maps, options and more standard scala classes.
  • Implementations of Monoids for interesting approximation algorithms, such as Bloom filter, HyperLogLog and CountMinSketch. These allow you to think of these sophisticated operations like you might numbers, and add them up in hadoop or online to produce powerful statistics and analytics.

Documentation

To learn more and find links to tutorials and information around the web, check out the Algebird website.

The latest API docs are hosted on Algebird's ScalaDoc index.

Get Involved + Code of Conduct

Pull requests and bug reports are always welcome! Check out our Contributing guide for information on what we most need help with and how you can get started contributing.

Discussion occurs primarily on the Algebird Gitter channel: Chat

We also monitor the Algebird mailing list.

Issues should be reported on the GitHub issue tracker.

We use a lightweight form of project governance inspired by the one used by Apache projects.

Please see Contributing and Committership for our code of conduct and our pull request review process.

The TL;DR is send us a pull request, iterate on the feedback + discussion, and get a +1 from a Committer in order to get your PR accepted.

The current list of active committers (who can +1 a pull request) can be found here: Committers

A list of contributors to the project can be found here: Contributors

Maven

Algebird modules are available on maven central. The current groupid and version for all modules is, respectively, "com.twitter" and 0.13.5.

See Algebird's page on the Scaladex for information on all published artifacts and their associated Scala versions. Algebird currently supports Scala 2.10, 2.11 and 2.12.

Projects using Algebird

Other projects built with Algebird, as compiled by the Scaladex: Scaladex Dependents

Authors

License

Copyright 2016 Twitter, Inc.

Licensed under the Apache License, Version 2.0.

Thanks to Yourkit

YourKit supports open source projects with innovative and intelligent tools for monitoring and profiling Java and .NET applications. YourKit is the creator of YourKit Java Profiler, YourKit .NET Profiler, and YourKit YouMonitor.

algebird's People

Contributors

avi-stripe avatar avibryant avatar azymnis avatar daniellesucher avatar dependabot[bot] avatar df-stripe avatar erik-stripe avatar erikerlandson avatar hansmire avatar ianoc avatar isnotinvain avatar jcoveney avatar johnynek avatar julienledem avatar koertkuipers avatar mansurashraf avatar mosesn avatar nevillelyh avatar oscar-stripe avatar reconditesea avatar regadas avatar rgcase avatar scala-steward avatar sid-kap avatar singhala avatar smarden1 avatar sritchie avatar sritchie-stripe avatar stephanh avatar wlue avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

algebird's Issues

ApproximateBoolean Confidence Incorrectness

Consider:

ApproximateBoolean(false, 0.6) && ApproximateBoolean(false, 0.4)

This yields ApproximateBoolean(false, 0.6), which is actually inaccurate. The false is correct, but the probability is low. The actual probability should be as follows: (0.6 + 0.4) - (0.6 * 0.4) = 0.76. Which is to say, the probability that the left is false, or the right is false (non-exclusively).

There is a similar problem with the || operation.

The implementations are nearly correct, they just require a bit of refinement. Here is a corrected && operator (bonus: also much faster!):

  def &&(that: ApproximateBoolean): ApproximateBoolean = {
    if(isTrue && that.isTrue) {
      //We need both to be correct:
      ApproximateBoolean(true, withProb * that.withProb)
    } else if (this.isTrue || that.isTrue) {
      ApproximateBoolean(false, if (!this.isTrue) withProb else that.withProb)
    } else {
      ApproximateBoolean(false, this.withProb + that.withProb - (this.withProb * that.withProb))
    }
  }

The || operator should be analogous.

Update readme

Algebird is now much more than just the Monoid trait. Anyone want to take a stab at updating the readme with all the algebird goodness?

Evaluate function for lifting into MapAlgebra

  protected def mergeLookup[T,U]
  (keys: Iterable[T], lookup: (T) => Option[V], present: (T) => U): Map[U,V] = {
    val kvPairs = keys map { key =>
      val v = lookup(key) getOrElse monoid.zero
      (present(key), v): (U,V)
    }
    MapAlgebra.sum(kvPairs)
  }

InvertibleAffineFn

Once we pull in a Bijection dependency,

case class InvertibleAffineFunction[R](slope: R, intercept: R)(implicit field: Field[R])
  extends Bijection[R,R] {
  lazy val toFn: Function1[R,R] = {x => this.apply(x) }
  override def apply(x: R): R = field.plus(field.times(slope, x), intercept)
  override def invert(x: R): R = field.minus(field.div(x, slope), field.div(intercept, slope))
  def andThen(other: InvertibleAffineFunction[R]): InvertibleAffineFunction[R] =
    InvertibleAffineFunction[R](slope * other.slope, intercept + (slope * other.intercept))
  def compose(other: InvertibleAffineFunction[R]): InvertibleAffineFunction[R] = andThen(other)
}

trait Approximate

As Oscar pointed out in #44, we're starting to get a lot of approximate collections. We tend to implement them right now as returning, say, a Long, and then having various other methods for describing the degree of error. Maybe instead they should be returning something like

trait Approximate[T:Numeric] {
   def estimate : T
   def max : T
   def min : T
   def probWithinBounds : Double
}

Add Equatable[T] typeclass and use it where needed

This will be of use in tests and elsewhere.

trait Equatable[T] extends Function2[T,T,Boolean] {
  def apply(left: T, right: T): Boolean
}

object Equatable {
  def apply[T](left: T, right:T)(implicit eq: Equatable[T]): Boolean = eq(left, right)
  def from[T](fn :(T,T) => Boolean) = new Equatable[T] { def apply(l: T, r:T) = fn(l,r) }

  //Make usual ones for numbers and all the containers we use in groups/monoids/etc...

  // Might be useful to have something like:
  def withEps[T:Metric](eps: Double) = from[T] { Metric(_,_) <= eps } 
}

Count-min-sketch on any T + Serialization

CountMinSketch should work on any T with a Hashable or at least %> Array[Byte](to use an injection/bijection).

Also, it should be easy to serialize the count-min-sketch (this complicates always including the heavy-hitters.

Cokleisli star operators

Algebird is much more than just the Monoid trait now. Isn't it time we added the Cokleisli star operators?

Implement FenwickTreeGroup

http://en.wikipedia.org/wiki/Fenwick_tree

Use SortedMap[K,V: Group] and make a group on SortedMap[K,V: Group] such that at K you have the sum(k <= K) V_k

Then, add a method on the Group object to get the sum over a range by getting the sum at the lower bound and subtracting that from the value at the upper bound of the range.

Add Bounded[T]

// string/list are instance here, 
trait LowerBounded[T] {
  def ordering: Ordering[T]
  def min: T
}


trait UpperBounded[T] {
  def ordering: Ordering[T]
  def min: T
}

// primitive types are instances here
trait Bounded[T] {
  def ordering: Ordering[T]
  def min: T
  def max: T
}

// Make an implicit Lower/Upper if you have a Bounded (don't subclass)

Deal with Abelian/Commutative objects better.

There are often optimizations the can be done if a semigroup/monoid/group is commutative.

How can we best communicate that?

Two obvious solutions: a property of the instance (.isCommutative). This rules out the ability to have some methods that only apply to commutative objects (because that check is at runtime).

A second type-class approach would be something like:

// use this to check at runtime:
sealed trait CommutativeStatus[O] {
  def commutes: Boolean
}
trait IsNotCommutative[O] extends CommutativeStatus[O] { val commutes = false }
// use this when you want to require commutativity
trait IsCommutative[O] extends CommutativeStatus[O] { val commutes = true }

trait LowP {
  implicit def defaultIsNot[O]: CommutativeStatus new IsNotCommutative[O] { }
}

object CommutativeStatus extends LowP {
  implicit def commutes[O](implicit isc: IsCommutative[O]): CommutativeStatus = isc
}

We can add instances of IsCommutative to the Monoid/Group/Semigroup objects.

More Hashable discussion

It seemed like there was some progress here: #35

Did anything get decided? It would be nice to be able to use other types in bloomfilter (and implement a bloom join).

I'd be happy to help out with it but I don't want to step on anything in progress.

Linear bloom filters

See http://comonad.com/reader/2008/linear-bloom-filters/:

So to recap, we took a normal (or counted or spectral) Bloom filter, crossbred it with Litwin's linear hash table and found that the mutant offspring is an approximation of a set that is better suited to sharing over the network than either structure alone, with a memory usage profile similar to that of a linear hash table. Interestingly as a side effect you can go one step further and allow for transmission of a requested subset of the exact hashes present in which case we've really only used the Blooms to provide partial information about the underlying linear hash table, which can aid in the subsequent join process.

And yes, they are probably better named something like Bloomed linear hash tables, but that doesn't roll off the tongue.

Add Countable[T]

Not totally sure about this:

Think it is:

trait Countable[T] {
  def next(t: T): Option[T]
  def prev(t: T): Option[T]
}

an infinite T might never return None, but Int, Long, Double etc instances would.

Add join method to Aggregator object

We need to add something like:

object Aggregator {
  def join2(ag1: Aggregator[A1,B1,C1], ag2: Aggregator[A2,B2,C2]):
    Aggregator[(A1,A2),(B1,B2),(C1,C2)] = new Aggregator[(A1,A2),(B1,B2),(C1,C2)] {
    def prepare(a: (A1,A2)) = (ag1.prepare(a._1), ag2.prepare(a._2))
    // etc..., to bad this kind of lifting is not easier to do automatically. I don't see how...
  }
  // and so on for join3, join4, up to join22.  Ideally these are all code-gen in scala.
}

Generalize Sparse/Dense vector

The recent refactoring of HLL does almost exactly what I am contemplating doing with MinHash, except that HLL is Map[Int,Max[Byte]] and I would want Map[Int,Min[Int]]. I'm wondering if we want to generalize to a OptionallySparseVector[V], with an OptionallySparseVectorMonoid[V:Monoid]. (Better name would be appreciated).

Add DecayedVector class

Use the new VectorSpace to make an analog of DecayedValue except on a VectorSpace[Double,C] for all container spaces C. Would be of use for IndexedSeq and Map's.

Add adjoin trick to add identity to pseduo-rings

http://en.wikipedia.org/wiki/Pseudo-ring#Adjoining_an_identity_element

What we want is something like:

// in Ring object:
  def adjoinOne[T](implicit pseudoring: Ring[T]): Ring[(BigInt, T)] = new Ring[(BigInt,T)] {
    lazy val one = (BigInt(1), pseudoring.zero)
    lazy val zero = (BigInt(0), pseudoring.zero)
   // and follow the rules above.
  }

This would allow us to use a Map as a true ring, not a pseudo-ring, which could be useful in Matrix.scala (but honestly, this is mostly a correctness issue, I don't think matrix.scala often assumes ring.one exists).

algebird-util depends on algebird-test

Is this intentional?
As it is now, anything that depends on algebird somewhere pulls in algebird-test, and transitively, its dependencies: specs2 and scalacheck, onto the runtime classpath.

Range queries for CountMinSketch

See eg http://www.cse.cuhk.edu.hk/~taoyf/course/5705f10/lec8.pdf for how this works.

I have a basic working implementation of the DyadicRange logic (for positive integer keys, which you can map whatever else to), but it needs to be hooked up to levels instances of CMS. We also want to implement binary search using this to get quantile queries.

case class DyadicRange(maxValue : Long = Long.MaxValue) {
  val levels = math.ceil(math.log(maxValue) / math.log(2)).toInt

  def indicesForPoint(v : Long) = (1 to levels).map{level => (level, indexForPoint(v, level))}
  def indicesForRange(start : Long, end : Long)  : List[(Int,Long)] = indicesForRange(start, end, levels)

  def indexForPoint(v : Long, level : Int) = v >> (level - 1)
  def rangeForIndex(i : Long, level : Int) = (i << (level-1), ((i+1) << (level-1)) - 1)

  def indicesForRange(start : Long, end : Long, maxLevel : Int) : List[(Int,Long)] = {
    if(start > end) {
      Nil
    } else if(start == end) {
      List((1, start))
    } else {
      val startIndex = indexForPoint(start, maxLevel)
      val (a,b) = rangeForIndex(startIndex, maxLevel)
      if(a >= start) {
        if(b <= end) {
          List((maxLevel, startIndex)) ++ indicesForRange(start, a - 1, maxLevel) ++ indicesForRange(b + 1, end, maxLevel) 
        } else {
          indicesForRange(start, end, maxLevel - 1)
        }
      } else {
        val (a2, b2) = rangeForIndex(startIndex + 1, maxLevel)
        if(b2 <= end && a2 <= end) {
          List((maxLevel, startIndex + 1)) ++ indicesForRange(start, a2 - 1, maxLevel - 1) ++ indicesForRange(b2 + 1, end, maxLevel)
        } else {
          indicesForRange(start, end, maxLevel - 1)
        }
      }
    }
  }

}

Flaky test in HyperLogLog

I got the following failure, unrelated to any code change that I did:

[error] x HyperLogLog should
[info]   + count with 5-bits
[error]   x count with 6-bits
[error]     0.3370433439011089 is not less than 0.325 (HyperLogLogTest.scala:56)

Count-min-sketch on any Monoid

You can think of Count-min-sketch as an approximate: Map[Long,Long].

The algorithm seems to generalize to Map[Long,V : Ordering : Monoid] because we use the count (monoid) and the min (ordering).

If we include a 64 bit hash from K => Long, then we get

SketchMap[K, V](implicit keyhash: Hashable[K,Long], monoidv: Monoid[V], ord: Ordering[V])

And generalize the CMS code in the obvious way.

Should we remove CMS and move to the above? I don't know.

SummingQueue fails w/ capacity of 0

scala> SummingQueue[Int](0)
java.lang.IllegalArgumentException
    at java.util.concurrent.ArrayBlockingQueue.<init>(ArrayBlockingQueue.java:169)
    at SummingQueue.<init>(<console>:27)
    at SummingQueue$.apply(<console>:21)
    at .<init>(<console>:22)
    at .<clinit>(<console>)
    at .<init>(<console>:11)
    at .<clinit>(<console>)
    at $print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:704)
    at scala.tools.nsc.interpreter.IMain$Request$$anonfun$14.apply(IMain.scala:920)
    at scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
    at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
    at java.lang.Thread.run(Thread.java:680)

Add a way to signal + commutative, * is commutative.

Here is an idea:

// sigil trait
trait CommutativePlus

// sigil trait
trait CommutativeTimes

Then we can add with CommutativePlus to most of the Monoids/Semigroups/etc... and we can use isInstanceOf to test at runtime, and type constraints to check at compile-time.

Count-Min-Sketch Monoid yields inaccurate frequency.

scala> import com.twitter.algebird._
import com.twitter.algebird._

scala> implicit val monoid = CMS.monoid(0.01, 0.01, 0, 0.01)
monoid: com.twitter.algebird.CountMinSketchMonoid = com.twitter.algebird.CountMinSketchMonoid@56fb2ac4

scala> val two = monoid.create(Seq(1L, 1L))
two: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
scala> val three = monoid.create(Seq(1L, 1L, 1L))
three: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
scala> two.frequency(1L)
res0: com.twitter.algebird.Approximate[Long] = Approximate(2,2,2,0.99)

scala> three.frequency(1L)
res1: com.twitter.algebird.Approximate[Long] = Approximate(3,3,3,0.99)

scala> monoid.plus(two, three)
res2: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> res2.frequency(1L)
res3: com.twitter.algebird.Approximate[Long] = Approximate(2,2,2,0.99)

scala> res2.asInstanceOf[CMSInstance].hhs
res4: com.twitter.algebird.HeavyHitters = HeavyHitters(TreeSet(HeavyHitter(1,2), HeavyHitter(1,3)))

Summing two and three should give me '5' as the frequency, but it's giving me two. Might be related to the duplicated Heavy Hitters issue.

PairMonoid/Semigroup to combine Set/Hyperloglog

Here is the idea, two functions:

sealed abstract class EitherAlgebra[+L,+R]
case class LeftAlg[+L](get: L) extends EitherAlgebra[L, Nothing]
case class RightAlg[+R](get: R) extends EitherAlgebra[Nothing,R]

//Then we have a monoid/semigroup on L,R and:

def monoid(shouldConvert: (R) => Boolean, convert: (R) => L): Monoid[EitherAlgebra[L,R]]

Then, shouldConvert(s: Set[T]) = s.size > threshold, and convert(s: Set[T]): HLLInstance = iterate over all the items and insert them.

In this way, we can make a monoid that is exact when size is small, and converts to approximate when we cross some size boundary (say about the size of the serialized hyperloglog).

We can do the same with count-min-sketch with a Map[K,Long]. The pattern is: use the exact data structure when the size is less than the approximate, and then switch on demand.

Approximate collections API.

There is something going on with some of the data-structures we have here: BloomFilter is a kind of approximate Set. HyperLogLog is an approximate set with Cardinality. Count-min-sketch is an approximate Map (so Count-min-sketch has an approximate set on the keys).

I think we could define a few traits to represent these concepts.

Something like:

trait ApproximateSet[V] {
  def mayContain(v: V): Boolean
  def containsFalsePosProb: Double
  def containsFalseNegProb: Double
}

In fact, most of these have false neg prob == 0.0, so maybe we should encode that in the type (for cases where you want to be sure you don't miss anything).

trait ApproxAcceptingSet[V] extends ApproximateSet[V] {
  final def containsFalseNegProb = 0.0
}

Maybe we want to punt and avoid the returning the false negatives, and if we want to support set union (which these as monoids would have):

trait ApproximateSet[V,S<:ApproximateSet[V,S]] {
  // if we return false, the item is definitely NOT in this set.
  def mayContain(v: V): Boolean
  // f-bounded polymorphism to preserve the types:
  def union(that: S): S
}

For size, maybe something like:

trait ApproximateCardinality {
  def approxSize: Long
}

For Count-min-sketch, it looks like we need something like:

trait ApproximateMap[K,V] {
  def approxValue(k: K): Option[V]
}

In this view, I think CMS extends ApproximateMap, BloomFilter extends ApproximateSet, and HyperLogLog extends ApproximateSet with ApproximateCardinality

publishing 0.1.12

Any last minute (binary compatible) fixes should be submitted in the next 24 hours.

Goal is to publish on Thursday.

Use a MurmurHash instead of MD5 for HyperLogLog

Cryptographic hashes are often slower than non-cryptographic. And MD5 suffers from known collision problems.

I've been using MurmurHash in my hyperloglog impl with good results. Scala is not my language of choice, but I'd provide a patch if you guys wanted.

Chernoff test

We should write some reusable code in algebird-test to have a scalacheck property that is true with probability at least P. Then, using the chernoff bound, we should verify that this claim is true with a very high probability (something like p=10^{-9}), so that the tests will almost always pass.

algebird-serialization

This module should contain thrift mappings for all Algebird case classes. Out of this we'll publish scrooge and java jars of these structs.

Some code duplication in tests

Currently a lot of the tests contain code for doing approximate equality of maps, lists, etc. It would be nice to move these to a common trait.

Specialize Max, Min, First and Last

Max, Min, First and Last should be specialized on the usual primitives (which are the salient use cases) in order to minimize (un-)boxing overhead

sortedSum implementation

Could this be useful inside of Monoid?

import scala.util.Sorting.stableSort

def sortedSum[T: Ordering: Manifest](vs: Seq[T]): T = Monoid.sum(stableSort(vs))

Count-Min-Sketch Heavy Hitters duplication.

When adding two CountMinSketch objects, is it intentional that the heavy hitters are duplicated when they're summed using the monoid? The heavyHitters method on CMS is correct because it filters out. Here's an example where I add two CMS objects, and inspect the heavy hitters internal representation.

scala> import com.twitter.algebird._
import com.twitter.algebird._

scala> val monoid = CMS.monoid(0.01, 0.01, 0, 0.01)
monoid: com.twitter.algebird.CountMinSketchMonoid = com.twitter.algebird.CountMinSketchMonoid@3219ab8d

scala> monoid.create(Seq(1L, 1L, 1L))
res0: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> monoid.create(Seq(1L, 1L, 1L, 1L))
res1: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> monoid.plus(res0, res1)
res2: com.twitter.algebird.CMS = CMSInstance(CMSCountsTable(Vector(Vector(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
scala> res2.asInstanceOf[CMSInstance].heavyHitters
res3: Set[Long] = TreeSet(1)

scala> res2.asInstanceOf[CMSInstance].hhs
res4: com.twitter.algebird.HeavyHitters = HeavyHitters(TreeSet(HeavyHitter(1,3), HeavyHitter(1,4)))

I would expect that the internal representation of hhs would remove the duplicate value. Since I serialize the data, it becomes a major waste of space storing a bunch of old values. For now, I just remove duplicate 'item' HeavyHitter keys when serialized, but I was wondering if this was intentional, since I can see that being the case for faster summing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.