Giter VIP home page Giter VIP logo

Comments (8)

krishnanraman avatar krishnanraman commented on August 15, 2024
 So you need a family of hash functions to pull this off. If you have n hashfunctions, and an array of size s, the following block of code does the rest:

scala> def hashFunction(n:Int, s:Int) = {
 | val p = math.ceil( (1 + math.sqrt(1+4*n))/2.0).toInt
 | val a = (1 to (p-1))
 | val b = (1 to p)
 | (k:Int) => a.map( ia => b.map( ib => ((ia*k+ib)%p)%s)).flatten.slice(0,n)
 | }
 hashFunction: (n: Int, s: Int)Int => scala.collection.immutable.IndexedSeq[Int]

So if you need 10 hash functions over an array with capacity to store 100 elements:

 scala> hashFunction(10,100)
 res3: Int => scala.collection.immutable.IndexedSeq[Int] = <function1>

To actually use the 10 hash functions on a key, say key = 34
scala> res3(34)
res4: scala.collection.immutable.IndexedSeq[Int] = Vector(3, 0, 1, 2, 1, 2, 3, 0, 3, 0)
---

The family of hashfunction algo : From pg 267, Corman & Leiserson ( Designing a universal class of hash functions )

from algebird.

krishnanraman avatar krishnanraman commented on August 15, 2024

Edit: p has to be prime, btw :)
So the algorithm above needs a little more work.

from algebird.

johnynek avatar johnynek commented on August 15, 2024

We've already done this three times in the code: min hash, count min sketch, and bloom filter. Here's one more approach: http://stackoverflow.com/questions/9553989/what-hashing-techniques-to-use-when-building-a-bloom-filter-in-clojure

I don't think we need a new hash family. In fact, I'd argue we should only use one approach in this code.

from algebird.

avibryant avatar avibryant commented on August 15, 2024

Agreed, we should have a HashFamily that gets used by all of these.

BTW this is a variation of what @jmhodges calls "The opposite of a bloom filter", right?
http://www.somethingsimilar.com/2012/05/21/the-opposite-of-a-bloom-filter/

from algebird.

krishnanraman avatar krishnanraman commented on August 15, 2024

In that case, the 'val hashFunctions' implementation in min hash should suffice for my purposes.

I will suggest benchmarking... the Leiserson algo is very,very cheap - generate a prime number once and integral mod twice. Whereas the implementation you have uses a random number gen plus other heavy machinery - maybe more robust but also more expensive.

from algebird.

davidthomas5412 avatar davidthomas5412 commented on August 15, 2024

Do you feel aFlowerFilter.contains(item) should return a Boolean or an ApproximateBoolean like BloomFilter?

Although it goes against the precedent, I believe it might be better to return a Boolean because:

  1. goal is to be fast/streaming and calculating probability for this structure is really expensive (based on http://eng.42go.com/flower-filter-an-update/ plus extra multiplicative factor for possible collisions when converting item to Byte or Short)
  2. the user sets their decay and accuracy thresholds when initializing
  3. wouldn't be that useful for standard use cases

Feedback appreciated.

from algebird.

johnynek avatar johnynek commented on August 15, 2024

David, if we could get a correct bound on the Boolean, then we should do so. So, when it says true, the probabilty it is correct is 1. But when we say absent, what is the probability that another item has evicted that item? Can we derive such a bound?

from algebird.

davidthomas5412 avatar davidthomas5412 commented on August 15, 2024

Yes sir, a slight modification to http://eng.42go.com/flower-filter-an-update/ for fingerprinting collisions yields the true probability. Two other things that I missed:

  1. We can find a simple log bound as opposed to exact computation
  2. We can make the computation lazy so that the data structure is fast when the probabilities are not used.

from algebird.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.