Giter VIP home page Giter VIP logo

callidon / bloom-filters Goto Github PK

View Code? Open in Web Editor NEW
343.0 5.0 38.0 1.26 MB

JS implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash

Home Page: https://callidon.github.io/bloom-filters/

License: MIT License

JavaScript 37.75% TypeScript 62.25%
bloom-filter probabilistic cuckoo-filter count-min-sketch javascript invertible-bloom-filter hyperloglog topk minhash

bloom-filters's Introduction

Bloom-Filters Master

JavaScript/TypeScript implementation of probabilistic data structures: Bloom Filter (and its derived), HyperLogLog, Count-Min Sketch, Top-K and MinHash. This package relies on non-cryptographic hash functions.

📕Online documentation

Keywords: bloom filter, cuckoo filter, KyperLogLog, MinHash, Top-K, probabilistic data-structures, XOR-Filter.

❗️Compatibility❗️

  • Be carefull when migrating from a version to another.
  • Bug fixes were introduced in 1.3.7 and from 1.3.9 to 2.0.0+ for hashing and indexing data. Then, you must re-build completely your filters from start to be compatible with the new versions.
  • To keep the breaking changes rule of npm versions we will make now new majored versions since 1.3.9 whenever a modification is done on the hashing/indexing system or breaks the current API.

Table of contents

Installation

npm install bloom-filters --save

Supported platforms

Data structures

Classic Bloom Filter

A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.

Reference: Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426. (Full text article)

Methods

  • add(element: HashableInput) -> void: add an element into the filter.
  • has(element: HashableInput) -> boolean: Test an element for membership, returning False if the element is definitively not in the filter and True is the element might be in the filter.
  • equals(other: BloomFilter) -> boolean: Test if two filters are equals.
  • rate() -> number: compute the filter's false positive rate (or error rate).
const {BloomFilter} = require('bloom-filters')
// create a Bloom Filter with a size of 10 and 4 hash functions
let filter = new BloomFilter(10, 4)
// insert data
filter.add('alice')
filter.add('bob')

// lookup for some data
console.log(filter.has('bob')) // output: true
console.log(filter.has('daniel')) // output: false

// print the error rate
console.log(filter.rate())

// alternatively, create a bloom filter optimal for a number of items and a desired error rate
const items = ['alice', 'bob']
const errorRate = 0.04 // 4 % error rate
filter = BloomFilter.create(items.length, errorRate)

// or create a bloom filter optimal for a collections of items and a desired error rate
filter = BloomFilter.from(items, errorRate)

Partitioned Bloom Filter

A Partitioned Bloom Filter is a variation of a classic Bloom Filter.

This filter works by partitioning the M-sized bit array into k slices of size m = M/k bits, k = nb of hash functions in the filter. Each hash function produces an index over m for its respective slice. Thus, each element is described by exactly k bits, meaning the distribution of false positives is uniform across all elements.

Be careful, as a Partitioned Bloom Filter have much higher collison risks that a classic Bloom Filter on small sets of data.

Reference: Chang, F., Feng, W. C., & Li, K. (2004, March). Approximate caches for packet classification. In INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies (Vol. 4, pp. 2196-2207). IEEE. (Full text article)

Methods

  • add(element: HashableInput) -> void: add an element into the filter.
  • has(element: HashableInput) -> boolean: Test an element for membership, returning False if the element is definitively not in the filter and True is the element might be in the filter.
  • equals(other: PartitionedBloomFilter) -> boolean: Test if two filters are equals.
  • rate() -> number: compute the filter's false positive rate (or error rate).
const {PartitionedBloomFilter} = require('bloom-filters')

// create a PartitionedBloomFilter of size 10, with 5 hash functions and a load factor of 0.5
const filter = new PartitionedBloomFilter(10, 5, 0.5)

// add some value in the filter
filter.add('alice')
filter.add('bob')

// lookup for some data
console.log(filter.has('bob')) // output: true
console.log(filter.has('daniel')) // output: false

// now use it like a classic bloom filter!
// ...

// alternatively, create a PartitionedBloomFilter optimal for a number of items and a desired error rate
const items = ['alice', 'bob']
const errorRate = 0.04 // 4 % error rate
filter = PartitionedBloomFilter.create(items.length, errorRate)

// or create a PartitionedBloomFilter optimal for a collections of items and a desired error rate
filter = PartitionedBloomFilter.from(items, errorRate)

Scalable Bloom Filter

A Scalable Bloom Filter is a variant of Bloom Filters that can adapt dynamically to the number of elements stored, while assuring a maximum false positive probability

Reference: ALMEIDA, Paulo Sérgio, BAQUERO, Carlos, PREGUIÇA, Nuno, et al. Scalable bloom filters. Information Processing Letters, 2007, vol. 101, no 6, p. 255-261. (Full text article)

This filter use internally Paritionned Bloom Filters.

Methods

  • add(element: HashableInput) -> void: add an element into the filter.
  • has(element: HashableInput) -> boolean: Test an element for membership, returning False if the element is definitively not in the filter and True is the element might be in the filter.
  • equals(other: ScalableBloomFilter) -> boolean: Test if two filters are equals.
  • capacity():number -> return the total capacity of this filter
  • rate() -> number: compute the filter's false positive rate (or error rate).
const {ScalableBloomFilter} = require('bloom-filters')

// by default it creates an ideally scalable bloom filter for 8 elements with an error rate of 0.01 and a load factor of 0.5
const filter = new ScalableBloomFilter()
filter.add('alice')
filter.add('bob')
filter.add('carl')
for (let i = 0; i < 10000; i++) {
  filter.add('elem:' + i)
}
filter.has('somethingwrong') // false

filter.capacity() // total capacity
filter.rate() // current rate of the current internal filter used

Cuckoo Filter

Cuckoo filters improve on Bloom filters by supporting deletion, limited counting, and bounded False positive rate with similar storage efficiency as a standard Bloom Filter.

Reference: Fan, B., Andersen, D. G., Kaminsky, M., & Mitzenmacher, M. D. (2014, December). Cuckoo filter: Practically better than bloom. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies (pp. 75-88). ACM. (Full text article)

Methods

  • add(element: HashableInput) -> void: add an element into the filter.
  • remove(element: HashableInput) -> boolean: delete an element from the filter, returning True if the deletion was a success and False otherwise.
  • has(element: HashableInput) -> boolean: Test an element for membership, returning False if the element is definitively not in the filter and True is the element might be in the filter.
  • equals(other: CuckooFilter) -> boolean: Test if two filters are equals.
  • rate() -> number: compute the filter's false positive rate (or error rate).
const {CuckooFilter} = require('bloom-filters')

// create a Cuckoo Filter with size = 15, fingerprint length = 3 and bucket size = 2
const filter = new CuckooFilter(15, 3, 2)
filter.add('alice')
filter.add('bob')

// lookup for some data
console.log(filter.has('bob')) // output: true
console.log(filter.has('daniel')) // output: false

// remove something
filter.remove('bob')
console.log(filter.has('bob')) // output: false

// alternatively, create a Cuckoo Filter optimal for a number of items and a desired error rate
const items = ['alice', 'bob']
const errorRate = 0.04 // 4 % error rate
filter = CuckooFilter.create(items.length, errorRate)

// or create a Cuckoo Filter optimal for a collections of items and a desired error rate
filter = CuckooFilter.from(items, errorRate)

WARNING: The error rate cannot be higher than 1 * 10^-18. Above this value, you will get an exception stating that the fingerprint length is higher than the hash length.

Counting Bloom Filter

A Counting Bloom filter works in a similar manner as a regular Bloom filter; however, it is able to keep track of insertions and deletions. In a counting Bloom filter, each entry in the Bloom filter is a small counter associated with a basic Bloom filter bit.

Reference: F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, “An Improved Construction for Counting Bloom Filters,” in 14th Annual European Symposium on Algorithms, LNCS 4168, 2006

Methods

  • add(element: HashableInput) -> void: add an element into the filter.
  • remove(element: HashableInput) -> boolean: delete an element from the filter, returning True if the deletion was a success and False otherwise.
  • has(element: HashableInput) -> boolean: Test an element for membership, returning False if the element is definitively not in the filter and True is the element might be in the filter.
  • equals(other: CountingBloomFilter) -> boolean: Test if two filters are equals.
  • rate() -> number: compute the filter's false positive rate (or error rate).
const CountingBloomFilter = require('bloom-filters').CountingBloomFilter

// create a Bloom Filter with capacity = 15 and 4 hash functions
let filter = new CountingBloomFilter(15, 4)

// add some value in the filter
filter.add('alice')
filter.add('bob')
filter.add('carole')

// remove some value
filter.remove('carole')

// lookup for some data
console.log(filter.has('bob')) // output: true
console.log(filter.has('carole')) // output: false
console.log(filter.has('daniel')) // output: false

// print false positive rate (around 0.1)
console.log(filter.rate())

// alternatively, create a Counting Bloom Filter optimal for a number of items and a desired error rate
const items = ['alice', 'bob']
const errorRate = 0.04 // 4 % error rate
filter = CountingBloomFilter.create(items.length, errorRate)

// or create a Counting Bloom Filter optimal for a collections of items and a desired error rate
filter = CountingBloomFilter.from(items, errorRate)

Count Min Sketch

The Count Min Sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions.

Reference: Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75. (Full text article)

Methods

  • update(element: HashableInput, count = 1) -> void: add count occurences of an element into the sketch.
  • count(element: HashableInput) -> number: estimate the number of occurences of an element.
  • merge(other: CountMinSketch) -> CountMinSketch: merge occurences of two sketches.
  • equals(other: CountMinSketch) -> boolean: Test if two sketchs are equals.
  • clone(): CountMinSketch: Clone the sketch.
const {CountMinSketch} = require('bloom-filters')

// create a new Count Min sketch with 2048 columns and 1 row
const sketch = new CountMinSketch(2048, 1)

// push some occurrences in the sketch
sketch.update('alice')
sketch.update('alice')
sketch.update('bob')

// count occurrences
console.log(sketch.count('alice')) // output: 2
console.log(sketch.count('bob')) // output: 1
console.log(sketch.count('daniel')) // output: 0

// alternatively, create a Count Min sketch optimal for a target error rate and probability of accuracy
const items = ['alice', 'bob']
const errorRate = 0.04 // 4 % error rate
const accuracy = 0.99 // 99% accuracy
sketch = CountMinSketch.create(errorRate, accuracy)

// or create a Count Min Sketch optimal for a collections of items,
// a target error rate and probability of accuracy
sketch = CountMinSketch.from(items, errorRate, accuracy)

HyperLogLog

HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities greather than 10e9 with a typical accuracy (standard error) of 2%, using around 1.5 kB of memory (see reference).

Reference: Philippe Flajolet, Éric Fusy, Olivier Gandouet and Frédéric Meunier (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm". Discrete Mathematics and Theoretical Computer Science Proceedings. (Full text article)

Methods

  • update(element: HashableInput) -> void: add a new occurence of an element to the sketch.
  • count() -> number: estimate the number of distinct elements in the sketch.
  • merge(other: HyperLogLog) -> HyperLogLog: merge occurences of two sketches.
  • equals(other: HyperLogLog) -> boolean: Test if two sketchs are equals.
const {HyperLogLog} = require('bloom-filters')

// create a new HyperLogLog with 100 registers
const sketch = new HyperLogLog(100)

// push some occurrences in the sketch
sketch.update('alice')
sketch.update('alice')
sketch.update('bob')

// count occurrences
console.log(sketch.count())

// print accuracy
console.log(sketch.accuracy())

MinHash

MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The goal of MinHash is to estimate the Jaccard similarity coefficient, a commonly used indicator of the similarity between two sets, without explicitly computing the intersection and union of the two sets. It does so by computing fixed sized signatures for a set of numbers using randomly generated hash functions.

❗️WARNINGS

  • A MinHash class only accepts numbers (integers and floats) as inputs.
  • Two MinHash can be compared only if they share the same set of randomly generated hash functions. To ease the creation of MinHash sets, we introduce a MinHashFactory class that is able to create MinHash structures that share the same set of hash functions. We recommend most users to rely on the factory, but the MinHash class remains importable for advanced usage.

Reference: Andrei Z. Broder, "On the resemblance and containment of documents", in Compression and Complexity of Sequences: Proceedings (1997). (Full text article)

MinHashFactory methods

  • create() -> MinHash: create a new empty MinHash structure, using the parameters of the factory.

MinHash methods

  • add(element: number) -> void: add a new element to the set.
  • bulkLoad(elements: number[]) -> void: efficently add several new elements to the set.
  • isEmpty() -> boolean: test if the signature of the MinHash is empty.
  • compareWith(other: MinHash) -> number: estimate the Jaccard similarity coefficient with another MinHash set.
const {MinHashFactory} = require('bloom-filters')

// create the MinHashFactory, to create several comparable MinHash sets
// it uses 10 random hash functions and expect to see a maximum value of 999
const factory = new MinHashFactory(10, 999)

// create two empty MinHash
const fistSet = factory.create()
const secondSet = factory.create()

// push some occurrences in the first set
fistSet.add(1)
fistSet.add(2)

// the MinHash class also supports bulk loading
secondSet.bulkLoad([1, 3, 4])

// estimate the jaccard similarity between the two sets
const jaccardSim = fistSet.compareWith(secondSet)
console.log(`The estimated Jaccard similarity is ${jaccardSim}`)

Top-K

Given a multiset of elements, the Top-K problem is to compute the ranking of these elements (by an arbitrary score) and returns the k results with the highest scores. This package provides an implementation of the Top-K problem that sort items based on their estimated cardinality in the multiset. It is based on a Count Min Sketch, for estimating the cardinality of items, and a MinHeap, for implementing a sliding window over the k results with the highest scores.

Items produced by the TopK class are JavaScript objects with the following content (shown in Typescript notation).

interface TopkElement {
  // The element's value
  value: string
  // The element's frequency
  frequency: number
  // The element's rank in the TopK, ranging from 1 to k
  rank: number
}

Methods

  • add(element: string, count: number = 1) -> void: add one or more new occurences of an element to the sketch.
  • values() -> Array<TopkElement>: get the top-k values as an array of objects.
  • iterator() -> Iterator<TopkElement>: get the top-k values as an iterator that yields objects.
const {TopK} = require('bloom-filters')

// create a new TopK with k = 10, an error rate of 0.001 and an accuracy of 0.99
const topk = new TopK(10, 0.001, 0.99)

// push occurrences one-at-a-time in the multiset
topk.add('alice')
topk.add('bob')
topk.add('alice')

// or, equally, push multiple occurrences at-once in the multiset
// topk.add('alice', 2)
// topk.add('bob', 1)

// print the top k values
for (let item of topk.values()) {
  console.log(
    `Item "${item.value}" is in position ${item.rank} with an estimated frequency of ${item.frequency}`
  )
}
// Output:
// Item "alice" is in position 1 with an estimated frequency of 2
// Item "bob" is in position 2 with an estimated frequency of 1

Invertible Bloom Filters

An Invertible Bloom Filters (IBLT), also called Invertible Bloom Lookup Table, is a space-efficient and probabilistic data-structure for solving the set-difference problem efficiently without the use of logs or other prior context. It computes the set difference with communication proportional to the size of the difference between the sets being compared. They can simultaneously calculate D(A−B) and D(B−A) using O(d) space. This data structure encodes sets in a fashion that is similar in spirit to Tornado codes’ construction, in that it randomly combines elements using the XOR function.

❗️WARNING❗️ An IBLT only accepts Buffer as inputs. If you are using bloom-filters in a Web browser, you might consider using the feros/buffer package, which provides a polyfill for Buffer in a browser.

Reference: Eppstein, D., Goodrich, M. T., Uyeda, F., & Varghese, G. (2011). What's the difference?: efficient set reconciliation without prior context. ACM SIGCOMM Computer Communication Review, 41(4), 218-229. (Full text article)

Methods

  • add(element: Buffer) -> void: add an element into the filter.
  • remove(element: Buffer) -> void: delete an element from the filter, returning True if the deletion was a success and False otherwise.
  • has(element: Buffer) -> boolean: Test an element for membership, returning False if the element is definitively not in the filter and True is the element might be in the filter.
  • equals(other: InvertibleBloomFilter) -> boolean: Test if two filters are equals.
  • substract(remote: InvertibleBloomFilter): peform the XOR substraction of two IBLTs.
  • decode() -> {additional: Buffer[], missing: Buffer[]} : decode an IBLT.
  • listEntries() -> Generator<Buffer, number, void>: list all entries in the IBLT using a Generator.
const {InvertibleBloomFilter} = require('bloom-filters')

const hashcount = 3
const size = 50
const iblt = new InvertibleBloomFilter(size, hashcount)

// push some data in the IBLT
iblt.add(Buffer.from('alice'))
iblt.add(Buffer.from('42'))
iblt.add(Buffer.from('help'))
iblt.add(Buffer.from('meow'))
iblt.add(Buffer.from('json'))

console.log(ilbt.has(Buffer.from('alice'))) // output: true
console.log(ilbt.has(Buffer.from('daniel'))) // output: false

iblt.remove(Buffer.from('alice'))
console.log(ilbt.has(Buffer.from('alice'))) // output: false

// Now, let's demonstrate the decoding power of IBLT!
const remote = new InvertibleBloomFilter(size, hashcount)
remote.add(Buffer.from('alice'))
remote.add(Buffer.from('car'))
remote.add(Buffer.from('meow'))
remote.add(Buffer.from('help'))

// decode the difference between the two filters
const result = iblt.substract(remote).decode()

console.log(
  `Did we successfully decode the subtracted iblts? ${result.success}. Why? $${result.reason}`
)
console.log(
  `Elements of iblt missing elements from remote: ${result.additional}`
)
console.log(`Elements of remote missing elements from iblt: ${result.missing}`)

// alternatively, create an IBLT optimal for a number of items and a desired error rate
const items = [Buffer.from('alice'), Buffer.from('bob')]
const errorRate = 0.04 // 4 % error rate
filter = InvertibleBloomFilter.create(items.length, errorRate)

// or create an IBLT optimal for a collections of items and a desired error rate
filter = InvertibleBloomFilter.from(items, errorRate)

Tuning the IBLT We recommend to use at least a hashcount of 3 and an alpha of 1.5 for at least 50 differences, which equals to 1.5*50 = 75 cells. Then, if you insert a huge number of values in there, the decoding will work (whatever the number of differences less than 50) but testing the presence of a value is still probabilistic, based on the number of elements inserted (Even for the functions like listEntries). For more details, you should read the seminal research paper on IBLTs (full-text article).

XOR Filter

Available as 8-bits and 16-bits fingerprint length

A XOR Filter is a better space-efficient probabilistic data structure than Bloom Filters. Very usefull for space efficiency of readonly sets.

Reference: Graf, Thomas Mueller, and Daniel Lemire. "Xor filters: Faster and smaller than bloom and cuckoo filters." Journal of Experimental Algorithmics (JEA) 25 (2020): 1-16. (Full text article)

Methods

  • add(elements: XorHashableInput[]) -> void: Add elements to the filter. Calling more than once this methods will override the current filter with the new elements.
  • has(element: XorHashableInput) -> boolean: true/false whether the element is in the set or not.

  • Extended input types: type XorHashableInput = HashableInput | Long
  • We use Buffers internally which are exported/imported to/from base64 strings.
const {XorFilter} = require('bloom-filters')
const xor8 = new XorFilter(1)
xor8.add(['a'])
xor8.has('a') // true
xor8.has('b') // false
// or the combined
const filter = XorFilter.create(['a'])
filter.has('a') // true
// using 16-bits fingerprint length
XorFilter.create(['a'], 16).has('a') // true
const a = new XorFilter(1, 16)
a.add(['a'])
a.has('a') // true

Export and import

All data structures exposed by this package can be exported and imported to/from JSON:

  • Use the method saveAsJSON() to export any data structures into a JSON object.
  • Use the static method fromJSON(json) to load a data structure from a JSON object.
const {BloomFilter} = require('bloom-filters')

const filter = new BloomFilter(15, 0.01)
filter.add('alice')

// export a bloom filter to JSON
const exported = filter.saveAsJSON()

// do something with the JSON object (save it as file, send it to a server, etc)
// ...

// import the same filter from its JSON export
const importedFilter = BloomFilter.fromJSON(exported)
console.log(filter.has('alice')) // output: true
console.log(filter.has('bob')) // output: false

Seeding and Hashing

By default every hash function is seeded with an internal seed which is equal to 0x1234567890. If you want to change it:

const { BloomFilter } = require('bloom-filter')
const bl = new BloomFilter(...)
console.log(bl.seed) // 78187493520
bl.seed = 0xABCD
console.log(bl.seed) // 43981

By default we hash elements using XXH.h64 function from xxhashjs. In the case you want to use your own hash functions, you can use your own Hashing class by extending the default one. Example:

const {BloomFilter, Hashing} = require('bloom-filters')

class CustomHashing extends Hashing {
  serialize(_element, _seed) {
    return Number(1)
  }
}

const bl = BloomFilter.create(2, 0.01)
// override just your structure locally
bl._hashing = new CustomHashing()
bl.add('a')

See test/utils-test.js "Use different hash functions" describe close.

Documentation

See documentation online or generate it in directory doc/ with: npm run doc

Tests and Development

  • Tests are performed using mocha and nyc (code coverage) on node 12.x, 14.x, 15.x and 16.x for the moment.
  • Linting and formatting are made using prettier and eslint

When submitting pull requests please follow the following guidance:

  • Please open pull requests on the develop branch. Direct contributions to the master branch will be refused without comments
  • Add tests when possible in the test folder.
  • Functions, methods, variables and types must be documented using typedoc annotations
  • Run yarn test (build, lint and run the mocha tests suite)

References

  • Classic Bloom Filter: Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426.
  • Partitioned Bloom Filter: Chang, F., Feng, W. C., & Li, K. (2004, March). Approximate caches for packet classification. In INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies (Vol. 4, pp. 2196-2207). IEEE.
  • Cuckoo Filter: Fan, B., Andersen, D. G., Kaminsky, M., & Mitzenmacher, M. D. (2014, December). Cuckoo filter: Practically better than bloom. In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies (pp. 75-88). ACM.
  • Counting Bloom Filter: F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese, An Improved Construction for Counting Bloom Filters, in 14th Annual European Symposium on Algorithms, LNCS 4168, 2006, pp.
  • Count Min Sketch: Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.
  • HyperLogLog: Philippe Flajolet, Éric Fusy, Olivier Gandouet and Frédéric Meunier (2007). "Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm". Discrete Mathematics and Theoretical Computer Science Proceedings.
  • MinHash: Andrei Z. Broder, "On the resemblance and containment of documents", in Compression and Complexity of Sequences: Proceedings (1997).
  • Invertible Bloom Filters: Eppstein, D., Goodrich, M. T., Uyeda, F., & Varghese, G. (2011). What's the difference?: efficient set reconciliation without prior context. ACM SIGCOMM Computer Communication Review, 41(4), 218-229.
  • Xor Filters: Faster and Smaller Than Bloom and Cuckoo Filters Thomas Mueller Graf, Daniel Lemire, Journal of Experimental Algorithmics 25 (1), 2020. DOI: 10.1145/3376122
  • Scalable Bloom Filters ALMEIDA, Paulo Sérgio, BAQUERO, Carlos, PREGUIÇA, Nuno, et al. Scalable bloom filters. Information Processing Letters, 2007, vol. 101, no 6, p. 255-261.

Changelog

Version Release date Major changes
v2.1.0 03/2022 - Add Scalable Bloom filters
- Use array of BitSet for Partitionned Bloom Filter
- Fix wrong MinHash comparison
v2.0.0 02/2022 - Use correctly double hashing #issue43.
- Move all hashing related functions to its specific Hash class in a component of the BaseFilter class. It also allows for overriding the serizalize function for using custom hash functions
- Add #PR44 optimizing the BloomFilter internal storage with Uint arrays.
- Disable 10.x, 15.x node tests.
- Add XorFilter #29
- Add .nextInt32() function to get a new random seeded int 32-bits from the current seed.
- Make all properties public for allowing developpers to override everything.
v1.3.0 10/04/2020 Added the MinHash set
v1.2.0 08/04/2020 Add the TopK class
v1.1.0 03/04/2020 Add the HyperLogLog sketch
v1.0.0 23/03/2020 Rework the whole library using TypeScript, unify the API and fix the documentation
v0.8.0 11/11/2019 Fix some issues with the cuckoo filter (performances). Fix the global API. It allows now to customize each Filter. If you want to use the old API, use the .create() or .from() functions to match the old api.
v0.7.1 11/09/2019 Add the Counting Bloom Filter
v0.7.0 01/07/2019 Move to XXHASH for hashing elements in the library. One property has been added into the exported json _seed which is used to seed every hash of every elements. Update Invertible Bloom Filters with #add, #has, #delete, #listEntries, #substract, #Static.decode methods. Updated the way to get distinct indices which could have collisions in many cases.
v0.6.1 18/06/2019 Add Invertible Bloom Filters (only #encode/#substract/#Static.decode methods)

License

MIT License

bloom-filters's People

Contributors

callidon avatar dependabot[bot] avatar dleppik avatar folkvir avatar jonahharris avatar justrox avatar lqmanh avatar zergity avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

bloom-filters's Issues

Exported JSON isn't portable

I was evaluating this library for a project I'm working on, where I need to export the bloom filter as a file, store it, then load it again from the file.

If I try using saveAsJSON(), the binary data is stored as an array of booleans, which, when saved to a file, looks like this:

[true, false, true, true, true, ...]

The bloom filter for a 25mb dataset ended up being ~120mb!

It would be nice to be able to save the file in a more space-efficient format.

`Buffer is not defined` error when using in the browser.

Hi I was trying to create a BloomFilter instance in the browser and it keeps throwing the error Buffer is not defined error even though I'm only importing the BloomFilter export which doesn't use the Buffer class.

Is there a workaround for this?

I'm using the library with the vite bundler on firefox 99, if that helps.

Hyperloglog Example Not working

The example hyperloglog that is provided in the documentation does not work as expected. When counting the sketch when initialized without any updates, I get 71.36002532672464, when I update the sketch with alice, the count returns the same number. Is there something that I'm missing?

CuckooFilter returns false negative

I'm using CuckooFilter to test the membership of the following items. The filter returns false - definitely not in the list - for 2 items. These cases are false negatives which should not happen.

Is this a bug or am I missing something?
Version: "bloom-filters": "^3.0.1"

import { CuckooFilter } from "bloom-filters";
const items = [
  "https://www.youtube.com/watch?v=HJjxN05ewEc",
  "https://www.youtube.com/watch?v=BZNUo7orS3k",
  "https://www.youtube.com/watch?v=SD-McWZz_pk",
  "https://www.youtube.com/watch?v=De4QjH9fpgo",
  "https://www.youtube.com/watch?v=Hzko-cjHhTg",
  "https://www.youtube.com/watch?v=vqR-8lgOmBE",
  "https://www.youtube.com/watch?v=j6u0LH67YLk",
  "https://www.youtube.com/watch?v=B2z8ikGLRh8",
  "https://www.youtube.com/watch?v=N3ftBeP16TA",
  "https://www.youtube.com/watch?v=38RBRPwODUk",
  "https://www.youtube.com/watch?v=Ry8nSUfX6fY",
  "https://www.youtube.com/watch?v=-KrYohUJvYw",
  "https://www.youtube.com/watch?v=zRpl7Pr0fs4",
  "https://www.youtube.com/watch?v=uYYiypp6WaY",
  "https://www.youtube.com/watch?v=EPap21FBGbE",
  "https://www.youtube.com/watch?v=zN2_0WC7UfU",
  "https://www.youtube.com/watch?v=DNVwnkgTzbY",
];

const errorRate = 0.04; // 4 % error rate
const filter = CuckooFilter.from(items, errorRate);
console.log(filter.has("hello")); // return false (ok)
console.log(filter.has("https://www.youtube.com/watch?v=HJjxN05ewEc")); // return true (ok)
console.log(filter.has("https://www.youtube.com/watch?v=38RBRPwODUk")); // return false (false negative)
console.log(filter.has("https://www.youtube.com/watch?v=-KrYohUJvYw")); // return false (false negative)

Question about implementation of double hashing

Hi. I'm just looking into the hashing implementation here and I'm baffled by several aspects of the implementation, hoping you can tell me if there's anything I'm missing.

For example, in getIndices, you're rehashing the input on every iteration of the loop:

bloom-filters/src/utils.ts

Lines 206 to 209 in 5c81fa4

for (let i = 1; i <= hashCount; i++) {
const hashes = hashTwice(element, true, seed + size % i)
arr.push(doubleHashing(i, hashes.first, hashes.second, size))
}

Surely this defeats the point of double hashing, to simulate k independent hash functions given only two real ones? Not using double hashing at all would need to do one hash per index, your implementation does two hashes per index.

It's true that the hashes you're calculating on each loop aren't quite the same, because you're adding size % i -- that's the number of cells modulo the loop iteration -- to the seed each time. But why? That seems like a really strange thing to add to the seed. It doesn't guarantee that the seed is different on each loop (eg if the number of cells is even it'll be 0 for the first 2 iterations). But again that's not something you should want/need anyway in double hashing.

Am I missing something?

failed to load bloom-filter module after npm installation

$ npm install --save bloom-filters
$ node
Welcome to Node.js v12.12.0.
Type ".help" for more information.
> const { BloomFilter } = require('bloom-filters')
Thrown:
Error: Cannot find module '/Users/henry/tmp/bloom/node_modules/bloom-filters/bloom-filters.js'. Please verify that the package.json has a valid "main" entry
    at tryPackage (internal/modules/cjs/loader.js:294:19)
    at Function.Module._findPath (internal/modules/cjs/loader.js:525:18)
    at Function.Module._resolveFilename (internal/modules/cjs/loader.js:781:27)
    at Function.Module._load (internal/modules/cjs/loader.js:687:27)
    at Module.require (internal/modules/cjs/loader.js:849:19)
    at require (internal/modules/cjs/helpers.js:74:18) {
  code: 'MODULE_NOT_FOUND',
  path: '/Users/henry/tmp/bloom/node_modules/bloom-filters/package.json',
  requestPath: 'bloom-filters'
}

but if I use the version 0.8.0 , it works well.

[Bug?] npm install produces only a api.js file

npm install installs version 1.3.7.
The npm package only contains a api.js in the dist directory. Did I missed something?
The package can not run because all files that are required in the api.js file are not there.

Unexpected behaviour in CuckooFilter

See the example. Many false positives (It happens especially after the first result in a row.) What am I doing wrong?

const { CuckooFilter } = require('bloom-filters')
let filter = new CuckooFilter(15, 3, 2)

 filter.add('alice')
 filter.add('andrew')
 filter.add('bob')
 filter.add('sam')

 filter.add('alice')
 filter.add('andrew')
 filter.add('bob')
 filter.add('sam')

// lookup for some data
console.log(filter.has('samx')) // output: false [ok]
console.log(filter.has('samy')) // output: true [?]
console.log(filter.has('alice')) // output: true [ok]
console.log(filter.has('joe')) // output: true [?]
console.log(filter.has('joe')) // output: true [?]

Version 0.7.1

Error in Browser Chrome 105

environment is a quasar+vite+vue3

vue-router.mjs:3428 ReferenceError: Buffer is not defined
at cell.js:178:76
at node_modules/bloom-filters/dist/iblt/cell.js (cell.js:200:1)
at __require (chunk-OL3AADLO.js?v=aca5486d:9:50)
at node_modules/bloom-filters/dist/iblt/invertible-bloom-lookup-tables.js (invertible-bloom-lookup-tables.js:84:30)
at __require (chunk-OL3AADLO.js?v=aca5486d:9:50)
at node_modules/bloom-filters/dist/api.js (api.js:54:40)
at __require (chunk-OL3AADLO.js?v=aca5486d:9:50)
at dep:bloom-filters:1:16

saveToJSON doesn't export default seed

This code imports the cuckoo filter to the localStorage in v3.0.0, but upon importing it again from localStorage, it'll find that the default seed wasn't serialized, and so it'll throw an error

!('_seed' in json)

import { CuckooFilter } from "bloom-filters";

// Check if there's a saved filter in localStorage.
let filter;
const savedFilter = JSON.parse(localStorage.getItem("--kiwi-news-read-links"));
console.log(savedFilter);
try {
  // Try to load the filter from localStorage.
  filter = CuckooFilter.fromJSON(savedFilter);
} catch (error) {
  console.log(error);
  // If an error is thrown, create a new filter.
  filter = CuckooFilter.create(1000, 0.01);
}

// Keep a queue of the most recent N links.
let queue = [];

// Check the current path.
const path = window.location.pathname;

if (path === "/" || path === "/new") {
  // Extract all the links on page load.
  const links = Array.from(document.querySelectorAll("a.story-link")).map(
    (a) => a.href,
  );

  links.forEach((link) => {
    if (!filter.has(link)) {
      const parentElement = document
        .querySelector(`a[href="${link}"]`)
        .closest("div");
      const domainElement = parentElement.querySelector(".story-domain");
      domainElement.textContent += " •";
      addLink(link);
    }
  });

  console.log("safe", filter.saveAsJSON());
  // Save the filter to localStorage.
  localStorage.setItem(
    "--kiwi-news-read-links",
    JSON.stringify(filter.saveAsJSON()),
  );
}

function addLink(link) {
  // If the filter is full, remove the oldest link.
  if (queue.length >= 1000) {
    const oldestLink = queue.shift();
    filter.remove(oldestLink);
  }

  // Add the new link to the filter and the queue.
  filter.add(link);
  queue.push(link);
}

however, after adding a custom seed manually, the seed gets exported as JSON and the problem is resolved

try {
  8   // Try to load the filter from localStorage.
  9   filter = CuckooFilter.fromJSON(savedFilter);
 10 } catch (error) {
 11   console.log(error);
 12   // If an error is thrown, create a new filter.
 13   filter = CuckooFilter.create(1000, 0.01);
 14   filter.seed = 0xabcd;
 15 }

still, I think this should also work with the default seed, shouldn't it?

Compact serialized form for BloomFilter

The serialized form of BloomFilter is often larger than the original data. I haven't tested in-memory usage, but since it uses an array of 64-bit 1's and 0's, the same is likely true. This implies that there is no advantage to a BloomFilter over a JavaScript Set when hashing short strings.

For example, an 8.2 Mb file (3.6 Mb gzip compressed) of 1 million leaked passwords serializes to 27 Mb (2.3 Mb compressed). This is with false positive tolerance of 0.1%.

I have written a fix for this which replaces the array with a BitSet backed by a Uint8Array and serializes the array with base64 encoding. This changes the 27 Mb filter to 2.3 Mb (1.3 Mb compressed).

I'm prepared to make a pull request for this fix, if you give me permission. Otherwise I can send you the changes a different way.

Counting bloom filter

Hello,

is there any chance that you could implement a counting bloom filter?
Would be great!

Thanks in advance!

Better examples for CustomHashing

Is your feature request related to a problem? Please describe.

It was confusing to figure out how to use the CustomHashing feature, and I discovered that naively constructing numbers from hex encoded cryptographic hashes, such as sha256 or shake256 can lead to scenarios where the filter always returns true.

Describe the solution you'd like

An example showing how to use cryptographic hashes, more precisely, how to safely convert from a hex encoded string to a number.... is it ok to truncate? Why not use BigInt?

Some explanation of how to use standard hash functions safely would be awesome.

Acceptation criterias

unit tests for sha256 and shake256.

Additional context

in some environments (regulated / security oriented) only specific hash functions are allowed,
it would be nice to have examples that are compliant to build from.

Persistence

Thanks for making this library.

Is there any way to serialise the structures into something (e.g. json or a base64 encoded string) so that it can be persisted somewhere and re-instantiated?

Question: v8 release?

I would like to use a cuckoo filter, and saw some work on improving it (ergonomics, and looks like speed?) in the master branch that hasn't been released. When do you think it will be in shape to go out?

Todo list

  • HyperLogLog
  • Top-K or Mine Top-K: Using Bloom Filters for Mining Top-k Frequent Itemsets in Data Streams, Younghee Kim et Al.
  • Min Hash Wikipedia source, need to find the paper
  • Scalable Bloom Filter, as the partitioned bloom filter is fixed now the Scalable bloom filter can be easily implemented.
  • XorFilter #29

Feel free to modify this list or add suggestions in comments.

saveAsJSON() method not found

The README says that you can call .saveAsJSON() on any bloom filter to get a serialization of the bloom filter state that can be persisted or sent over the network. When I try to use it, however, my Typescript code fails to compile with the error: error TS2339: Property 'saveAsJSON' does not exist on type 'BloomFilter'.

My code looks like:

import { BloomFilter } from 'bloom-filters'

// [...]
const bloomFilterEntries = new Set<string>()
// [...]

const bloomFilter = BloomFilter.from(bloomFilterEntries, 0.01)
const bloomJSON = bloomFilter.saveAsJSON()

import / export BloomFilter not working as expected

Hi, these functionalities seemed have issue:

static create (items = 1000, errorRate = 0.001, seed = utils.getDefaultSeed()) {
const size = fm.optimalFilterSize(items, errorRate)
const hashes = fm.optimalHashes(size, items)
const filter = new BloomFilter(size, hashes)
filter.seed = seed
return filter
}

  • BloomFilter.create should be the correct way to construct a filter since it calculates sizing internally, but forgot to save its _errorRate for serialization later on

const BloomFilterSpecs = {
export: cloneObject('BloomFilter', '_errorRate', '_size', '_length', '_nbHashes', '_filter', '_seed'),
import: (FilterConstructor, json) => {
if ((json.type !== 'BloomFilter') || !assertFields(json, '_errorRate', '_size', '_length', '_nbHashes', '_filter', '_seed')) {
throw new Error('Cannot create a BloomFilter from a JSON export which does not represent a bloom filter')
}
const filter = new FilterConstructor(json._capacity, json._errorRate)
filter.seed = json._seed
filter._size = json._size
filter._nbHashes = json._nbHashes
filter._filter = json._filter.slice(0)
filter._length = json._length
return filter
}
}

  • because _errorRate has not been saved, it left as undefined during exports, then got omitted via JSON.stringify, which later leading to throw an exception from assertFields by import
  • _capacity has not been set onto the instance nor serialization, but treat as size to feed into FilterConstructor

class BloomFilter extends Exportable {
/**
* Constructor
* @param {int} size - the number of cells
* @param {number} nbHashes - the number of hash functions used
*/
constructor (size = 1000, nbHashes = 4) {

Implements Scalable Bloom Filter

From #8 , we need to implements Scalable Bloom Filter, since the partitioned bloom filter is fixed now the Scalable bloom filter can be easily implemented.

Filters created with version 1.3.4 and exported to json do not work when imported to version 1.3.9

We use this library in as part of the Ceramic Network. We are publishing batches of work done on the network along with a JSON-serialized bloom filter to make it easier to look up what happened in that batch. Our hosted service making these batches and publishing these bloom filters seems to be using version 1.3.4 of the bloom-filters package. Importing those bloom filters with the current 1.3.9 version of the library causes the .has() method of the reconstituted bloom filter to always return false, even for entries that are definitely present in the filter. Reverting to version 1.3.4 and rebuilding the bloom filter from the same input data causes it to behave correctly. It seems that there was a format change somewhere between version 1.3.4 and version 1.3.9 that causes filters created on the old version, exported with saveAsJSON(), and then imported with fromJSON() on the new version to not build the bloom filter correctly.

Code that builds the filter: https://github.com/ceramicnetwork/ceramic-anchor-service/blob/9ff9e1a20e46c65036fcbed600a0abf1b09d0bec/src/merkle/merkle-objects.ts#L247-L249

Serialized filters: are safe to expose publicly?

Hi, im working on a blockchain, on-chain scallable filtering solution, and I wish to know if an attacker could get advantages if the entire serialized filter is stored publickly.

If thats the case, are there any suggestions to follow, like keeping some parts of the filter off-chain maybe? filter should be exposed somehow to decentralize the filtering.

The error rate should stay the lowest and filter content size can vary.

I'm analyzing to use this nice library.

Btw: tryed error rate to zero just for testing and browser cpu usage got crazy, I know it was an utopic case but just notifying if it's usefull for you guys,

Reusing Hashes for faster lookups

Is your feature request related to a problem? Please describe.

My Problem: I replicate bloom filters from many clients (up to 1000) to my server.
The server then has to lookup a given key "foobar" and check which of the replicated filters contains the key.
My bloom filters are pretty big and need about 13 hashes.
To check which replicated filter has the key, the server would then have to do 13*1000 hash operations.

Describe the solution you'd like

A way to precalculate the hash once and then lookup the existence in multiple bloom filters without having to hash again.

import { hash } from 'bloom-filters';
const lookupHash = hash(13, 'foobar');

const foundInFilters = bloomFilters.filter(bf => bf.hasHash(lookupHash));

Acceptation criterias

For testing we could run a test that ensures that lookups for pre-calculated hashes are exactly equal to normal key lookups.

Additional context

Also a question: Is it possible to use a different hash function? My plan is to run the hashing inside of WebAssembly for better performance, so maybe even an async hash function would be useful.

ERROR TypeError: "bloom_filters__WEBPACK_IMPORTED_MODULE_2__ is not a constructor"

In angular project
Step 1 : run npm install bloom-filters --save
Step 2: Add "node_modules/bloom-filters/bloom-filters.js" in angular.json
Step 3: Import in component file as import * as BloomFilter from 'bloom-filters'

my package.json is
{
"name": "",
"version": "0.0.0",
"scripts": {
"ng": "ng",
"start": "ng serve",
"build": "ng build",
"test": "ng test",
"lint": "ng lint",
"e2e": "ng e2e"
},
"private": true,
"dependencies": {
"@angular/animations": "~7.1.0",
"@angular/cdk": "~7.1.0",
"@angular/common": "~7.1.0",
"@angular/compiler": "~7.1.0",
"@angular/core": "~7.1.0",
"@angular/forms": "~7.1.0",
"@angular/platform-browser": "~7.1.0",
"@angular/platform-browser-dynamic": "~7.1.0",
"@angular/router": "~7.1.0",
"@types/axios": "^0.14.0",
"aes-js": "^3.1.2",
"angular-filepond": "^1.0.5",
"bloom-filters": "^0.5.2",
"chart.js": "^2.7.3",
"core-js": "^2.5.4",
"d3": "^5.7.0",
"datatables.net": "^1.10.19",
"datatables.net-dt": "^1.10.19",
"echarts": "^4.2.0-rc.2",
"js-sha256": "^0.9.0",
"jssha": "^2.3.1",
"ng2-charts": "^1.6.0",
"ng2-pdf-viewer": "^5.2.3",
"ng2-toastr": "^4.1.2",
"ngx-echarts": "^4.1.0",
"ngx-extended-pdf-viewer": "^0.9.14",
"rxjs": "~6.3.3",
"tslib": "^1.9.0",
"zone.js": "~0.8.26"
},let filter = new BloomFilter(15, 0.01)
"devDependencies": {
"@angular-devkit/build-angular": "~0.11.0",
"@angular/cli": "~7.1.2",
"@angular/compiler-cli": "~7.1.0",
"@angular/language-service": "~7.1.0",
"@schematics/angular": "~7.1.0",
"@types/echarts": "^4.1.3",
"@types/jasmine": "~2.8.8",
"@types/jasminewd2": "~2.0.3",
"@types/node": "^8.9.5",
"codelyzer": "~4.5.0",
"jasmine-core": "~2.99.1",
"jasmine-spec-reporter": "~4.2.1",
"karma": "~3.1.1",
"karma-chrome-launcher": "~2.2.0",
"karma-coverage-istanbul-reporter": "~2.0.1",
"karma-jasmine": "~1.1.2",
"karma-jasmine-html-reporter": "^0.2.2",
"protractor": "~5.4.0",
"ts-node": "~7.0.0",
"tslint": "~5.11.0",
"typescript": "~3.1.6"
}
}

GOT the title error

Solution might be
1.

2. how to import node npm module in angular project

Bug: Endless recursion in getDistinctIndices with certain parameters.

Explanation
The getDistinctIndicesBis function within getDistinctIndices will fail with a Maximum call stack size exceeded error when there are too few distinct indizes being returned while n < size.

How to reproduce

const filter = new BloomFilter(39, 28);
filter.add("da5e21f8a67c4163f1a53ef43515bd027967da305ecfc741b2c3f40f832b7f82");

This code will trigger the issue.

Why does this happen?

// This returns the same hashes after n is larger than size
const hashes = hashTwice(elem, true, seed! + (size % n));
// This return from the doubleHashing function will also return the same values when called with the same hashes and different n values
return Math.abs((hashA + n * hashB) % size);

This combined with the filtering of double indizes in getDistinctIndicesBis will lead to the issue.
Its important to note that this will only happen in certain cases with high hash count and a low bit count.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.