mgunlogson / cuckoofilter4j Goto Github PK
View Code? Open in Web Editor NEWHigh performance Java implementation of a Cuckoo filter - Apache Licensed
License: Apache License 2.0
High performance Java implementation of a Cuckoo filter - Apache Licensed
License: Apache License 2.0
I want to know how to serialize and deserialize the cuckoo filter with gson. Can you provide an example?
Sorry about my code ability.
Hey,
Thanks for authors' work !
I find curIndex and altIndex might be equal in put method (cuckoofilter.java), which results in an error for the countTag method (FilterTable.java), An item will be counted twice, I don't know what other effects this error has.
Fix countTag bug:
In the second Bucket, you can determine if i1(curIndex ) does not equal i2(altIndex )
if (i1!=i2 && checkTag(i2, posInBucket, tag))
From the original paper, It should be like this
return DoubleMath.roundToInt((DoubleMath.log2(1 / fpProb) + 3) / loadFactor, RoundingMode.UP);
Hey,
I find it better in some aspects than bloom filters but I depend on storing it to a file and use it later.
Are you planing to add support for that in future? Do you see any technical problems in doing so?
Also I'm wondering how to get size of a cuckoo filter in memory or in persisted state. I'm doing some benchmarks and without knowing this I cannot really use them ...
Btw the results of the benchmark that tests false positive rate on 100M unique UUIDs :
sparkBF finished in 205.543 seconds with 0 collisions and size 299533092 bytes ...
guavaBF finished in 229.518 seconds with 79 collisions and size 299533086 bytes ...
sipHash24 cuckooF finished in 183.235 seconds with 1093 collisions ...
Murmur3_128 cuckooF finished in 181.725 seconds with 1131 collisions ...
sha256 cuckooF finished in 229.018 seconds with 1084 collisions ...
As of version 23, guava bloomfilter is thread safe.
the sequence of mightContain + put is racy.
would it make sense to create a putIfAbsent variant that is atomic and race-free from the callers point of view?
or are there other options available to achieve something similar without synchronizing on the CuckooFilter instance?
Hi MGunlogson,
when i create a cuckoo filter for string and set the fpp little than 1e-10, there comes some false negative. after test a while, i find than:
in the readTagAndSet method of FilterTable.class, the following line:
tag |= (1 << tagPos);
should be changed to:
tag |= (1L << tagPos);
because when fpp little than 1e-10, the tagBit woule be more than 32bits, so there would do
something wrong.
@MGunlogson this library remove duplicate line from a file. I'm currently work with next-generation sequencing (ngs) and need to remove duplicate readings from large volumes of data (big data)
I am wondering if it's possible to resize (grow) an existing Cuckoo filter once the capacity is reached? Rather than rebuild one anew.
Hi, loving using your Cuckoo filter always loved the concept behind cuckoo hashing but never ended up using it until now.
There are some ways to make it faster however.
1: Support primitives such as int, long so there is no autoboxing overhead and garbage generation,
this would probably necessitate a class for each type like the trove collections do.
eg, IntCuckooFilter, LongCuckooFilter etc.
2: use atomicArray and compare and swap instead of locks for concurrency, locking always has a big overhead. see> https://github.com/Valloric/guava/tree/lock-free-bloom
3: Use xxHash instead of MurMur it is twice as fast with the same or better quality hashes and made the normal guava bloom filter access go from 440ns for the MURMUR128_MITZ_64 to 93ns for xxHash
see> google/guava#2748 (comment)
thanks for building such as great filter!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.