Giter VIP home page Giter VIP logo

cuckoofilter4j's People

Contributors

mgunlogson avatar mpdaedalus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cuckoofilter4j's Issues

Serialized CuckooFilter

I want to know how to serialize and deserialize the cuckoo filter with gson. Can you provide an example?
Sorry about my code ability.

curIndex and altIndex might be equal, which results in an error for the countTag

Hey,
Thanks for authors' work !
I find curIndex and altIndex might be equal in put method (cuckoofilter.java), which results in an error for the countTag method (FilterTable.java), An item will be counted twice, I don't know what other effects this error has.

Fix countTag bug:
In the second Bucket, you can determine if i1(curIndex ) does not equal i2(altIndex )
if (i1!=i2 && checkTag(i2, posInBucket, tag))

Question: How to get size and serialize cuckoo filter

Hey,

I find it better in some aspects than bloom filters but I depend on storing it to a file and use it later.

Are you planing to add support for that in future? Do you see any technical problems in doing so?

Also I'm wondering how to get size of a cuckoo filter in memory or in persisted state. I'm doing some benchmarks and without knowing this I cannot really use them ...

Btw the results of the benchmark that tests false positive rate on 100M unique UUIDs :

sparkBF                 finished in 205.543 seconds with 0 collisions and size 299533092 bytes ...
guavaBF                 finished in 229.518 seconds with 79 collisions and size 299533086 bytes ...
sipHash24 cuckooF       finished in 183.235 seconds with 1093 collisions  ...
Murmur3_128 cuckooF     finished in 181.725 seconds with 1131 collisions ...
sha256 cuckooF          finished in 229.018 seconds with 1084 collisions ...

putIfAbsent support

the sequence of mightContain + put is racy.

would it make sense to create a putIfAbsent variant that is atomic and race-free from the callers point of view?

or are there other options available to achieve something similar without synchronizing on the CuckooFilter instance?

when fpp smaller than 1e-10, there is little bug in FilterTable.java,makes false negative

Hi MGunlogson,
when i create a cuckoo filter for string and set the fpp little than 1e-10, there comes some false negative. after test a while, i find than:
in the readTagAndSet method of FilterTable.class, the following line:
tag |= (1 << tagPos);
should be changed to:
tag |= (1L << tagPos);
because when fpp little than 1e-10, the tagBit woule be more than 32bits, so there would do
something wrong.

Remove duplicate rows from file

@MGunlogson this library remove duplicate line from a file. I'm currently work with next-generation sequencing (ngs) and need to remove duplicate readings from large volumes of data (big data)

Resize the filter?

I am wondering if it's possible to resize (grow) an existing Cuckoo filter once the capacity is reached? Rather than rebuild one anew.

suggestions for speedups

Hi, loving using your Cuckoo filter always loved the concept behind cuckoo hashing but never ended up using it until now.

There are some ways to make it faster however.

1: Support primitives such as int, long so there is no autoboxing overhead and garbage generation,
this would probably necessitate a class for each type like the trove collections do.
eg, IntCuckooFilter, LongCuckooFilter etc.

2: use atomicArray and compare and swap instead of locks for concurrency, locking always has a big overhead. see> https://github.com/Valloric/guava/tree/lock-free-bloom

3: Use xxHash instead of MurMur it is twice as fast with the same or better quality hashes and made the normal guava bloom filter access go from 440ns for the MURMUR128_MITZ_64 to 93ns for xxHash
see> google/guava#2748 (comment)

thanks for building such as great filter!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.