Giter VIP home page Giter VIP logo

Comments (2)

domodwyer avatar domodwyer commented on June 2, 2024

Hi @d33tah

I'm surprised it's not fast enough for your use case! Are you hashing the raw data and inserting the hash or is record[1] the raw data? What sort of numbers are you seeing?

I just ran a quick check using xxhash (though in reality the choice of hash here has no impact):

pub fn xxhash_bench(c: &mut Criterion) {
    use std::hash::BuildHasher;
    use std::hash::Hasher;
    use twox_hash::*;

    // Really any u64/hash bytes is fine, but this shows how to use xxhash.
    let mut bloom = CompressedBitmap::new(FilterSize::KeyBytes2);
    let mut hasher = RandomXxHashBuilder64::default().build_hasher();
    hasher.write(&[42]); // Hash the raw data bytes
    let hash = hasher.finish().to_le_bytes();

    c.bench_function("xxhash_insert", |b| b.iter(|| bloom.insert_hash(hash)));

    c.bench_function("xxhash_lookup hit", |b| {
        b.iter(|| black_box(bloom.contains_hash(hash)))
    });

    c.bench_function("xxhash_lookup miss, partial match", |b| {
        b.iter(|| black_box(bloom.contains_hash([1, 2, 3, 4, 5, 6, 7, 8])))
    });
}

And got:

curiosity:bloom2 % cargo bench -- xxhash
   Compiling bloom2 v0.1.0 (/Users/dom/Documents/rust/bloom2)
    Finished bench [optimized] target(s) in 12.46s
     Running target/release/deps/bench-8985f05ed1c18d86
Gnuplot not found, using plotters backend
xxhash_insert           time:   [26.184 ns 26.251 ns 26.321 ns]
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  2 (2.00%) high mild
  2 (2.00%) high severe

xxhash_lookup hit       time:   [19.978 ns 20.069 ns 20.176 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

xxhash_lookup miss, partial match
                        time:   [4.6333 ns 4.6693 ns 4.7297 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

I'm compiling with -Ctarget-cpu=native to enable SSE/AVX and allow rust to use the all the features of the CPU - if your aiming for flexible builds (say, packaging up the binary to be run by 3rd parties on unknown CPUs) then skip this - it will still be pretty fast! But if you're deploying binaries to known hardware, everything will be a bit faster with this enabled and is worth doing!


To answer your actual question: sounds like a great idea!

Maybe something like:

fn try_insert<T: AsRef<[u8]>>(&self, hash: T) -> bool

Where the returned bool is true/false depending on if it is inserted?

I think this crate could do with some love to be honest - adding a generic parameter for the default hasher (xxhash?) the user can override instead of requiring them to pass hashes directly, and adding more utility methods seems like an obvious improvement.

My original use case was dealing with hashes, so I didn't want the bloom filter to then _re_hash the hash I was giving it, so that's why it wound up like this, but now it's public that is likely not the most useful form this crate could take...

Dom

from bloom2.

d33tah avatar d33tah commented on June 2, 2024

@domodwyer thanks for the RUSTFLAGS hint! It speeded up the program twice. Now I guess that the next low hanging fruit is the try_insert function you proposed. If you're interested in adding it, please ping me once it's ready for testing. :)

from bloom2.

Related Issues (4)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.