Giter VIP home page Giter VIP logo

vers's Introduction

Vers - Very Efficient Rank and Select

crates.io rust docs

Vers (vers-vecs on crates.io) contains pure-Rust implementations of several data structures backed by rank and select operations. When using this library, it is strongly recommended to enable the BMI2 and popcnt features for x86_64 CPUs or compile with the target-cpu=native flag, since the intrinsics speed up both rank and select operations by a factor of 2-3.

Data Structures

  • A fully-featured bit vector with no memory overhead.
  • A succinct bit vector supporting fast rank and select queries.
  • An Elias-Fano encoding of monotone sequences supporting constant-time predecessor/successor queries.
  • Two Range Minimum Query vector structures for constant-time range minimum queries.
  • A Wavelet Matrix supporting O(k) rank, select, statistical, predecessor, and successor queries.

Why Vers?

  • Vers is among the fastest publicly available bit vector implementations for rank and select operations.
  • Vers has a substantially lower memory overhead than its competitors.
  • Without crate features, all data structures are implemented in pure Rust and have no dependencies outside the standard library.
  • Every functionality is extensively documented.
  • Vers aims to provide more functionality for its data structures than competitors (e.g., Elias-Fano sequences and the Wavelet Matrix support predecessor and successor queries, the Wavelet Matrix supports statistical queries, all data structures implement various iterators, etc.).

Crate Features

  • simd: Enables the use of SIMD instructions for rank and select operations. This feature requires AVX-512 support and uses unsafe code. It also enables a special iterator for the rank/select bit vector that uses vectorized operations. The feature only works on nightly Rust. Enabling it on stable Rust is a no-op, because the required CPU features are not available there.
  • serde: Enables serialization and deserialization of the data structures using the serde crate.

Benchmarks

I benchmarked the implementations against publicly available implementations of the same data structures. The benchmarking code is available in the vers-benchmarks repository. The benchmark uses the simd feature of rsdict, which requires nightly Rust.

I performed the benchmarks on a Ryzen 9 7950X with 32GB of RAM. Some of the results are shown below. All benchmarks were run with the target-cpu=native flag enabled, and the simd feature enabled for Vers. More results can be found in the benchmark repository.

Benchmarks for the Wavelet Matrix are still missing because I want to improve the benchmarking code before I do them. Because Wavelet Matrices have very little room for engineering, there aren't any surprising results to be expected, though. The performance solely depends on the bit vector implementation, so the results will be similar to the bit vector benchmarks. The only exception is the qwt crate, which uses quad vectors instead, and is substantially faster than any other crate due to the reduced number of cache misses.

Bit-Vector

Rank & Select

The bit vector implementation is among the fastest publicly available implementations for rank and select operations. Note that the succinct crate substantially outperforms Vers' rank operation but does not provide an efficient select operation.

The x-axis is the number of bits in the bit vector. An increase in all runtimes can be observed for input sizes exceeding the L2 cache size (16 MB).

Legend Crate Notes
bio https://crates.io/crates/bio with adaptive block-size
fair bio https://crates.io/crates/bio with constant block-size
fid https://crates.io/crates/fid
indexed bitvector https://crates.io/crates/indexed_bitvec
rank9 https://crates.io/crates/succinct Fastest of multiple implementations
rsdict https://crates.io/crates/rsdict
vers https://github.com/Cydhra/vers
sucds-rank9 https://crates.io/crates/sucds
sucds-darray https://crates.io/crates/sucds Dense Set Implementation
bitm https://crates.io/crates/bitm

Bit-Vector Rank Benchmark Bit-Vector Select Benchmark

Heap Size

The memory overhead of the bit vector implementation is significantly lower than that of other implementations. The x-axis is the number of bits in the bit vector, the y-axis is the additional overhead in percent compared to the size of the bit vector. Only the fastest competitors are shown, to make the graph more readable (I would like to add the bio crate data structure as well, since it is the only truly succinct one, but it does not offer an operation to measure the heap size. The same is true for the bitm crate, which claims to have a lower memory overhead compared to Vers, but does not offer a convenient way of measuring it). Vers achieves its high speeds with significantly less memory overhead, as can be seen in the heap size benchmark. The legend contains the measurement for the biggest input size, because I assume that the overhead approaches a constant value for large inputs.

Bit-Vector Heap Size Benchmark

Elias-Fano

The benchmark compares the access times for random elements in the sequence. The x-axis is the number of elements in the sequence. Note, that the elias-fano crate is inefficient with random order access. In-order access benchmarks can be found in the benchmark repository.

Elias-Fano Randomized

The following two benchmarks show the predecessor query times for average element distribution and the worst-case element distribution. Note that Vers worst-case query times are logarithmic, while sucds has linear worst-case query times.

Elias-Fano Worst Case Elias-Fano Worst Case

Range Minimum Query

The Range Minimum Query implementations are compared against the range_minimum_query and librualg crate. Vers outperforms both crates by a significant margin with both implementations. An increase in runtime can be observed for input sizes exceeding the L3 cache size (64 MB). The increase is earlier for the BinaryRMQ implementation, because it has a substantially higher memory overhead. For the same reason, the final two measurements for the BinaryRMQ implementation are missing (the data structure exceeded the available 32 GB main memory).

(Yes, the naming of both implementations is unfortunate, but they will stay until I do a major version bump.)

RMQ Comparison

Intrinsics

This crate uses compiler intrinsics for bit manipulation. The intrinsics are supported by all modern x86_64 CPUs, but not by other architectures. There are fallback implementations if the intrinsics are not available, but they are significantly slower. Using this library on x86 CPUs without enabling BMI2 and popcnt target features is not recommended.

The intrinsics in question are popcnt (supported since SSE4.2 resp. SSE4a on AMD, 2007-2008), pdep (supported with BMI2 since Intel Haswell resp. AMD Excavator, in hardware since AMD Zen 3, 2011-2013), and tzcnt (supported with BMI1 since Intel Haswell resp. AMD Jaguar, ca. 2013).

Safety

This crate uses no unsafe code, with the only exception being compiler intrinsic for pdep. The intrinsics cannot fail with the provided inputs (provided they are supported by the target machine), so even if they were to be implemented incorrectly, no memory corruption can occur (only incorrect results).

Unsafe code is hidden behind public API.

Dependencies

The library has no dependencies outside the Rust standard library by default. It has a plethora of dependencies for benchmarking purposes, but these are not required for normal use. Optionally, the serde feature can be enabled to allow serialization and deserialization of the data structures, which requires the serde crate and its derive feature.

License

Licensed under either of

at your option.

This project includes code developed by Gonzalo Brito Gadeschi originally licensed under the MIT license. It is redistributed under the above dual license.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

vers's People

Contributors

chris-ha458 avatar cydhra avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

vers's Issues

`indexset`

Hey there!

Thanks a lot for the PRs that made range iteration work better.

How does indexset's rank and select fare against what you have here?

Errors in `iter1()`

Very cool crate! I ran into some cases where iter1() returns incorrect results - it appears to skip blocks of ~8 ONE bits. I augmented test_random_data_rank() with some tests of iter0/1 and ran into a crash. The test below will fail with thread 'bit_vec::fast_rs_vec::tests::test_random_data_rank' panicked at src/bit_vec/fast_rs_vec/iter.rs:327:19: attempt to shift left with overflow on x86_64. I get the same failure with and without target-cpu=native

Test:

#[test]
fn test_random_data_rank() {
    let mut bv = BitVec::with_capacity(LENGTH);
    let mut rng = StdRng::from_seed([
        0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5,
        6, 7,
    ]);
    let sample = Uniform::new(0, 2);
    static LENGTH: usize = 4 * SUPER_BLOCK_SIZE;

    for _ in 0..LENGTH {
        bv.append_bit(sample.sample(&mut rng));
    }

    let bv = RsVec::from_bit_vec(bv);

    assert_eq!(bv.len(), LENGTH);

    // new iter0/1 tests
    for bit1 in bv.iter1() {
        assert_eq!(bv.get(bit1), Some(1));
    }

    for bit0 in bv.iter0() {
        assert_eq!(bv.get(bit0), Some(0));
    }

    let mut all_bits: Vec<_> = bv.iter0().chain(bv.iter1()).collect();
    all_bits.sort();
    all_bits.dedup();
    assert_eq!(all_bits.len(), LENGTH);
}

Relevant stack:

                               at /mnt/home/pat/code/vers/src/bit_vec/fast_rs_vec/iter.rs:327:19
  21:     0x559235af78b8 - <vers_vecs::bit_vec::fast_rs_vec::iter::SelectIter<_> as core::iter::traits::iterator::Iterator>::next::h917ec949850a2aeb
                               at /mnt/home/pat/code/vers/src/bit_vec/fast_rs_vec/iter.rs:358:13
  22:     0x559235adffcf - vers_vecs::bit_vec::fast_rs_vec::tests::test_random_data_rank::hbe5742e7b9163e43
                               at /mnt/home/pat/code/vers/src/bit_vec/fast_rs_vec/tests.rs:36:17
  23:     0x559235ac5077 - vers_vecs::bit_vec::fast_rs_vec::tests::test_random_data_rank::{{closure}}::h661c2093803983ac
                               at /mnt/home/pat/code/vers/src/bit_vec/fast_rs_vec/tests.rs:19:27
  24:     0x559235b27da6 - core::ops::function::FnOnce::call_once::h22b1d390239012f7```

Breaking Changes Wishlist

Listed here are breaking changes to implement if a major version release is planned

  • Rename FastRmq to SmallRmq or something similar, expressing its higher memory efficiency
  • Rename BinaryRmq to SparseRmq
  • Change return value of count_ones and count_zeros functions in both bit vectors to usize
  • Refactor the data structure APIs into traits, so alternative backing structs can be used interchangeably with trait objects, and code can be reused when providing alternative implementations or utility like MaskedBitVec. This also enables a graceful implementation of #5
  • Rename BitVector::from_bits to from_bits_u8
  • Change the Vecs inside immutable data structures into Box<&[u64]> where applicable
  • Change u64 limbs into a struct worth 512 bits to force both cache-alignment and avx-alignment
  • merge superblocks and blocks into a cache-aligned interleaved structure to reduce cache misses
  • Maybe abstract over the RS support structure, such that it can be exchanged through different implementations

bug in iter1() with length >8192

Incorrect results from iter1() when length > 8192 and certain bits are present. This is a "local minimum" - if you make the length shorter or make the bit list shorter it will start passing again.

#[test]
fn iter1_test() {
    let input_on_bits = vec![1, 14, 21, 24, 36, 48, 57, 59, 65, 69, 81, 87, 97, 100, 101, 104, 111, 117];

    let mut bv = BitVec::from_zeros(8193);

    for idx in &input_on_bits {
        bv.set(*idx as usize, 1).unwrap();
    }

    let bv = RsVec::from_bit_vec(bv);
    let output_on_bits: Vec<_> = bv.iter1().collect();
    assert_eq!(input_on_bits, output_on_bits);
}

Zero-Copy Serialization/Deserialization

The pointer-free nature of succinct data-structures makes them very amenable to (de)serialization by simply casting their memory to/from a bunch of bytes.

Not only would this remove most (de)serialization costs, it could also enable very fast and simple on-disk storage when combined with mmap.

One might want to implement this via rkyv, but simply providing a safe transmute to and from bytes::Bytes (with a potential rkyv implementation on top of that) might be the simpler, more agnostic solution.

WaveletMatrix

It would be nice if there was a WaveletMatrix implementation, as it would enable the most common succinct-datastructure applications in text-processing and databases.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.