Giter VIP home page Giter VIP logo

gxhash's Introduction

GxHash

Build & Test

GxHash is a blazingly fast and robust non-cryptographic hashing algorithm.

Usage

cargo add gxhash

Used directly as a hash function:

let bytes: &[u8] = "hello world".as_bytes();
let seed = 1234;
println!(" 32-bit hash: {:x}", gxhash::gxhash32(&bytes, seed));
println!(" 64-bit hash: {:x}", gxhash::gxhash64(&bytes, seed));
println!("128-bit hash: {:x}", gxhash::gxhash128(&bytes, seed));

Used in HashMap/HashSet:

// Type alias for HashSet::<String, GxBuildHasher>
let mut hashset = gxhash::GxHashSet::default();
hashset.insert("hello world");

Features

Blazingly Fast 🚀

Up to this date, GxHash is the fastest non-cryptographic hashing algorithm of its class, for all input sizes. This performance is possible mostly thanks to heavy usage of SIMD intrinsics, high ILP construction, a small bytecode (easily inlined and cached) and some (outrageously unsafe) tricks.

See the benchmarks.

Highly Robust 🗿

GxHash uses several rounds of hardware-accelerated AES block cipher for efficient bit mixing.
Thanks to this, GxHash passes all SMHasher tests, which is the de facto quality benchmark for non-cryptographic hash functions, gathering most of the existing algorithms. GxHash has low collisions, uniform distribution and high avalanche properties.

Check out the paper for more technical details.

Portability

Important Because GxHash relies on aes hardware acceleration, you must make sure the aes feature is enabled when building (otherwise it won't build). This can be done by setting the RUSTFLAGS environment variable to -C target-feature=+aes or -C target-cpu=native (the latter should work if your CPU is properly recognized by rustc, which is the case most of the time).

Architecture Compatibility

GxHash is compatible with:

  • X86 processors with AES-NI & SSE2 intrinsics
  • ARM processors with AES & NEON intrinsics

Warning Other platforms are currently not supported (there is no fallback). GxHash will not build on these platforms.

Hashes Stability

All generated hashes for a given version of GxHash are stable, meaning that for a given input the output hash will be the same across all supported platforms.

Benchmarks

Benchmark
GxHash is continuously benchmarked on X86 and ARM Github runners.

To run the benchmarks locally use one of the following:

# Benchmark throughput
# Add --features bench-md for markdown output or --features bench-plot for .svg plots
cargo bench --bench throughput

# Benchmark performance of GxHash's Hasher when used in a HashSet
cargo bench --bench hashset

Throughput

Throughput is measured as the number of bytes hashed per second.

Some prefer talking latency (time for generating a hash) or hashrate (the number of hashes generated per second) for measuring hash function performance, but those are all equivalent in the end as they all boil down to measuring the time it takes to hash some input and then apply different scalar transformation. For instance, if latency for a 4 bytes hash is 1 ms, then the throughput is 1 / 0.001 * 4 = 4000 bytes per second. Throughput allows us to conveniently compare the performance of a hash function for any input size on a single graph.

Lastest Benchmark Results:
aarch64 x86_64 x86_64-hybrid

Security

DOS Resistance

GxHash is a seeded hashing algorithm, meaning that depending on the seed used, it will generate completely different hashes. The default HasherBuilder (GxHasherBuilder::default()) uses seed randomization, making any HashMap/HashSet more DOS resistant, as it will make it much more difficult for attackers to be able to predict which hashes may collide without knowing the seed used. This does not mean however that it is completely DOS resistant. This has to be analyzed further.

Multicollisions Resistance

GxHash uses a 128-bit internal state. This makes GxHash a widepipe construction when generating hashes of size 64-bit or smaller, which had amongst other properties to be inherently more resistant to multicollision attacks. See this paper for more details.

Cryptographic Properties

GxHash is a non-cryptographic hashing algorithm, thus it is not recommended to use it as a cryptographic algorithm (it is not a replacement for SHA). It has not been assessed if GxHash is preimage resistant and how difficult it is to be reversed.

Contributing

  • Feel free to submit PRs
  • Repository is entirely usable via cargo commands
  • Versioning is the following
    • Major for stability breaking changes (output hashes for a same input are different after changes)
    • Minor for API changes/removal
    • Patch for new APIs, bug fixes and performance improvements

ℹ️ cargo-asm is an easy way to view the actual generated assembly code (cargo asm gxhash::gxhash::gxhash64) (method #[inline] should be removed otherwise it won't be seen by the tool)
ℹ️ AMD μProf gives some useful insights on time spent per instruction.

Publication

Author note: I'm committed to the open dissemination of scientific knowledge. In an era where access to information is more democratized than ever, I believe that science should be freely available to all – both for consumption and contribution. Traditional scientific journals often involve significant financial costs, which can introduce biases and can shift the focus from purely scientific endeavors to what is currently trendy.

To counter this trend and to uphold the true spirit of research, I have chosen to share my work on "gxhash" directly on GitHub, ensuring that it's openly accessible to anyone interested. Additionally, the use of a free Zenodo DOI ensures that this research is citable and can be referenced in other works, just as traditional publications are.

I strongly believe in a world where science is not behind paywalls, and I am in for a more inclusive, unbiased, and open scientific community.

Publication:
PDF

Cite this publication / algorithm:
DOI

gxhash's People

Contributors

actions-user avatar dherse avatar notsatvrn avatar ogxd avatar virtualritz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

gxhash's Issues

Don't finalize in Hasher::write, but in finish instead

Context

Currently, each Hasher::write implies a finalization. This is a problem because hashing strings for instance needs 2 writes (see write_str implementation), thus 2 finalization steps, which is the most expensive part of the hashing process.

Todo

Only finalize in finish

terminated by signal SIGILL (Illegal instruction)

RUSTFLAGS="-C target-feature=+aes" cargo build --release

fn main() {
    let bytes: &[u8] = "hello world".as_bytes();
    let seed = 1234;

    println!(" 32-bit hash: {:x}", gxhash::gxhash32(&bytes, seed));
    println!(" 64-bit hash: {:x}", gxhash::gxhash64(&bytes, seed));
    println!("128-bit hash: {:x}", gxhash::gxhash128(&bytes, seed));
}

OS: Ubuntu 23
Cargo: cargo 1.80.0-nightly (4de0094ac 2024-05-09)

Hybrid state gxhash

Context

Currently, gxhash with a 128-bit state is much faster for small input sizes, while gxhash with a 256-bit state is faster for large input sizes. The idea is to study the possibilities of making an hybrid state version of gxhash, leveraring the advantages of 128-bit state with the advantages of the 256-bit state processing, for maximum throughput for all input sizes.

See the break-even point:
image

Challenges

There are two ways to achieve this:

  • Either modify the 256-bit state version to use 128-bit state logic for small input sizes
  • Or try to make the 256-bit state processing stable along the 128-bit state. This is more complex, but if achieved, it could greatly simplify the usage of gxhash as a whole (stable for intrinsics set width)

Some challenges are:

  • It must be all evaluated at compile time because we don't want to increase the bytecode of the algorithm. There could not be a dynamic switch between two state width paths in the algorithm.
  • Making 256-bit and 128-bit stable requires rethinking of the high ILP construction

Todo

  • Study possibilities of using generics to avoid dynamic state width evaluation
  • Figure out where the break-even point is
  • Study the possibility of having a stable high ILP construction for both 128 and 256 bit states

Quality issue: permutations produce the same hashes?

I have noticed a strange behavior which I am not sure whether it is intended or not:

The hashes of two u8 a and b are the same regardless of order:

  • (a, b).hash(&mut state)
  • (b, a).hash(&mut state)

This is an issue in my particular use case because it prevents me from distinguishing between two distinct cases only by the hash.

I don't know whether this is actually a bug or no as I am a total noob about cryptography and hashing functions.

Thanks for you help/advice,

Dherse

Missing `std` compatibility traits on `GxHasher` and `GxBuildHasher`

I often use AHashMap as a drop in for HashMap in my code.
Switching to GxHashMap in one of my projects, I get this:

error[E0277]: the trait bound `GxBuildHasher: Clone` is not satisfied
  --> src/app.rs:60:26
   |
59 | #[derive(Clone, Debug, Default)]
   |          ----- in this derive macro expansion
60 | struct UnicodeCollection(HashSet<char>);
   |                          ^^^^^^^^^^^^^ the trait `Clone` is not implemented for `GxBuildHasher`
   |
   = note: required for `HashSet<char, GxBuildHasher>` to implement `Clone`

Here is a PR that, among other stuff, fixes this..

Make GxHasher DOS resistant

Context

Currently, GxHasher::default() is not DOS resitant. That is because it uses a default state/seed of 0, which implies that an attacker can generate in advance a set of colliding keys for this seed and dramatically alter the efficiency of a given HashMap / HashSet on a targeted service.
It is very possible to use GxHash with a seed (a full state or a short, fixed-size i64 value). Thus, by making GxHasher::default() create an Hasher with a random state (using a cryptographically secure RNG) we can make it DOS resistant, as attackers will no longer be able to know it advance which set of colliding keys.
An i64 seed is more than enough for this purpose.

Todo

  • Change GxHasher::default() to use a random seed.
  • Mention DOS resistance in README
  • Release new version

WASM support

Hello, I find this project quite interesting, do you think there could be a WASM compatible version (perhaps just not manually vectorized) for applications that compile on both WASM and native targets?

Congratulations on the awesome performance and good quality metrics! 🎉

Thanks in advance,

Dherse

build script is wrong

The build script in this repository is wrong in many ways.

gxhash/build.rs

Lines 5 to 6 in c9c9f61

// When conditions permits, enable hybrid feature to leverage wider intrinsics for even more throughput
if version_meta().unwrap().channel == Channel::Nightly

Checking for nightly to enable features should not be done. If the feature ever changes, builds using old versions of this library on new nightlies will break, which is highly undesirable. Not everyone that uses nightly wants to use unstable features, and they should never be implicitly enabled behind the users back. Use a feature that can be enabled by the user or some global cfg instead.

gxhash/build.rs

Lines 7 to 9 in c9c9f61

&& cfg!(target_arch = "x86_64")
&& cfg!(target_feature = "avx2")
&& cfg!(target_feature = "vaes") {

This checks the feature of the build script, so the host. This makes no sense, the host bears no resemblance to the target.

gxhash/build.rs

Lines 13 to 15 in c9c9f61

// If not cross compiling, make sure the aes feature is available
if std::env::var("HOST").unwrap_or_default() == std::env::var("TARGET").unwrap_or_default()
&& cfg!(not(target_feature = "aes")) {

While this attempts to handle cross compilation (unlike the previous check), it does not handle it correctly. By default, this works. But when using cargo cross compilation mode, even for the same target as the host (by passing --target explicitly, which is done by Miri and also required for -Zbuild-std), then RUSTFLAGS only apply to the target binaries, not programs on the host. Using cfg in the build script to attempt to detect whether the target feature is enabled does not work. Instead, check the cfg in the code and compile_error! when it is not enabled.

Primitives specialization for Hasher

Context

The Hasher trait exposes methods to hash primitives. Currently, we hash primitives by considering them all as slices of bytes. Hashing can be much performance if the type is known in advance (eg load primitive directly in SIMD vector).

Triggered by the following discussion: rust-lang/hashbrown#487

Goals

  • Hashing a primitive type is faster.
  • Hashing a primitive still follows the algorithm principles (and thus remains stable and passes SMHasher)

Todo

  • Add benchmark to test hashing on primitive, like u32
  • Implement all methods (write_u32, write_u64, ...)
    • Make it work for ARM
    • Make it work for x86
  • Publish benchmark results before/after on ARM/X86]

Possible seed and value recovery

For small inputs (<16 bytes), the length of the input is added to the input.

Value recovery
During compressing, the byte array { 1, 2, 3, 4 } becomes a 16-byte vector <5, 6, 7, 8, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4>
It is then run through Finalize, which runs 4 rounds of AES encrypt with static keys:

output = Encrypt(input, seed);
output = Encrypt(output, key1);
output = Encrypt(output, key2);
output = Encryptoutput, key3);

It is possible for an attacker who knows the seed to recover the original value by inverting the function.

output = Decrypt(input, key3);
output = Decrypt(output, key2);
output = Decrypt(output, key1);
output = Decrypt(output, seed);

We get a masked input, but can simply unmask it by taking the length at the end of the vector and subtract it from each of the values in the vector. <5, 6, 7, 8, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4> then becomes the original input <1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>

Invertibility in a non-cryptographic hash function is not an issue, but it makes certain attacks much easier.

Seed recovery
In the case where an attacker does not know the seed, it is possible to recover the seed through brute-force, given only a single full-length hash output.

Let's say we are given a 128-bit hash value of <213, 184, 163, 143, 142, 216, 110, 155, 68, 66, 183, 166, 216, 225, 120, 52>
We run it through our inverted hash function above and get <148, 13, 13, 13, 13, 13, 13, 207, 13, 13, 58, 13, 13, 144, 13, 13>

At this point we don't know that the original input was <5, 6, 7, 8, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4>, but we can brute-force the seed by running a single decrypt round: output = Decrypt(input, seed);

We know we have the right seed, if the result of decryption ends with a repeating integer (the length). We can filter false-positives by checking if the repeating characters start length-number-of-bytes into the vector. Once we have the seed, we can recover the original value.

Couple issues in the paper

Combiniting is not a word

"Ideally, half of the bit should change on average." should be "Ideally, half of the hash should change on average"

Functionality on stable

It seems like this is nightly only, but it is probably pretty easy to get it working on stable. Here are some workarounds for the nightly features it uses:

  • pointer_byte_offsets: just cast it back and forth ptr.cast::<u8>().add(...).cast()
  • stmt_expr_attributes: looks like you only use this to make #[allow(unused_assignments)] ok. You can probably eliminate this attribute and work around it with something like let _tmp_ptr; before the if block then load_unaligned(_tmp_ptr, ...)
  • core_intrinsics for likely: you may be able to reverse this and get the same effects with stable #[cold] instead. There is also likely_stable which may work. You also might not even need it - not sure if you've benchmarked but iirc, likely doesn't gain as much in recent LLVM versions
  • stdsimd for _mm_loadu_epi8. I'm unfortunately not really sure what to do here, maybe you could inline assembly it?

Missing generic implementation or helper

The crate seems to be missing a non-arch specific implementation, which forces anyone using gxhash to select another hash and use cfg directives to choose between them (obviously releasing non-specialized crates that only work on x86-64 and arm64 is not acceptable).

If the user selecting another hash is desired, then all GxHash types should take the other hash as a generic parameter, for user convenience.

write(&[1u8]) and write_u8(1) yield different hashes

Hello,

this is mostly an API question -
should these two invocations yield the same hash?
I would expect that they would yield the same hash as they do for the DefaultHasher (see here).

use gxhash::*;
use std::hash::Hasher;

fn main() {
    let mut h = GxHasher::with_seed(0);
    h.write_u8(1);
    dbg!(h.finish()); // 7327909443358324775

    let mut h = GxHasher::with_seed(0);
    h.write(&[1u8]);
    dbg!(h.finish()); // 8871263162142091921
}

Kind regards,
ambiso

how many bytes use write() each time that are most efficient?

i see blake3
https://docs.rs/blake3/latest/blake3/struct.Hasher.html

Hasher implements std::io::Write, so it’s possible to use std::io::copy to update a Hasher from any reader. Unfortunately, this standard approach can limit performance, because copy currently uses an internal 8 KiB buffer that isn’t big enough to take advantage of all SIMD instruction sets. (In particular, AVX-512 needs a 16 KiB buffer.) update_reader avoids this performance problem and is slightly more convenient.

I want ask, how many bytes use write() each time that are most efficient when use gxhash?

cargo bench --no-fail-fast -F=avx2 failed

image

     Running benches/hashset.rs (target/release/deps/hashset-e0bfa844d705dd83)
Gnuplot not found, using plotters backend
HashSet/u32/Default Hasher
                        time:   [16.380 ns 16.711 ns 17.067 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking HashSet/u32/GxHash: Warming up for 3.0000 serror: bench failed, to rerun pass `--bench hashset`

Caused by:
  process didn't exit successfully: `/root/git/gxhash/target/release/deps/hashset-e0bfa844d705dd83 --bench` (signal: 4, SIGILL: illegal instruction)
     Running benches/ilp.rs (target/release/deps/ilp-18f7b97131886606)
Gnuplot not found, using plotters backend
baseline                time:   [134.34 µs 135.20 µs 136.28 µs]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

unrolled                time:   [129.67 µs 132.18 µs 136.40 µs]
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe

temp                    time:   [54.313 µs 55.820 µs 57.345 µs]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

laned                   time:   [46.377 µs 47.916 µs 49.764 µs]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

     Running benches/throughput/main.rs (target/release/deps/throughput-ce001a906bb8460b)
gxhash-avx2
error: bench failed, to rerun pass `--bench throughput`

Caused by:
  process didn't exit successfully: `/root/git/gxhash/target/release/deps/throughput-ce001a906bb8460b --bench` (signal: 4, SIGILL: illegal instruction)
     Running benches/throughput_criterion.rs (target/release/deps/throughput_criterion-c233dcb596180928)
Gnuplot not found, using plotters backend
Benchmarking all/gxhash-avx2/4: Warming up for 3.0000 serror: bench failed, to rerun pass `--bench throughput_criterion`

Caused by:
  process didn't exit successfully: `/root/git/gxhash/target/release/deps/throughput_criterion-c233dcb596180928 --bench` (signal: 4, SIGILL: illegal instruction)
error: 3 targets failed:
    `--bench hashset`
    `--bench throughput`
    `--bench throughput_criterion`
~/git/gxhash main                                                                                                                                                     1m 30s root@u2 13:43:20
❯ cargo bench --no-fail-fast -F=avx2

C# version

I saw your post on dotnet/runtime#85206 about a C# port. Are you able to share it? I'd like to benchmark it both in terms of speed and correctness.

Use aligned loads in get_partial_safe

In get_partial_safe, it's possible to use aligned loads by declaring the buffer as MaybeUninit<State> and then casting the pointer for std::ptr::copy and zeroing the rest of the buffer, instead of declaring the buffer as an u8 array.

let mut buffer = [0i8; VECTOR_SIZE];
// Copy data into the buffer
std::ptr::copy(data as *const i8, buffer.as_mut_ptr(), len);
// Load the buffer into a __m256i vector
let partial_vector = _mm_loadu_epi8(buffer.as_ptr());

Major bug: out of bounds read for some input sizes

Context

For input sizes >= 80 bytes and modulo 16 (length of vector size) the construction is proceeding to reading one vector out of the bounds, making such hashes invalid.
This is a major bug for two reasons:

  • It means for instance hashing the same 80 bytes input twice can lead to different hashes as a result
  • It can lead to segfaults from reading beyond private memory

Todo

  • Fix the issue
  • Add an unit test to prevent out of bounds reads to happen again
  • Publish update (2.2.5)
  • Cargo yank bugged versions (2.X.X up to 2.2.4 included)

Non-hardcoded AES keys

The code seems to hardcode AES keys.

This makes no sense, and it should instead a mechanism similar to what the standard library uses to provide keys for SipHash.

Latency benchmarks

Hi! It would be interesting to see comparison of time to hash a short string (like in aHash README). Throughput is great, but since in many cases hashmap keys are short strings, time to hash 40 KB of data is less relevant then time time to hash some e.g. 16-byte strings. Plot of hashing time per key size would be really helpful IMO.

Tracking: missing compatibility with `std::HashMap` & `std::HashSet`

I just looked into replacing ahash with gxhash elsewhere and ran into HashMap::new() missing.

I reckon there may be more. I'll have a look later and may add some PRs to this issue. I'd keep it open until I'm certain the gxhash containers are full drop-in replacements for the std versions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.