wangyi-fudan / wyhash Goto Github PK

View Code? Open in Web Editor NEW

919.0 37.0 71.0 9.6 MB

The FASTEST QUALITY hash function, random number generators (PRNG) and hash map.

License: The Unlicense

C++ 8.35% Makefile 0.66% C 90.99%

hash fast simple random-number-generators c hashmap bloom-filter

wyhash's People

Contributors

Stargazers

Watchers

wyhash's Issues

Use fixed width integer types instead of unsigned long long

unsigned long long has at least 64 bits and really has 64 bits on most platforms. However, it is possible that it has more than 64 bits on some platforms. Therefore, it would be better to replace unsigned long long by uint64_t which is defined since C99 (see https://en.cppreference.com/w/c/types/integer). Otherwise, wyhash would behave unexpectedly on exotic platforms that implement unsigned long long with more than 64 bits.

wyhash paper

My dear friends:
I feel almost satisfied with current maturity of wyhash. Maybe it's time to write an "immortal" paper. I need your help on history review/technical words/tables/language editing etc. I bravely imagine that this can be achieved on Github. Join me to be an author freely!
Many thanks!
Yours,
Yi

gamma version is only compiled on gcc

__uint128_t is a gcc extension, which do not works on msvc and clang

lose entropy

it should be noted that wyhash can lose entropy (similarly as described below). For instance, when data could be correlated with the seed ^ _wypN values or equal to it.

@leo-yuriev described this issue on the README of t1ha hash.

Wyhash version gamma - no TCC support

Wyhash version gamma does not seem to be working with Tiny C Compiler (TCC). It would be greatly appreciated if support for this compiler was added.

/home/runner/.cache/v/v2.tmp.c:688: warning: implicit declaration of function '_wyread64'
tcc: error: undefined symbol '_wyread64'

wyrand interface change

Is there any reason why the interface of wyrand was changed to

wyhash/wyhash.h

Line 72 in 05d803d

static inline unsigned long long wyrand(void){

? I prefered the old interface of version 1:

wyhash/wyhash.h

Line 69 in e1e1b50

 static inline unsigned long long wyrand(unsigned long long *s){ *s+=_wyp0; return _wymum(*s^_wyp1,*s); } 

which does not rely on a single global state. Otherwise, it is difficult to create two independent streams of pseudo-random numbers.

Wyhash is used in Zig language

wyhash.zig

Currently wyhash is a hasher by default in a hash table.
IMHO this can be mentioned in readme.

Thank you!

Trivial DoS collision for hash combiners

Somewhat related to the previous issue: #36

Imagine any code that combines hashes with wyhash64:

uint64_t getUserHash(const User& user) {
  auto h0 = hash(user.firstName.data(), user.firstName.size(), 0);
  auto h1 = hash(user.lastName.data(), user.lastName.size(), h0);
  auto h2 = hash(user.company.data(), user.company.size(), h1);
  return wyhash64(h2, user.phoneNumber);
  // or
  // return wyhash64(user.phoneNumber, h2);
}

If an adversary can control this field, all they have to do is to write _wyp0 (for the first variation) or _wyp1 (for the second variation) there to make getUserHash always return 0 regardless of the value of the other data (because _wyp0 ^ _wyp0 is 0, and multiplying by 0 returns 0).

wyhash64(EvilConstant, input) should be a good hash for any value of EvilConstant.
The same is true for wyhash64(input, EvilConstant).

I understand that wyhash is not a crypto hash, but this is probably not a very difficult thing to fix, and it directly affects the quality of the hash.

Unexpected behavior of the new _wymum with WYHASH32

Regarding this update: fd9229f

Consider this snippet (includes and common functions are omitted):

uint64_t _wymum64(uint64_t A, uint64_t B) {
  A = _umul128(A, B, &B);
  return A ^ B;
}

uint64_t _wymum32_old(uint64_t A, uint64_t B) {
  uint64_t hh = (A >> 32) * (B >> 32), hl = (A >> 32) * (unsigned)B,
           lh = (unsigned)A * (B >> 32),
           ll = (uint64_t)(unsigned)A * (unsigned)B;
  return _wyrotr(hl, 32) ^ _wyrotr(lh, 32) ^ hh ^ ll;
}

uint64_t _wymum32_new(uint64_t A, uint64_t B) {
  uint64_t c = A ^ B;
  return ((c >> 32) * (unsigned)c) ^
         _wyrotr(((A >> 32) * (unsigned)A) ^ ((B >> 32) * (unsigned)B), 32);
}

int main() {
  for (uint64_t a = 100; a < 105; ++a) {
    for (uint64_t b = 100; b < 105; ++b) {
      std::cout << "A: " << a << " B: " << b
          << " _wymum64: " << _wymum64(a, b)
          << " _wymum32_old: " << _wymum32_old(a, b)
          << " _wymum32_new: " << _wymum32_new(a, b) << '\n';
    }
  }
}

The output for me (VS2019) is:

A: 100 B: 100 _wymum64: 10000 _wymum32_old: 10000 _wymum32_new: 0
A: 100 B: 101 _wymum64: 10100 _wymum32_old: 10100 _wymum32_new: 0
A: 100 B: 102 _wymum64: 10200 _wymum32_old: 10200 _wymum32_new: 0
A: 100 B: 103 _wymum64: 10300 _wymum32_old: 10300 _wymum32_new: 0
A: 100 B: 104 _wymum64: 10400 _wymum32_old: 10400 _wymum32_new: 0
A: 101 B: 100 _wymum64: 10100 _wymum32_old: 10100 _wymum32_new: 0
A: 101 B: 101 _wymum64: 10201 _wymum32_old: 10201 _wymum32_new: 0
A: 101 B: 102 _wymum64: 10302 _wymum32_old: 10302 _wymum32_new: 0
A: 101 B: 103 _wymum64: 10403 _wymum32_old: 10403 _wymum32_new: 0
A: 101 B: 104 _wymum64: 10504 _wymum32_old: 10504 _wymum32_new: 0
A: 102 B: 100 _wymum64: 10200 _wymum32_old: 10200 _wymum32_new: 0
A: 102 B: 101 _wymum64: 10302 _wymum32_old: 10302 _wymum32_new: 0
A: 102 B: 102 _wymum64: 10404 _wymum32_old: 10404 _wymum32_new: 0
A: 102 B: 103 _wymum64: 10506 _wymum32_old: 10506 _wymum32_new: 0
A: 102 B: 104 _wymum64: 10608 _wymum32_old: 10608 _wymum32_new: 0
A: 103 B: 100 _wymum64: 10300 _wymum32_old: 10300 _wymum32_new: 0
A: 103 B: 101 _wymum64: 10403 _wymum32_old: 10403 _wymum32_new: 0
A: 103 B: 102 _wymum64: 10506 _wymum32_old: 10506 _wymum32_new: 0
A: 103 B: 103 _wymum64: 10609 _wymum32_old: 10609 _wymum32_new: 0
A: 103 B: 104 _wymum64: 10712 _wymum32_old: 10712 _wymum32_new: 0
A: 104 B: 100 _wymum64: 10400 _wymum32_old: 10400 _wymum32_new: 0
A: 104 B: 101 _wymum64: 10504 _wymum32_old: 10504 _wymum32_new: 0
A: 104 B: 102 _wymum64: 10608 _wymum32_old: 10608 _wymum32_new: 0
A: 104 B: 103 _wymum64: 10712 _wymum32_old: 10712 _wymum32_new: 0
A: 104 B: 104 _wymum64: 10816 _wymum32_old: 10816 _wymum32_new: 0

You can adjust a and b to pretty high values, the result of the new _wymym will stay 0 (I picked the range between 100 and 105 just as an example).
I don't know if this is a problem for the overall result or not, but it looks suspicious. Is this the expected behavior?

Commit messages [suggestion]

Hi Yi, again thanks for wyhash and wyrand! 🙏

I'm watching this repo, to keep abreast of what your doing, and with an eye on updating my C# port to include wyhash v2 and v3 once they've stabilised.

I have a polite suggestion to propose 😄 Most of your git commit messages are very generic, such as "Update wyhash.h", or "Add files via upload", and these make it a bit difficult to follow your changes. My suggestion is to use more specific commit messages than explain what has been changed, such as you did here.

It is of course but a suggestion - feel free to ignore it 🙂

when len % 64 === 0, we can reduce one time of mix

If len = n * 64 where n >= 2`, we will do a a mix again, that could be saved.

Now:

   if(_likely_(i>=4)) return _wymix(_wyr4(p)^secret[0],_wyr4(p+i-4)^seed);
   else return _wymix((_likely_(i)?_wyr3(p,i):0)^secret[0],seed); // come here if len = 0, 1, 2, 3, 128, 128 + 64....

Modified:

   if(_likely_(i>=4)) return _wymix(_wyr4(p)^secret[0],_wyr4(p+i-4)^seed);
   else if(_likely_(loop)) return seed; // come here if len = 128, 128 + 64, ...
   else return _wymix((_likely_(i)?_wyr3(p,i):0)^secret[0],seed); // come here if len = 0, 1, 2, 3

Use of wyhash with BBhash causes collision

I'm trying to use wyhash with BBhash (https://github.com/rizkg/BBHash), however this is causing a strange collision. I've opened an issue with BBHash here:

rizkg/BBHash#12

Since the problem goes away with every other hash implementation I've tried, it could very well be an issue with wyhash or my use of wyhash. Therefore I'm posting this here for your feedback. I apologize if this is truly not relevant, however the issue is sufficiently odd that I thought I would ask.

madhash

I would like to introduce the madhash. It samples up to 32 byte of the data to produce a hash.
It breaks every rules, however, it is valid to short string hashing up to 32 byte. for longer strings, it is a bad guy but works. The bulk speed is insane...

Benchmarking /usr/share/dict/words
HashFunction Words Hashmap Bulk64K Bulk16M
madhash 303.92 50.70 11295.11 2863311.53
std::hash 96.09 36.57 7.34 7.36
wyhash 265.71 45.59 26.35 21.78
xxHash64 110.84 35.85 14.71 14.58
XXH3_scalar 183.82 42.64 13.09 13.05
t1ha2_atonce 127.93 36.26 16.60 16.32

non-uniform distribution of wyrand()

@wangyi-fudan and @lemire, I would like to discuss distribution of wyrand().

AFAIK, the function mum(x ^ y, x) is not bijection for x with the any y.
Therefore wyrand() should show a statistical skew/bias.
Simply put, it should not (be able to) return all values from set 0...2**64-1, but some more than once.

What do you think about this?
Did you make any estimates of the number of "collisions"?
Did you found some tactics for optimal choice the value of y?

For instance, check by brute-force for 16-bit case shown at least 39483 collisions.
I.e. my test shown that 16-bit variant of wyrand() generates only ~40% values from complete set of 16-bit integers.

wyhash_beta version is ready

Yann Collet discovered that early wyhash has 62 bit uniqueness strength. https://github.com/Cyan4973/xxHash/wiki/Collision-ratio-comparison
The wyhash_beta version solved the problem.

And now wyhash is optimized for a new benchmark: 0-31 random byte test. It generate 0-31 unpredictable random length according to hash result, thus do not require a PRNG. I PRed smhasher.

any way to avoid unaligned reads?

I think wyhash looks really promising and am considering using it in my project. Looking at the code, it seems like it will perform unaligned reads from memory if the input isn't aligned. This is undefined behavior, and while it works fine on architectures like x86, some architectures will generate hardware exceptions for unaligned reads. More info here: https://stackoverflow.com/questions/32062894/take-advantage-of-arm-unaligned-memory-access-while-writing-clean-c-code

Perhaps you could add some (optional) code to handle this? Maybe something like this?

unsigned long long wyhash_nounaligned(const void* key, unsigned long long len, unsigned long long seed) {
  const	unsigned char *ptr=(const unsigned char*)key;
  int remainder = (uintptr_t)ptr & 7;
  if (len < 32 && remainder) {
    union {
      long long align;  // Ensure 8-byte alignment of buffer.
      char buf[8];
    } tmp;
    int patch = 8 - remainder;
    memcpy(&tmp.buf, ptr, patch);
    seed = wyhash(tmp.buf, patch, seed);
    ptr += patch;
    len -= patch;
  }
  return wyhash(ptr, len, seed);
}

minimal hash map/set and bloom filter

Dear All:
I just implemented the minimal hash table/set and bloom filter. Each of them are only 7 lines, but the hash table speed is on par with state-of-the-art and is most memory efficient They are manual, but are flexible. In some genomic applications, we just want all memory be used and disallow auto growth. The std::pair+one flag byte may waste space due to alignment.
Hope you like them.

Stability across platforms (little/big endian, ARM/x86/..., ...)

Could you please add detailed information about hash stability across different platforms to README.md? Thanks 😉.

major/fundamental: probability of fatal entropy loss.

Well, it was obvious from the beginning.
However, I did not see that this issue was voiced and/or discussed somewhere.
I assume that not all users can see this drawback, respectively, can be misled and make a decision without realizing all the risks.
So I decided to raise this issue (better late than never).

From the source code make it clear that wyhash uses problematic "Narrow-pipe" Merkle–Damgård construction with the MUL-XOR mixer as an entropy sponge.

static inline uint64_t wyhash(const void *key, uint64_t len, uint64_t seed) {
  const uint8_t *p = (const uint8_t *)key;
  uint64_t i;
  for (i = 0; i + 32 <= len; i += 32, p += 32)
    seed = _wymum(seed ^ _wyp0,
                  _wymum(_wyr64(p) ^ _wyp1, _wyr64(p + 8) ^ _wyp2) ^
                      _wymum(_wyr64(p + 16) ^ _wyp3, _wyr64(p + 24) ^ _wyp4));
  seed ^= _wyp0;
  switch (len & 31) {
  case 1:
  /* ... omitted for brevity ... */
  case 30:
    seed = _wymum(__wyr64(p) ^ seed, __wyr64(p + 8) ^ _wyp2) ^
           _wymum(__wyr64(p + 16) ^ seed,
                  ((_wyr32(p + 24) << 16) | _wyr16(p + 24 + 4)) ^ _wyp4);
    break;
  case 31:
    seed = _wymum(__wyr64(p) ^ seed, __wyr64(p + 8) ^ _wyp2) ^
           _wymum(__wyr64(p + 16) ^ seed,
                  ((_wyr32(p + 24) << 24) | (_wyr16(p + 24 + 4) << 8) |
                   _wyr08(p + 24 + 6)) ^
                      _wyp4);
    break;
  }
  return _wymum(seed, len ^ _wyp5);
}

The problem here is that if one multiplier equals zero, then the result is independent of the other multiplier.

Therefore, some individual 64-bit words and/or on any initial part of the processed data may not affects the hashing result.
The probability of such cases cannot be called large, but it increases in proportion to the size of the data.

In my opinion, this is unacceptable for a universal/common hash function. However, in some cases where the kind of the data is known (e.g. for ascii texts), you can be sure that no problems will occur.

So I think you need to clearly describe this defect.

For instance the test case:

#include <cinttypes>
#include <cstdint>
#include <cstdio>
#include <cstdlib>

#include "wyhash.h"

int main(int argc, const char *argv[]) {
  const unsigned N = 128;
  uint64_t array[N];
  static_assert(N > 4, "precondition");

  const unsigned offset = 4;
  static_assert(offset % 4 == 0 && offset < N, "precondition");

  memset(array, 'x', sizeof(array));
  array[N - 4] = UINT64_C(0xe7037ed1a0b428db);
  array[N - 1] = UINT64_C(0x1d8e4e27c47d124f);

  array[0] = 0;
  const uint64_t A = wyhash(array, sizeof(array), 42);
  printf("wyhash(%" PRIx64 ", xxx, poison, xxx) = %" PRIx64 "\n", array[0], A);

  array[0] = 1;
  const uint64_t B = wyhash(array, sizeof(array), 42);
  printf("wyhash(%" PRIx64 ", xxx, poison, xxx) = %" PRIx64 "\n", array[0], B);

  array[0] = 2;
  array[N - 3] = 'Y';
  const uint64_t C = wyhash(array, sizeof(array), 42);
  printf("wyhash(%" PRIx64 ", xxx, poison, Yxx) = %" PRIx64 "\n", array[0], C);

  return (A != B && B != C && A != C) ? EXIT_SUCCESS : EXIT_FAILURE;
}

Output:

wyhash(0, xxx, poison, xxx) = d1fb6b2bf52fd8db
wyhash(1, xxx, poison, xxx) = d1fb6b2bf52fd8db
wyhash(2, xxx, poison, Yxx) = d1fb6b2bf52fd8db

Use Git Tags for Versions

I have some trouble finding the latest stable version of the hash. Furthermore, I'm confused since there are several variants: FastestHash and wyhash. Is there a guide on when to use which? Or is it that one of them is always the best? Then why would you keep the other one?

For wyhash the repository contains several versions at the same time. Which one should I use? If only the latest is relevant: why not tag the older versions (using Git tags) and remove it from the master/HEAD? This way it's clear which version is the preferred one but older versions are still easily accessible.

The current version contains large chunks of commented out code. Is this code still relevant or can it be removed entirely? If it's just for historical reasons: why not archive it using Git? It will still be available through the Git history of the repository.

performance with version

can you give any performance compare from v1-v6?

v5 default key fails MomentChi2 and Sparse

On my local testing:

wyhash secret[0]: 0xa0761d6478bd642f
wyhash secret[1]: 0xe7037ed1a0b428db
wyhash secret[2]: 0x8ebc6af09c88c6e3
wyhash secret[3]: 0x589965cc75374cc3
wyhash secret[4]: 0x1d8e4e27c47d124f

[[[ Keyset 'Sparse' Tests ]]]
Worst bias is the 13-bit window at bit 11 - 1.045% !!!!!

[[[ MomentChi2 Tests ]]]

Running 1st unseeded MomentChi2 for the low 32bits/step 3 ... 38917465.185213 - 410425.007092
Running 2nd seeded MomentChi2 for the low 32bits/step 3 ... 38919451.497948 - 410461.736885
KeySeedMomentChi2: 4.80631 FAIL!!!!

Representative set of test strings.

What would be representative set of strings to test the hash?
Should they be long or short?
I thought to run it million/billion times with that strings to see whether it is a difference between hashes for such operation.

I came up with more real life set of strings containing strings from single words to long sentences which goes like this:

# 1: top 1000 internet search terms @ 2008
max length: 57
avg length: 16.056

# 2: 1000 Most Asked Questions On Google
max length: 45
avg length: 23.024

# 3: Most common passwords list
max length: 9
avg length: 6.237

# 4: User:Adam78/List of tongue-twisters
max length: 999
avg length: 72.465

# 5: Top 500 domains from Mozilla
max length: 28
avg length: 11.808

Which gives almost 5000 strings.
max length: 999
avg length: 31.211

I also thought about repo containg a list of the 10,000 most common English words in order of frequency but they are just single words which implies short length therefore they could supplement up to 5k and that's it.

What do you think?

How does one calculate split/buffered input hash?

What is intermediate state of hash in case of split/buffered input? I thought it was seed and I used it as has intermediate hash for every slice. But I made a simple test and failed on first sliced input (at second iteration). Assume I get a date which is a string "abc" and split it in 1-byte slices then is it:
wyhash("abc", len, seed) == wyhash("c", len_slice, wyhash("b", len_slice, wyhash("a", len_slice, seed))) ?

		seed = _seed;
		cptr = (char*)_data;
		len = _data_len;
		for(j=0; j<len; j++)
			{
			hash = wyhash(cptr+j, 1, seed);
			seed = hash;
			}

my output is like this ($xx means seed):

#reference: seed: $0000000000000000
"a" : 31db9c4e34072a5f
"ab" : 8bb082c4c3cc236e
"abc" : e3db0f558c63ddee

"a" : 31db9c4e34072a5f $0000000000000000
---- summary #a ----
"a" : 31db9c4e34072a5f/31db9c4e34072a5f (SUCCEEDED) // this is ok

"ab" : 31db9c4e34072a5f $0000000000000000 // first iteration is ok
"b" : 8efeffb8472c70e0 $31db9c4e34072a5f // second is already not
---- summary #ab ----
"ab" : 8efeffb8472c70e0/8bb082c4c3cc236e (FAILED)

"abc" : 31db9c4e34072a5f $0000000000000000
"bc" : 8efeffb8472c70e0 $31db9c4e34072a5f
"c" : e17fff3caab4cb76 $8efeffb8472c70e0
---- summary #abc ----
"abc" : e17fff3caab4cb76/e3db0f558c63ddee (FAILED)

If this is right then what is wrong with this approach or code. If it's wrong then what is the right way?

I attached my test programs and their results.
wyhash_test-1abc-issue#28.tar.gz

Make things neater and more professional

There are many things that are generally messy and unprofessional about this repo.

First of all, as mentioned before in #34 and #61, the way you are using git is rather poor:

You are doing everything in the GitHub web editor, this is literally the most inefficient way of doing this.
Commit messages are still left out. Even if it seems redundant, it helps a lot. Even something as simple as fix typo is better than Update README.md, Add files via upload, etc. That way you can get an idea of what change each commit made without having to look at the diff itself.
Make small changes which can easily be undone.

If I am not mistaken, it seems you don't know how to use git. Don't fear git: Out of the hundreds of options, you will rarely ever use more than 20, and you can get away with only partially knowing 10. I can help you if you can't find a resource online that works for you.

Second of all, your code is very messy.

Code should be treated like a work of art, or at the very least, kept relatively organized.

If you want others to be comfortable with your code enough to use/contribute to it, they have to be able to understand it. This means making things easy to read, at least trying to stick to a code style, and documenting everything.

Nobody cares if the source code is tiny, including the compiler. The only place where that matters is JavaScript, but minified JavaScript is usually "compiled" from normal source code.

Obviously, nobody wants a 2 MB source file, but nobody cares about a few extra kilobytes, especially if it improves readability.

Would you rather read this:

static inline uint64_t FastestHash(const void *key, size_t len, uint64_t seed) {
	const uint8_t *p = (const uint8_t *)key;
	return _likely_(len >= 4) ? (_wyr4(p) + _wyr4(p + len - 4)) * (_wyr4(p + (len >> 1) - 2) ^ seed) : (_likely_(len) ? _wyr3(p, len) * (*_wyp ^ seed) : seed);
}

or this?

// A quick and dirty hash for very short keys.
// It is not secure by any means; it aims only to be fast.
// This only reads a sample of up to 12 bytes, ignoring
// the rest. This makes it unsuitable for long hashes.
static inline uint64_t FastestHash(const void *key, size_t len, uint64_t seed)
{
    const uint8_t *input = (const uint8_t *)key;
    if (_likely_(len >= 4)) {
        // Read the first, middle, and last 4 bytes.
        uint64_t first_4 = _wyr4(input);
        uint64_t last_4  = _wyr4(input + len - 4);
        uint64_t mid_4   = _wyr4(input + (len >> 1) - 2);
        // Mix and multiply.
        return (first_4 + last_4) * (mid_4 ^ seed);
    } else if (_likely_(len != 0)) {
        // Mix with _wyp and multiply.
        return _wyr3(input, len) * (_wyp[0] ^ seed);
    }
    // Return the seed on a zero-length hash.
    return seed;
}

It is significantly more readable, easier to contribute to and analyze, and GCC and Clang generate exact same code. (Compilers break things up into single steps and reorder any way they want, so combining into ternaries and other nonsense is pointless)

Also, please don't put commented out blocks of code in the repo — it is fine locally when testing, but nobody wants to look at this

Lastly, everything should be ready to compile and run from a fresh clone. This is unacceptable:

$ git clone https://github.com/wangyi-fudan/wyhash
$ cd wyhash
$ make
make: *** No rule to make target 'wyhash/wyhash.h', needed by 'benchhash'.  Stop.

Also, what is up with supplemental_material.zip? git repos are compressed already, and the extra disk space is insignificant.

I'd be happy to help with this — I love making things neat and writing/proofreading documentation — but I really don't want to put time into cleaning things up if you are going to just throw it all away. I'd be seriously disappointed (and slightly offended), and I feel the other people who tried to clean things up feel the same way.

Use Yves tests as they're stricter/better than smhasher

See rurban/smhasher#31 .

wyhash is slower than other popular hash functions for large enough key size in my performance benchmarks

I add wyhash hash function into my benchmark and below are my observations:

myhash is the fastest for the key sizes from 1-3 bytes
std::hash is faster than wyhash for the key size from 4-31 bytes.
AquaHash is the fastest hash function for key size > 31 bytes
clhash, created by @lemire, outperforms wyhash if the key sizes are divisible by 8. clhash is an excellent candidate if the data size is >= 96 bytes
The maximum bytes/cycle rate of wyhash is about 6 bytes/cycle and it is slower than all studied hash functions except boost::hash
xxhash64, created by @Cyan4973, is the best in term of usability i.e it provides the update function and is very easy to use, portability (support all OSes), and performant i.e the maximum rate is 7.5 bytes/cycle.

My results might be biased, however, you can always run the benchmark in your machine using steps described from this page https://github.com/hungptit/AquaHash/blob/master/benchmark.md.

The output for empty inputs doesn't depend on the seed.

There are multiple scenarios where the missing dependency from the seed can be a problem.

For example, chaining the hashes. Imagine a case where the hash of the first input is used as a seed for the second input, etc.

h0 = hash(data0, size0, 0)
h1 = hash(data1, size1, h0)
...
h100 = hash(data100, size100, h99)

If, for example, size97 was 0, the final h100 will never depend on data10 .. data96 (because the seed for h97 will be ignored). And if size100 is 0, the result will always be 0, regardless of all the other inputs.

Chaining hashes like in the example above is a common pattern, and people do it quite often:

uint64_t getUserHash(const User& user) {
  auto h0 = hash(user.firstName.data(), user.firstName.size(), 0);
  auto h1 = hash(user.lastName.data(), user.lastName.size(), h0);
  auto h2 = hash(user.company.data(), user.company.size(), h1);
  // notes are almost always empty
  return hash(user.notes.data(), user.notes.size(), h2);
}

If they will decide to replace their old hash with wyhash without reading the implementation, they will silently break their code (because getUserHash() will start returning 0 almost all the time).

The suggested solution is to either return the seed, or a simple hash of the seed. For example, something like:

return _wymum(_wymum(seed^_wyp0, _wyp1), _wyp2);

This is just a suggestion, I didn't run tests for this.

Ideally the seed should be hashed equally well with normal inputs, so that the quality of hash(&foo64, 8, 0) is the same as the quality of hash(0, 0, foo64).

PS: thank you for creating wyhash. It is by far the best quick hash function I tried so far, I really hope it will become more popular.

wytruerand "secret" [question]

The wytruerand function takes as parameters both a seed and a "secret" - I was wondering about the purpose of this, and whether it is indeed intended to be kept, well, "secret"?

Release/Tag for wyhash v4

Could you please create a release and/or a tag for version 4 as you have done for version 1 and 3. This would make it easier to reference. BTW, thank you very much for your work on wyhash!

wyrand(void) and wyrand(uint64_t) conflict

I understand this might be a v1 to v2 change (#12). In any case, if you want to keep both available, either a name change or an ifdef/else around them is in order.

Test vectors

Any chance you could publish some test vectors for wyhash?

This would be really useful for verifying correctness of ports to other languages.

define WYHASH32 will produce different hash

_wymum will return different values if WYHASH32 is defined at compile time. So it is better to define a separate hash func for WYHASH32 version.

License not defined

Are there any plans to define a license? I am asking, because I have seen that very first versions have been published under Apache 2.0.

wyhash_v6 is coming

Dear All:
Based on recent experiments, wyhash_v6 with avx2 support is coming. some preliminary
benchmark here, let's meow:

Benchmarking /usr/share/dict/words
HashFunction Words Hashmap Bulk64K Bulk16M
std::hash 96.28 34.17 7.34 7.36
MeowHash 73.90 25.59 37.18 26.88
wyhash_v6_avx2 253.69 44.53 40.26 25.74
xxHash64 105.03 35.13 14.70 14.62
XXH3_avx2 185.96 40.64 27.89 23.21
t1ha2_atonce 127.92 34.68 17.10 16.81

Provide suggested secrets or secret seed for v5

From v5, we have to generate secrets before hashing. Even though we have make_secret() helper function, getting qualified secrets still requires many knowledge about this project and other benchmark/testing frameworks, and spends hours to run many test cases. The testing results vary on different secret. For current v5 dev version (master@36aec06), about 20% of 1-byte seeds for make_secret fails rurban/smhasher. After many testing, it's a little lucky to find make_secret(13, secret) give out a group secrets passing rurban/smhasher, demerphq/smhasher and LongNeighborsTest from hmakholm/smhasher. When the hash algorithm updates, it have to search qualified secrets from zero.

So, when the codes become stable, it could be better to lock a default group of secrets that can pass all/most test cases.

wyhash V5 draft is out

Dear friends:
wyhash V5 draft is out. It make _wyp as variable secrets. By assuming _wyp as secrets, we reduced one xor operation and make the bulk speed crazy.
Tomorrow, I will write a help-function to generate secrets from a random uint64_t.
Exciting!

hashes.h not exists in the repository

#include        "hashes.h"

in hashbench.cpp

Wyhash v3 is not optimal for stream processing

Unlike the first two versions, the third version of Wyhash is not optimal for stream processing. After Wyhash v1 and v2 received a chunk of 32 bytes, they were able to process it immediately. Wyhash v3 must wait for the next byte to decide how to process the previous 32 bytes. This is because the last 32 bytes are processed by the main loop when they are not at the end of the stream. Otherwise, they are processed within the finalization step that does not match the main loop. As a result, a streaming version of Wyhash v3 will be more complex than those of v1 and v2.

Hash function versus PRNG function

A quote from http://www.reedbeta.com/blog/quick-and-easy-gpu-random-numbers-in-d3d11/ :

Incidentally, I was confused for a long time about the distinction between PRNGs and hash functions—they seem to do the same thing, i.e. create an “unpredictable” but deterministic output by jumbling up input bits. It wasn’t until I started playing around with GPU PRNGs that I realized the difference: PRNGs are designed for going deep, and hashes for going wide.

@wangyi-fudan could you maybe shed some light how wyhash performs as PRNG - i.e. when going "deep" instead of wide?

crypto secure hash function

according to https://github.com/rurban/smhasher/tree/crypto , sha2 and sha3 hash functions are not only slow but also bad. My new mission is to design a new crypto secure hash function, not only secure but also fast.

java version

Update documentation to point to java port https://github.com/OpenHFT/Zero-Allocation-Hashing

Wyhash v4 is not optimal for stream processing

While updating my Zig port of wyhash I noticed that unlike the first version, the fourth version of Wyhash is not optimal for stream processing. This is because the fourth version requires the length at the start of the hashing algorithm because it immediately processes the remaining (mod 32) bytes. The previous version would first process the 32 byte chunks. This makes implementing a streaming version much more complex.

make_secret() seems not guaranteed to return

Is the following loop guaranteed to end or is there a chance of infinite looping (depending on the given seed)?

...
  }while(!ok);

I'd say there should be detection of some fixed number of iterations and if that won't be enough, then just return some error. Or better, make the function always succeed 😉 (without sacrificing entropy from the seed).

Please increase the version number each time the test vectors change.

I.e. when an hash implementation change causes the output values to change. Otherwise, it is impossible to identify such versions and distinguish them from each other. So, it should not be a "old v4" and "new v4" version, but v4 and v5, and so on.

make_secret is slow

build with clang -O3, 4.5G intel cpu:

time ./a.out 
secret=7354222319401939409, 10064039272631357873

real	0m22.942s
user	0m22.901s
sys	0m0.019s

Is a random 16 byte nonce a valid secret ?

in safety wymum, why use bit or operation?

*A|=(uint64_t)r; *B|=(uint64_t)(r>>64);

is here a typo for bit xor?

wyhash 20190324 broke MomentChi2 for wyhash32low from 0.000751594 to 2.76359

See my wyhash-20190324 branch
2.7 is too high for my taste for the lower 32bits

I'd rather mix the seed before, not after

-       for(i=0,        seed^=_wyp0;    i+32<=len;      i+=32,  p+=32)  seed=_wymum(seed,
       _wymum(_wyr64(p)^_wyp1,_wyr64(p+8)^_wyp2)+_wymum(_wyr64(p+16)^_wyp3,_wyr64(p+24)^_wyp4));
+       for(i=0;        i+32<=len;      i+=32,  p+=32)  seed=_wymum(seed^_wyp0, _wymum(_wyr64(p)^_wyp1,_wyr64(p+8)^_wyp2)^_wymum(_wyr64(p+16)^_wyp3,_wyr64(p+24)^_wyp4));
+       seed^=_wyp0;

Please provide test vectors.

For instance like this (#1, #2).

a 'MUM' variant friendly to SSE and AVX

Here is a variant of 'MUM' function. I's quality is good enough, when just replacing the original one in alpha version wyhash, smhasher only fail in LongNeighbors test.
The new one is obviously slow in scalar mode, but all its internal computing should be easily rewritten as SSE/AVX instructions.

static inline uint64_t
high52(uint64_t a, uint64_t b)
{
        uint64_t aa = (a >> 12) | 0x42F0000000000000;
        uint64_t bb = (b >> 12) | 0x42F0000000000000;
        double c = *((double*)&aa) * *((double*)&bb);
        uint64_t dl = *((uint64_t*)&c);
        return (dl << 12) ^ dl;
}

static inline uint64_t
mum_d(uint64_t a, uint64_t b)
{
        return __builtin_bswap64(high52(a, b))
               ^ high52(__builtin_bswap64(a), __builtin_bswap64(b));
}

wangyi-fudan / wyhash Goto Github PK

wyhash's People

Contributors

Stargazers

Watchers

Forkers

wyhash's Issues

Recommend Projects

Recommend Topics

Recommend Org