Comments (13)
I think a hash function does not promise you that.
from wyhash.
So how do you calculate hash of "big data"? If I wanted to calculate hash of some bigger data, like files I have to buffer them and calculate hash of buffer, then next and next... as far as I understand it in crc32 the intermediate state is hash itself that is fed as seed to next iteration*
, in xxhash it's XXH32_state_t, so what it is in wyhash? Is there any way to calculate hash of data that don't fit in single buffer? If I wanted to calculate hash of Silesia corpus and wasn't able to fit it in single memory buffer**
how do I do that?
*
for example: CRC32_update_byte()/CRC32_update_buf()
**
Same applies to streaming scenario.
from wyhash.
got it! thanks! I will think about that.
from wyhash.
Maybe I expressed it imprecisely and not clearly.
It's like adding values. You can add y=a+b+c
and you can add one part at a time and eventually get whole sum:
istate = 0
--------------------+
istate += a |
istate += b |
istate += c |
--------------------+
y = istate
Assume you have lot of data to hash. If you wanted to hash it in one go it would be like that ($
is to show intermediate state of function):
hash(&abc, whole_len, $0000000) == 352441c2
Assume that for practical reasons you have to split them in parts called a, b, c.
+========+========+========+
|....a....................|....b....................|....c....................|
+========+========+========+
^$istate0..........^$istate1..........^$istate2..........^istate3
To calculate hash you have to hash part 'a' with initial value = 0 ($istate0), save intermediate state ($istate1), fill the buffer and hash part 'b' with initial value $istate1, save intermediate state ($istate2), fill the buffer and hash part 'c' with initial value $istate2. That would give you result for whole set of data.
$istate := 0x00000000 // $istate0
hash(&a, slice_len, $0000000) == e8b7be43
$istate := e8b7be43
hash(&b, slice_len, $istate) == 9e83486d
$istate := 9e83486d
hash(&c, slice_len, $istate) == 352441c2
$istate := 352441c2
_end_of_data ?
yes: hash := $istate == 352441c2 // result
You can do that as many times as you want until you process all the data. But to be able to do that you need to have some kind of intermediate value that allows you to start where you stopped off. This intermediate state can be hash itself that is becoming seed in next iteration od some other "structure" that save place where you stopped off.
from wyhash.
Presumably that works now, with the seed also being the current state. As long as you use the same block size, you should get the same result each time, with the caveat that the result won't be the same as if you didn't buffer it - I assume this last part if your problem?
from wyhash.
See my first post. That's exactly what I did and it didn't work. If that doesn't work then what does? So there are only two options:
- it doesn't work like that
- there is an error in my program
If you see what's wrong then tell me.
I just checked it once again. Wyhash works correctly with all test vectors (from here).
wyhash("a", 1, 0x0000000000000000); // == 31db9c4e34072a5f; // that's right
wyhash("b", 1, 0x31db9c4e34072a5f); // == 8efeffb8472c70e0 // that's wrong; should be 8bb082c4c3cc236e
Also I made my test program more general and it can use any slice you want now. Presumably it stil doesn't work like that.
I'm actually surprised why didn't I take a look into <wyhash.h>. It would give me an answer quicker.
Wyhash works like that (pseudocode):
wyhash (key, len, seed)
{
seed = internal_wyhash(seed, data);
;
return _wymum(seed, len^_wyp5);
}
Just this one line: return _wymum(seed, len^_wyp5);
shows that seed
is not a hash although seed
is some internal state of wyhash.
My guess is if one reverses _wymum(seed, len^_wyp5)
and feeds internal seed to next iteration it might work. Might, because there is seed^=_wyp0;
for "unaligned bytes". (They are not literally unaligned as there is switch(len&31) {...}
which makes things mixed up)
I don't get exact mathematical side of this hash and I'm talking as regular programmer here, person who would like to use a library function with external interface, without going into details.
So if you tell me how to _un_wymum(seed, len^_wyp5);
I may try to do that but it may or may not work. Probably only @wangyi-fudan knows.
from wyhash.
I had read your comment, but did find it a little unclear.
My point is that you can technically today hash a file/stream piece by piece, but the result won't be the same as if you did all the data at once. As long as you use the same block size, the result will be repeatable (but not the same as the test vectors!).
To get the same result, you need to process each 32-byte block, and only then transform/finalise the final block.
If I remember, I'll take a stab at it with my c# fork later on π
from wyhash.
So, my approach works. I've created a gist to demonstrate.
The idea is as above - you hash block by block, only transforming/computing after the final block.
I wrote the code in around 3 minutes, so it's a mess and likely doesn't work for all input lengths - but it demonstrates the principal π Probably best to start at the test file to make sense of it.
I've also created an issue for this over on the repo of my C# fork, as this would be really useful.
from wyhash.
My point is that you can technically today hash a file/stream piece by piece, but the result won't be the same as if you did all the data at once.
If you don't get same result for same data then what is the point os calculating hash? You can equally well use random number with the same result.
Few things. I don't have nor am intending to have cs compiler. I don't know language anyways. So I cannot test your WyHash64Tests.cs
. As far as I understand your WyHash.cs
is "a bit" different than wyhash.h
. You have divided hashing into few steps and that's core of your approach. With wyhash.h
it's impossible. Simple as that. Unless I take it apart into pieces and reorganise the cogs to make them do what I want. I respect your skill, I'm not as good but still I'm not going to reinvent the wheel or a food processor for that matter by rewriting the whole algorithm from scratch. That's not what I meant.
The way it is now it's useful to hash strings, and some small data sets and that's all. The whole alleged speed it totally wasted as noone in no real circumstances will utilise it.
from wyhash.
If you don't get same result for same data then what is the point os calculating hash? You can equally well use random number with the same result
That's just silly. You will get the same result as long as you use the same block size. So if this is for internal use, it works. I'm just trying to help out if you have an immediate need.
Few things. I don't have nor am intending to have cs compiler. I don't know language anyways. So I cannot test your
WyHash64Tests.cs
. As far as I understand yourWyHash.cs
is "a bit" different thanwyhash.h
. You have divided hashing into few steps and that's core of your approach.
Yes, it's essentially the same as the original c version, just a different public API (yeah, in my gist it's terrible, but I'll iterate on that!). I added comments to the tests to try to explain, but as I mentioned, I didn't spend any time on it, so go easy π .
With
wyhash.h
it's impossible. Simple as that. Unless I take it apart into pieces and reorganise the cogs to make them do what I want. I respect your skill, I'm not as good but still I'm not going to reinvent the wheel or a food processor for that matter by rewriting the whole algorithm from scratch. That's not what I meant.
I'm no cryptographer, I just have a keen interest in the field; the approach I took is tried and tested, but I absolutely defer to ηδΈεθΎ (WangYi) with regards to any refactoring of the original c source to enable hashing in chunks!
The way it is now it's useful to hash strings, and some small data sets and that's all. The whole alleged speed it totally wasted as noone in no real circumstances will utilise it.
Ouch, that's a bit harsh, IMO π. Wyhash is a young hash, and particularly well suited to hashing small keys. But of course I do agree that the ability to hash streams is extremely useful.
from wyhash.
If you don't get same result for same data then what is the point of calculating hash? You can equally well use random number with the same result
That's just silly. You will get the same result as long as you use the same block size. So if this is for internal use, it works. I'm just trying to help out if you have an immediate need.
It's not. Till yesterday I didn't know there are hashes that cannot split the data. Silly me.
I'm no cryptographer, (...)
I'm not a cryptographer either. My math skills are good but not good enough to do that. My interest in this come from sheer interest in hashes as recently I tried to improve string comparison program and hashes proved to be the answer. Enough to say that strcmp(str1, str2);
did the job of (n^2-n)/2 iterations for n=216555 in ~25 min and int i1=wyhash(str1); (...); if (i1==i2){}
in ~1.5 min. It's 16 times faster.
The way it is now it's useful to hash strings, and some small data sets and that's all. The whole alleged speed it totally wasted as noone in no real circumstances will utilise it.
Ouch, that's a bit harsh, IMO οΏΌ. Wyhash is a young hash, and particularly well suited to hashing small keys. But of course I do agree that the ability to hash streams is extremely useful.
Sorry, I'm not good in beating around the bush sometimes.
I don't care if it's young or old as long as it does its job. And it does but inability to split the input or in general application and limitations could be mentioned in documentation. Not everyone is wanting or able to read and understand the source. But (almost) everyone should be able to use it as library function.
When it comes to speed i made small experiment:
put silesia.tar in ramdisk (tmpfs) and calculated few hashes: crc32-1 (Sarwate byte-by-byte), crc32-2 (slice-by-8-bytes), xxhash, wyhash (attached; it puts whole file into memory butter and then hashes it wyhash(buffer, sizeof(buffer), 0llu); [wyhash_file-1c.c.gz](https://github.com/wangyi-fudan/wyhash/files/3386849/wyhash_file-1c.c.gz)
).
the results I got are somewhat disappointing, especially after smahasher claims that wyhash is much faster than xxhash.
$ time crc322 silesia.tar
8c7ee36e
real 0m0.299s
# 676 [MiB/s]
$ time xxhsum silesia.tar
a604faf88ddecf47
real 0m0.134s
# 1508 [MiB/s]
$ time wyhash_file-1 silesia.tar
e6b859f26a2ab90b
real 0m0.242s
# 835 [MiB/s]
Feel free to test it and say if anything is wrong with that.
from wyhash.
Sorry to the delay as I am busy recently.
Thanks for your discussions.
I think wyhash is mainly designed for short string hashing. It is not the fastest one for hashing large files.
On my personal way, I use mmap to map large file to memory and do wyhash at once.
Thus I consider to keep the original interface to keep things simple :-)
from wyhash.
@wangyi-fudan personally, I think this is right for the "reference implementation", as long as the design is such that stream hashing is possible (which it certainly is for v1; I haven't looked at the latest version yet).
For real-world development, I'd prefer to use implementations with documentation and dev-friendly APIs - these exist already at least for C#, Rust and Go. I haven't written any real C or C++ for a decade, but I might even have a crack at one of these myself if I find time.
from wyhash.
Related Issues (20)
- v4 has even more bad seeds HOT 5
- Benchmark not measuring what you expect
- How do you use practrand? HOT 2
- License issue
- Using wyhash64 to mix numbers into prngs? HOT 5
- New release for wy_hash_final4?
- WyRand fails 64-bit 1-dimensional collision tests HOT 14
- Streaming hash HOT 1
- `make_secret` but for strings or other data
- Link to absl's wyhash implement seems to be changed. HOT 1
- WyRand64 (bit reversed) fails PractRand at 32TB HOT 3
- Question about wymum HOT 1
- Full round for every multiple 48-bytes, including last one HOT 1
- Secret seeds and primes HOT 4
- sprp and is_prime should be static inline?
- c-string optimized version?
- wy2u0k returns [1, k] instead of [0, k) when WYHASH_CONDOM=2
- wyhash election started HOT 1
- ζ―ε¦ε―θ½η»ε½ζ°ζ΄ε€ηη΅ HOT 2
- Is wy2u01 mistakenly not making use of all 53 bits?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wyhash.