Comments (11)
I'm not passing bytes (i.e. a string) into the function, rather, I'm passing 32-bit integers into it. Hashing 300 bytes or 300 integers should have about the same performance, right? So that means I'm essentially passing "4 bytes at a time" into the function.
Okay, I think I understand now. I would bring up two concerns;
A) Changing a hash function to read 32-bits instead of 8-bits could very likely compromise the hash quality in some ways.
As an example, here is what my (extremely dumb) hash test suite shows for a 32-bit FNV hash variant I made:
Unmodified:
The results are fairly good, when it's like this, it means it's probably not broken. My test suite is not as sensitive as SMHasher - it only can show the more obvious biases and collisions issues.
Code:
Hash.FNV1a_BYC = function FNV1a_BYC(key) {
var h = 2166136261 | 0;
for(var i = 0; i < key.length; i++) {
h = Math.imul(h ^ key[i], 16777619);
}
h += h << 17;
h ^= h >>> 7;
return h >>> 0;
}
And here is a modified variant which reads 4 bytes (32-bit integers) into the algorithm instead of 1 (8-bit integers):
Modified (32-bit input):
This causes a significant bias issue in the first test - basically it breaks the avalanche effect. There's also 24 collisions out of 17116, which is also not good - it should be extremely unlikely to have any collision after 17,000 hashes generated for 32-bit hashes. At most 1 collision is within error.
There's also some severe bias in the textual tests.
Functions designed to read four bytes at a time are usually tuned specifically for that (see MurmurHash3 or my WIP BlazeHashB). There's also the concept of integer hashing, which is intended to deal with lists of integers instead of arrays. Some reading material on that:
https://gist.github.com/badboy/6267743
https://burtleburtle.net/bob/hash/integer.html
And the code modifications for this version:
Hash.FNV1a_BYC4 = function FNV1a_BYC4(key) {
var h = 2166136261 | 0;
for(var k, i = 0, b = key.length & -4; i < b; i += 4) {
k = key[i+3] << 24 | key[i+2] << 16 | key[i+1] << 8 | key[i];
h = Math.imul(h ^ k, 16777619);
}
k = 0;
// handle remaining data if exists
switch(key.length & 3) {
case 3: k ^= key[i+2] << 16;
case 2: k ^= key[i+1] << 8;
case 1: k ^= key[i];
h = Math.imul(h ^ k, 16777619);
}
h += h << 17;
h ^= h >>> 7;
return h >>> 0;
}
B) If the input size does not divide evenly into 32-bit integers, does your for of
loop automatically deal with the remaining 24-bit, 16-bit or 8-bit integer on the final iteration? Or is it truncated? If so, that is convenient; the traditional method requires a switch statement to handle any trailing data after each 32-bit block.
This very likely could break the hash quality, resulting in more collisions or compromise the avalanche effect (i.e. it could make patterns in the data).
from code.
Thanks for bringing this up; charCodeAt
reads the numerical value of a character. More specifically, it will read the most-significant UTF-16 code unit.
So, a character of value 256 will simply cause a single loop iteration using that literal value 0x100
in the calculation. The alternative way to handle this would be interpreting the value as two bytes 0x01
and 0x00
, and process them individually in two subsequent iterations.
In most cases this is usable, as it's still deterministic and it won't "break" the calculation, but the two methods won't result in the same hash output when Unicode is used. Unfortunately this will also mean that if the string contains characters that require multiple UTF-16 code units such as emojis, only the most-significant UTF-16 code unit is returned by charCodeAt
, which I am only realizing now is not exactly a good thing - its only half.
So basically many emojis (and potentially Chinese characters, etc) can have the same value:
"π".charCodeAt(0); // returns 55356
"π".charCodeAt(0); // returns 55356
So it's not really Unicode-compatible - it was originally intended to read one byte at a time (0-255), and using charCodeAt
was a quick change from str[i]
, for convenient casual usage of string input.
A more rigorous solution would be to convert an input string to a byte array like so:
// Here, a 4-byte array is produced for an emoji, representing two UTF-16 code units
new TextEncoder().encode("π"); // returns Uint8Array([240, 159, 140, 128])
And simply use ch = str[i]
instead of ch = str.charCodeAt(i)
in the code.
Hope this helps.
Also keep in mind that cyrb53a
is not final, it will likely change without notice when I finally make sense of the SMHasher test suite. It just slightly scores better through avalanche testing, but it's really not a practical difference. So any hash values it produces and stores will not be the same if updated.
Probably your handle backwards. But why not make it a fun acronym too? π
I honestly never intended cyrb53 to be anything spectacular at first, just a minimal example of something that is far better than the accepted answer on StackOverflow, which is also vulnerable to the same Unicode issues, while also being very weak.
When I finalize everything, it will be called BlazeHash, which I have a few WIPs of already, but anything in the "experimental" folder is sort of provided "as-is", so to speak.
Though the state of the original cyrb53
will be kept as it was on StackOverflow.
https://github.com/bryc/code/blob/master/jshash/experimental/blaze.js
https://github.com/bryc/code/blob/master/jshash/experimental/blaze128.js
For very large inputs, such as files, it's a huge performance boost to read 4 bytes (or more, 8-16 can be done reasonably well). But it becomes harder to ensure the hash's quality due to fewer iterations. Doing one byte at a time has the benefit of allowing rapid mixing of the input while also keeping code short and simple, so I will end up having a couple of lettered variants to make it a hash family.
from code.
Thanks again for your input, always insightful!
I have now prepared a POC PR for the hash function (mausworks/tyin#11), it's essentially a "deep hash" function that can be used for objects, arrays, etc β¦
In the PR I have also included collision testing using fast-check (one of my favorite testing frameworks!). It's not exactly meant for this purpose, but it still works great for it. You mentioned earlier that you wanted to "make sense of the SMHasher test suite", so if you're looking for some inspiration, maybe you can find some there!
I'm completely unfamiliar with Iterables, so I'm afraid I may not be much help there.
Iterables are just a way to iterate over something without having to allocate memory for an array (or other data structure). The caveat is that the length can't be known beforehand, you just have to "blindly" iterate it.
But I see now that MurMurHash doesn't need to know the length of the input beforehand, my mistake! The length is only needed after the first iteration, so I could essentially "count up" the length during the first iteration.
Using WASM is a compelling idea, however; one of the key metrics for Tyin is bundle size, these implementations seem to ship a 3K binary (gzipped), which is significantly larger than the entirety of Tyin itself. That's why I like cyrb53/a: it's quite compact (without being silly!). MurMurHash, while probably faster, also produces more code. But I will do some rigorous collision testing (and perf testing to boot) before shipping anything. Let me know if you want me to report back the results! It's tough to speculate on what the actual impact will be beforehand.
[...] If performance is important, and your input data is >300 bytes [...] on average, you may benefit from considering an algorithm that can read 16 bytes per iteration.
Is this "300 iterations"? Can I effectively make that "1200 bytes" now if I use 32-bit integers instead? That's quite a large object! 1.2K JSON is a lot of data, and 1.2K of "lazily serialized, packed, 32-bit integers" is a lot more.
I will close this issue now since you have more than answered the initial questionβthanks again for everything!
from code.
I'm completely unfamiliar with Iterables, so I'm afraid I may not be much help there. I'm not sure what the best approach would be, but if performance is important, and your input data is >300 bytes (the bigger the input, the better it will do) on average, you may benefit from considering an algorithm that can read 16 bytes per iteration.
The best of such a function is probably this: https://github.com/bryc/code/blob/master/jshash/hashes/murmurhash3_128.js
It's speed should be quite good, and it outputs a 128-bit hash (4 x 32-bit hashes). It reads 16 bytes per loop iteration. And it's a fairly respected algorithm as well. It would also be straight forward to modify the return statement to output 53-bit Number like cyrb53
.
But it does expect Array
or Uint8Array
as input, so you'd have to adapt it to achieve what you want.
Alternatively, you may consider WASM-based hashes, which should highly accelerate the calculations. This looks decent:
https://github.com/jungomi/xxhash-wasm
It supports 32-bit hashes as well as 64-bit hashes. If you're only comparing two hashes to find matches, and not storing millions of hashes, you may not need more than 32-bits.
In which case, you may also consider a vanilla implementation of MurmurHash1, which I believe is very fast and hard to beat speed-wise:
https://github.com/bryc/code/blob/master/jshash/hashes/murmurhash1.js
It reads 4 bytes at a time, and is basically barebones in the amount of operations used. Quality of the hash is at least acceptable I'd say.
Or a WASM version of MurmurHash3:
https://github.com/jonahsnider/murmurhash-wasm
If you are storing a very large amount of 32-bit hashes, you may need to consider collision handling techniques often employed in hash tables.
from code.
Is this "300 iterations"? Can I effectively make that "1200 bytes" now if I use 32-bit integers instead? That's quite a large object! 1.2K JSON is a lot of data, and 1.2K of "lazily serialized, packed, 32-bit integers" is a lot more.
I'm not entirely sure what you're asking π¦. What I meant was, reading large chunks of input is generally faster than reading one byte at a time. It's a sort of I/O bottleneck - the actual math operations ends up being faster. This is more true in languages like C/C++ or Rust, but my old benchmarks definitely seemed to suggest that JS had similar results (it's been a while!).
But the benefits only really begin when the inputs are somewhat large. Come to think of it, 300 bytes is probably too small to notice much of a difference, only around 64K - 1M bytes would you really start to see performance improvements.
Assuming you're sending relatively small data into cyrb53
, lets say <= 2000 bytes of data, it's probably not a concern IMO.
Do keep me posted on how things go! If it ends up practical maybe others would find it useful. I've seen people talk about object-hash
(https://www.npmjs.com/package/object-hash), but have yet to actually need to hash objects directly myself.
from code.
Okay, that's good then!
BlazeHashB
is not yet final in its current state and might be slightly worse than cyrb53
in terms of randomness quality/avalanche effect (still needs tweaks and deep testing), but I imagine 53-bits are enough to ward off collisions at least.
Pic of what I mean (1 out of 32 bits in test 6 have a consistent bias above threshold in the h1
output):
from code.
Thanks for your wonderful reply!
I'm planning to use cyrb53a
as a state hashing solution in Tyin, shipping it as an alternative to JSON.stringify
.
One of the reasons that it fits my use case is because the algorithm doesn't need to know the length of the input to produce the hash. This means that I can modify the code slightly to accept an Iterable<number>
instead of a Uint8Array
, and effectively "stream numbers" into the hash function with a generator. This generator serializes the input into a number sequence, which is only meant to be consumed by the cyrb53a
function. So, if it's safe to pack 4 bytes into a number
, I figured I might as well do that to save a few hash iterations, WDYT?
Performance is paramount since the function is competing against JSON.stringify
(i.e. native code) followed by a string comparison. So to motivate its existence it needs to perform better; either with regards to throughput, memory allocation, or ideally, a combination of them. It also doesn't matter if the output of the function changes between versions, because this function is only meant to be used for internal state comparisons. I may even swap it out for a different function altogether in the future.
If you have a function in mind that fits my use case, I'd be happy to hear it!
from code.
I'm not entirely sure what you're asking π¦.
My apologies, I may be explaining myself poorly. π
I'm not passing bytes (i.e. a string) into the function, rather, I'm passing 32-bit integers into it. Hashing 300 bytes or 300 integers should have about the same performance, right? So that means I'm essentially passing "4 bytes at a time" into the function.
If the difference is negligible up until a few MBs of data, then this is no a concern at all.
Do keep me posted on how things go! If it ends up practical maybe others would find it useful.
I will, for sure report back my findings on the function!
I had a look at object-hash (I have even used it before). Looking at the source code, it seems to try to build up a sort of pseudo-JSON string, which it then hashes using a common hashing function (SHA or MD5). I want to try to do something different!
I want Tyin to be able to produce a single 53-bit hash (number) for a given state (object) as efficiently as possible. For the purpose of knowing "whether the state changed", 53 bits are more than enough. I just need enough entropy to ensure that there aren't any collisions between one state update to the next (hyperbolic, but you get the point).
I also don't want to have to pay the cost of string allocations if I don't have to (see object-hash
), so that's why I'm serializing directly into 32-bit integers. I can just yield
those into a for
loop, and produce a hash without allocating any memory (or at least not very much of it).
I'll make sure to include a few benchmarks once done. But most of them will be apples-to-oranges comparisons, since most object hashing libraries seem to produce string hashes with much higher entropy.
from code.
A) Changing a hash function to read 32-bits instead of 8-bits could very likely compromise the hash quality in some ways.
Thus, the original question! So thanks a lot for these screenshots and explanations for how the quality can be worsened. These really helped me visualize the problem space! π
Functions designed to read four bytes at a time are usually tuned specifically for that (see MurmurHash3 or my WIP BlazeHashB). There's also the concept of integer hashing, which is intended to deal with lists of integers instead of arrays. Some reading material on that:
https://gist.github.com/badboy/6267743
https://burtleburtle.net/bob/hash/integer.html
This is great material, thanks!
If the input size does not divide evenly into 32-bit integers, does your for of loop automatically deal with the remaining 24-bit, 16-bit or 8-bit integer on the final iteration? Or is it truncated? If so, that is convenient; the traditional method requires a switch statement to handle any trailing data after each 32-bit block.
I'm not sure I understand the question. My function only accepts 32-bit integers as input, and I'm packing as many characters of a string into a 32-bit integer as possible, and just leaving the remaining bits to 0
(for example). Are you saying those 0
-bits have to be truncated before hashing?
This very likely could break the hash quality, resulting in more collisions or compromise the avalanche effect (i.e. it could make patterns in the data).
I'll report back on that once I sit down and do some experimentation again. Thanks again for all the material!
from code.
I'm not sure I understand the question. My function only accepts 32-bit integers as input, and I'm packing as many characters of a string into a 32-bit integer as possible, and just leaving the remaining bits to 0 (for example). Are you saying those 0-bits have to be truncated before hashing?
Put it like this, let's say you are trying to hash an object like {apples:"yum"}
If it gets split into 32-bit integers, we could imagine the segments like so:
{app = 4 chars (32 bits)
les: = 4 chars (32 bits)
"yum = 4 chars (32 bits)
"} = 2 chars (16 bits)
What happens to the final segment of input, which naturally might contain less than 32 bits of data?
Is it ignored (bad), or is it simply the final 32-bit integer in the sequence, which has a reduced max value?
from code.
What happens to the final segment of input, which naturally might contain less than 32 bits of data?
Is it ignored (bad), or is it simply the final 32-bit integer in the sequence, which has a reduced max value?
Basically the latter. Using your example, it would essentially get coded like this:
{app = 4 chars (32 bits) [7B 61 70 70]
les: = 4 chars (32 bits) [6C 65 73 3A]
"yum = 4 chars (32 bits) [22 79 75 6D]
"}\0\0 = 4 chars (32 bits) [22 7D 00 00]
The last two bytes will get treated like null characters. I guess this is less than ideal since a string could technically contain two null characters, so you'd have a collision if that was the case. This will probably be an unusual case in JS since null termination is uncommon and archaic, but it is something I should probably cover anyway.
I realize that I need to be considerate with how I serialize the data, and not just how the hashing function works.
So I guess that this is saying that I definitely should be truncating my data, or just pass bytes into the function, as god intended. But I now have a deeper understanding of these functions (many thanks to you), so I still want to give the integer hashing ("4 bytes at the time") a go, since it will dramatically speed up the function. But in this case, I think blazehash will be better since it already operates on 4 bytes at a time.
Thanks for all your help again, bryc. You are a great mentor, and I salute you. π
from code.
Related Issues (12)
- (JS) Replace var by let HOT 1
- PRNGs.md: SplitMix32, a shift seems to be off? HOT 2
- Is cyrb128 Public domain like cyrb53? HOT 1
- String-based hash functions should accept numeric arrays as input HOT 3
- collision stats for experimental functions HOT 1
- Mulberry32 no longer recommended by its author HOT 5
- Specify Licensing HOT 1
- Open source license HOT 5
- PRNG.md: link more directly to sources HOT 2
- PRNG.md: xoshiro is not newer than xoroshiro HOT 2
- PRNG non-decimal outputs HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from code.