Giter VIP home page Giter VIP logo

Comments (11)

bryc avatar bryc commented on September 28, 2024 3

I'm not passing bytes (i.e. a string) into the function, rather, I'm passing 32-bit integers into it. Hashing 300 bytes or 300 integers should have about the same performance, right? So that means I'm essentially passing "4 bytes at a time" into the function.

Okay, I think I understand now. I would bring up two concerns;

A) Changing a hash function to read 32-bits instead of 8-bits could very likely compromise the hash quality in some ways.

As an example, here is what my (extremely dumb) hash test suite shows for a 32-bit FNV hash variant I made:

Unmodified:

image
The results are fairly good, when it's like this, it means it's probably not broken. My test suite is not as sensitive as SMHasher - it only can show the more obvious biases and collisions issues.

Code:

Hash.FNV1a_BYC = function FNV1a_BYC(key) {
    var h = 2166136261 | 0;
    for(var i = 0; i < key.length; i++) {
        h = Math.imul(h ^ key[i], 16777619);
    }
    h += h << 17;
    h ^= h >>> 7;
    return h >>> 0;
}

And here is a modified variant which reads 4 bytes (32-bit integers) into the algorithm instead of 1 (8-bit integers):

Modified (32-bit input):

image

This causes a significant bias issue in the first test - basically it breaks the avalanche effect. There's also 24 collisions out of 17116, which is also not good - it should be extremely unlikely to have any collision after 17,000 hashes generated for 32-bit hashes. At most 1 collision is within error.

There's also some severe bias in the textual tests.

Functions designed to read four bytes at a time are usually tuned specifically for that (see MurmurHash3 or my WIP BlazeHashB). There's also the concept of integer hashing, which is intended to deal with lists of integers instead of arrays. Some reading material on that:

https://gist.github.com/badboy/6267743
https://burtleburtle.net/bob/hash/integer.html

And the code modifications for this version:

Hash.FNV1a_BYC4 = function FNV1a_BYC4(key) {
    var h = 2166136261 | 0;
	for(var k, i = 0, b = key.length & -4; i < b; i += 4) {
		k = key[i+3] << 24 | key[i+2] << 16 | key[i+1] << 8 | key[i];
        h = Math.imul(h ^ k, 16777619);
    }
    k = 0;
    // handle remaining data if exists
    switch(key.length & 3) {
        case 3: k ^= key[i+2] << 16;
        case 2: k ^= key[i+1] << 8;
        case 1: k ^= key[i];
        h = Math.imul(h ^ k, 16777619);
    }
    h += h << 17;
    h ^= h >>> 7;
    return h >>> 0;
}

B) If the input size does not divide evenly into 32-bit integers, does your for of loop automatically deal with the remaining 24-bit, 16-bit or 8-bit integer on the final iteration? Or is it truncated? If so, that is convenient; the traditional method requires a switch statement to handle any trailing data after each 32-bit block.

This very likely could break the hash quality, resulting in more collisions or compromise the avalanche effect (i.e. it could make patterns in the data).

from code.

bryc avatar bryc commented on September 28, 2024 2

Thanks for bringing this up; charCodeAt reads the numerical value of a character. More specifically, it will read the most-significant UTF-16 code unit.

So, a character of value 256 will simply cause a single loop iteration using that literal value 0x100 in the calculation. The alternative way to handle this would be interpreting the value as two bytes 0x01 and 0x00, and process them individually in two subsequent iterations.

In most cases this is usable, as it's still deterministic and it won't "break" the calculation, but the two methods won't result in the same hash output when Unicode is used. Unfortunately this will also mean that if the string contains characters that require multiple UTF-16 code units such as emojis, only the most-significant UTF-16 code unit is returned by charCodeAt, which I am only realizing now is not exactly a good thing - its only half.

So basically many emojis (and potentially Chinese characters, etc) can have the same value:

"πŸŒ€".charCodeAt(0); // returns 55356
"πŸŒ‚".charCodeAt(0); // returns 55356

So it's not really Unicode-compatible - it was originally intended to read one byte at a time (0-255), and using charCodeAt was a quick change from str[i], for convenient casual usage of string input.

A more rigorous solution would be to convert an input string to a byte array like so:

// Here, a 4-byte array is produced for an emoji, representing two UTF-16 code units
new TextEncoder().encode("πŸŒ€"); // returns Uint8Array([240, 159, 140, 128])

And simply use ch = str[i] instead of ch = str.charCodeAt(i) in the code.

Hope this helps.


Also keep in mind that cyrb53a is not final, it will likely change without notice when I finally make sense of the SMHasher test suite. It just slightly scores better through avalanche testing, but it's really not a practical difference. So any hash values it produces and stores will not be the same if updated.


Probably your handle backwards. But why not make it a fun acronym too? πŸ™Œ

I honestly never intended cyrb53 to be anything spectacular at first, just a minimal example of something that is far better than the accepted answer on StackOverflow, which is also vulnerable to the same Unicode issues, while also being very weak.

When I finalize everything, it will be called BlazeHash, which I have a few WIPs of already, but anything in the "experimental" folder is sort of provided "as-is", so to speak.

Though the state of the original cyrb53 will be kept as it was on StackOverflow.

https://github.com/bryc/code/blob/master/jshash/experimental/blaze.js
https://github.com/bryc/code/blob/master/jshash/experimental/blaze128.js

For very large inputs, such as files, it's a huge performance boost to read 4 bytes (or more, 8-16 can be done reasonably well). But it becomes harder to ensure the hash's quality due to fewer iterations. Doing one byte at a time has the benefit of allowing rapid mixing of the input while also keeping code short and simple, so I will end up having a couple of lettered variants to make it a hash family.

from code.

mausworks avatar mausworks commented on September 28, 2024 2

Thanks again for your input, always insightful!

I have now prepared a POC PR for the hash function (mausworks/tyin#11), it's essentially a "deep hash" function that can be used for objects, arrays, etc …

In the PR I have also included collision testing using fast-check (one of my favorite testing frameworks!). It's not exactly meant for this purpose, but it still works great for it. You mentioned earlier that you wanted to "make sense of the SMHasher test suite", so if you're looking for some inspiration, maybe you can find some there!

I'm completely unfamiliar with Iterables, so I'm afraid I may not be much help there.

Iterables are just a way to iterate over something without having to allocate memory for an array (or other data structure). The caveat is that the length can't be known beforehand, you just have to "blindly" iterate it.

But I see now that MurMurHash doesn't need to know the length of the input beforehand, my mistake! The length is only needed after the first iteration, so I could essentially "count up" the length during the first iteration.

Using WASM is a compelling idea, however; one of the key metrics for Tyin is bundle size, these implementations seem to ship a 3K binary (gzipped), which is significantly larger than the entirety of Tyin itself. That's why I like cyrb53/a: it's quite compact (without being silly!). MurMurHash, while probably faster, also produces more code. But I will do some rigorous collision testing (and perf testing to boot) before shipping anything. Let me know if you want me to report back the results! It's tough to speculate on what the actual impact will be beforehand.

[...] If performance is important, and your input data is >300 bytes [...] on average, you may benefit from considering an algorithm that can read 16 bytes per iteration.

Is this "300 iterations"? Can I effectively make that "1200 bytes" now if I use 32-bit integers instead? That's quite a large object! 1.2K JSON is a lot of data, and 1.2K of "lazily serialized, packed, 32-bit integers" is a lot more.

I will close this issue now since you have more than answered the initial questionβ€”thanks again for everything!

from code.

bryc avatar bryc commented on September 28, 2024 1

I'm completely unfamiliar with Iterables, so I'm afraid I may not be much help there. I'm not sure what the best approach would be, but if performance is important, and your input data is >300 bytes (the bigger the input, the better it will do) on average, you may benefit from considering an algorithm that can read 16 bytes per iteration.

The best of such a function is probably this: https://github.com/bryc/code/blob/master/jshash/hashes/murmurhash3_128.js

It's speed should be quite good, and it outputs a 128-bit hash (4 x 32-bit hashes). It reads 16 bytes per loop iteration. And it's a fairly respected algorithm as well. It would also be straight forward to modify the return statement to output 53-bit Number like cyrb53.

But it does expect Array or Uint8Array as input, so you'd have to adapt it to achieve what you want.

Alternatively, you may consider WASM-based hashes, which should highly accelerate the calculations. This looks decent:

https://github.com/jungomi/xxhash-wasm

It supports 32-bit hashes as well as 64-bit hashes. If you're only comparing two hashes to find matches, and not storing millions of hashes, you may not need more than 32-bits.

In which case, you may also consider a vanilla implementation of MurmurHash1, which I believe is very fast and hard to beat speed-wise:

https://github.com/bryc/code/blob/master/jshash/hashes/murmurhash1.js

It reads 4 bytes at a time, and is basically barebones in the amount of operations used. Quality of the hash is at least acceptable I'd say.

Or a WASM version of MurmurHash3:

https://github.com/jonahsnider/murmurhash-wasm

If you are storing a very large amount of 32-bit hashes, you may need to consider collision handling techniques often employed in hash tables.

from code.

bryc avatar bryc commented on September 28, 2024 1

Is this "300 iterations"? Can I effectively make that "1200 bytes" now if I use 32-bit integers instead? That's quite a large object! 1.2K JSON is a lot of data, and 1.2K of "lazily serialized, packed, 32-bit integers" is a lot more.

I'm not entirely sure what you're asking 😦. What I meant was, reading large chunks of input is generally faster than reading one byte at a time. It's a sort of I/O bottleneck - the actual math operations ends up being faster. This is more true in languages like C/C++ or Rust, but my old benchmarks definitely seemed to suggest that JS had similar results (it's been a while!).

But the benefits only really begin when the inputs are somewhat large. Come to think of it, 300 bytes is probably too small to notice much of a difference, only around 64K - 1M bytes would you really start to see performance improvements.

Assuming you're sending relatively small data into cyrb53, lets say <= 2000 bytes of data, it's probably not a concern IMO.

Do keep me posted on how things go! If it ends up practical maybe others would find it useful. I've seen people talk about object-hash (https://www.npmjs.com/package/object-hash), but have yet to actually need to hash objects directly myself.

from code.

bryc avatar bryc commented on September 28, 2024 1

Okay, that's good then!

BlazeHashB is not yet final in its current state and might be slightly worse than cyrb53 in terms of randomness quality/avalanche effect (still needs tweaks and deep testing), but I imagine 53-bits are enough to ward off collisions at least.

Pic of what I mean (1 out of 32 bits in test 6 have a consistent bias above threshold in the h1 output):
image

from code.

mausworks avatar mausworks commented on September 28, 2024

Thanks for your wonderful reply!

I'm planning to use cyrb53a as a state hashing solution in Tyin, shipping it as an alternative to JSON.stringify.

One of the reasons that it fits my use case is because the algorithm doesn't need to know the length of the input to produce the hash. This means that I can modify the code slightly to accept an Iterable<number> instead of a Uint8Array, and effectively "stream numbers" into the hash function with a generator. This generator serializes the input into a number sequence, which is only meant to be consumed by the cyrb53a function. So, if it's safe to pack 4 bytes into a number, I figured I might as well do that to save a few hash iterations, WDYT?

Performance is paramount since the function is competing against JSON.stringify (i.e. native code) followed by a string comparison. So to motivate its existence it needs to perform better; either with regards to throughput, memory allocation, or ideally, a combination of them. It also doesn't matter if the output of the function changes between versions, because this function is only meant to be used for internal state comparisons. I may even swap it out for a different function altogether in the future.

If you have a function in mind that fits my use case, I'd be happy to hear it!

from code.

mausworks avatar mausworks commented on September 28, 2024

I'm not entirely sure what you're asking 😦.

My apologies, I may be explaining myself poorly. πŸ™

I'm not passing bytes (i.e. a string) into the function, rather, I'm passing 32-bit integers into it. Hashing 300 bytes or 300 integers should have about the same performance, right? So that means I'm essentially passing "4 bytes at a time" into the function.

If the difference is negligible up until a few MBs of data, then this is no a concern at all.

Do keep me posted on how things go! If it ends up practical maybe others would find it useful.

I will, for sure report back my findings on the function!

I had a look at object-hash (I have even used it before). Looking at the source code, it seems to try to build up a sort of pseudo-JSON string, which it then hashes using a common hashing function (SHA or MD5). I want to try to do something different!

I want Tyin to be able to produce a single 53-bit hash (number) for a given state (object) as efficiently as possible. For the purpose of knowing "whether the state changed", 53 bits are more than enough. I just need enough entropy to ensure that there aren't any collisions between one state update to the next (hyperbolic, but you get the point).

I also don't want to have to pay the cost of string allocations if I don't have to (see object-hash), so that's why I'm serializing directly into 32-bit integers. I can just yield those into a for loop, and produce a hash without allocating any memory (or at least not very much of it).

I'll make sure to include a few benchmarks once done. But most of them will be apples-to-oranges comparisons, since most object hashing libraries seem to produce string hashes with much higher entropy.

from code.

mausworks avatar mausworks commented on September 28, 2024

A) Changing a hash function to read 32-bits instead of 8-bits could very likely compromise the hash quality in some ways.

Thus, the original question! So thanks a lot for these screenshots and explanations for how the quality can be worsened. These really helped me visualize the problem space! πŸ‘

Functions designed to read four bytes at a time are usually tuned specifically for that (see MurmurHash3 or my WIP BlazeHashB). There's also the concept of integer hashing, which is intended to deal with lists of integers instead of arrays. Some reading material on that:

https://gist.github.com/badboy/6267743
https://burtleburtle.net/bob/hash/integer.html

This is great material, thanks!

If the input size does not divide evenly into 32-bit integers, does your for of loop automatically deal with the remaining 24-bit, 16-bit or 8-bit integer on the final iteration? Or is it truncated? If so, that is convenient; the traditional method requires a switch statement to handle any trailing data after each 32-bit block.

I'm not sure I understand the question. My function only accepts 32-bit integers as input, and I'm packing as many characters of a string into a 32-bit integer as possible, and just leaving the remaining bits to 0 (for example). Are you saying those 0-bits have to be truncated before hashing?

This very likely could break the hash quality, resulting in more collisions or compromise the avalanche effect (i.e. it could make patterns in the data).

I'll report back on that once I sit down and do some experimentation again. Thanks again for all the material!

from code.

bryc avatar bryc commented on September 28, 2024

I'm not sure I understand the question. My function only accepts 32-bit integers as input, and I'm packing as many characters of a string into a 32-bit integer as possible, and just leaving the remaining bits to 0 (for example). Are you saying those 0-bits have to be truncated before hashing?

Put it like this, let's say you are trying to hash an object like {apples:"yum"}

If it gets split into 32-bit integers, we could imagine the segments like so:

{app  = 4 chars (32 bits)
les:  = 4 chars (32 bits)
"yum  = 4 chars (32 bits)
"}  = 2 chars (16 bits)

What happens to the final segment of input, which naturally might contain less than 32 bits of data?

Is it ignored (bad), or is it simply the final 32-bit integer in the sequence, which has a reduced max value?

from code.

mausworks avatar mausworks commented on September 28, 2024

What happens to the final segment of input, which naturally might contain less than 32 bits of data?

Is it ignored (bad), or is it simply the final 32-bit integer in the sequence, which has a reduced max value?

Basically the latter. Using your example, it would essentially get coded like this:

{app    = 4 chars (32 bits) [7B 61 70 70]
les:    = 4 chars (32 bits) [6C 65 73 3A]
"yum    = 4 chars (32 bits) [22 79 75 6D]
"}\0\0  = 4 chars (32 bits) [22 7D 00 00]

The last two bytes will get treated like null characters. I guess this is less than ideal since a string could technically contain two null characters, so you'd have a collision if that was the case. This will probably be an unusual case in JS since null termination is uncommon and archaic, but it is something I should probably cover anyway.

I realize that I need to be considerate with how I serialize the data, and not just how the hashing function works.

So I guess that this is saying that I definitely should be truncating my data, or just pass bytes into the function, as god intended. But I now have a deeper understanding of these functions (many thanks to you), so I still want to give the integer hashing ("4 bytes at the time") a go, since it will dramatically speed up the function. But in this case, I think blazehash will be better since it already operates on 4 bytes at a time.

Thanks for all your help again, bryc. You are a great mentor, and I salute you. πŸ––

from code.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.