Giter VIP home page Giter VIP logo

Comments (13)

manugoyal avatar manugoyal commented on July 17, 2024

Hi,

So the reason this is happening is that by design, each hash table bucket is a fixed size. So if all elements are hashed to the same bucket (actually 2 buckets, each of size 4 by default), then after 8 insertions, the cuckoo hashing algorithm won't be able to move elements around to different buckets, because they all hash to the same 2 buckets.

If the hash table can't shuffle items around to make more space in the fixed-size buckets, it will expand the table by increasing the number of buckets (not the size of each bucket, though). This obviously won't solve the problem because all the elements will still hash to the same 2 buckets, so the table will get stuck in this loop and keep expanding forever.

Rather than overhaul the entire design of the table, one way to stop the table from expanding forever would be to set a limit on the number of expansions we try before quitting. However, since the case you brought up seems fairly unlikely to occur in the real world, I'm not sure it's worth the tradeoff of introducing a new failure case to insertion. Also, stopping after a certain number of expansions wouldn't actually solve this problem, just prevent an infinite loop.

I suppose this situation would occur more generally whenever the range of values returned by your hash function is smaller than the actual table size, and you want to insert more than 8x the number of values that your hash function ranges over, but since std::hash or any other good hash function won't have this problem, I'm not sure this is worth fixing.

Are there any other scenarios you found that would trigger this behavior?

from libcuckoo.

hb020 avatar hb020 commented on July 17, 2024

The reason I was trying this scenario is that I am planning to store strings as keys. The range of values from any reasonable size hash function will by definition be lower than the range of anything but very small strings, especially when the strings are unicode. Simple math there. You will get collisions. And I was testing that. Now I understand that I can tune the bucket size. But that is compile time. Run time anything can happen, and the code should be able to handle that gracefully. Personally I need robust code. Crashing is out of the question.
So to answer, to me this is worth fixing. By principle. And after your explanation I understand a lot better, and it looks like it's not that complicated and would have close to no impact on performance. But it's your call. It would make it more robust...

from libcuckoo.

manugoyal avatar manugoyal commented on July 17, 2024

I see. So the hash table is certainly built to handle collisions (the wikipedia article on cuckoo hashing is a great place to get an introduction to how we handle collisions, and the paper linked in the README has the full details), however with fixed bucket sizes, there's really no good way to handle the case when there are more collisions than the possible range of values that the hash function can represent. With a good hash function, though, you will almost certainly run out of memory before you can cause enough collisions to trigger the scenario you've identified, so it shouldn't really be a problem.

As a rough calculation, imagine your hash function returns a 64 bit integer (as most hash functions do), which means it can return 2^64 different values. In reality, your keys are not going to be perfectly distributed over the entire 2^64 range, so let's say it covers something like 2^54 different values. This means you would have to insert more than 2^54 * [the bucket size] keys in your hash table for it to run into this issue, because then there would be more than [bucket size] keys for every hash value, so resizing is not enough to resolve collisions. That's over 10^17 keys, which, if you're using string keys, is almost certainly more memory than any one machine has.

I think variable-sized buckets would be possible to implement, but it would require a major re-design of the table and a significant change in the underlying cuckoohashing algorithms, since now there's the additional problem of figuring out when it is best to perform a resize of the entire table (which is generally better for cuckoo hashing), or a resize of a specific bucket (which seems like it would only be necessary if you had a pathologically bad hash function).

I encourage you to play around with the insert_throughput benchmark in the tests/benchmarks directory. It'll let you run a randomized insert workload (with string keys if specified) and different table sizes, so you can explore how the table behaves under different workload types and sizes.

from libcuckoo.

tewarfel avatar tewarfel commented on July 17, 2024

The whole intent of a hash table is that it relies on a good hash function to generate a repeatable (key-driven) pseudorandom mapping between keys and buckets. If your hash function does not pseudorandomly distribute your items, then you need a better hash function, not bigger buckets (or else your formerly fast hash table degenerates into a big, slow unsorted list). "when there are more collisions than the possible range of values that the hash function can represent", it means you are using the wrong hash function.

You can use strings as keys to the hashtable, but you still need to run them through the hash function to obtain the pseudorandom mapping. It is a (very nice) feature of the library that you may plug in whichever hash function you want for your application. CityHash, the recommended default hash function, generates 64-bit indexes which is adequate for most applications on modern hardware. If you don't want to use CityHash, I'd suggest you look at Bob Jenkin's "SpookyHash", which has 32-bit, 64-bit, and 128-bit output versions and works particularly well on older machines. (http://burtleburtle.net/bob/hash/spooky.html )

Also note that libcuckoo internally masks the hash function output so that it only uses as many hash bits as necessary to address the bucket array (which is always sized as a power of 2). Don't intentionally limit your hash function to the exact number of hash bits you think you'll need (trying to avoid an array out-of-bounds condition). If the hash table grows unexpectedly, the hash function would still be unable to address any more buckets than it had before the resize. Instead, just configure your hash function to generate the maximum number of bits you think you may ever need in the worst-case scenario. Libcuckoo always masks off the ones it doesn't need.

from libcuckoo.

hb020 avatar hb020 commented on July 17, 2024

I hear all of that, and am aware that a good hash will probably not make libcuckoo crash. That is not the point. My point is being robust. All libraries I use must be stable and not provoke unforeseeable side effects. Potential infinite loops or are simply a no go. For any reason. Even if the probability is low. If it is possible to trap incompatible use (i.e. wanting to put more than X elements in a bucket), then do it. Leaving it to chance is planning for disaster.

from libcuckoo.

manugoyal avatar manugoyal commented on July 17, 2024

I think that's fair. With every hash function, there is always some very small chance that you'll insert more than 2*[bucket size] different elements with exactly the same hash value, and I think throwing an exception if we exceed some maximum number of expansion retries is better than looping forever and running out of memory.

from libcuckoo.

tewarfel avatar tewarfel commented on July 17, 2024

Manu, please consider the following alternative:
Define two optional const parameters that could be set at table creation: "maximum hashpower" and "minimum load factor"

  • If "maximum hashpower" is defined, cuckoo_expand_simple() will check when an expansion (not contraction) is requested. If the table hashpower is already at the "maximum hashpower", and if an expansion is requested (whether by the user or by an insertion failure) the exception gets thrown. I can see a number of cases where the user would set a "largest permissible data set" and could want that threshold set. A simple comparison in cuckoo_expand_simple() would have negligible performance impact.
  • if "minimum load factor" threshold is defined, cuckoo_insert() will check the table load factor just prior to returning a "failure_table_full" status. If the current load factor is below the "minimum load factor" threshold, throw the exception. We check at the end of cuckoo_insert() and not in cuckoo_expand() because the user may pre-emptively expand the table using .reserve(), and a small load factor (even zero) would be fine in that case.

I see typical load factors between .3 to .9 when failure_table_full gets returned. If a user wanted to define an optional minimum threshold, 0.06 would probably be a good first guess.

These thresholds would allow earlier run-time detection of bad hashing (or too much data) conditions, but these exceptions will not be any more recoverable than an "out of memory" condition is recoverable. You just get more information about what went wrong when they happen.

hb020:
What you are asking for (detect whether "too many" of my data points are hashing to the same bucket?) is essentially a form of Turning's halting problem, and cannot be answered prospectively. A program can't "figure out" if everything hashes to the same location until it runs every bit of data through the hash function and does some evaluation on the output.

The CityHash function is recommended because it has been extensively tested by Google and runs fast on modern hardware. If you choose some other hash, you need to do your own prospective evaluation (using SMhasher, or the Die Harder test suites) before putting your hash function into the driver's seat of your hashtable.

The cuckoo hash table provides fast, efficient data storage for large (but sparse) data structures; it sacrifices some determinism and worst-case performance in exchange for fast average performance - which is great for well-understood, long-running, large sparse data problems.

If your need is strict predictability, I would consider using a different data structure rather than a hash table. If you can afford the memory, a sparsely-filled conventional array with lots of zeros would probably run the fastest (maybe with a second vector or boolean array to keep track which entries are non-zero). The other approach would be to use some variant of a linked list (more memory efficient, but slower than an array for inserts and changes. And (only slightly tongue-in-cheek) consider writing your code in Haskell (or maybe Clojure) rather than C / C++ if you want to minimize side-effects and maximize predictability.

from libcuckoo.

dave-andersen avatar dave-andersen commented on July 17, 2024

Sorry, but to be very clear, a good hash will not make the table crash with any more probability than the machine exploding in a fire. It's only "probably" in the sense of "with high probability" -- it's simply not going to happen. The argument about being robust to a crappy hash is reasonable if it can be implemented at low overhead. But with a modern 64 bit hash function, the only way you'd get 8 exact hash collisions is with an adversarial workload, and if your hash table is subject to adversarial input, you should use siphash.

That said, I do support handling this in some reasonable way, because exhausting the machine's memory is a worse consequence of using the wrong hash function under adversarial input than the linear CPU exhaustion attacks that hash tables are more typically subject to. Any such users should still be using siphash, of course, but some won't.

from libcuckoo.

manugoyal avatar manugoyal commented on July 17, 2024

@tewarfel I think it might be good to have both mechanisms in place. With just your alternative, I'm still worried about the totally default case, where somebody doesn't set those parameters and gets into an infinite loop due to a bad hash function. The expansion limit should prevent the infinite loop in the worst case, and the load factor and maximum hash power parameters can help catch the problem earlier.

It'd probably be best if they were settable parameters, so people could change the maximum hash power or minimum load factor dynamically if the need arises.

from libcuckoo.

tewarfel avatar tewarfel commented on July 17, 2024

Hi Manu,
As your colleague Prof. Andersen said, the only times a multiple over-expansion failure will occur are:
-broken hash function
-intentionally “toxic” keys intended to cause hash collisions

In either of those cases, the table will double in size and “load factor” will keep halving each time. There’s essentially no chance that automatically-triggered expansions (with non-pathological data and a working hash function) should ever take the “load factor” below ~6.25%.

Code the test for “minimum allowable load factor check before automatic resizing” and set the threshold at ~5%, with an error message along the lines of
“excessive number of collisions to the same hash buckets – either the hash is performing poorly, or keys are intentionally colliding.
Verify proper hash function operation, and consider switching to SIPHASH for improved resistance against intentional collisions."

This will flag the overt pathological case (where everything hashes to the same bucket) at the first expansion attempt for any non-trivially-sized array, and flag marginal home-rolled hash functions within a few expansions if they are sufficiently bad (or are “helpfully” pre-truncating their hash function’s output bits like I did at first).

Since checking is only needed just before an automatic expansion is triggered, the performance impact is near-zero and implementation complexity is low.

If you let the user raise or lower the threshold, they can be more paranoid (when testing home-built hash functions) by setting it higher at ~15-23%, or set it lower if they don’t care (a value of 0 effectively turns the checking off), or just want a hash power (max memory consumption) limit.

Since you’re trying to protect the user, I would disallow setting the threshold over 25% (as normal conditions can occasionally reach that).

I do not see the benefit for any checking more complex than this one test for detecting those two conditions (bad hash function / bad key values in the data).

The other check I suggested (maximum size (hashpower)) is just a nice-to-have option to warn the developer (or user) when data storage consumption exceeds the initial expectations. A normally-working system will trigger this if simply storing more data than expected.

Checking the hash power only before expansions would have negligible overhead (as opposed to a cautious developer trying to poll the array hash power himself after each insertion). I would not enable the maximum hashpower check by default, as there’s no predicting what a “reasonable” maximum size would be for all intended uses.

-Tom

From: Manu Goyal <[email protected]mailto:[email protected]>
Reply-To: efficient/libcuckoo <[email protected]mailto:[email protected]>
Date: Thursday, October 1, 2015 at 4:02 PM
To: efficient/libcuckoo <[email protected]mailto:[email protected]>
Cc: "Warfel, Thomas E" <[email protected]mailto:[email protected]>
Subject: Re: [libcuckoo] exception std::bad_alloc when buckets grow (#18)

@tewarfelhttps://github.com/tewarfel I think it might be good to have both mechanisms in place. With just your alternative, I'm still worried about the totally default case, where somebody doesn't set those parameters and gets into an infinite loop due to a bad hash function. The expansion limit should prevent the infinite loop in the worst case, and the load factor and maximum hash power parameters can help catch the problem earlier.

It'd probably be best if they were settable parameters, so people could change the maximum hash power or minimum load factor dynamically if the need arises.


Reply to this email directly or view it on GitHubhttps://github.com//issues/18#issuecomment-144872038.

from libcuckoo.

manugoyal avatar manugoyal commented on July 17, 2024

Oh I see, so you're suggesting we set a minimum load factor for expansion during insertion (of say 5%) by default, but no maximum hash power by default, and let the user set them if they'd like? That sounds like it should work.

from libcuckoo.

tewarfel avatar tewarfel commented on July 17, 2024

Yes. Initially, I was arguing against having any checks on by default, but now I agree you can and should do just the one test by default: confirm that the load factor is above a minimum threshold before allowing an automatically-triggered expansion to run, and I think a good default threshold would be around 5%.

I changed my mind after thinking about it when I realized that running the check event-driven in your code will be simpler to implement and have far less performance impact than if the user were to have to poll-test for array resizing and load factor with each insert/update iteration.

Allowing the user to change that threshold to anything between 0 and25% would be a nice (but probably not necessary) bonus; being able to set an optional maximum hash power check as well would be another nice feature that would better detect and distinguish "pathologic key/hashing" from “too much data”.

-Tom

From: Manu Goyal <[email protected]mailto:[email protected]>
Reply-To: efficient/libcuckoo <[email protected]mailto:[email protected]>
Date: Friday, October 2, 2015 at 12:53 PM
To: efficient/libcuckoo <[email protected]mailto:[email protected]>
Cc: "Warfel, Thomas E" <[email protected]mailto:[email protected]>
Subject: Re: [libcuckoo] exception std::bad_alloc when buckets grow (#18)

Oh I see, so you're suggesting we set a minimum load factor for expansion during insertion (of say 5%) by default, but no maximum hash power by default, and let the user set them if they'd like? That sounds like it should work.


Reply to this email directly or view it on GitHubhttps://github.com//issues/18#issuecomment-145139821.

from libcuckoo.

manugoyal avatar manugoyal commented on July 17, 2024

Okay, I added minimum load factor and maximum hash power parameters. I think this resolves the issue.

from libcuckoo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.