Giter VIP home page Giter VIP logo

Comments (13)

ulfjack avatar ulfjack commented on September 24, 2024

We are considering a scenario where FindMissingBlobs may temporarily return incorrect results, i.e., return existence even though the entry does not exist, which can result in an infinite loop depending on how the client is implemented. If the client calls FindMissingBlobs and then Execute and then FindMissingBlobs again and so on.

from remote-apis.

EricBurnett avatar EricBurnett commented on September 24, 2024

@ulfjack I'd hope that would result in a build failure, not an infinite loop, once the overall retries for execution attempts had been exhausted. Each FAILED_PRECONDITION should count against the retry limit, no?

What clarification would you like to see? I could imagine e.g. "Clients SHOULD re-upload blobs named in PreconditionFailure errors, if available, without trusting FindMissingBlobs response values for these blobs" as a defense-in-depth mechanism clients can leverage against incorrect FindMissingBlobs responses. But clients may not have the blob to upload, in which case it should be expected to be a build failure eventually.

I would not change the expectation of correctness around FindMissingBlobs in general...it's really the responsibility of FindMissingBlobs to be correct, and to err on the side of false-positives over false-negatives. Obviously we can't ever guarantee that for lossy CAS implementations since the blob can go missing between the two calls, but if any inconsistencies can be retroactively-corrected (e.g. upon a client trying to read a blob that is learned to be missing, update the database synchronously to no longer register it as present) and/or time-boxed (in the face of a lossy CAS, don't cache presence states outside the canonical DB for longer than a client will take to exhaust all its retries), maybe that will suffice. But as you've heard me say before, I consider a reliable CAS an important investment.

I'll note that even if blobs are going to be removed arbitrarily due to space pressure, the FindMissingBlobs semantics can still be maintained - e.g. by leveraging a two-phase GC where blobs are flagged as missing for FindMissingBlobs purposes but still downloadable for a period of time before they're removed entirely. That way there's time to propagate that state around and no risk that the removal of the blob is observable before a FindMissingBlobs call will tell clients to expect it.

from remote-apis.

ulfjack avatar ulfjack commented on September 24, 2024

@EricBurnett, yes, it should eventually give up after exhausting the retry attempts.

Yes, I think your text proposal makes sense, although I'd try to add some details to provide more of a rationale.

Unfortunately, Bazel is currently badly-behaved in that it asks about the same digest a lot, and we had no choice but to implement deduplication and caching of existence checks in the scheduler as we were otherwise seeing both excessive backend load as well as frontend throughput issues. I have a patch for Bazel to dedupe requests client-side, but it's not merged yet, and we have seen similar behavior from other clients.

In general, in a distributed system we may see interesting failure modes due to the CAP theorem. Our implementation errs on the side of availability and performance over consistency. Our cache is scheduler-local, so if the client hits different schedulers for different calls, then we may not see the failed CAS read on the scheduler that has the cache entry (we are considering broadcasting CAS read failures to improve consistency in this case).

Unfortunately, caching increases the chance of false negatives as well as false positives, and both can cause problems for Bazel at this time.

Both of these can be handled gracefully by the client. False positives primarily affect execution and having clients not retry FindMissingBlobs after a failed Execute helps safeguard against them.

False negatives primarily affect performance, but can also break the build for clients that do something like Bazel's build-without-the-bytes mode. Note that Bazel is currently susceptible to build breakage on any backend GC. It currently never times out the local cache and never re-requests the action. Apart from timing out local entries, the general solution for Bazel is 'action rewinding', i.e., re-running the action that would have generated that output file. There is code in Blaze for this, but it doesn't work in Bazel right now (FWIW this has been repeatedly requested by the community).

Note that a) Google internally uses an unreliable CAS, and b) doesn't differentiate between FindMissingBlobs and Execute, which allows it to avoid this problem by hiding the details in the service. REAPI can't do that because of the Merkle tree structure.

Finally, I'll also say that a lossy CAS can potentially provide better performance at a lower cost, but it does require some client cooperation.

from remote-apis.

EricBurnett avatar EricBurnett commented on September 24, 2024

Thanks for the details Ulf. Mostly makes sense - just extracting a few questions/comments. Feel free to send a PR for the text you want to see, expanded however you were thinking.

Unfortunately, Bazel is currently badly-behaved in that it asks about the same digest a lot

Agreed - would be happy to see bazel-side improvements here!

Unfortunately, caching increases the chance of false negatives as well as false positives, and both can cause problems for Bazel at this time.

Do you find much value in caching negatives? We do not cache them, and I wouldn't expect it to be a benefit - most negatives arise from either a blob the client can upload (and so are one-off transient - will be present on next request) or stale cache entries, which should be self-limiting (either action cache entries, which the server can check for and prune, or client-side cache, which will presumably cause the build to restart avoiding checks for any other equally stale entries from that client). In general I'd expect caching negative values to be more dangerous than efficient?

Note that Bazel is currently susceptible to build breakage on any backend GC. It currently never times out the local cache and never re-requests the action. Apart from timing out local entries, the general solution for Bazel is 'action rewinding'

Fair point, I forgot that wrinkle. Independent of full action-level rewinding, timing out local entries should be fairly tractable (if prioritized; I can't speak for bazel as to where it is on their list). One approach would be a simplistic change: if doing an incremental build on top of cached action state and encountering a missing output file, dump state and start the build from the beginning, refreshing or re-running all action-cache entries along the way. Kinda like rewinding, except without all the messy complicated logic.

Note that a) Google internally uses an unreliable CAS, and b) doesn't differentiate between FindMissingBlobs and Execute, which allows it to avoid this problem by hiding the details in the service. REAPI can't do that because of the Merkle tree structure.

Interesting point w.r.t. the FindMissingBlobs/Execute split - yeah if we were designing for an unreliable cache from the start that would not have been the optimal choice. Beyond that, I'd be careful of over-indexing on the Google-internal design decisions: it was designed that way in part because a reliable CAS was a lot harder to imagine implementing when it was created (predates most scalable DB tech you might use today - Spanner paper was 2012), not because it's necessarily the right "first-principles" design for today.

A lossy CAS is only useful if you can use it to materially reduce your storage costs. You probably can't have lossy "pure" outputs - handing off a build to a user and then rapidly losing their test logs is unlikely to be sufficient - so it's really only intermediate outputs you may want to risk losing. But even there, losing say 0.1% of files would probably cause you to rewind most in-flight builds to the beginning anyways...between each action taking multiple inputs and each action depending on many upstream actions, a small number of randomized outputs being lost will invalidate a large portion of the build graph. From there, if we posit you want to lose <1/1000 at a time for rewinding to make sense, you'll need to shard over >1000 servers (if in-memory) or disks. And then either under-load those servers/disks to deal with poor distribution and hotspotting (costing you say 2x in over-provisioning), or implementing chunking . But with chunking if you want to lose <0.1% of files you have to lose considerably fewer chunks, so more sharding, at which point your frequency of loss events is starting to get noticeable.

To mitigate this, you probably want to store more than one copy of each blob or chunk. Full redundancy is an expensive way to achieve that if you're prioritizing storage costs, so you probably want to save by implementing some sort of erasure coding instead, a la this post. Ok great. But now we should be comparing the cost difference between

  • "not durable": accepting periodic loss events that require restart of all in-flight builds, e.g. using a few memcache nodes for your CAS and BwtB downloading all final outputs for persisting somewhere else. Simple, cheap.
  • "somewhat durable": losing files frequently enough that rewinding is worth implementing, but not in "large" loss events where fully restarting builds is just as good. Most complex (not taking advantage of assuming files stay present; implementing rewinding); marginally cheaper than "highly durable".
  • "highly durable": losing files infrequently enough that the rate of failed builds due to this is not worth over-engineering for - accept the failures and retry the builds. A subset of what's required for "somewhat durable", with a higher redundancy factor.

Which is all to say, I can buy the argument that having a low-durability CAS and being willing to restart in-flight builds upon losing a server or disk makes sense as a way to have a cheaper CAS. Go for it! We should make sure the API remains amenable to this, and I'd be open to bazel having some sort of "restart whole build" failure mechanism to improve the UX.

I'm very skeptical though that the "somewhat durable" middle-ground is wide enough to engineer for. Cost-wise, it's a 10-20% delta on storage costs I'd guess, and a superset of the complexity needed to be "highly durable". And this is not accounting for the hidden cost of state synchronization for re-checking that files that should exist actually do all the time - might actually cost more on net, all-in. I remain open to the possibility...but am at the point where I want to see a concrete argument for how exactly it can be cheaper (and how much cheaper) before I'll support anything tied to mostly-durable-CAS+action-rewinding as a cost savings. The math never seems to work out materially in its favour.

(Note: I don't conflate "low durability" with "short lifetime" - getting enough metadata into the system to store blobs for as short a period as possible I'm all for! I just remain fairly convinced it should be durable in the sense of "expected to survive to the end of its desired lifetime", for whatever lifetime that is).
(Note 2: I'm aware Google uses rewinding internally; that's targeted at solving a somewhat different problem though and doesn't really apply to the discussion here)

from remote-apis.

ulfjack avatar ulfjack commented on September 24, 2024

We are not caching negative results as such, but we are coalescing requests for the same file, and there are multiple race conditions that can result in false negatives due to that. For example, one client asks about a file X and another client uploads the file concurrently. Then the second client asks about X's presence again (for whatever reason). Now, if the second client's request is merged with the first one, then it may see a false negative even though it just uploaded the file. We can update the scheduler's cache where the upload happens, but then what about hitting another scheduler? Broadcasting the successful upload between schedulers may help?

Wrt. Bazel timing out local actions: this may be more difficult than supporting action rewinding. The reason is that we don't currently have a concept of time in Skyframe and action rewinding is already mostly implemented, except for that one missing piece (famous last words?).

Wrt. the lossy CAS: The scenario we're looking at is that we want to run a remote execution cluster and there is no existing large-scale storage infrastructure (might be on-prem or a cheap data center). We think this use case is important enough to support. We think we can make do with an integrated 'lossy' CAS as long as individual machine failures are sufficiently rare (and yes, keeping multiple copies of the data can help cover cases where they are not quite as rare). IIRC, at least in some configurations, Buildbarn also loses CAS entries on machine failures.

Another use case is using GCS or S3 as a CAS, neither of which guarantees consistency (this is a bit simplified).

from remote-apis.

EdSchouten avatar EdSchouten commented on September 24, 2024

IIRC, at least in some configurations, Buildbarn also loses CAS entries on machine failures.

Yes. If you don’t use MirroredBlobAccess, for example. That said, those setups are simply best effort. There is no need to complicate the REv2 spec to account for that. I’m more than happy to extend any documentation on my end to make this more clear if desired.

from remote-apis.

illicitonion avatar illicitonion commented on September 24, 2024

action rewinding is already mostly implemented, except for that one missing piece (famous last words?).

Can you elaborate on what's missing (maybe a link to an issue?) - I was trying it out a few weeks ago and it seemed to be working fine for all of the use-cases I tested out, but those use-cases were all fairly simple.

from remote-apis.

EricBurnett avatar EricBurnett commented on September 24, 2024

@ulfjack :

We are not caching negative results as such, but we are coalescing requests for the same file...
Depending on how often this results in negative results, you could follow the pattern of "make a cheap/stale check (the coalesced check)", and then if false, "make a check with strong consistency" to confirm the negative (coalesced with nothing from before the original request was initiated, though other subsequent lookups can be coalesced with it).

The scenario we're looking at is that we want to run a remote execution cluster and there is no existing large-scale storage infrastructure (might be on-prem or a cheap data center)

This to me argues for some sort of bazel-level build retries - being able to tolerate discrete "catastrophic" loss events of anywhere up to the entire CAS at a time. A starting point might be a wrapper that looks for exit code 32 and tries again? Though for all cases not using BwtB, I think bazel can be made robust enough on the existing APIs to recover from the loss and re-upload, and I'd support changes in the spirit of robustness like ignoring FindMissingBlobs upon Execute file consistency errors as we discussed above.

But now I'm also wondering the same as @illicitonion ... just how close is rewinding to being fully implemented in Bazel? I'm mostly a curmudgeon around it as unnecessary complexity, but if that complexity already exists, I have no leg to stand on :).

Another use case is using GCS or S3 as a CAS, neither of which guarantees consistency (this is a bit simplified).

I'd be interested in changes necessary to support these better also, but I think they have a different problem? GCS is strongly consistent and durable, which I think is all you need...bit of messiness around garbage collection also, but that only applies if you want to combine both GC and BwtB.

S3 is harder, in that it's eventually consistent for visibility of new uploads. That's problematic for a CAS - if you've successfully uploaded a blob but it's not visible when someone wants to go download it, what can they reasonably do? I guess it depends how often this inconsistency arises, and how long it can persist. If infrequent and short, allowing the higher-level Execute retry loop to kick in may suffice to mask this. If frequent and short, it may be worth waiting and re-confirming negatives before erroring. If long enough to survive all retries...I guess retry longer/slower, or don't use S3? A CAS is not much good without read-after-write.

from remote-apis.

ulfjack avatar ulfjack commented on September 24, 2024

I think Bazel is fairly robust to CAS loss except for BtwtB. IIRC, I did run some tests for that earlier this year, but that was before we added the cache on the server-side.

Last I talked with @anakanemison, rewinding was incompatible with the default Skyframe graph implementation, but admittedly, that was 8+ months ago. @illicitonion, did you see actual rewinding, or did you 'just' enable the flag? I wonder if Bazel might be silently ignoring the flag.

Ah, so I was wrong about GCS. My apologies. That is good to know.

S3 provides read-after-write consistency within a region. It is eventually consistent for overwrites, but that isn't necessarily a problem. It is eventually consistent for delete, which can lead to false positives from FindMissingBlobs.

from remote-apis.

EdSchouten avatar EdSchouten commented on September 24, 2024

Keep in mind that if you first do a HEAD on S3 (e.g. as part of a FindMissingBlobs implementation), it is not guaranteed that if you PUT (Bytestream Write), that GET requests (Bytestream Read) after it immediately return success.

From the S3 documentation:

Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all Regions with one caveat. The caveat is that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.

from remote-apis.

illicitonion avatar illicitonion commented on September 24, 2024

Last I talked with @anakanemison, rewinding was incompatible with the default Skyframe graph implementation, but admittedly, that was 8+ months ago. @illicitonion, did you see actual rewinding, or did you 'just' enable the flag? I wonder if Bazel might be silently ignoring the flag.

IIRC my experiment was roughly (all with build without the bytes):

  • Set up a chain of 100 genrules each of which depends on output of the last
  • Execute a build for the first 99 actions
  • Flushed the remote CAS
  • Executed a build for the last action
    I observed all 99 of the required actions being re-run, and the 100th succeeding.

from remote-apis.

EdSchouten avatar EdSchouten commented on September 24, 2024

Last I talked with @anakanemison, rewinding was incompatible with the default Skyframe graph implementation, but admittedly, that was 8+ months ago. @illicitonion, did you see actual rewinding, or did you 'just' enable the flag? I wonder if Bazel might be silently ignoring the flag.

IIRC my experiment was roughly (all with build without the bytes):

  • Set up a chain of 100 genrules each of which depends on output of the last
  • Execute a build for the first 99 actions
  • Flushed the remote CAS
  • Executed a build for the last action
    I observed all 99 of the required actions being re-run, and the 100th succeeding.

I’ve observed that enabling builds without the bytes causes Bazel to rebuild stuff unnecessarily. Did you verify that no rebuilding happened when you didn’t flush the remote CAS? It may have been the case that Bazel always did a rebuild, regardless of whether data was present in the CAS or not.

from remote-apis.

illicitonion avatar illicitonion commented on September 24, 2024

Last I talked with @anakanemison, rewinding was incompatible with the default Skyframe graph implementation, but admittedly, that was 8+ months ago. @illicitonion, did you see actual rewinding, or did you 'just' enable the flag? I wonder if Bazel might be silently ignoring the flag.

IIRC my experiment was roughly (all with build without the bytes):

  • Set up a chain of 100 genrules each of which depends on output of the last
  • Execute a build for the first 99 actions
  • Flushed the remote CAS
  • Executed a build for the last action
    I observed all 99 of the required actions being re-run, and the 100th succeeding.

I’ve observed that enabling builds without the bytes causes Bazel to rebuild stuff unnecessarily. Did you verify that no rebuilding happened when you didn’t flush the remote CAS? It may have been the case that Bazel always did a rebuild, regardless of whether data was present in the CAS or not.

I'm pretty sure I did, but I'm not certain. I'll run some experiments tomorrow :)

from remote-apis.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.