Currently, when an execution fails, we allow returning a partially-populated <code cla

Exactly what <a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I don't really understand what the issue is here. If an infras

Protect against partial actions in the cache,about bazelbuild/remote-apis

Comments (24)

tjgq commented on September 24, 2024 3

Or are you suggesting that the client (user) should fix their build rules, generate a new Command, and then execute it in a new Action?

This is the point I'm making above. Fixing the genrule to produce 2.txt would result in a distinct Command and would not hit the existing AC entry.

But Invalid action cache entry is a really bad error message (it makes it sound like a programmer error, but it's actually a rule author error). I think it should say something along the lines of mandatory output 2.txt was not produced by action execution.

from remote-apis.

sluongng commented on September 24, 2024 1

One of our customers run into this situation recently and it's unclear to me where the problem should be fixed: on the client side (Bazel), on the API spec (this repo), on the Server side, or a combination of the above.

The problematic code in Bazel is over at https://cs.opensource.google/bazel/bazel/+/master:src/main/java/com/google/devtools/build/lib/remote/RemoteExecutionService.java;l=1266-1288;drc=3e79472690126689304c714711d911395db3a278. In most action implementation in Bazel, all declared outputs in Command would be treated as mandatory. But in some special native actions (CppCompile, JavaCompile, etc..) some output paths would be considered as "optional". This "optional/mandatory" property is definitely not described in the current API spec and thus, is not known to the server side, but the client side does apply a validation over the ActionResult and yell Invalid action cache entry if some of the "mandatory" were missing.

Our customers have some internal setup where the remote action is executed via multiple layers of wrappers. In some infrastructure changes (i.e. server rotation), one of these wrappers handles SIGTERM in a way that it would exit 0, signaling a successful action, without producing the outputs that the client deemed to be mandatory. Because of exit code zero, our server treated this as a valid ActionResult and recorded the partial outputs into AC. However, the users would experience confusing errors from Bazel as they would see it as Invalid action cache entry and question the cache integrity.

Our current solution is to advise the customer to fix their wrapper to produce a non-zero exit code in case of SIGTERM. However, I think we should discuss what is the right solution here moving forward. A couple of options to consider:

Client and Server both agree that all outputs listed in Command are Mandatory when exit code zero is given. This way, the server could perform extra validation over the missing outputs and avoid invalid writing ActionResults into AC that the client won't be able to accept. This means that the client is responsible for ensuring all actions need to create corresponding outputs (i.e. create an empty file if the file is previously "optional").
The client and Server both agree that optional outputs are possible. ActionResult validity is judged solely based on the exit code equal to zero. The client needs to make sure that when exit code zero, the output files are correct. Subsequent downstream action failures for missing inputs (as previous outputs were not given) should be handled on the client side.

IMO (1) will provide a better user experience, as the server could provide verbose errors for the client regarding missing outputs. But it will take a bit more effort to implement and enforce. (2) is quick and easy, as it's closer to the current state. But it delegates more responsibilities toward build rules authors and end-users, causing worse UX.

cc: @EdSchouten for V3 consideration.

from remote-apis.

tjgq commented on September 24, 2024 1

A user-provided command could fail to produce some requested outputs because of other reasons and then exit with code zero as well. Such "Action Cache Entry" would be considered invalid by some clients (i.e. Bazel).

Again, under the assumption of determinism, I don't necessarily see this as a problem: Bazel would consider the action result invalid, but that's fine because you'd have to change something in the Command to produce a valid result (as opposed to just rerunning it and expecting a different result). It does place a burden on users (which in the Bazel case means not just Bazel itself, but also rule authors) to ensure all error conditions produce a nonzero exit, but that burden already exists irrespective of mandatory outputs.

I think we should codify the definition of success in the proto documentation:

To be completely clear, what "success" means here is whether the action result is allowed to be cached, right? (i.e., a nonzero exit would still result in an ExecutionResponse.status == OK?)

For completeness, there's an option 3, which is to let the client explicitly mark mandatory outputs in the Command message. (After some spelunking I did yesterday, it turns out that's what the Google-internal RE equivalent does.)

from remote-apis.

EdSchouten commented on September 24, 2024 1

Exactly what @tjgq says. :-)

The AC entry that is created in case of the erroneous action is when taken in isolation valid. You're only touching 1.txt, so you get an ActionResult that only contains 1.txt. The AC entry, though not very useful, will in no way conflict with newer versions of the rule that do touch the correct set of files.

from remote-apis.

ola-rozenfeld commented on September 24, 2024

To clarify, this was an RBE bug which imo is against the API spec. The API states that if an action was not executed, the server returns a failure status in the ExecuteResponse field. Obviously, whenever this status is not OK is considered an action failure and should not populate the remote cache (unlike what we accidentally did).

But amending the API to additionally consider output-less actions as failures may be a good idea regardless.

from remote-apis.

bergsieker commented on September 24, 2024

Adding to list of issues to consider for v3. Not clear what the right answer is, but we can consider it at that time.

from remote-apis.

mostynb commented on September 24, 2024

I can think of one scenario where an ActionResult with no output apart from the exit code might be useful: sanity checks.

Maybe in v3 it would be sufficient to require that the ActionResult's execution_metadata field has at least one non-default field value? Setting the worker name would be the simplest way to implement this.

from remote-apis.

mostynb commented on September 24, 2024

Actually, maybe we can advise that the backend implement this behaviour in REAPIv2: when receiving an ActionResult that does not have the worker set in the metadata, set it to some non-default value (such as the IP address of the client)? This allows for non-zero filesize checks on the backend, and we can consider making it compulsory for the client to set this field in REAPIv3 (which would then allow us to catch some client and transmission errors).

(I'm trialing such a workaround in bazel-remote at the moment.)

from remote-apis.

tjgq commented on September 24, 2024

@sluongng As far as I can tell, there's no distinction between "mandatory" and "optional" outputs in Bazel. The declared outputs of a spawn are exactly the ones we populate Command.output_paths with (see here). My reading of the spec is that the ActionResult.output_* fields are required to be a subset of Command.output_paths. Therefore, the code you linked to would seem to imply that we treat every output as mandatory. I don't expect native actions to invalidate this, because the implementation is the same as far as remote caching/execution is concerned; can you clarify in what way C++/Java actions differ?

I do agree that, if we are to always treat outputs as mandatory, it's problematic for a remote implementation to cache action results where some of the declared outputs are missing. Unless my analysis above is wrong and we do have a use for optional outputs in Bazel, I'd strongly prefer (1).

from remote-apis.

EdSchouten commented on September 24, 2024

I don't really understand what the issue is here.

If an infrastructure failure occurred while running the action, Execute() will return an ExecuteResponse that has a non-null status. Even if exit_code is set to zero in that case, clients can distinguish that from normal process termination.
In the case of GetActionResult() there is no ExecuteResponse. You get the bare ActionResult back. But why would a remote execution system write anything into the Action Cache in case of an infrastructure failure? It is completely valid to just drop it on the floor. My assumption was that any reasonable implementation would already do that.

The ActionResult you get back simply describes which files were present after the action ran. It is up to the client to determine whether those results pass any client-imposed restrictions.

from remote-apis.

sluongng commented on September 24, 2024

@tjgq The default implementation for isMandatoryOutput in Spawn.java could be overridden by classes that implement that interface. For example: SimpleSpawn.java (which is used in CcCompile?) and JavaCompileAction.java (used in JavaCompile?) both override it to set some output paths to be optional.

@EdSchouten

But why would a remote execution system write anything into the Action Cache in case of an infrastructure failure?

I think this is the key question here.

From our server perspective, there was no infra failure. Our workers were gracefully shutting down and the action on said worker exited with code zero after given SIGTERM. We interpreted a zero exit code, generated by a user's provided wrapper, as a successful action. Hence we wrote the Action Result to AC.

If we could agree that a user action exits with code zero, but does not produce all the listed outputs in Command, is a fail action, then we could implement that check (for missing outputs) on the worker side and avoid writing that Action Result to AC. However, this is not how some Bazel native actions are currently interpreting things, so we would need to fix those native actions accordingly before rolling out such a validation scheme on the worker side.

from remote-apis.

EdSchouten commented on September 24, 2024

If workers are shutting down gracefully, why are they sending SIGTERM to the build action? That's not graceful.

If workers send SIGTERM to an action, they should already record some state in their bookkeeping that they did not allow the action to run to completion, and should therefore not write anything to the AC, regardless of the exit code returned by the build action.

from remote-apis.

tjgq commented on September 24, 2024

@sluongng Apologies; you're completely right. We do have a notion of optional outputs (for Java and C++). I don't feel like I have a solid understanding of why they're necessary at this time, but in the absence of one, I'm going to assume they exist for $important_reasons, and we can't (easily) get rid of them.

Still, even if we establish that an action is allowed to produce partial outputs, it's unclear to me that we need to revise the definition of success: an action should deterministically produce the same subset of the outputs when presented with the same inputs. A client might later interpret a missing output as an error, but that shouldn't compromise the ability to reuse the action result (it would just deterministically lead to the same error at the client level).

But, to Ed's point, under the current definition of success, you do need to ensure that infrastructure failures always result in a nonzero exit code or a non-success status (which seems completely orthogonal to the discussion of mandatory vs. optional outputs).

from remote-apis.

sluongng commented on September 24, 2024

If workers are shutting down gracefully, why are they sending SIGTERM to the build action? That's not graceful.

This is a good point. I guess we could consider all commands that are interrupted by infrastructure reasons to be retriable failures, regardless of the exit code.

However, this is only one of the possible causes that could yield this "Invalid action cache entry" error.
A user-provided command could fail to produce some requested outputs because of other reasons and then exit with code zero as well. Such "Action Cache Entry" would be considered invalid by some clients (i.e. Bazel).

Still, even if we establish that an action is allowed to produce partial outputs, it's unclear to me that we need to revise the definition of success: an action should deterministically produce the same subset of the outputs when presented with the same inputs.

I think we should codify the definition of success in the proto documentation:

A zero exit code, regardless of actual outputs matching requested outputs in Command.
OR a zero exit code AND all requested outputs in Command are available.

Right now, I would assume that most server implementations are going with (1).
However, Bazel is performing client-side validation for (2) in most cases.

If we were to agree upon (1), which is what I am picking up from @tjgq's message, then we could have a separate discussion in Bazel regarding guidance on user-provided build rules.

from remote-apis.

sluongng commented on September 24, 2024

Agree. Option 3 seems like a great solution :)

Is it possible to include it in REv2 or is it disruptive and have to wait for REv3 (as proposed in some earlier comments in this issue a few years ago)?

from remote-apis.

EdSchouten commented on September 24, 2024

For completeness, there's an option 3, which is to let the client explicitly mark mandatory outputs in the Command message. (After some spelunking I did yesterday, it turns out that's what the Google-internal RE equivalent does.)

I personally don't see the point in doing that. Clients already can't necessarily trust the ActionResult they receive. Even if the server does some checking on its end, there is no absolute guarantee that a call to GetActionResult() will yield an ActionResult message that contains entries for all the paths you're interested in. Clients need to do some form of error handling based on that anyway. Otherwise they would crash on a null pointer dereference/KeyError exception/...

from remote-apis.

tjgq commented on September 24, 2024

How about simply codifying that an implementation MUST NOT cache an ActionResult with exit_code != 0? Would that still be too disruptive for v2? (I feel like this is a pretty glaring omission - caching nonzero exits makes it very easy for a flaky test to poison a cache!)

from remote-apis.

sluongng commented on September 24, 2024

For completeness, there's an option 3, which is to let the client explicitly mark mandatory outputs in the Command message. (After some spelunking I did yesterday, it turns out that's what the Google-internal RE equivalent does.)

I personally don't see the point in doing that. Clients already can't necessarily trust the ActionResult they receive. Even if the server does some checking on its end, there is no absolute guarantee that a call to GetActionResult() will yield an ActionResult message that contains entries for all the paths you're interested in. Clients need to do some form of error handling based on that anyway.

The key differentiation here is for the server to know how the client would validate the ActionResult.
If server could validate the result in the same way client does, then storing the result in AC would be relatively safe.

If the only validation happens on the client side, then by that time, the AC entry might have already been written. We don't really provide an API in the spec for the user to purge a specific AC entry then, thus poorer UX.

from remote-apis.

EdSchouten commented on September 24, 2024

The issue is that it's virtually impossible for the server to validate the ActionResult. File existence may not be sufficient. For example, I've seen cases where programs terminate with exit zero, even though they were only halfway done. Even worse: I have seen workers with memory corruption, yielding output files containing bit flips. In order for a server to detect those cases based on generated outputs, you need something far stronger than simply checking whether a file exists. Regex matching of file contents? Invocation of a separate validation tool? It all leads to a path with no clear outcome.

If the only validation happens on the client side, then by that time, the AC entry might have already been written. We don't really provide an API in the spec for the user to purge a specific AC entry then, thus poorer UX.

With the use case discussed above, there is no need to actually purge the AC entry. The client can just ignore it and rerun the action. This causes the AC entry to be overwritten. That said, if we really want to offer some kind of special API for removing AC entries, this is worth discussing.

from remote-apis.

sluongng commented on September 24, 2024

For the sake of easier demonstration of the issue, I have created an example repo here https://github.com/sluongng/bb-rbe-test/blob/sluongng/invalid-ac/BUILD

genrule(
    name = "temp",
    outs = [
        "1.txt",
        "2.txt",
    ],
    cmd = """
    touch $(execpath 1.txt)
    exit 0
    """,
    exec_properties = {
        # "workload-isolation-type": "firecracker",
    },
)

Building this remotely with our server currently will make Bazel pew out

Invalid action cache entry ec4b980964439d5ca71081dc87a30e7a881e66e7af9bb9d6924cbf012edcf1dc: expected output 2.txt does not exist.

However, this ActionResult is written into our AC for subsequent builds to re-use because the exit code was zero.

With the use case discussed above, there is no need to actually purge the AC entry. The client can just ignore it and rerun the action. This causes the AC entry to be overwritten.

Wouldn't a rerun yield the same (error-nous) cached result from AC? Or are you suggesting that the client (user) should fix their build rules, generate a new Command, and then execute it in a new Action? Perhaps I am missing something here.

from remote-apis.

sluongng commented on September 24, 2024

These are valuable feedbacks with great clarity. 🙏
Much appreciated.

Do you think it's worth keeping this issue open for a V3 discussion? or should we close it?

from remote-apis.

tjgq commented on September 24, 2024

All things considered, I'm comfortable with the status quo of allowing partial outputs and still caching the respective results. But I do think we need to codify that results with exit_code != 0 must not be cached (otherwise, a lot of things would probably break). I'm hoping that's small enough for v2, and added it to the agenda for the next REAPI meeting.

from remote-apis.

EdSchouten commented on September 24, 2024

@tjgq Then we're actually going back and forth on this. In the past it wasn't permitted to write AC entries with exit_code != 0, but some people outside of Bazel explicitly asked for this. (Buildstream? Not sure.)

Buildbarn still uses the historical behaviour of only writing AC entries in the case of exit_code == 0. The reason being that I write those AC entries into the CAS instead, inside a message named HistoricalExecuteResponse. This allows me to obtain a stable link to a historical build failure for people to inspect.

from remote-apis.

tjgq commented on September 24, 2024

@tjgq Then we're actually going back and forth on this. In the past it wasn't permitted to write AC entries with exit_code != 0, but some people outside of Bazel explicitly asked for this. (Buildstream? Not sure.)

Oh, I wasn't aware of that bit of history. I suppose we could at best make it configurable, then? (e.g. an ExecuteRequest.no_cache_exit_nonzero field)? I think this behavior is pretty much required to use Bazel effectively: since we use the exit code to determine test success, caching nonzero exits means flaky tests can poison the cache. (I'm less sure that it matters for non-test actions.)

from remote-apis.

Protect against partial actions in the cache about remote-apis HOT 24 OPEN

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent