bazelbuild / remote-apis Goto Github PK

An API for caching and execution of actions on a remote system.

License: Apache License 2.0

Shell 3.00% Starlark 97.00%

remote-apis's Introduction

remote-apis

This repository contains a collection of APIs which work together to enable large scale distributed execution and caching on source code and other inputs. It describes how to upload inputs, request the execution, monitor for results, and cache those results. It's overall aim is to enable large scale parallel executions that wouldn't be feasible on a single system, while minimizing the amount of uploads and executions needed by storing data in a content-addressable format and caching results.

Remote Execution API

The Remote Execution API is an API that, at its most general, allows clients to request execution of binaries on a remote system. It is intended primarily for use by build systems, such as Bazel, to distribute build and test actions through a worker pool, and also provide a central cache of build results. This allows builds to execute faster, both by reusing results already built by other clients and by allowing many actions to be executed in parallel, in excess of the resource limits of the machine running the build.

Remote Asset API

The Remote Asset API is an API to associate Qualifiers and URIs to Digests stored in Content Addressable Storage. It is primary intended to allow clients to use semantically relevant identifiers, such as a git repository or tarball location, to get the corresponding Digest. This mapping may be pushed by a client directly, or dynamically resolved and added to CAS by the asset server when fetched by a client.

Remote Logstream API

The Remote Logstream API is an API supporting ordered reads and writes of LogStream resources. It is intented primarily for streaming the stdout and stderr of ongoing Action executions, enabling clients to view them while the Action is executing instead of waiting for it's completion.

API users

There are a number of clients and services using these APIs, they are listed below.

Clients

These tools use the Remote Execution API to distribute builds to workers.

Servers

These applications implement the Remote Execution API to serve build requests from the clients above.

bazel-remote (open source, cache only)
Buildbarn (open source)
BuildBuddy (commercial & open source)
Buildfarm (open source)
BuildGrid (open source)
EngFlow (commercial)
Flare Build Execution (commercial)
Justbuild (via --compatible, open source)
Kajiya (open source)
NativeLink (open source)
Scoot (open source)

Workers

Servers generally distribute work to a fleet of workers. The Remote Worker API defines a generic protocol for worker and server communication, although, this API is considered too heavyweight for most use-cases. Because of that, many implementations have designed their own protocols. Links to these APIs are provided as a reference below. Adhering to any one of these protocols is not a requirement.

Buildfarm Operation Queues
- Uses sets of queues for managing different payload requirements.
Buildbarn Remote Worker
- Uses a custom protocol for workers to connect to a scheduler and receive instructions.
BuildGrid Bots
- A server implementation of the Remote Workers API.
Buildbox Worker
- A worker implementation of the Remote Workers API.

API Community

The Remote Execution APIs group hosts discussions related to the APIs in this repository.

Interested parties meet monthly via VC to discuss issues related to the APIs, and several contributors have organized occasional meetups, hack-a-thons, and summits. Joining the email discussion group will automatically add you to the Google Calendar invite for the monthly meeting.

Dependencies

The APIs in this repository refer to several general-purpose APIs published by Google in the Google APIs repository. You will need to refer to packages from that repository in order to generate code using this API. If you build the repository using the included BUILD files, Bazel will fetch the protobuf compiler and googleapis automatically.

Using the APIs

The repository contains BUILD files to build the protobuf library with Bazel. If you wish to use them with your own project in Bazel, you will possibly want to declare cc_proto_library, java_proto_library, etc. rules that depend on them.

Other build systems will have to run protoc on the protobuf files, and link in the googleapis and well-known proto types, manually.

Go (for non-Bazel build systems)

This repository contains the generated Go code for interacting with the API via gRPC. Get it with:

go get github.com/bazelbuild/remote-apis

Import it with, for example:

repb "github.com/bazelbuild/remote-apis/build/bazel/remote/execution/v2"

Development

Enable the git hooks to automatically generate Go proto code on commit:

git config core.hooksPath hooks/

This is a local setting, so applies only to this repository.

remote-apis's People

Contributors

Stargazers

Watchers

Forkers

ola-rozenfeld dotordogh bloomberg juergbi edschouten bbarnes52-zz edbaunton sadaf-matinkhoo bergsieker aidanhs santigl ericburnett ianoc renovate-bot jjardon buchgr arberx jasharpe erikmav hiramtibbit ifoox jmillikin-stripe traveltissues mostynb sstriker simon0191 keith roitk liangti techietommy codificasolutions ianoc-stripe yannic illicitonion wdxx zym1009 ulfjack stefanhoelzl qinshulei peterebden tylerwilliams vors werkt atetubou qinusty flarebuild mickael-carl basavaraj29 unilang cosmicexplorer robjh alekseyl1992 siggisim robbertvanginkel tdyas isabella232 alonestars wjtracey luxe travistakai nodirg ruoye-w moroten bozydarsz wiwa gormo yerinu2019 jeremiahbonney loyalpartner tagantroy jmillikin stjordanis heath-at-canva sushain97 albinvass roloffs shengchen1998 zpzjzj allada redwoodtj rbsservicemanager cheese restingbull elide-tools bentekkie limkokholefork heyisun asartori86 245307867 sluongng oreiche aaronmondal exoson lgalfaso linzhp philwo ukai mortenmj tyler-french diegohavenstein

remote-apis's Issues

Execute keepalive client expectations

We should clarify the expected behavior in the API for streaming responses meant to ensure that an operation is still in progress with long intervals between reported status changes. Namely that operations will be updated with done = false, the operation name in place, and either unchanged since the last update or absent metadata.

Add properties for files

In order to preserve file properties such as timestamps and permission bits as metadata, I propose to add the concept of NodeProperties as repeated key/values to FileNodes, DirectoryNodes, and SymlinkNodes. This would allow servers to enrich directory trees and is similar to the Platform properties in #38. For example in directory proto:

{
 files: [
   {
     name: "bar",
     digest: {
       hash: "4a73bc9d03...",
       size: 65534
     },
     "properties": [
       {
         "mtime": "2019-08-12 10:00:00.000000000 +0100 "
       }
     ]       
   }
 ]
}

Ambiguity extracting instance names from resource names

remote_execution.proto describes instance names but there appears to be a small hole in the specification:

the `instance_name` is an identifier, possibly containing multiple path segments,

There are no further restrictions, so the following would appear to be a valid instance name: uploads/6c92172c-8064-4351-93a2-640d5e8761fe/blobs/187d384348c73a2c0246a42fd061167039e551c1fe8c24a51d9538f4536fa72c/1034

In which case, if I sent this resource name:

/uploads/6c92172c-8064-4351-93a2-640d5e8761fe/blobs/187d384348c73a2c0246a42fd061167039e551c1fe8c24a51d9538f4536fa72c/1034/uploads/6c92172c-8064-4351-93a2-640d5e8761fe/blobs/187d384348c73a2c0246a42fd061167039e551c1fe8c24a51d9538f4536fa72c/1034/uploads/6c92172c-8064-4351-93a2-640d5e8761fe/blobs/187d384348c73a2c0246a42fd061167039e551c1fe8c24a51d9538f4536fa72c/1034

What is the actual instance name and what is the hash of the object requested? I think the instance name could be either one or two repetitions of the first string I posted.

Because anything after the 'size' field is ignored, you cannot pattern match from the right.

I ask because we have an implementation of a remote execution server, and it attempts to determine an optional instance name from the resource name. I'm wondering how to get this code exactly correct.

Are outputs relative to exec root or working directory?

The documentation for output paths makes several references to output paths being relative to the action's "working directory". I interpret this as outputs being relative to the "working directory" as specified in the command.

Example:

exec root: "/foo"
working dir: "bar"
output: "baz.out"

absolute path of output "/foo/bar/baz.out"

However, after reading the Go client in remote-api-sdks, I am pretty sure that this code treats outputs as being relative to the exec root, not the working directory of the command.

So either they are relative to the command's working dir, in which case there's a of bug to be filed against remote-api-sdks, or they are relative to the exec root, in which case the proto documentation should be fixed to remove the ambiguity.

Clarify BatchUpdateBlobs semantics

We're discussing whether the reply should have the same digests in the same order as the request, or whether it's ok to reorder the replies, or elide replies. I don't have a strong opinion either way, but it would be good to clarify what the minimum requirements are.

Clarify RW state diagram to allow direct transition to "completed" state

At the API level, it is not required for bots to transition a task to "assigned" before transitioning to "completed." It is acceptable for the bot to transition the task directly to the "completed" state.

This should be clearly documented in the RW Google Doc (and eventually markdown).

Clarify what a client should do on FAILED_PRECONDITION

As discussed with @ola-rozenfeld, we need to clarify what the client should do in case of FAILED_PRECONDITION

https://github.com/bazelbuild/remote-apis/blob/master/build/bazel/remote/execution/v2/remote_execution.proto#L101

In particular for any blobs that are reported as MISSING, the client should, if a blob is a Directory, ensure that the entire tree rooted at the directory is going to be present in CAS prior to re-submission. This practically involves a call to FindMissingBlobs for anything transiently referred to by the directory.

Clarification: Ordering of entries in BatchReadBlobsResponse

The API allows a client to request a list of blobs using BatchReadBlobs(). That sends a BatchReadBlobsRequest with a list of digests to the server, which in turn replies with a BatchReadBlobsResponse.

That response will contain, in the best-case scenario, a list of blobs that the client asked for. Since the specification states that the blobs should be allowed to fail separately, it could happen that not all of the blobs requested are contained in the response, although there will be an entry for them in the reply with status set to an error code.

The current specification, however, does not specify what should happen with the ordering of the blobs contained in the BatchReadBlobsResponse.

Should there be any guarantees or expectations regarding the relation between the ordering of the blobs in the response and the order of their corresponding requests?

Clarification on case-sensitivity of platform properties

Are the platform properties (names and values) case-sensitive? As far as I can tell, there's no explicit specification in the proto file.

My assumption is that they are, given the rendering of "OSFamily" as well as the following statement in platform.md:

Multiple values are not allowed and an exact match is required.

Prefix non-standardised platform keys

At the London Build Meetup @sstriker suggested to prefix all platform keys with the backend type, e.g. rbe-container-image instead of container-image (enforce namespaces).

Sander, can you please remind me the rationale for this?

V3 idea: No longer allow Digest.size_bytes <= 0

Right now it is allowed to create Digest messages that have size_bytes == 0, referring to the empty blob. In #131 we're extending the protocol to require that the empty blob is always present, because it can be derived trivially. I personally find this a bit problematic:

It makes the protocol less regular and consistent.
Naïvely implemented client/servers will get this wrong. For example, what is FindMissingBlobs() on {hash: "e984d2bdd07318c4e29f7a2ceea4a9e4569e2d8e695a953a4e2df6f69fbdec95", size_bytes: 0} supposed to do? Report existence, because it has size zero? Or should it report absence, because the empty blob actually has SHA-256 sum e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855?
When digests for empty blobs are embedded in other messages, they still waste space. We still end up storing a SHA-256 sum.

I would like to suggest that we simply deny the existence of Digest messages with size_bytes <= 0. Any field where the empty blob needs to be referenced, we should use null. This means that the optimization that Bazel performs of not loading the empty blob becomes the norm, as there is no longer any way to even address the empty blob.

Protect against partial actions in the cache

Currently, when an execution fails, we allow returning a partially-populated ActionResult message. RBE experienced a failure where these partial results were making it into the cache, and because the exit_code field was unpopulated, it was interpreted as 0. Subsequent builds read the result from the cache and interpreted it as successful, but since it had no output files the build failed later when the requisite file was not present. It's certainly believable that a failure like this could reoccur in RBE or in other implementations, so we would like to put protections in place against it.

As I understand it, Bazel's architecture does easily lend itself to having outputs be mandatory, which is why it can only detect the failure downstream. This is why all outputs are considered optional at the API level; even trying to separate out optional and mandatory outputs on the Bazel side might prove difficult.

One suggestion was to require that all action results have at least one output file or directory to be considered valid; an action that has no meaningful output files could add a dummy output and touch it on the bot side (or even include it as an input) to ensure that the empty ActionResult is not propagated.

V3: Clarify relationship between APIs and Endpoints

As discussed in #116 , and previous places.

The REAPI currently does not specify what APIs and resources must be accessible from the same endpoint (domain), and what are able to be independently varied. For example, hosting the Execution API on a different endpoint from the CAS. A few clients currently support varying the endpoints along specific bounds, but not all consistently with each other, and no client allows full flexibility in how APIs/resources and endpoints can be paired. This ambiguity routinely comes up and causes problems for implementers.

There are two directions I see that we could go as a community. One is to say that clients should expect all APIs available at a single endpoint, period. The other is to define a clear set of different 'virtual' endpoints (Service APIs, perhaps scoped to a particular set of resources, e.g. like the Execution service, plus 'CAS blob bytestreams', 'streamed bytestreams', etc), specify how they're bucketed in terms of which are allowed to vary independently and which should be co-hosted, and recommend a consistent way to configure clients and/or let them discover it.

The Capabilities API will also need to be revised in light of whatever is decided here.

cc @edbaunton @sstriker

ActionCache: explicit delete operation?

The action cache currently does not have a delete operation. According to GCP, the RBE permissions include a delete permission. One could consider an update request without a result as a delete. However, this isn't clear from the documentation. I think I'd have a slight preference for having a separate delete operation. WDYT?

:go_default_library should be alias of :xyz_go_proto

I'm seeing the following error when building a go binary that depends on semver.proto and remote_execution.proto:

$ bazel build //:test
INFO: Invocation ID: 5938ca3c-1dae-404a-833e-20b364aed4bb
INFO: Analyzed target //:test (1 packages loaded, 2 targets configured).
INFO: Found 1 target...
INFO: From GoLink darwin_amd64_stripped/test:
link: warning: package "github.com/bazelbuild/remote-apis/build/bazel/semver" is provided by more than one rule:
    //build/bazel/semver:semver_go_proto
    //build/bazel/semver:go_default_library
Set "importmap" to different paths in each library.
This will be an error in the future.
Target //:test up-to-date:
  bazel-bin/darwin_amd64_stripped/test
INFO: Elapsed time: 2.112s, Critical Path: 1.76s
INFO: 2 processes: 2 darwin-sandbox.
INFO: Build completed successfully, 3 total actions

BUILD:

go_binary(
    name = "test",
    srcs = [
        "test.go",
    ],
    deps = [
        "//build/bazel/remote/execution/v2:go_default_library",
        "//build/bazel/semver:semver_go_proto",
    ],
)

test.go:

package main

import ()

func main() {
}

Design doc link

Is the main google doc design doc still maintained to match the API? If so, should it be linked here? I am pointing some people at the API and was surprised that it isn't linked. Alternatively, perhaps it should be moved into the repository in Markdown so that it can be updated alongside the proto?

Clarification: GetTree() pages and relation with gRPC max. message size

The GetTreeRequest parameter of the GetTree() call allows the client to set a page_size limit. According to the spec, "the server may place its own limit [...] and require the client to retrieve more items using a subsequent request".

Servers read the page_token field in requests and include a next_page_token in responses. That last token "[i]f present, signifies that there are more results" and "[i]f empty, [...] the last page of results".

My specific questions related to this are what would happen in the case where a GetTreeResponse message exceeds the maximum gRPC message size, and whether it would be correct to make a distinction between logical pages (defined by a token value) and physical pages (gRPC messages in the stream).

So, for example, if a client makes a GetTree() call without setting a limit, and the server is happy to return the whole tree, could the server split it into multiple GetTreeResponses without setting a next_page_token and expect the client's implementation to keep reading the stream?

That same scenario might also happen if the page_size given by the client produces a GetTreeResponse larger than the gRPC maximum message size. Could then a page span multiple GetTreeResponses? Or should the server keep a 1:1 relationship between pages and messages?

Thank you, and sorry if this is something that is totally clear from reading the specification.

Base64 encoding of RequestMetadata

Looking at the RequestMetadata proto:

// * contents: the base64 encoded binary RequestMetadata message.

https://github.com/bazelbuild/remote-apis/blob/master/build/bazel/remote/execution/v2/remote_execution.proto#L1416

It states that the contents should be base64 encoded binary. I don't think this is what Bazel currently does:

https://github.com/bazelbuild/bazel/blob/master/src/main/java/com/google/devtools/build/lib/remote/util/TracingMetadataUtils.java#L65

To me it looks as if it is just sending down a raw protobuf. Would it make sense to change the docs to reflect what Bazel currently does as I'm not sure it makes sense to encode it as base64?

Should we be making version 2.0.0?

The initial API version we released was labeled as v1test, i.e. a prerelease version of v1. The name was based on Google practice while we were still largely following Google's API processes, representing a pre-release version of v1. But in practice, the API has evolved faster.

I picked v0.2 for the API when publishing this repository because we needed a number and that seemed like a good choice---what Semver would recommend for the second in-development version. Semver says "If you have a stable API on which users have come to depend, you should be 1.0.0. If you’re worrying a lot about backwards compatibility, you should probably already be 1.0.0." Given the immense amount of dancing around backwards compatibility that we are currently doing, this implies to me that the current API is effectively already 1.0.0.

So I think that perhaps we should be calling the new API 2.0.0, and commit to no major breaking changes. In terms of the proto package path, we should just go with "v2" with no minor number, as unlike with prerelease versions, the minor number changing cannot introduce breaking changes.

V3 idea: Convert Digest.hash into a oneof for each hashing algorithm

Consider the case that you have a storage service that supports multiple hashing algorithms. If someone downloads/uploads a blob, you can currently derive the hashing algorithm that was used by checking the length of Digest.hash. 32 hexadecimal characters? Likely MD5. 64 of them? Probably SHA-256. This allows a storage service to do integrity checking.

This approach becomes problematic if people want to use other hashing algorithms that use the same size. Even worse, modern hashing algorithms like BLAKE3 use a XOF (Extendable Output Function) where the digest length is user configurable. There is no proper way you can even derive the hashing algorithm in that case.

One way would be to simply remove the hash field, replacing it by a oneof:

message Digest {
  oneof hash {
    string sha256 = 1;
    string blake3 = 2;
    ...
  }
  ...
}

Unfortunately, not all Protobuf implementations (e.g., the Go one) guarantee that oneof fields are serialized in a stable way. Maybe we need to simply use separate fields, emulating a oneof kind of construct at a higher level.

Lack of description on what the permissions on files in the input root are

(Note: This is a continuation of a remark I made in #40)

It seems to be the case that the protocol doesn't document what the permissions of files and directories in the input root are in relation to the credentials of the build action.

Are build actions permitted to overwrite input files? When using Sandboxfs, this would be easy to achieve. Without using Sandboxfs, it's also possible, but has the downside that optimizations built around caching input files and hardlinking them is out of the question.
Related to the previous question, is it permitted to hardlink input files to some other location? Recent versions of Linux have fs.protected_hardlinks enabled by default. This implies that if input files are read-only (due to the use of hardlinking caches), we cannot guarantee that the kernel will allow the creation of hardlinks to input files.
In the general sense: is there even any guarantee that hardlinks can be created between two distinct directories in the input root? For example, may a worker place the output directories on a separate file system (tmpfs)? If so, this means you can't hardlink input files into an output directory.
What are permissions on directories? Where may the build action create temporary files? In any directory in the input root, or only inside of a directory containing one or more outputs?

Buildbarn's workers don't use anything like FUSE (yet!), for the reason that I initially aimed at using Buildbarn on Kubernetes, where you can't simply make mounts inside of containers. After giving it enough tweaks, I eventually concluded that:

All directories in the input root need to be writable to appease the build rules out there. Build rules should be allowed to rename and remove any file in the input root, and to create files in any directory in the input root.
Input files may be read-only. They may be replaced by removing them first and creating a new one with the old name.
Disallowing input files to be hardlinked (fs.protected_hardlinks == 1) causes a very small number of build rules to break, but those may be easy to fix.

Do we want to document this in the .proto file somewhere?

remote_execution.pb.go does not take over comment from remote_execution.proto

If remote_execution.pb.go takes over comment from remote_execution.proto, godoc becomes more useful.

Add generated *.pb.go

googleapis/go-genproto#138 and googleapis/google-cloud-go#1322

could you add generated *.pb.go to make it easier for Go developers to use that API?

go get fail

go get fails with the recent commit

..
go: extracting github.com/bazelbuild/remote-apis v0.0.0-20190524141337-c0682f068a60
-> unzip /usr/local/google/home/ukai/go/pkg/mod/cache/download/github.com/bazelbuild/remote-apis/@v/v0.0.0-20190524141337-c0682f068a60.zip: case-insensitive file name collision: "BUILD" and "build"

switched_rules_by_language doesn't fully work in not requiring per-platform deps

If you try use this without go/having go locally it will fail that gazelle isn't defined. I think because the load definition at

https://github.com/bazelbuild/remote-apis/blob/master/remote_apis_deps.bzl#L3

loads in //build/bazel/remote/execution/v2 induce dependencies

The loads of grpc and rules_go starlark methods mandate that any workspace that references the //build/bazel/remote/execution/v2 package - even to non-language targets - provide both the rules_go and grpc repositories. This will break implementors that do not have these dependencies already, and require that wholly unrelated implementations (i.e. buildfarm in Java) include declarations for these go and grpc repos, that must match by name.

If remote-apis is going to continue down this path of providing language implementations (C++ and go so far), it should provide an extension library for initializing the repo with these dependencies activated, and personally I would like to have these available in separate packages by language, if only to make it possible not to import the growing list of supported language rules toolchains.

Execute keep alive response

Clarify and document the behavior for using an operation response with done=false. This has not been documented with expected behavior by the server or client, and should be detailed as such.

Client behavior should be to ignore the update and continue to wait for responses.

Server behavior should be to emit this update on some reasonable interval - with or without updates, perhaps upon some (maximally limited) client request period, since this will vary by client and should be tunable to address connectivity issues over multi-hop routes.

Update proto_library import for 1.0.0

Remote apis will not build against bazel 1.0.0 due to the now default --incompatible_load_proto_rules_from_bzl. The imports will need to be present in remote/execution/v2 and semver.

V3 idea: Let the CAS be encrypted + the AC be encrypted/signed

Right now the CAS and the AC are not encrypted and/or signed. This means that systems that store the data have full access to the entire data set. This is bad for confidentiality. Even though the CAS is immutable, the AC can easily be tampered with.

At first glance, it would be trivial to encrypt the CAS: simply apply some symmetrical encryption on top of it and only let clients and workers have access to the key. Unfortunately, this wouldn't allow storage infrastructure to implement GetActionResult() anymore, as ActionResult messages reference Tree objects. GetActionResult() is supposed to touch everything referenced by the Tree. The Tree would need to be decomposed into an encrypted and a non-encrypted portion.

Encrypting the CAS also doesn't allow us to build schedulers that don't have the encryption key, as those need to parse the Action and the Command to be able to extract do_not_cache and platform properties to route the request properly.

Platforms Standardisation

From the previous Buildfarm meeting there was some discussion around providing standardisation for the Platforms specification in the Remote APIs. This issue is to track the follow-up and see where we should go.

The benefit we gain by standardising these platform specifications is interopability between services. For example, all services could implement the DockerImage specification in the same way and one could seamlessly switch between, e.g. RBE and BuildGrid without have to respecify the platform configuration.

Current Status

Per my understanding, currently there is a mismatch in the way that platforms are specified between the Remote Execution API and the Remote Worker API. The REAPI simply provides a key/value alphabetical list of properties that must be satisfied by the servicing worker owning bot. On the other side, the BotsInterface provides 3 separate attributes for describing in a richer way the executing worker: properties, configurations and devices.

The RWAPI properties are matched against the platform specification exactly to find an appropriate worker. The rest are all 'hints'.

The RWAPI also provides the following conventions:

Case sensitivity and camelcase
Standardised keys begin with an uppercase letter
Standardised keys have standardised values (these will be listed in the proto)

Standardisation

So far we have come up with the following use cases and keys for standardisation; under this ticket hopefully that can be formalised into the proto. The list is specified here.

Please update this issue if clarifications/changes are suggested to those keys specified here.

Provide recommendations for platform properties labelling

Implementors of remote apis should have some guidance for useful identification of action runtime specifications like cores, memory, disk, bandwidth, and boundings (min/max) for all of those.

This issue is a discussion point for these recommended field names, values, and the ability to encapsulate and translate them throughout an operation's lifecycle in any given implementation.

What is the expected behaviour when an action deletes an output directory?

This came up in buildbarn/bb-remote-execution#46 because rules_go does this (accidentally): the output directory is deleted then recreated.

The subtlety here is about whether the output directory is "the exact inode created before the action runs" or "the inode at this path after the action completes". The second one is obviously more convenient, but it's not clearly specified.

V3 idea: Let Digest.hash use 'bytes' instead of 'string'

Right now the Digest message uses a string to store the checksum of the object. This is wasteful, because for the hashing algorithms that we use, the output is always a sequence of bytes. Using a bytes field cuts down the storage space by a half (for that field; not necessarily for the full object).

Pre-commit hook fails with bazel ^27.0

I suspect the project will require some modifications due to changes in Bazel's Go rules. When the hook is triggered and the build is executed with a bazel release greater than 26.0:

ERROR: /home/traveltissues/.cache/bazel/_bazel_traveltissues/7ef9b178bea50aa11f4ea96ef499099e/external/io_bazel_rules_go/BUILD.bazel:62:1: in go_context_data rule @io_bazel_rules_go//:go_context_data: 
Traceback (most recent call last):
	File "/home/traveltissues/.cache/bazel/_bazel_traveltissues/7ef9b178bea50aa11f4ea96ef499099e/external/io_bazel_rules_go/BUILD.bazel", line 62
		go_context_data(name = 'go_context_data')
	File "/home/traveltissues/.cache/bazel/_bazel_traveltissues/7ef9b178bea50aa11f4ea96ef499099e/external/io_bazel_rules_go/go/private/context.bzl", line 396, in _go_context_data_impl
		cc_common.configure_features(cc_toolchain = cc_toolchain, reque..., ...)
Incompatible flag --incompatible_require_ctx_in_configure_features has been flipped, and the mandatory parameter 'ctx' of cc_common.configure_features is missing. Please add 'ctx' as a named parameter. See https://github.com/bazelbuild/bazel/issues/7793 for details.
ERROR: Analysis of target '//build/bazel/remote/execution/v2:remote_execution_go_proto' failed; build aborted: Analysis of target '@io_bazel_rules_go//:go_context_data' failed; build aborted
INFO: Elapsed time: 0.358s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (1 packages loaded, 1 target configured)

(see bazelbuild/bazel#7793)

ActionResult inline restoration

I'd like to suggest resurrecting the inline content field in file output relevant fields under ActionResult.

The cached result path is as a gatekeeper to all other action interactions, and serves to throttle extremely heavyweight activity - findMissingBlobs for a huge tree, possibly many [concurrent] writes, executes, and downloads. Minimizing additional requests in the optimized case of getActionResult is therefore of substantial benefits, where cached-result retrieval can outweigh the frequency of missing-forked procedures by extraordinary (i.e. as-many-cached-actions-times-builds you have) factors of load.

With an extension to the GetActionResultRequest to indicate inlining preferences (size limitations, selection mechanisms), we can accommodate ongoing efforts like https://github.com/buchgr/bazel/tree/minimize-downloads as well, and extending ActionResult to be capable of minimized tree delivery could substantially improve that content path as well.

WDYT?

api releases

IIUC this repository wants to use semantic versioning but so far there have been no releases. I suppose my question is simple: Does a plan exist to start doing releases?

Execute/ActionResult Rejection Response

Executions are capable of depending upon or producing invalid definitions or products, respectively.

Currently, the remote system has limited options for influencing the behavior of the client, exemplified here with bazel. This sequence is very complicated and boils down to:

Any stock RemoteRetrier 'retriable' status codes in the ExecuteResponse status field trigger a waitExecution.
A FAILED_PRECONDITION status in ExecuteResponse with an array of ExecuteResponse->Status->Details[PreconditionFailures] with only "MISSING" types triggers a restart of the execution loop, starting with ensureInputsPresent, if the outer execution retrier has not been exhausted.

If the execution retrier is exhausted, the client will revert to fallback behavior, either local execution or failing.

Currently, a remote instance may identify subsequent requests on the action (pursuant to RequestMetadata) and coordinate a short-circuited response (of more like [2] above).

But this practice is awkward, and depends upon a reasonable retry count and hopefully no exponential backoff for any sub request. Can we provide/standardize a status like GOAWAY that indicates that an action is unsuitable for remote execution or caching?

Clarify OutputFile.path and OutputDirectory.path relativeness

In REAPI v1 OutputFile.path and OutputDirectory.path where both documented as:

message OutputFile {
// The full path of the file relative to the input root, [...]

message OutputDirectory {
// The full path of the directory relative to the input root, [...]

But now, in REAPI v2, OutputDirectory.path documentation changed for :

message OutputDirectory {
// The full path of the file relative to the working directory. [...]

There doesn't seem to be any good reason for OutputFile.path and OutputDirectory.path to be relative to different root, neither there is to depend on a root (working directory) that is define outside of the current message scope.

Any reasons why OutputDirectory.path relativeness had been changed?

I think that this modification has been introduced by mistake.

SymlinkAbsolutePathStrategy enum should be wrapped in a message

Protobuf's enum scoping follows traditional C/C++ rules, where the enum name is not a new level of namespacing. This means the current definition of SymlinkAbsolutePathStrategy:

message CacheCapabilities {
  enum SymlinkAbsolutePathStrategy {
    UNKNOWN = 0;
    DISALLOWED = 1;
    ALLOWED = 2;
  }
  SymlinkAbsolutePathStrategy symlink_absolute_path_strategy = 5;

will define CacheCapabilities::ALLOWED, CacheCapabilities::DISALLOWED, etc. This is very confusing because the symbols indicate that caching itself is allowed/disallowed, instead of just this one particular feature.

I recommend the following structure instead:

message SymlinkAbsolutePathStrategy {
  enum Enum {
    UNKNOWN = 0;
    DISALLOWED = 1;
    ALLOWED = 2;
  }
}

message CacheCapabilities {
  SymlinkAbsolutePathStrategy.Enum symlink_absolute_path_strategy = 5;

Capabilities in multi-endpoints configuration

The REAPI describes four services: Execution (EXEC), ActionCache (AC), ContentAddressableStorage (CAS) and Capabilities. The services do have inter-dependencies but the specification doesn't restrict nor advocate on how separated a server implementation can or should expose them. Clients tend to support only some possible configurations:

Bazel has support for two (as far as I know):
- EXEC + CAS + AC available at one end-point.
- EXEC separated from CAS + AC.
BuildStream has support for any: three different endpoints can be specified for the three main services.
RECC has support for two (goal is to support any):
- EXEC + CAS + AC available at one end-point.
- EXEC + AC separated from CAS.

The Capabilities allow requesting server capabilities at a given endpoint. The ServerCapabilities messages contains two sets of capabilties:

CacheCapabilities relevant for CAS and AC.
ExecutionCapabilities relevant for EXEC.

The specification currently doesn't mention anything about how the server should advertise capabilities. Disparities on client expectations already exists. For example Bazel expects all capabilities to be served at the EXEC end-point (event if CAS + AC are separated) while BuildStream expects a Capabilities service to be available at every endpoint but only advertising relevant capabilities for the services hosted at that endpoint .

I think we should discuss and agree on how capabilities should be advertised by server implementations and how client implementations should query them depending on the endpoint configuration. Sensible conclusions should probably be part of the specification.

can't import v2: need semantic import versioning

with require github.com/bazelbuild/remote-apis v2.0.0 in go.mod
go claims

go: finding github.com/bazelbuild/remote-apis v2.0.0
go: finding github.com/bazelbuild/remote-apis v2.0.0
go: errors parsing go.mod:
/workspace/go.mod:10: require github.com/bazelbuild/remote-apis: version "v2.0.0" invalid: module contains a go.mod file, so major version must be compatible: should be v0 or v1, not v2

with require github.com/bazelbuild/remote-apis/v2 v2.0.0

go: github.com/bazelbuild/remote-apis/[email protected]: go.mod has non-.../v2 module path "github.com/bazelbuild/remote-apis" (and .../v2/go.mod does not exist) at revision v2.0.0

I think remote-apis's go.mod should say module github.com/bazelbuild/remote-apis/v2.

cf: https://github.com/golang/go/wiki/Modules#semantic-import-versioning

GetActionResult non-error cache miss response

The error classification response of NOT_FOUND for GetActionResult cache misses means that some response observers, including those of the circuit breaker implementation available but not currently in use by bazel, cannot use success vs. failure as a signal for the reliability of remote availability. Since these requests connote a completely successful round trip through the service to serve an application-level meaningful response, to be lumped in with all other errors, some of which can indicate any level of failure along the communication hierarchy, is an incorrect presentation of the nature of the cache miss.

I suggest that the response should be wrapped in a GetActionResultResponse, with the present (hasActionResult() == false) interpretation via protobuf available as the proper means of determining a cache miss, rather than the RESTful error response of NOT_FOUND.

declared but never created output directories

Hi,

when debugging a change with rules_kotlin I found that some actions declare output directories that they never create. This special case works well in Bazel local execution because the local strategies will create all output (and input) directories before even running the action.

I ran rules_kotlin against several remote execution backends and found that they behave differently. Some will include the never created directories as empty directories in the ActionResult and others will only include in the ActionResult what the action actually created.

The API isn't very specific on how to behave in this situation. It says

A list of the output directories that the client expects to retrieve from the action

I think we should specify this better and remove the ambiguity. I believe there are three options:

Actions that don't create all declared outputs should fail.
The remote execution system should declare never created directories (by the action) as empty directories in the action result.
It's fine to only return a subset of the declared outputs directories.

At this point, I would argue for (3) to be the most sane behavior because this would also work for output files and not break existing clients (who are free to enforce this by themselves). I think this behavior is a bit unsatisfying though because the names of the expected outputs are part of the action key computation.

Thoughts?

P.S.: Somewhat related but not the same issue: bazelbuild/bazel#6393

[remote asset api] resource_type qualifier is too vague

The remote asset api refers to resource_type as a recommended qualifier to resolve ambiguities, but the Qualifier Lexicon file merely has this circular definition: resource_type: This describes the type of resource. It's the only standard qualifier in the doc without examples- I think we should add some.

Clarify whether nested output directories are permitted

The comments for Command.output_files state that an output file cannot be the parent of another output file, but there is not a corresponding comment for directories.

Please add a comment to clarify which, if any, of these output layouts are valid:

Output directory is child of another output directory:

output_directories: "upper"
output_directories: "upper/lower"

Output file is child of output directory:

output_directories: "upper"
output_files: "upper/lower"

Output directory is child of output file (possible if file is a symlink)

output_files: "upper"
output_directories: "upper/lower"

Remove output type specification by clients

Currently, the client needs to specify on each output whether it is an output_file or an output_directory. This turned out to be too restrictive -- some build tools don't necessarily know what an action produces until it is done.

We could fix this as a non-breaking change in V2, it will be a bit challenging, but possible, if needed (add a new outputs field, keep supporting all fields on the servers, stop type-checking the outputs on the servers, change the clients once servers add support for the new field).

But it will be much simpler to just change this in v3.

Move Remote Workers API into this repository

Per discussions in the monthly meetup, we're going to publish the remote workers API in this repository. Moving the API here will make it easier to handle subsequent changes and clarifications via public issues and PRs rather than the current opaque, Google docs-based process.

For now, we'll move the API itself and provide a link to the existing docs. We expect the existing design doc to transition more gradually (converting it, including the comment threads, to markdown will probably be a decent amount of work) and we don't want to block downstream changes on that step.

Clarify RW API semantics around bots advertising leases

"Number of Leases" should be defined as an explicit resource, and bots should include the number of supported leases in the Resources that they advertise to the server. This should be clearly documented in the RW API

Deprecate `is_executable`

If #91 is merged I think is_executable can be deprecated in v3 and this can be handled via properties.

use github.com/googleapis/googleapis directly

I am integrating this repository into my own Bazel project.

I am already using github.com/googleapis/googleapis as the @googleapis repository (the canonical one), which collides with the definition of the @googleapis repository for which a BUILD.googleapis file is vendored in external (the vendored).

I looked through the differences. I noticed that mainly the @googleapis vendored here ships with cc_grpc_library targets on top of a few things that the canonical repo already has.

It would probably be best to consolidate the two. Would like to open the discussion a bit.