tracemachina / nativelink Goto Github PK

NativeLink is an open source high-performance build cache and remote execution server, compatible with Bazel, Buck2, Reclient, and other RBE-compatible build systems. It offers drastically faster builds, reduced test flakiness, and specialized hardware.

Home Page: https://nativelink.com

License: Apache License 2.0

Starlark 4.13% Rust 90.46% Python 0.30% Shell 0.97% Dockerfile 0.16% Nix 2.17% C++ 0.01% Go 1.80%

apache2 bazel buck2 build-automation build-system chromium ci content-addressable-storage free nix re-client remote-execution rust simulation

nativelink's People

Contributors

Stargazers

Watchers

Forkers

johnward skirio aaronmondal elide-tools chrisstaite-menlo spamdoodler marcussorealheis open-source-george blakehatch tn819 whoaa512 triplekai steedmicro cormacrelf mhz5 jirixek adam-singer heyitsalec kubevalet allada codechef93 ruoye-w zbirenbaum lukts30 bytes00000111 dolcetriade blizzardc0der nfarah86 escapeb chrisstaite dev-demons pigfall matdexir aleksdmladenovic chinchaun barrbrain shevisj schahinrohani bclark8923 caass aromate joeleinbinder elee1766 jhpratt krishmoran tobistudio apoorva100 saqib-trywe shabbirhasan1 coredumped7893 milesfeldstein rickdeb2004 jdkochoa didier9598 westercz prachich515 jlswin ditzii goober86 ehsansoraya alxbnct will7455 iyetkin65 eltociear leojmarco altair50 makoweb3 jarandilla d0s0med3v ai2hub cryptomercury nothingfool linecode ukaserge shekevinhall743 ovjjamesclark203 ibzjosephsmith525 rexa302 aggkartik jaroeichler guangminglion evaan2001 chukwu3meka garikasplund harper-carroll froody astrolemonade covenantjunior suchoudh learnrust vlad-secure rksiitd1 leodziki vkobinski ibilalkayy mnowzari mohamadelgendy23

nativelink's Issues

Test NativeLink with Buck2

Buck2 has potential for gaining a lot of adoption quickly. They seem to make BRE a first-class-citizen, so we should test to ensure TruboCache can fully build Buck2 on Buck2.

If all passes, we should make an upstream patch to add TurboCache to their examples:
https://github.com/facebook/buck2/tree/main/examples/remote_execution

Wrapping GrpcStore with FastSlowStore failure for AC

Now that GrpcStore supports forwarding to an upstream AcServer, I attempted wrapping it in a FastSlowStore to make a local cache and am getting the following error:

[2023-07-17T08:06:26.568Z ERROR ac_server] get_action_result Resp: 0.14148201 Some("0f3c5b706e843dc5323f8a21ed8ea59bd8d55c53310abf8b11471f0e1e95924c") Err(Error { code: Internal, messages: ["Action result not found", "Failed to get_part in get_part_unchunked", "---", "Writer was dropped before EOF was sent", "Failed to recv first chunk in collect_all_with_size_hint", "Failed to read stream to completion in get_part_unchunked"] })

This is obviously not as expected, so something is wrong with the implementation. I haven't dug into what's causing it yet.

benchmarks?

This isn't so much an issue but I'm curious if any benchmarks were done against any of the other popular bazel remote cache implementations

AC context lost in store trait

When forwarding to an upstream AC store via a GrpcStore then it uses the AcServer. However, if you wrap it in a fast_slow store, for example, to have a local cache, then you end up with it loosing the context and querying the CAS instead of the AC.

Inefficient upload of stderr/stdout for workers

Currently we do upload stderr and stdout wait for them to finish uploading then start uploading the output files.

This is silly because just below we have a FuturesUnordered that we could add the futures to instead and upload them in parallel with the files.

This is a serial operation, so this fix has a high chance of dramatically reducing upload time for fast running tasks.

Offending code:
https://github.com/allada/turbo-cache/blob/master/cas/worker/running_actions_manager.rs#L628

Errors should be forwarded to client if it could not execute command

Currently if a command cannot be executed on the worker the client is never notified and according to the client it hangs.

It is likely not passing the error code to the proper field.

EvictingMap doesn't remove the oldest entries

I was just looking into how EvictingMap determined which was the least recently used entry and it appears that it's entirely within the lru::LruCache implementation, and entries are not promoted when they are touch()'d and they are not put in order when the insert_with_time is used.

Stop vendoring protos

Since rules_rust 0.25 introduced prost/tonic rules it's now theoretically possible to remove the entire proto directory and trivially generate the current proto target with this:

rust_tonic_library(
    name = "remote_execution",
    proto = "@remote-apis//build/bazel/remote/execution/v2:remote_execution_proto",
)

Modulo some toolchain setup/configuration and adjustments to bring turbo-cache in sync with the upstream proto this will give us a net gain of over 6k LoC. Fantastic 😍

There seems to be one bug remaining in 0.25 that prevents us from implementing this and we'll probably have to wait for 0.25.1 or 0.26, but we're getting close:

bazelbuild/rules_rust#2033 (comment)

Hard linking in `download_to_directory` in `running_actions_manager.rs` does not give guarantees

The linking logic here:
https://github.com/allada/turbo-cache/blob/e172756613b5398f1ccdaaf258f3f7b80ac4b08e/cas/worker/running_actions_manager.rs#L101

Does not properly hold the FileEntry object which is needed in order to guarantee the file does not get deleted and any other needed guarantees. It should also fix the bug outlined here if we properly hold a reference to FileEntry: https://github.com/allada/turbo-cache/blob/e172756613b5398f1ccdaaf258f3f7b80ac4b08e/cas/store/filesystem_store.rs#L411

Make gzip stream compression optional to give possible performance optimization

After doing some profiling, it appears that the gzip decompression/compression of the GRPC stream is by far the slowest part of the program. This should be optional config for users that don't care much about network speed (like localhost/LAN).

Find an alternative to the docker layer caching action

satackey/action-docker-layer-caching has been outdated for a long time.

We're working on docker builds and new integration tests, so this will probably resolve itself.

Speratic crashes due to unordered map panicing during panic

This seems to be happening in the futures crate.

Upstream issue filed here w/ backtrace:
rust-lang/futures-rs#2278 (comment)

Revisit how `recently_completed_actions` are handled in the scheduler.

Spawned from: https://reviewable.io/reviews/allada/turbo-cache/177#-N_UvdbR6if15-0rySV9

I think there's other alternatives that need to be thought about in more detail as outlined here:
https://reviewable.io/reviews/allada/turbo-cache/177#-N_UZDcjCStLmxJZG5Cn

Create test harness for Grpc*

Currently we have no tests for GrpcScheduler or GrpcStore. This has resulted in at least one bug (see #199) and an outstanding ticket to create tests (#154).

In order to test this a gRPC service needs to be spun up to test against. This is non-trivial.

This issue exists to track the effort to create a framework that can create a gRPC service to write tests against.

FastSlow store should publish prometheus stats

Examples on how to do this:
https://github.com/allada/turbo-cache/blob/main/cas/store/memory_store.rs#L141
https://github.com/allada/turbo-cache/blob/main/cas/store/verify_store.rs#L173
https://github.com/allada/turbo-cache/blob/main/cas/store/filesystem_store.rs#L688

Add AWS-native k8s Deployment

I'd like to run turbo-cache in a k8s cluster deployed with Pulumi so that we can automatically set it up for users as part of rules_ll. Simple yaml-manifests would be usable for users of raw k8s, Terraform and Pulumi.

I'd be willing to work on this ☺️

Publish more prometheus stats throughout system

Now that Prometheus is added and the API is established, we need to spread the usage around the system.

Rename primary development branch to `main`

Would be nice if the primary development branch was named main rather than the somewhat outdated/potentially insensitive master 😇

Add official nix package

As the project matures and gets closer to an actual release version we should start considering packaging options. One such option (that I'm most interested in 😆) is the nix package repository.

@allada Plz ping me when we're getting close to a release and I can prepare a nixpkgs release for turbo-cache. Apart from maybe a CI workflow this probably doesn't require a PR to this repo, just one to nixpkgs.

Add prometheus logging

We need to gather stats on what is happening and when. Prometheus is a great choice for this and there's pretty good rust support.

Workers that spawn child processeses may cause zombies

In fixing another bug I discovered that tokio's kill command does not kill the entire child process tree, instead it only kills the immediate child. This can cause zombies if the child spawns more child processes then is killed.

This is going to be very tricky to write tests for it because bazel uses a sandbox internally, so we may need to break out of the sandbox in order to call setsid() to create a process group.

This library does work, but when I looked at it's implementation I'm not sure we should use it because it uses spawn_blocking when waiting on processes. This might cause all our threads to block if we are not careful on high cpu machines.
https://docs.rs/command-group/latest/command_group/index.html

It is very common for people to use the entrypoint_cmd config to wrap their program in a shell script runs their program under docker making this problem much less of an issue.

Support building with only Cargo

To make it more accessible we should support both cargo building and bazel building.

The major reason for this is because we should support windows, but in my experience it's quite difficult to get windows to work with bazel, so I'd like to just use cargo for such case.

Build stall when CAS reset

I had everything stall during a full Chromium build, it did all come back to life after a minute or so, but the cause appears to be this:

[2023-09-07T07:59:41.326Z WARN  h2::proto::streams::recv] recv_reset; remotely-reset pending-accept streams reached limit (20)
[2023-09-07T07:59:41.326Z ERROR cas] Failed running service : hyper::Error(Http2, Error { kind: GoAway(b"", ENHANCE_YOUR_CALM, Library) })

I'm not sure why this happened and it's the first time I've seen it.

Implement blake3

Sha256 is known to be quite slow. There's a new kid in town, Blake3. It's crazy fast, super secure and just better overall.

BRE already supports it:
https://github.com/bazelbuild/remote-apis/blob/39c174e10d224c46b556d8d4615863804d5b2ff6/build/bazel/remote/execution/v2/remote_execution.proto#L1900

Bazel appears to be in the process of supporting it:
bazelbuild/bazel#18658

Micro-bench testing shows it is worth the effort:
https://gist.github.com/allada/6b4321a6487c2888ff73ce1cc0fc86ed

All results are on a 16 core i9 with (no threads):
1GB @ 10:

sha256: 35.546221869s
blake3: 2.346503712s

abs difference: 33.199718157s
% difference r: 1514.86%
% difference i: 6.60%

1MB @ 10_000:

sha256: 34.653737424s
blake3: 2.129524155s

abs difference: 32.524213269s
% difference r: 1627.30%
% difference i: 6.15%

1KB @ 1_000_000:

sha256: 3.714629289s
blake3: 725.049013ms

abs difference: 2.989580276s
% difference r: 512.33%
% difference i: 19.52%

100B @ 1_000_000:

sha256: 453.309362ms
blake3: 113.922135ms

abs difference: 339.387227ms
% difference r: 397.91%
% difference i: 25.13%

This is a significant difference and when things are under high workload, it is often because we are spending so much time hashing. This hash a very high chance of dramatically improving performance of the cas stores.

Workers do not honor timeout

Currently we do not honor the timeout field in the proto for actions. We need to also have a "max job time" type setting and force kill jobs if the go over this limit regardless of what the proto requests.

This should be trivial to implement.

Use `crate_universe` instead of `cargo_raze`

cargo_raze seems to be mostly abandoned in favor of crate_universe.

Moving from cargo_raze to crate_universe means essentially just changing build files like this:

"//third-party:prost_types" -> "@crate_index//:prost-types"

The entire third_party directory is then superceded by a single Cargo.Bazel.lock file.

There are two ways crate_universe can be used:

Track dependencies in the WORKSPACE directly and delete Cargo.toml, i.e. something like

crates_repository(
    ...
    packages = {
        "somecrate": crate.spec(version = "1.2.3")
    },
)

Keep the Carto.toml and generate dependencies from that, i.e. something like
```
crates_repository(
    ...
    manifests = ["@//:Cargo.toml"],
)
```

@allada I already have an implementation of option 1 but I need to update it to the recently changed deps. I'll send a PR when It's ready ❤️

If everything runs on same host (worker, scheduler, cas, exc...) it is possible to deadlock

Buck2 hammers the remote execution as hard as it can (which is a good thing). In my testing I was running everything in the same process (which is not how it should be done in production), it caused the max number of files to be opened then deadlocked because every thread was waiting for another thread to release a file.

This only happens when you are reading and writing from one-file to another (ie: CAS(file) -> worker(file)).

Unable to build on Ubuntu 20.04

The build now fails for Ubuntu 20.04:

gcc: error: unrecognized command line option '-std=c++20'; did you mean '-std=c++2a'?
gcc: error: unrecognized command line option '-std=c++20'; did you mean '-std=c++2a'?

I also still have a requirement to build on 18.04.

This was broken in 6a72841.

Remote execution is not supported by the remote server, or the current account is not authorized to use remote execution

Looks like a cool project, so tried to use the TLDR, but got error

Remote execution is not supported by the remote server, or the current account is not authorized to use remote execution

The "docker-compose up" seemed be build great and get the containers running.

das@das-T14s-g1:~/temp$ sudo docker ps
[sudo] password for das: 
CONTAINER ID   IMAGE                       COMMAND                  CREATED       STATUS          PORTS                                                      NAMES
0c9ef89096f2   allada/turbo-cache:latest   "turbo-cache /root/w…"   2 hours ago   Up 53 minutes   50051-50052/tcp                                            docker-compose_turbo_cache_executor_1
375d537172ef   allada/turbo-cache:latest   "turbo-cache /root/s…"   2 hours ago   Up 53 minutes   50051/tcp, 0.0.0.0:50052->50052/tcp, :::50052->50052/tcp   docker-compose_turbo_cache_scheduler_1
27cdf217dd15   allada/turbo-cache:latest   "turbo-cache /root/l…"   2 hours ago   Up 53 minutes   0.0.0.0:50051->50051/tcp, :::50051->50051/tcp, 50052/tcp   docker-compose_turbo_cache_local_cas_1

Then tried the test example https://github.com/allada/turbo-cache/#tldr

das@das-T14s-g1:~/Downloads/turbo-cache$ bazelisk test //...   --remote_instance_name=main   --remote_cache=grpc://127.0.0.1:50051   --remote_executor=grpc://127.0.0.1:50051
2023/08/30 17:35:52 Downloading https://releases.bazel.build/6.2.1/release/bazel-6.2.1-linux-x86_64...
Extracting Bazel installation...
Starting local Bazel server and connecting to it...
INFO: Invocation ID: d46b187b-1d91-477e-8f72-7d814b681684
ERROR: Remote execution is not supported by the remote server, or the current account is not authorized to use remote execution.

This is likely a beginner error, so sorry about that.

Thanks in advance

Use the value of ExecuteRequest.results_cache_policy if set

The PR #142 caches based on the server default, but if the ExecuteRequest has a policy set it should take precedence over the default configured on the server.

S3 store needs to refresh objects when touched

We need to copy the data from/to same s3 location when the object appears that is about to expire. This will ensure we don't expire items when about to remote execute them.

Add rate limiting to GrpcStore

When performing on-boarding of an upstream Goma proxy it requires the uploading of every file to populate the Redis cache. However, this is causing concurrency errors in the GrpcStore:

[2023-07-17T07:52:43.032Z ERROR cas_server] Error during .has() call in .find_missing_blobs() : Error { code: Internal, messages: ["status: Internal, message: \"h2 protocol error: http2 error: connection error received: unspecific protocol error detected (b\\\"[p]req HEADERS: max concurrency reached\\\")\", details: [], metadata: MetadataMap { headers: {} }", "in GrpcStore::find_missing_blobs"] } - ff2dde80d3e78f42b128dcf6b4fe7b1173908e4276a4ef9ac3819c09f668bbb2

There should be a back-off for handling this situation.

build failed on mac

Hi,

I'm trying to build turbo-cache on Mac. Bazel build returns this error message:

error[E0432]: unresolved import libc
--> external/raze__tempfile__3_3_0/src/file/imp/unix.rs:19:5
|
19 | use libc::{c_char, c_int, link, rename, unlink};
| ^^^^ use of undeclared crate or module libc

error: aborting due to previous error

[idea] Amount of magic for publishing metrics

Looking for feedback on how automatical publishing metrics should be?

Right now I have 2 approaches, a procedural, but verbose one and a magical macro one.

Here's the syntax for the two:

// FilesystemStore example.
impl<Fe: FileEntry> MetricsComponent for FilesystemStore<Fe> {
    fn gather_metrics(&self, c: &mut CollectorState) {
        c.publish(
            "read_buff_size",
            self.read_buffer_size,
            "Size of the configured read buffer size",
        );
        c.publish(
            "active_drop_spawns",
            &self.shared_context.active_drop_spawns,
            "Number of active drop spawns",
        );
        c.publish(
            "temp_path",
            &self.shared_context.temp_path,
            "Path to the configured temp path",
        );
        c.publish(
            "content_path",
            &self.shared_context.content_path,
            "Path to the configured content path",
        );
        c.publish("evicting_map", &self.evicting_map, "");
    }
}

// VerifyStore example.
impl MetricsComponent for VerifyStore {
    fn gather_metrics(&self, c: &mut CollectorState) {
        c.publish(
            "verify_size",
            self.verify_size,
            "If the verification store is verifying the size of the data",
        );
        c.publish(
            "verify_hash",
            self.verify_hash,
            "If the verification store is verifying the hash of the data",
        );
        c.publish(
            "size_verification_failures",
            &self.size_verification_failures,
            "Number of failures the verification store had due to size mismatches",
        );
        c.publish(
            "hash_verification_failures",
            &self.hash_verification_failures,
            "Number of failures the verification store had due to hash mismatches",
        );
    }
}

// MemoryStore example.
impl MetricsComponent for MemoryStore {
    fn gather_metrics(&self, c: &mut CollectorState) {
        c.publish("evicting_map", &self.evicting_map, "");
    }
}

The macro version would look like this:

// FilesystemStore example.
publish_metrics! {
    FilesystemStore<Fe> {
        evicting_map,
        read_buff_size "Size of the configured read buffer size" Bytes,
        shared_context {
            active_drop_spawns "Number of active drop spawns",
            temp_path "Path to the configured temp path",
            content_path "Path to the configured content path",
        }
    }
}

// VerifyStore example.
publish_metrics! {
    VerifyStore {
        verify_size "If the verification store is verifying the size of the data",
        verify_hash "If the verification store is verifying the hash of the data",
        size_verification_failures "Number of failures the verification store had due to size mismatches",
        hash_verification_failures "Number of failures the verification store had due to hash mismatches",
    }
}

// MemoryStore example.
publish_metrics! {
    MemoryStore {
        evicting_map,
    }
}

These two examples would do 100% identical things, except the macro one would also make it easy to denote the type (which is a bit tricky to do with the procedural one.

Thoughts?

Let's get goma going

I'm trying to get turbo-cache working with goma. We're running a 512-core goma + buildbarn cluster with about a dozen builds per day. I got somewhere by just replacing buildbarn's CAS with turbo-cache, but would love to try running the whole thing on tc.

I've got a turbo-cache scheduler, cas and worker running but currently stuck on this error from goma:

exec call: error in check missing blobs: rpc error: code = Unimplemented desc = missing blobs: rpc error: code = Unimplemented

EDIT: got it to work! lmk if you want me to test with goma + chromium + lots of cores!

Add worker draining when worker errors out

Spawned from a comment here:
https://reviewable.io/reviews/allada/turbo-cache/122#-NZTearw2sEJWAJFoE_f

In the event a worker returns an error response, we should put the worker into a draining() state and then terminate the worker's connection. This would prevent new jobs from being scheduled on that potentially bad node and causing other jobs to fail.

Enable clippy during tests

A draft of this is at #152.

My current migration plan:

#158 Change impls of Into to From since the latter gives the former for free
#163 Implement is_empty for LenEntry and all its impls

Add remaining fixes separately:

Finally:

#182

Additional issues/questions I encountered so far:

Clippy doesn't like the highest_priority_action_first and equal_priority_earliest_first tests in cas/scheduler/tests/action_messages_test.rs. I'm not sure what these tests are trying to test. Is this about Ord? Would testing something like assert!(first_action < current_action) or similar also work?
#174 The GetFinishedResult in cas/worker/tests/utils/mock_running_actions_manager.rs pointed out that the largest GetFinishedResult variant contains at least 496 bytes. It's talking about ActionResult here. I'd expect this to be fairly large but should it really be this large at all times?
We "hold a RefCell reference across an await point" in cas/store/tests/filesystem_store_test.rs L731. Seems like something is wrong there.

[Idea] Checkpoint support

Today I heard an interesting use case. Sometimes users may want to have processes that take a very long time, like training a ML model, but want to upload resume-able checkpoints that if the program is resumed it will resume from the last checkpoint.

Specific use case:

Training program takes 3 days to run on a single GPU instance.
The intermediate state can be quite like (100GB+), so uploads are slow.
While the intermediate state is being uploaded, we want to keep the ML model training on the same GPU with same state.
If the task is terminated turbo-cache should attempt to resume the process from the last saved state.
A special ActionResult will be uploaded to AC for the task with a last_state tag in the hash (maybe environmental variable?). This will allow actions to be run against whatever the most recent state of the action cache is. For example to run some heuristics on the last model being trained (like TensorBoard).

Obviously this would be very difficult to implement right. It would be great if we could just snapshot memory state & files, upload it and allow it to be resumed, but certain things like GPU drivers present issues. We could easily do this by sending special signals to the program like: SIGUSR1, SIGUSR2, SIGVTALRM or exc, then the program would need to do the actions needed to save the resume-state files to disk then inform turbo-cache worker process it is done. TurboCache will then upload the state and the special "latest" ActionCache result.

This would obviously represent non-deterministic behavior, but it would be a configured parameter on the worker, so only use cases that specifically request this functionality would be allowed to use it (ie: opt-in to non-determinism).

Projects that do similar stuff:
https://github.com/checkpoint-restore/criu

Remove CacheLookupScheduler

I created CacheLookupScheduler because Goma wasn't getting cache hits. Turns out this was actually due to #176. This is now an entirely unused scheduler type.

I thought I'd raise an issue to determine if we want to simply delete it to avoid the maintenance cost. If I understand the RBE protocol it should actually never be required...

Look for rusoto alternatives

Unfortunately rusoto is in "maintenance mode".

Fortunately, this only seems to affect cas/store/s3_store.rs and its tests.

It seems that https://github.com/awslabs/aws-sdk-rust could be a good alternative. That's also used by attic, which is basically turbo-cache for nix: https://github.com/zhaofengli/attic/blob/main/server/src/storage/s3.rs

Create GrpcScheduler tests

There are cases where proxying a Scheduler to another instance would be useful.

An example use case is where an instance sits close to clients that is running a fast-slow store for CAS and AC over a slow network connection. This would allow an action cache lookup scheduler to utilise the local AC and then forward actions that are not found to a remote scheduler that is closer to the workers.

Sanitizer tracking issue

Sanitizer integration in rust is still quite experimental and tends to produce false positives. I went through a bunch of logs and think the issues below could be real bugs. I've added a few points of interest to key points in the codebase after some initial disentangling of the error logs.

AddressSanitizer:

#187 A lot of leaked memory during cas/store:ref_store_test. Seems to occur in get_test and update_test. POI:
- The setup_stores call in update_test, in ref_store_owned. Seemingly during some clone operations.
- The add_store call in setup_stores.
- The first scope in each of the tests.
- The ref_store field in the RefStore::new implementation.
- The unsafe impl Sync for StoreReference {}.
- The RefStore struct.
- The name.to_string call in stores.insert in cas/store/lib.rs in add_store.
- Seems like everything points to the string keys of the hashmap. in the stores.insert call.
- Related issues:
  - https://users.rust-lang.org/t/understand-memory-leak-with-parkinglot-and-hashmap/92519
  - rust-lang/rust#73307

ThreadSanitizer:

Data race in cas/worker:running_actions_manager_test in cleanup_happens_on_job_failure. POI:
- Creation of fast_store in the setup_stores for these tests.
- The try_join in running_actions_manager.rs in upload_results.
Data race in cas/worker:local_worker_test in new_local_worker_removes_work_directory_before_start_test. POI:
- The new_local_worker in the failing test.
- The fs::canonicalize call in new_local_worker in cas/worker/local_worker.rs.
Data race in cas/store:filesystem_store_test in oldest_entry_evicted_with_access_times_loaded_from_disk. POI:
- The match statement at the end of the test on store.get_file_entry_for_digest.
- The fs::create_dir_all call in the test.
- The write_file calls in that test.

Check compatibility with Reclient

Chromium has finally deprecated Goma in favour of Reclient (https://github.com/bazelbuild/reclient) check compatibility with Turbo-Cache.

Add config to allow env variables to be injected into action

Buck2 does not inject the PATH env variable. This is different on different machines, so we should make this a config option to inject specific env variables.

Make custom binaries for different services

As outlined in #116, we should make a few binaries for CAS, Scheduler, Worker instead of forcing every service to run everything.

We will still have a single binary that can do everything, but for special cases where certain dependencies cannot be filled (like wasm or certain OSs/kernels) users could use the split-out binaries.

Make WASM compatible binary optimized for edge computing

As outlined in #116, WASM could be very useful for edge-computing services. Since many of the bazel files are likely to live in S3, redis or other similar layers, it would likely save money, time and effort if users could hit a local edge point, a WASM module starts up, runs the tasks, then powers down. Since these kind of services would be data transformations the compute and resources would likely be low.

[info] Chrome stats

In building chrome with turbo-cache and getting a completely full cache, here's some useful stats on disk usage for a single fresh build (iterating newest 10k items):

item_size_bytes{quantile="0.00"} 102
item_size_bytes{quantile="0.01"} 883
item_size_bytes{quantile="0.03"} 3540
item_size_bytes{quantile="0.05"} 6320
item_size_bytes{quantile="0.10"} 11102
item_size_bytes{quantile="0.30"} 37572
item_size_bytes{quantile="0.50"} 83920
item_size_bytes{quantile="0.70"} 187360
item_size_bytes{quantile="0.90"} 477504
item_size_bytes{quantile="0.95"} 722688
item_size_bytes{quantile="0.97"} 932288
item_size_bytes{quantile="0.99"} 1433904
item_size_bytes{quantile="1.00"} 9157664

Migrate scheduler to an interface to support multiple implementations

Similar to how stores work, we should migrate schedulers to a trait/interface so we can develop multiple implementations of it.

If a worker is killed, stale work directories exist

I had a Worker disconnect and then reconnect and then the scheduler logged:

[2023-09-07T08:29:17.678Z WARN  simple_scheduler] Internal error for worker 7b2e3f8d-0afb-4774-82e3-cecef835dbf3: Error { code: AlreadyExists, messages: ["File exists (os error 17) : Error creating work directory /root/.cache/turbo-cache/work/f5596fdea33511e1c7f513873c087faee689c6ee024c9d90d2908c351e5b0e83"] }

I think #129 may still be required.

Document that users should use `-c opt` when going into production

If the user just compiles using bazel build //cas it will result in a very slow binary. This is because of the way rust compiles by default and rust is extremely slow in non-optimized binaries.

We can simply update the readme to tell users to use -c opt when building for production.