roc-lang / rbt Goto Github PK

View Code? Open in Web Editor NEW

45.0 45.0 3.0 846 KB

Roc Build Tool

License: Universal Permissive License v1.0

Rust 89.08% Nix 8.78% C 0.07% Shell 2.06%

hacktoberfest

rbt's People

Contributors

Stargazers

Watchers

Forkers

gtrunsec zwilias celsobonutti doytsujin

rbt's Issues

make Roc API stuff opaque

Right now we have to do things like:

Rbt = [Rbt { default: Job }]

but it would be better to make that opaque and a non-union type:

Rbt := { default : Job }

Problem is, glue can not see through module boundaries right now so making that change in Package-Config.roc would mean that the API in Rbt.roc could not construct values!

Once glue can see through module boundaries, this should be trivial to fix (and should simplify the generated glue code a lot!)

implement FileMapping

As in ADR 008 (docs/adrs/008-unified-inputs.md) we need some data structure like this to avoid conflicts between jobs and source files:

FileMapping := { sourceFile : Str, workspacePath : Str }

sourceFile : Str -> FileMapping
sourceFile = \path -> @FileMapping { sourceFile : path, workspacePath : path }

withWorkspacePath : FileMapping, Str -> FileMapping
withWorkspacePath = \@FileMapping { sourceFile }, workspacePath ->
    @FileMapping { sourceFile, workspacePath }

Right now we're just using a Str for input paths—they should be replaced by this data structure.

Require two factor authentification for all github accounts of contributors.

Relevant docs

store does a little too much work preemptively

it creates directories and moves around output files before checking if it needs to. See these comment threads for a way to improve the situation:

The temporary directory should also probably live somewhere more inspectable in case of failure. (#33 (comment))

make a nice CLI output

Once we have a bunch of jobs running in parallel (#73) we should make a nicer CLI output. Something like superconsole should come in quite handy!

think through how to handle downloads

We want to be able to download files. We probably need to do this fairly frequently, and should be able to do this without invoking some system tool like curl. At minimum, we probably need to be able to:

download files from HTTP(S)
verify that the downloaded files match some hash
store the output in the content-addressable cache

The task here is not to actually implement this, but to create an ADR detailing what kind of API we want, exactly.

it should be possible to capture stdout or stderr to a file

When implementing #7, @celsobonutti found that it's pretty inconvenient to capture stdout as a build result: you need to spawn a shell and do input redirection. This is a common-enough pattern, in my mind, that we should allow for it in the rbt API. So, this issue needs two things:

a new ADR detailing what the new API should be for capturing stdout to a file
after getting input on the ADR, implementing it

fix overeager cloning in src/job.rs

In order to avoid pointers, we currently copy a little too much memory. @bhansconnect wrote a helper to avoid doing some of that, which we should use instead: roc-lang/roc#4361. That way we could have owned lists and dicts, and in so doing avoid some copies!

move to a data store that can be used across threads

In order to walk the build graph in parallel, the data store has to be accessible across threads. We've been thinking about using sled for a while (in fact, it's still in the dependencies from the initial spike!)

jobs should be able to depend on other jobs

We have an ADR for this (docs/adrs/008-unified-inputs.md) but we're going to need to make some modifications to work around some problems in how Roc interprets the memory. ("alias references to mutually recursive types" in the Roc Zulip).

We will probably need to have Job be defined something like this for the first pass, and work towards the design from ADR 008 in future PRs.

Job := [Job { command : Command, inputs : List [FromProjectSource (List FileMapping), FromJob Job (List FileMapping)], outputs : List Str }]

use a faster hash for input hashing

We're currently using the default Rust hasher, as @bhansconnect points out in https://github.com/roc-lang/rbt/pull/33/files#r954403324. Should use something else. Looks like xxhash (crate) is a pretty reasonable/fast option.

empty out environment

we currently inherit environment variables from the parent process. We need to do that!

This should be a pretty reasonable first issue for someone, except for the fact that a lot of tools will get annoyed at you if you don't set some environment variables. The most common one I've seen is HOME (which we're setting in #62) but I tested a few things out and it seems like it might be necessary to set LOCALE to some fake value as well.

set up tokio-console to measure if we're using async effectively

https://github.com/tokio-rs/console should be a useful tool for us to figure out if rbt is performing well after merging #82. Let's set it up and see if it can give us insight into how well the system is working!

do something more reasonable with jobs' log output

Right now, stdout and stderr from jobs get dropped. That's not really tenable! We probably want to store it so we can more easily debug builds.

This might depend on #72 if we decide we want to store it in the database (or if we want to store it in the CAS.)

If someone other than @BrianHicks picks this up, the best bet is probably to have a chat with him on the Roc Zulip about design considerations here!

add optional names to jobs

Jobs currently just have their IDs and the first 20 characters (or so) of their commands. We should do two things to improve this:

allow job to take an optional name field. If we get that, we should present jobs with that field everywhere we can.
finalize what we want to do in the case where we don't have a name (the current solution is a little hacky.)

Those items can be one PR or two—they're at least partially separate!

walk the build graph in parallel

With #69, we have an actual build graph. Next step: walk it in parallel!

Dependencies:

#69 for the parallel graph
#72 for a data store that can be accessed in parallel

We probably don't want to do this in an async context, by the way. The async_std implementation of process is unstable, and we're going to want to avoid spawning all the build tasks at once!

Remove `.envrc` from the repo

I think that we shouldn't check in .envrc to the repo. Instead we should let the user specify it if they want it. The reason for this is so that users aren't stuck with the default .envrc. For example, I use lorri instead of direnv directly. This means that I want .envrc to be eval "$(lorri direnv)" instead of use nix.

chore: refactor responsibilities

Currently:

Workspace is responsible for setting up files for a job
Job is responsible for creating the command
Runner is responsible for bringing the workspace and job together
Coordinator is responsible for executing jobs in the right order

It might make sense to refactor these in terms of responsibilities:

Runner is responsible for taking a job (not a job and workspace, just a job), running it, and saying where we can get the output
- for local builds, this probably just looks like managing the workspace
- for remote builds, this looks like shipping the job definition and any necessary files to a remote executor, then downloading the result again (to be clear, remote builds are not part of this story!)
Workspace is responsible for isolating the file system as much as possible
- it symlinks files into the working directory
- it creates a fake HOME (see #62)
- it symlinks any other jobs in
Job is responsible for command isolation
- it isolates the PATH (see #61, or should Workspace maybe handle this?)
- or maybe we should have some other kind of command that wraps std::process::Command that can do this wrapping for us?
Coordinator is responsible for the same things as before

This'll be a biggish refactor, but should mean we have better-defined responsibilities for all the components in the system.

think through how to download package dependencies

Lots of languages I've used (Ruby, Elm, Python, Rust) have a single manifest file (sometimes with a lock file) to define all the dependencies and versions needed for a project. In a lot of cases, we're just going to want to leave this alone and call (e.g.) pip install with the manifest and lock files as targets. But in certain cases, we may be able to take over some of the downloading and installing. (For example, look at all the x2nix packages out there!)

So, in the case where we can download a bunch of files, we probably want to cache those—how? Does the cache need to be mutable? In what cases? Can we avoid having a shared mutable state escape hatch altogether while still supporting these use cases?

validate that we aren't getting hash collisions

When rbt gets a Job from Roc, it uses the information to create a key, which is then used as the identifier for the job in the build graph. However unlikely it is, it's possible that we could get a collision there.

To guard against this, rbt could keep a mapping of key to job while constructing the build graph. Every time we got a new job, we'd insert into this mapping. If we already had an item in there, we'd verify that the new value was exactly equal to the existing one. If it wasn't, we'd raise an error and ask the caller to set some field designed to avoid hash collisions.

Thanks, @bhansconnect, for the idea of how to mitigate this!

Require review from other code owner

The way to go seems to be to:

set up CODEOWNERS
require one review for the trunk branch
require reviews from CODEOWNERS

The CODEOWNERS file should be set up to have at least two trusted contributors as code owners per folder.
That way every PR will be reviewed by a (different) trusted contributor.

Verify settings by trying to merge PR with review from someone who is not a code owner.

think through how to handle file patterns

We know that it'll be way too annoying to manually specify every single one of our files, so we'd like something a little better!

I'm assigning @zwilias on this since we've already paired on it and he has more context on the ADR that will be written up to close this issue.

isolate binaries to only those specified in the job

We currently fudge a little with systemTool: we assume the tool is in PATH, but we don't check at all and also don't prevent it from running anything else by name in PATH.

Implementation idea: look up the binary in PATH, then symlink the binary location to some discrete bin directory that the job has access to. Takes a little more work but isolates more and lets us give better error messages for missing binaries.

implement ADR 4: Symlinking

I've had this as a personal TODO forever but realized it needs to be public! ADR 4 (https://github.com/rtfeldman/rbt/blob/trunk/docs/adrs/004-symlinking.md) is not implemented and it totally could be picked up by someone else if they'd like!

see if seahash is faster than meowhash

Just a note for later, as I've just learned about seahash: https://lib.rs/crates/seahash

isolate HOME

we don't currently empty out the environment, which means that paths like HOME are totally available for caching, config files, etc. Not isolated even a little!

What we want: create a fake HOME, then look through it after the build completes. If the build leaves anything in it, issue a warning. Eventually this will be an error, but for now we don't have any way to work with mutable caches so we should be a little gentler.