dimo414 / bkt Goto Github PK

View Code? Open in Web Editor NEW

213.0 6.0 13.0 215 KB

a subprocess caching utility, available as a command line binary and a Rust library.

Home Page: https://www.bkt.rs

License: MIT License

Rust 97.14% Shell 2.86%

rust subprocess caching cli bash shell fzf cache memoization

bkt's People

Contributors

Stargazers

Watchers

Forkers

tranzystorekk rrx 00mjk kumy dmgolembiowski jqlin minimal101 iq-scm alexanderkjall eatnumber1 vorburger larsks

bkt's Issues

Provide arm64 Darwin build

Hi @dimo414

could you please provide release builds for arm64 Darwin?

Thanks 🙏

Consider %LocalAppData% for Windows

Currently bkt defaults to using the directory returned by std::env::temp_dir() on all platforms. In principle bkt should probably use a "user cache" directory instead, such as ProjectDirs::cache_dir - notably this would isolate cached data between users, which is currently a limitation we workaround on Unix with permissions.

However, in practice using temp_dir() is probably the best choice on Linux/OSX because we want to use a tempfs directory for performance. On Windows (to my knowledge) that isn't likely/possible so we might as well use %LocalAppData% instead which shouldn't be any slower than the Windows temp directory.

Relative cache path does not cache as expected

The following does not seem to cache:

rm -rf asdf
# takes 10 seconds as expected
time cargo run -- --cache-dir asdf -- sleep 10
# also takes 10 seconds
time cargo run -- --cache-dir asdf -- sleep 10

The second invocation should return immediately but does not.

But if I specify the absolute path, it works as expected:

rm -rf asdf
# takes 10 seconds
time cargo run -- --cache-dir $PWD/asdf -- sleep 10
# returns immediately
time cargo run -- --cache-dir $PWD/asdf -- sleep 10

I haven't discovered why this is happening in the code, but the quick solution I came up with was to canonicalize the cache path. At some point the path is getting expanded out. Not sure where.

The following fixes the issue: #8

Clean up cached data linked to old file modtimes

Now that --modtime is supported it's possible for a cached invocation to become functionally unreachable¹ but remain in the cache for an extended period of time. For example:

bkt --ttl=1d --modtime=.git/FETCH_HEAD -- git status

Any cached status will become unreachable whenever FETCH_HEAD is updated (i.e. after git fetch / git pull) but presently bkt can't actually clean up the data until the TTL expires. In principle bkt could clean up all existing cached data linked to a given file whenever its modtime changes.

Furthermore, if this functionality is added such commands could be allowed to not set a TTL at all (if they want) and simply rely on modtime invalidation.

This would be a somewhat sizable refactoring of the Bkt and Cache APIs, but it's probably worth the effort.

yes a file could theoretically be set to a prior modtime, either manually or after a clock change, but this is uncommon and a cache miss in such circumstances is probably acceptable. ↩

Symlinks on Windows require privilege escalation

On Windows bkt fails with this error:

> /bkt.exe -- ipconfig
bkt: Cache write failed: A required privilege is not held by the client. (os error 1314)

This comes from the call to std::os::windows::fs::symlink_file which actually doesn't work in most cases. Specifically, Windows treats symlink creation as a privileged action and there aren't a lot of good ways around it.

Python suggests running as an administrator, enabling Developer Mode, or setting SeCreateSymbolicLinkPrivilege
- Setting the create symlink privilege doesn't always work either; administrators seemingly can't set this for themselves; UAC would need to be disabled

Some users report success calling mklink in a subprocess, but doesn't work for me, at least. For posterity:

#[cfg(windows)]
fn symlink<P: AsRef<Path>, Q: AsRef<Path>>(original: P, link: Q) -> Result<()> {
    let result = Command::new("cmd")
        .arg("/C").arg("mklink")
        .arg(link.as_ref().as_os_str())
        .arg(original.as_ref().as_os_str())
        .output().context("mklink subprocess failed")?;
    if !result.status.success() {
        return Err(Error::msg("Failed to create symlink"))
            .context(format!("Details: {:?}", result));
    }
    Ok(())
}

GoLang appears to have given up on symlinks outright and just does extra copies

Rather than making users jump through these hoops, it might be easiest to just manually reimplement "symlinks" as vanilla text files, so the key file would simply contain the path to the data file and lookup() would read that file to find where the data is stored.

Executable missing

So when I try to run a script in the current directory

PATH=.:$PATH bkt --ttl 24h -- FHQ

also can't run

bkt --ttl 24h -- ./FHQ

I get

thread 'main' panicked at 'Executable missing', /usr/share/cargo/registry/bkt-0.6.1/src/lib.rs:1065:64
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

however

PATH=.:$PATH FHQ

works

SHELL=/bin/zsh

Feature Request: return when value changes (differs from previous invocation)

bkt stores state across invocations, and can diffs easily. I suppose it's not currently diffing data, just conditionally refreshing based on time and flags.

I find myself wanting to only see new data when a source changes, and a tool that caches data seems like the right place make a change.

PRs accepted?

It's possible this feature falls outside the scope of what maintainers want to maintain.

Record cache statistics

It would be nice to log statistics about cached invocations, such as execution durations and hits vs. misses. This could be used to optimize TTLs and observe unexpected cache misses.

Any such logging should be opt-in at the call site - e.g. behind a --log_statistics flag - and ideally also gated by a Cargo feature (though the feature could be enabled by default as long as it doesn't incur too much overhead).

I'm envisioning writing to an append-only file in the cache directory and using a simple schema such as a TSV. Columns might include:

Timestamp
Scope
Command name (i.e. argv-0)
Full command line
cwd
Parent PID
Hit / miss / force / refresh (i.e. executed asynchronously via --warm or --stale)
Exit status
Execution time (0 for hits)
Remaining TTL (0 for misses)
Remaining TTStale (0 for misses)

It would be nice to find some prior art of other CLI tools that log local statistics like this.

Get bkt into the major package managers

I don't have much experience with this, but filing a bug to keep track of this over time. Package managers to consider: https://github.com/junegunn/fzf#using-linux-package-managers.

cli::discard_failures_in_background is flaky

Reported by @tranzystorek-io in #12 (comment) - example failure blaming this line:

---- cli::discard_failures_in_background stdout ----
thread 'cli::discard_failures_in_background' panicked at 'SystemTime { tv_sec: 1648411197, tv_nsec: 770703005 } !> SystemTime { tv_sec: 1648411197, tv_nsec: 770703005 }', tests/cli.rs:227:9

bkt: stdout pipe failed: failed to write whole buffer

Hi,
I recently ran into an error with bkt:

bkt -- find . -type f | head -n1

./file
bkt: Subprocess execution failed: stdout pipe failed: failed to write whole buffer

This only seem to happen with many files/lines, but I don't get the error when piping the stdout to wc -l, so maybe the issue only happens when the process on the right side of the pipe closes the stdout of the bkt process before it could write out it's whole output (e.g. head, fzf).

bkt -- find . -type f | wc -l

The stdout is actually correct in both cases so it sounds like this bkt error is not really necessary to show to the user.

Reproduced with:

bkt version: 0.7.0 (with cargo install bkt)
macOS 13.5 (arm64)
a fairly up to date arch linux (aarch64).

Consider using assert_cmd

https://docs.rs/assert_cmd might help simplify some of the boilerplate in cli.rs. Need to experiment with it further.

[Feature suggestion] Flag to report bkt invocation values, and bkt context

I'd like a flag analogous to report Bkt status values. A one line output

As a baseline, a report of this invocation
Bkt: in_cache: false, ttl: 600s, stale?, warm?, force?

Perhaps repeated use of the flag, or a different flag, can report environment state consistent across invocations
Bkt context: cache_dir: ..., scope, env(Hash): 5a5a5a, command: "[expensive-command]", cwd?

Suggested names: [i]nclude (as in curl) / [v]erbose / [d]ebug / [r]eport? / [s]tatus.

Key off an external file's modtime

I'm not sure how common/useful this would really be, but I can imagine wanting to use a file's modtime as an input to a command's cache key (alongside the cwd or environment). As a toy example:

$ bkt --modtime infrequent_updates.json --ttl=1d -- jq -f infrequent_updates.json '{foo: .bar}'

This assumes some occasional background job writes updates to infrequent_updates.json. We can generally rely on the cached data, so a long TTL is preferable, but we want to pick up updates as quickly as possible.

Users can accomplish this today using --scope like so:

$ bkt --scope="$(stat -c '%n:%Y' infrequent_updates.json)" --cwd --ttl=1d -- jq -f infrequent_updates.json '{foo: .bar}'

But this means launching two additional processes (the command substitution subprocess and stat) on each call. Doing a modtime check in-process would be much lighter weight.

Default cache directory should be user-specific

so i really like bkt alot, however on multiuser system. or when root is using bkt -- somecommand, and after wards user tries bkt -- somesamecommandasother user ending up with above msg :'(

$ bkt -- ....
bkt: Cache lookup failed: Failed to access cache file: Permission denied (os error 13)

strace revelead to me:

openat(AT_FDCWD, "/tmp/bkt-0.5-cache/keys/D6A5253D88DA21DC", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)

not sure if this is a build or runtime option? probably i'm using it wrong and there's options. but would prefer to have it just work without options (thinking)

Cooperative locking

bash-cache provides optional support for cooperative locking via bc::locking_cache, which ensures concurrent calls to the same function will not race. It would be nice to add something similar to bkt.

There's already a simple FileLock implementation, though it doesn't support waiting for the lock to release at present. notify looks useful, but a simple polling loop might be sufficient (and more lightweight) given that the process won't be doing anything else until the lock releases.

Feature request: stream the subprocess' stdout and stderr while it's running

Hi!

Really useful cli tool. I use this mostly for downloading large files in a cronjob.
Something that I'm currently lacking is a strategy for streaming the progress to stdout. Both in interactive and in CI-scripts.

What are your thoughts about this?

Several tests occasionally flake due to missing cwd

---- bkt_tests::with_env stdout ----
thread 'bkt_tests::with_env' panicked at 'assertion failed: `(left == right)`
  left: `"shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory\n"`,
 right: `""`', src/lib.rs:1317:9
stack backtrace:
   0: rust_begin_unwind
             at /usr/src/rustc-1.63.0/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /usr/src/rustc-1.63.0/library/core/src/panicking.rs:142:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /usr/src/rustc-1.63.0/library/core/src/panicking.rs:181:5
   4: bkt::bkt_tests::with_env
             at ./src/lib.rs:1317:9
   5: bkt::bkt_tests::with_env::{{closure}}
             at ./src/lib.rs:1311:5
   6: core::ops::function::FnOnce::call_once
             at /usr/src/rustc-1.63.0/library/core/src/ops/function.rs:248:5
   7: core::ops::function::FnOnce::call_once
             at /usr/src/rustc-1.63.0/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I looked at this a bit, and couldn't make sense of it, it seems to only happen inside the build environment and not with a normal cargo test on an arm64 machine, it also doesn't happen on any other architectures.

Build log: https://ci.debian.net/data/autopkgtest/testing/arm64/r/rust-bkt/34500844/log.gz

cache_tests::stable_hash test fails on s390x

Output:

---- cache_tests::stable_hash stdout ----
thread 'cache_tests::stable_hash' panicked at 'assertion failed: `(left == right)`
  left: `"038D651019410545"`,
 right: `"7D208C81E8236995"`', src/lib.rs:869:9
stack backtrace:
   0: rust_begin_unwind
             at /usr/src/rustc-1.63.0/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /usr/src/rustc-1.63.0/library/core/src/panicking.rs:142:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /usr/src/rustc-1.63.0/library/core/src/panicking.rs:181:5
   4: bkt::cache_tests::stable_hash
             at ./src/lib.rs:869:9
   5: bkt::cache_tests::stable_hash::{{closure}}
             at ./src/lib.rs:868:5
   6: core::ops::function::FnOnce::call_once
             at /usr/src/rustc-1.63.0/library/core/src/ops/function.rs:248:5
   7: core::ops::function::FnOnce::call_once
             at /usr/src/rustc-1.63.0/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This looks like it might be a big/small endian problem.

Build log: https://ci.debian.net/data/autopkgtest/testing/s390x/r/rust-bkt/34502277/log.gz

BrokenPipe panic when stdout is closed early

Example:

$ bkt -- echo | head -n0
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 32, kind: BrokenPipe, message: "Broken pipe" }', /usr/local/google/home/diamondm/.cargo/registry/src/github.com-1ecc6299db9ec823/bkt-0.3.1/src/main.rs:77:48
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

In more typical use cases (e.g. head -n10) we'll only hit this error with sufficiently large output (IIRC ~60k lines on Linux), since the OS buffers output between processes, but either way bkt shouldn't blow up if its output descriptors close.

bkt should make scope "filename-safe"

My very first use of bkt was to cache the output of a kubectl command. I want the results scoped to a particular cluster, so I did this:

export BKT_SCOPE=$(kubectl config view --minify -o json | jq .clusters[0].cluster.server)

This ends up with BKT_SCOPE having a value like https://api.cluster.example.com:6443, which ultimately causes bkt to fail with:

bkt: Cache write failed: No such file or directory (os error 2)

And this is because it's trying to use the scope literally as part of a pathname:

1142219 rename("/tmp/bkt-0.7-cache/keys/tmp-symlink.TysE1Bh1WR7I7TDH", "/tmp/bkt-0.7-cache/keys/\"https://api.cluster.example.com:6443\".4EBF7E540DA5C7C7") = -1 ENOENT (No such file or directory)

Either...

(a) The documentation should make it clear that scope value is used verbatim in a pathname and must not contain any "path sensitive" characters (and bkt should probably throw an error if it detects things like / or \ in the scope value), or

(b) bkt should sanitize the path (e.g., to only ascii alphanumerics or something), so that the above might result in the path .../keys/https___api_cluster_example_com_6443....

Possible Race When Starting Many Parallel Bkt Instances

I use bkt in my .zshenv, which, among other things, runs on every parallel arm of a GNU parallel invocation. So I'm starting bkt in parallel many times.

Sometimes, I see:

bkt: Cache lookup failed: Failed to remove expired data: No such file or directory (os error 2)

Could this be a race between bkt invocations deleting expired data? Seems like something that could be made idempotent (... by ignoring ENOENT)?

Mechanism to report cache age to callers

In #46 @huyz asked about including some messaging like Command cached 2h 3m ago in bkt's output. My initial reaction was to say bkt should be transparent and shouldn't modify a command's output in any way, but the cache age is a piece of information only bkt knows, so it does seem like a potentially useful piece of data to expose in some way.

I think there's a lot of intricacies with including this information in bkt's output (e.g. in the thread there's discussion of how to format the time delta, and you'd presumably need some sort of format string support to enable different associated text like "Offline for", not to mention localization). So I'm hesitant to go that way, but maybe there's something reasonable we could do with that.

Another option would be a separate flag / command like --stats that prints some structured information about the cached data such as when it was last cached, so that a caller could query that information to compute a time delta and print the messaging they want. This could also be used to help inform when to --warm a cache, for example.

Just spitballing right now so please drop some comments if you would be interested in functionality along these lines (or start a separate discussion if that would be easier).

cannot build rev 0.5.4 without updating cargo.lock

the cargo.lock update to 0.5.4 was not included in the 0.5.4 release

Flag to exit immediately on cache misses

Please, consider adding a flag to return null when there is no caching results, but to request execution of the command in background.

I'm trying to implement a nushell cli wrapper, that would ask for results of cli execution asynchroniously.

I have a list of parameters, that I want to request information about in parallel threads from the app that retrieves such information from some server in the internet. BKT would be helpful here, if it just returned null instantaneously, and in background asked for information from cli. And, on the next call it would return the requested information.

Mechanism for invalidating (parts of) the cache

An easy option is to add an --invalidate flag that, rather than retrieving the command simply deletes the cache key associated with the given command, like so:

$ bkt -- date
Sun 21 Nov 2021 11:17:03 AM PST

$ bkt -- date
Sun 21 Nov 2021 11:17:03 AM PST

$ bkt --invalidate -- date

$ bkt -- date
Sun 21 Nov 2021 11:17:13 AM PST

But it might be preferable to support some sort of more powerful invalidation, such as the ability to invalidate all calls to a given command, e.g.:

$ bkt -- date +%T
11:18:19

$ bkt -- date +%T
11:18:19

$ bkt --invalidate=date

$ bkt -- date +%T
11:18:32

However bkt doesn't currently have any way to introspect the cache like this, short of a linear search. That might be fine, but otherwise either the key space would need to be redesigned or some secondary index would need to be maintained.

Also --scope can be used to invalidate the cache, as long as all callers participate in using the right scope.

panicked at 'overflow when subtracting duration from instant'

I have included bkt in one of our internal tool. I got at least 5 complaints already from users having such error (MacOs users):

thread 'main' panicked at 'overflow when subtracting duration from instant', library/std/src/time.rs:424:33
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

or (distinct column number on this one) Here it's bkt v0.7.1

thread 'main' panicked at 'overflow when subtracting duration from instant', library/std/src/time.rs:424:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Clearing the bkt-cache folder was our quick fix.

I'm sorry I don't have a backtrace yet, but will try to provide it next time it happen. I thought it might be worth to already open that issue.

dimo414 / bkt Goto Github PK

bkt's People

Contributors

Stargazers

Watchers

Forkers

bkt's Issues

Footnotes

Recommend Projects

Recommend Topics

Recommend Org