kevin-hanselman / dud Goto Github PK

View Code? Open in Web Editor NEW

166.0 8.0 6.0 3.39 MB

A lightweight CLI tool for versioning data alongside source code and building data pipelines.

Home Page: https://kevin-hanselman.github.io/dud/

License: BSD 3-Clause "New" or "Revised" License

Go 73.69% Makefile 2.66% Dockerfile 1.04% Shell 7.16% Python 3.09% Awk 0.63% Jupyter Notebook 11.71%

machine-learning dvcs data-science dataset mlops data-engineering data-pipelines

dud's People

Contributors

Stargazers

Watchers

Forkers

arjunkumar09 bmwtsn098 mayhemheroes brunoscaglione genericpirate

dud's Issues

Provide a means to prevent files/directories/patterns from being tracked as part of a directory artifact

The main benefit is that directory artifacts could be created from directories whose contents should not be entirely tracked by Dud.

Option 1: Support .dudignore files that work similarly to .gitignore files. This is likely most useful when there's a pattern that shouldn't be tracked project-wide.

Option 2: Replace Artifact.DisableRecursion with a .gitignore-style list defined in the stage YAML itself. In this approach, YAML files remain standalone; no separate .dudignore file affects the definition of the stage.

Decoupling ignored patterns from the stages/artifacts themselves has strong pros and strong cons.

refresh benchmarking suite

The benchmarking suite has gone stale and should be revitalized.

add command to move a stage

Improve fetch speed on large recursive directories

Fetching directory manifests happens sequentially. This is quite slow for directory artifacts with many (100's to 1000's) of sub-directories.

"comitting stage..." can appear after inputs are checksummed

$ dud commit
  train.py              2.53 KiB / 2.53 KiB  100%  ?/s  0s total
  model.py              1.59 KiB / 1.59 KiB  100%  ?/s  0s total
committing stage train.yaml
  model.pt              195.98 KiB / 195.98 KiB  100%  ?/s  1ms total

Index.Run doesn't check Stage locked status

I'm struggling to think of a case where implementing this would make Index.Run behave differently from the current behavior. If we can identify a case where an Index entry's IsLocked value is at odds with the Stage's status, this bug becomes a higher priority.

`stage gen -w` is always resolved relative to the project root

mkdir -p proj/subdir && cd proj
dud init
cd subdir
dud stage gen -w . -o foo.txt
working-dir: .  # should be "subdir"
outputs:
  foo.txt: {}

Automate Github releases

checkout: copy strategy fails when artifact checked out as link

$ ./duc init
$ ./duc add 50mb_random.bin
$ ./duc commit
$ ./duc checkout --copy
stage checkout failed: checkout "50mb_random.bin": open 50mb_random.bin: file exists

The checkout command should be smart enough to remove the link prior to attempting the checkout.

Note that this is different than adding a --force flag, which would unconditionally remove the workspace file before checkout.

data is missing from workspace

Describe the bug
Hi, so I think its less of bug and rather a user issue.

I have a folder datsets with subfolders like datasets/set1

I've added a stage with:

dud stage gen -o datasets/set1 | tee set1.yaml

I setup a remote cache (webdav) and pushed. Everything seems to work fine.

I put my yamls on a git repo and now wanted to fetch it on different machine

I did

dud fetch

It seems to download the data into the cache, but it does not create the symlinks.

dud status

set1.yaml        stage definition up-to-date
  datasets/set1  missing from workspace

I guess I'm doing somehting wrong?

System information

Output of dud version:

dud version
0.4.3

Output of uname -srmo:

Linux 5.14.21-150400.24.63-default x86_64 GNU/Linux

add command to visualize the stage graph

The initial output could/should be in graphviz format and sent to stdout.

e.g.

dug graph | dot -Tpng -o graph.png

prevent Artifacts being both a dependency and an output of the same Stage

commit: symlink destinations aren't absolute paths

$ ls -al foo/
-rw-r--r--  1 kevin kevin    2 May 15 14:49 1.txt
-rw-r--r--  1 kevin kevin    2 May 15 14:49 2.txt
-rw-r--r--  1 kevin kevin    2 May 15 14:49 3.txt
-rw-r--r--  1 kevin kevin    2 May 15 14:49 4.txt
-rw-r--r--  1 kevin kevin    2 May 15 14:49 5.txt
-rw-r--r--  1 kevin kevin    2 May 15 14:49 6.txt
-rw-r--r--  1 kevin kevin    2 May 15 14:49 7.txt
-rw-r--r--  1 kevin kevin    2 May 15 14:49 8.txt
-rw-r--r--  1 kevin kevin    2 May 15 14:49 9.txt

$ duc track foo
$ duc commit

$ ls -al foo/
lrwxrwxrwx  1 kevin kevin   52 May 15 14:50 1.txt -> .duc/cache/e5/fa44f2b31c1fb553b6021e7360d07d5d91ff5e
lrwxrwxrwx  1 kevin kevin   52 May 15 14:50 2.txt -> .duc/cache/74/48d8798a4380162d4b56f9b452e2f6f9e24e7a
lrwxrwxrwx  1 kevin kevin   52 May 15 14:50 3.txt -> .duc/cache/a3/db5c13ff90a36963278c6a39e4ee3c22e2a436
lrwxrwxrwx  1 kevin kevin   52 May 15 14:50 4.txt -> .duc/cache/9c/6b057a2b9d96a4067a749ee3b3b0158d390cf1
lrwxrwxrwx  1 kevin kevin   52 May 15 14:50 5.txt -> .duc/cache/5d/9474c0309b7ca09a182d888f73b37a8fe1362c
lrwxrwxrwx  1 kevin kevin   52 May 15 14:50 6.txt -> .duc/cache/cc/f271b7830882da1791852baeca1737fcbe4b90
lrwxrwxrwx  1 kevin kevin   52 May 15 14:50 7.txt -> .duc/cache/d3/964f9dad9f60363c81b688324d95b4ec7c8038
lrwxrwxrwx  1 kevin kevin   52 May 15 14:50 8.txt -> .duc/cache/13/6571b41aa14adc10c5f3c987d43c02c8f5d498
lrwxrwxrwx  1 kevin kevin   52 May 15 14:50 9.txt -> .duc/cache/b6/abd567fa79cbe0196d093a067271361dc6ca8b

Note that the last command shows all broken links because the links are relative to the foo directory.

Disallow running duc as root user

Duc should refuse to run as the root user. In addition to the general (and important) security concerns with running as root, data integrity is also critically vulnerable, as root can silently ignore file permissions on most UNIX systems.

Improve error message around tracking a stage file

Please acknowledge the following

I have read about Minimal Bug Reports and what follows is my good faith attempt at creating one.

Describe the bug
Generating a stage file that tracks itself yields an unhelpful error message on commit.

System information
(This should be dud --version FWIW)

Output of dud version:

dud version v0.0.1-33-ge898ce1

Output of uname -srmo:

uname -srmo
Linux 5.10.42-1-lts x86_64 GNU/Linux

If you're building from source, please provide your Go version.

go version
go version go1.16.5 linux/amd64

Steps to Reproduce

dud init
dud stage gen -o myfile > myfile
dud stage add myfile
dud commit

Expected behavior
I expect an error message that informs me of the problem (that you can't track a stage file because tracked files aren't writable, but stage files need to be). Alternatively, you could inform the user of the issue during the stage add.

Actual output:

committing stage myfile
  myfile                37 B / 37 B  100%  ?/s  0s total
Error: open myfile: permission denied

only allow tracking files within Dud root

add and commit should throw an error if they are asked to track files that exist outside of the DUC root directory.

Lock resources during operation

Dud should set and respect advisory locks on the files it owns during operation. These include, but are not necessarily limited to, stage/lock files, the index file, and cached files.

One such Go library for file locking: https://github.com/gofrs/flock

detect cycles in stage graph

"add" command

duc add <file/dir ...> will:

Create a stage with the files/dirs as the outputs and write the stage to the workspace
Add a pointer to the stage to the index and mark it for commit
(This command will subsume and replace duc track)

duc add <Ducfile> will:

Add the Ducfile to the index if it isn't present
Mark the Ducfile for commit in the index

commit: integrate with index

duc commit (without any arguments) will:

Read the index and select all stages which are marked for commit
Commit each of the marked stages
Clear all marked stages in index

duc commit <Ducfile> will:

Add the stage to the index if not present
Commit the stage

add diff command to summarize the differences between two cached artifacts

Usage:

dud diff <checksum_a> <checksum_b>

dud diff <path_to_cached_artifact_a> <path_to_cached_artifact_b>

If the cached artifacts are directory manifests, the directory artifacts are recursively loaded and a diff of the full structure is displayed. (cmp.Diff could be used to accomplish this.)

If the cached artifacts are NOT directory manifests, the location of the first difference is displayed (e.g., "first difference detected at byte X")

If the cached artifacts are a directory manifest and a binary file, display as much.

add end-to-end benchmark and comparison environment

We need an isolated, reproducible environment to benchmark DUC performance on real datasets and compare performance against DVC and potentially other tools. The environment should make it easy to add and remove datasets to the list of those under test, and it should report quantitative metrics of performance (primarily runtime).

Improve data integrity for misbehaving users

Please acknowledge the following

I have read about Minimal Bug Reports and what follows is my good faith attempt at creating one.

Describe the bug
A user that writes to a file mid commit and uses the link strategy may end up with a mismatch between the saved checksum and actual checksum of the stored file. This can be prevented by os.Renameing the file to a temporary location before computing the checksum.

A similar issue can happen with the copy strategy where the saved checksum will be correct but the file will be corrupt. This can be fixed by moving first, then copying on checkout.

I am not sure if the user should be warned that data integrity is impossible to guarantee in cases where atomic rename is not available.

System information

Output of dud version:

dud version
v0.4.1-1-g5db6164

Output of uname -srmo:

uname -srmo
Linux 6.1.2-arch1-1 x86_64 GNU/Linux

If you're building from source, please provide your Go version.

go version
go version go1.19.5 linux/amd64

Steps to Reproduce
Steps to reproduce the behavior. Ideally this a copy-paste-able shell script (or
set of small scripts) that reproduces the problem.

mkdir reproduction
# beware this will output a large file (so we'll have time to mess with it while it's being committed)
dd if=/dev/random of=random.psd bs=4M count=5000
dud init
dud stage gen -o random.psd | tee random.yaml
dud stage add random.yaml
dud commit 
# meanwhile, in another terminal....
dd if=/dev/random of=random.psd bs=1 seek=190 count=1 conv=notrunc
rm random.psd
dud checkout -c

Expected behavior
All checksums should match and dud checkout -c should not error out

remove commit set

The commit set was originally intended to mirror Git's workflow of copying files to a staging area prior to commit. The commit set idea in Duc is far from a key feature. With YAGNI and an MVP in mind, we should remove it.

dud help errors out when not in project

The auto-generated help command errors out because it inherits the "switch to project root" PersistentPreRun from the root command. The simplest fix would be to remove the PersistentPreRun from the root command, but that would force sub-commands to implement those where necessary; forgetting to do so would cause confusing bugs.

files and artifacts can have more than one owner

There is no logic in duc for preventing multiple artifacts from co-owning files -- likewise for multiple stages from co-owning artifacts.

commit: skip up-to-date files

$ ./duc init
Initialized .duc folder
$ tree foo/
├── 1.txt
├── 2.txt
├── 3.txt
└── bar
    ├── 4.txt
    └── 5.txt 
$ ./duc add -r foo/
$ ./duc status
foo/  uncommitted
$ ./duc commit
$ ./duc status
foo/  up-to-date
$ echo 'foo' > foo/new_file.txt
$ ./duc status
foo/  modified
$ ./duc add Ducfile
$ ./duc commit
2020/06/11 20:44:06 stage commit failed: commit "foo/1.txt": not a regular file

Add cache validation command

It would be nice to have a function (and CLI command) to sweep through the Dud cache and verify the checksum of each object matches it's path in the cache.

Add CONTRIBUTING.md

add pull command

fetch + checkout

Support stage definitions as JSON

Lots of people hate YAML.

Dud should intelligently de/serialize from/to JSON when a Stage file ends with .json.

add flag to dry-run impactful commands

At the least, remote cache commands (e.g. fetch, push, pull) should support some sort of dry-run. Commit and checkout could also benefit from a dry-run flag, but are not as critical.

prevent Artifact owner conflicts when loading the Index

Currently, Artifact owner conflicts (i.e. when multiple Stages claim ownership of an Artifact) are only detected during duc add. This detection also needs to happen when the Index is loaded to prevent conflicts when Stages are edited by the user.

The sensible solution is to make Index.FromFile use the same code path as duc add.

Parallelize checkout

dud run: explain why a stage is being run

Support cloud usage patterns

This is a pretty involved feature request centred around cloud usage for fully versioned pipelines (i.e. where both code and the data are versioned, though a combo such as Git+Dud). The domain is more data engineering rather than pure ML, though ML quite often is one of the stages of larger pipeline.

Suppose that the business logic described by the pipeline is heavily reliant on cloud, i.e. successive stages are expected to be performed in the cloud on-demand. This necessitates storing data in the cloud as well. If the data usage is heavy, emulating a "local" solution such as shared disk is impractical due to bandwidth limitations (and also in certain sense runs counter to cloud-native ideology), and we shall assume that the on-demand workers will have to get their data independently from object storage.

Object storage cache backup is currently adopted by Dud by default, however that in itself does not accomplish the goal. The issue is in how dud run currently expects all the work to be done locally, with remote cache acting as a backup to local cache, and local stage files acting as synchronization objects. To work around this, the user is expected to manually execute the following on an on-demand cloud worker node:

(git-)pull the stage files from remote
(dud-)pull the required inputs
run the job
(dud-)commit the outputs to local cache
(dud-)push the outputs to cloud storage
(git-)commit the Dud state (stage files)
(git-)push the stage files to remote

It would be great if Dud could do all of that in dud run, perhaps with a suitable set of flags!

I appreciate this goes counter to having Dud as a single-purpose tool, but this would lift a lot of weight for the developer in a very common (albeit advanced) use case. dud run at the moment is already a composite tool, devoted to pipelining rather than data versioning per se. This is natural, as pipelining works much better with full versioning of both code and data, and the feature request is a natural development of this functionality.

In this solution synchronisation between the stages is accomplished using Git -- a bit crude, but keeping in the spirit of reproducible pipelines. The resulting "pipeline log" can be squashed, so that's not really a major issue in my view, but certainly there are alternatives such as using cloud-based key-value stores to synchronize.

I would be interested in your thoughts on the matter.

Invalidate stages with only inputs and no command

A stage with only inputs and no command is entirely useless and should be invalid. Making this type of stage invalid will mitigate an issue where source stages (no inputs, only outputs) are incorrectly configured (no outputs, only inputs).

Improve status speed

status needs to get a similar treatment as commit and checkout.

add artifacts to .gitignore files to prevent accidental tracking with git

This should either happen as part of duc add or duc commit.

Add run command

duc run <Stage> should run the command associated with the Stage.

If the Stage has no command, run is a no-op. (It still returns whether or not the stage is up-to-date. This enables Stages that use this Stage as a dependency to properly resolve downstream status.)
If the Stage's outputs and dependencies are up-to-date, run does nothing.
If the Stage has a command and only outputs, always run the command. (This can be disabled via a freeze boolean later.)
If the Stage has a dependency which is the output of another Stage, recursively run on that Stage first.

add command to create a stage from the command line

Something like:

$ duc stage new -d bish bash.txt -o bosh --recursive-out big -c "echo 'Hello World'"
command: echo 'Hello World'
dependencies:
- path: bish
  isdir: true
- path: bash.txt
outputs:
- path: bosh
  isdir: true
- path: big
  isdir: true
  isrecursive: true

The output can be trivially redirected to a file for further modification, registering via duc add, etc.

Limit number of goroutines during directory commit

Currently a goroutine will be spawned to commit every file in a directory. This obviously does not scale to massive directories. A sensible cap should be placed on the number of workers.

status: add flag to skip byte-level equality checks

When you've checked out hard copies of files (or when you have generated new files, e.g., as part of a stage run) dud status can be extremely slow, as it checks the hard copies for equality vs. the cache. The user should be able to skip this checking to get a more responsive (albeit incomplete) status update.

empty sub-directories not included in directory status output

user@bfda6b1564a9:~/repo$ dud init
Dud project initialized.
See .dud/config.yaml and .dud/rclone.conf to customize the project.

user@bfda6b1564a9:~/repo$ mkdir -p bish/bash/bosh

user@bfda6b1564a9:~/repo$ dud stage gen -o bish/ > bish.yaml

user@bfda6b1564a9:~/repo$ dud stage add bish.yaml
Added bish.yaml to the index.

user@bfda6b1564a9:~/repo$ dud status
bish.yaml  stage definition not checksummed
  bish

The last line should not be empty. Options for correcting the behavior:

Include a "... in X directories" note in the output. In this example: bish: 0 files in 3 directories or bish: 3 directories: 0 files. This is my preferred solution at present.
Output x3 empty directory. "empty" here is misleading, because empty is only considering files, not sub-dirs. This option should probably be avoided.
Output x1 empty directory. This may also be misleading, because there are actually three (nested) directories.
Combine 1) and 3): x1 empty directory across 3 directories. This could be more confusing than helpful, but it gives the most information to the user.

kevin-hanselman / dud Goto Github PK

dud's People

Contributors

Stargazers

Watchers

Forkers

dud's Issues

Recommend Projects

Recommend Topics

Recommend Org