replicate / keepsake Goto Github PK

View Code? Open in Web Editor NEW

1.6K 25.0 72.0 14.83 MB

Version control for machine learning

Home Page: https://keepsake.ai

License: Apache License 2.0

Python 60.40% Makefile 1.12% JavaScript 3.31% Go 31.28% Shell 0.40% SCSS 3.49%

version-control machine-learning

keepsake's Introduction

📣 This project is not actively maintained. If you'd like to help maintain it, please let us know.

Keepsake

Version control for machine learning.

Keepsake is a Python library that uploads files and metadata (like hyperparameters) to Amazon S3 or Google Cloud Storage. You can get the data back out using the command-line interface or a notebook.

Track experiments: Automatically track code, hyperparameters, training data, weights, metrics, Python dependencies — everything.
Go back in time: Get back the code and weights from any checkpoint if you need to replicate your results or commit to Git after the fact.
Version your models: Model weights are stored on your own Amazon S3 or Google Cloud bucket, so it's really easy to feed them into production systems.

How it works

Just add two lines to your training code:

import torch
import keepsake

def train():
    # Save training code and hyperparameters
    experiment = keepsake.init(path=".", params={...})
    model = Model()

    for epoch in range(num_epochs):
        # ...

        torch.save(model, "model.pth")
        # Save model weights and metrics
        experiment.checkpoint(path="model.pth", metrics={...})

Then Keepsake will start tracking everything: code, hyperparameters, training data, weights, metrics, Python dependencies, and so on.

Open source & community-built: We’re trying to pull together the ML community so we can build this foundational piece of technology together.
You're in control of your data: All the data is stored on your own Amazon S3 or Google Cloud Storage as plain old files. There's no server to run.
It works with everything: Tensorflow, PyTorch, scikit-learn, XGBoost, you name it. It's just saving files and dictionaries – export however you want.

Features

Throw away your spreadsheet

Your experiments are all in one place, with filter and sort. Because the data's stored on S3, you can even see experiments that were run on other machines.

$ keepsake ls --filter "val_loss<0.2"
EXPERIMENT   HOST         STATUS    BEST CHECKPOINT
e510303      10.52.2.23   stopped   49668cb (val_loss=0.1484)
9e97e07      10.52.7.11   running   41f0c60 (val_loss=0.1989)

Analyze in a notebook

Don't like the CLI? No problem. You can retrieve, analyze, and plot your results from within a notebook. Think of it like a programmable Tensorboard.

Compare experiments

It diffs everything, all the way down to versions of dependencies, just in case that latest Tensorflow version did something weird.

$ keepsake diff 49668cb 41f0c60
Checkpoint:       49668cb     41f0c60
Experiment:       e510303     9e97e07

Params
learning_rate:    0.001       0.002

Python Packages
tensorflow:       2.3.0       2.3.1

Metrics
train_loss:       0.4626      0.8155
train_accuracy:   0.7909      0.7254
val_loss:         0.1484      0.1989
val_accuracy:     0.9607      0.9411

Commit to Git, after the fact

If you eventually want to store your code on Git, there's no need to commit everything as you go. Keepsake lets you get back to any point you called experiment.checkpoint() so, you can commit to Git once you've found something that works.

$ keepsake checkout f81069d
Copying code and weights to working directory...

# save the code to git
$ git commit -am "Use hinge loss"

Load models in production

You can use Keepsake to feed your models into production systems. Connect them back to how they were trained, who trained them, and what their metrics were.

import keepsake
model = torch.load(keepsake.experiments.get("e45a203").best().open("model.pth"))

Install

pip install -U keepsake

Get started

If you prefer training scripts and the CLI, follow the our tutorial to learn how Keepsake works.

If you prefer working in notebooks, follow our notebook tutorial on Colab.

If you like to learn concepts first, read our guide about how Keepsake works.

Get involved

Everyone uses version control for software, but it is much less common in machine learning.

Why is this? We spent a year talking to people in the ML community and this is what we found out:

Git doesn’t work well with machine learning. It can’t handle large files, it can’t handle key/value metadata like metrics, and it can’t commit automatically in your training script. There are some solutions for this, but they feel like band-aids.
It should be open source. There are a number of proprietary solutions, but something so foundational needs to be built by and for the ML community.
It needs to be small, easy to use, and extensible. We found people struggling to integrate with “AI Platforms”. We want to make a tool that does one thing well and can be combined with other tools to produce the system you need.

We think the ML community needs a good version control system. But, version control systems are complex, and to make this a reality we need your help.

Have you strung together some shell scripts to build this for yourself? Are you interested in the problem of making machine learning reproducible?

Here are some ways you can help out:

Contributing & development environment

Take a look at our contributing instructions.

keepsake's People

Contributors

Stargazers

Watchers

keepsake's Issues

`replicate run` fails when `.replicate/metadata-cache` is missing

═══╡ Fetching new data from "gs://andreas-bookrec-2"...
═══╡ lstat
   │ /Users/andreas/r8/book-rec/.replicate/metadata-cache/metadata:
   │ not a directory

Make tabwriter work with formatting

Basic thing that is broken in tabwriter. This seems to be an alternative: https://github.com/juju/ansiterm

But maybe we want to start our own version of tabwriter, anyway. I'm sure there are other improvements we want to make (e.g. making it responsive).

Progress reporting on `replicate.init()`

It currently just sits there for a long time if you have big files in your working directory that you forgot to add to .replicateignore. Feels like a bug.

Hearbeats are broken

After just a few minutes, running experiments are reported as "stopped"

$ replicate ls
═══╡ Fetching new data from "gs://andreas-bookrec-2"...
EXPERIMENT  STARTED        STATUS   HOST          USER     LEARNING_RATE  LATEST CHECKPOINT   TEST_LOSS  BEST CHECKPOINT   TEST_LOSS
1971ade     17 hours ago   stopped  35.229.78.80  andreas  0.0001         0fd8422 (step 398)  0.79566                              
eb51e11     6 minutes ago  running                root     0.001          d3d8dd2 (step 1)    1.0395     d3d8dd2 (step 1)  1.0395  
andreas@Andreass-MBP:~/r8/book-rec
$ replicate ls
═══╡ Fetching new data from "gs://andreas-bookrec-2"...
EXPERIMENT  STARTED        STATUS   HOST          USER     LEARNING_RATE  LATEST CHECKPOINT   TEST_LOSS  BEST CHECKPOINT   TEST_LOSS
1971ade     17 hours ago   stopped  35.229.78.80  andreas  0.0001         0fd8422 (step 398)  0.79566                              
eb51e11     6 minutes ago  stopped                root     0.001          d3d8dd2 (step 1)    1.0395     d3d8dd2 (step 1)  1.0395

(eb51e11 is actually still running)

It's really hard to see failures during build

Try to spot the error at a glance:

═══╡ Using directory: /Users/andreas/r8/book-rec
═══╡ Building Docker image...
═══╡ Found CUDA driver on remote host: 440.33.01
═══╡ No CUDA version specified in replicate.yaml, using CUDA 10.2 and CuDNN 8
═══╡ Using base image: us.gcr.io/replicate/base-ubuntu18.04-python3.8-cuda10.2-cudnn8:0.3
═══╡ Running '/usr/local/bin/docker build . --build-arg BUILDKIT_INLINE_CACHE=1 --build-arg BASE_IMAGE=us.gcr.io/replicate/base-ubuntu18.04-python3.8-cuda10.2-cudnn8:0.3 --progress plain --file - --tag replicate-6724bb9308c32a404992f11db4d0dc1c622360fba56c840ded84e6b7a1c0494e --build-arg HAS_GPU=1'
═══╡ Uploading /Users/andreas/r8/book-rec to [email protected]:/tmp/replicate/upload/aWV4GzipkssLRZe7sYGn
#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 433B done
#2 DONE 0.0s

#1 [internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s

#3 [internal] load metadata for us.gcr.io/replicate/base-ubuntu18.04-python...
#3 DONE 0.2s

#4 [1/6] FROM us.gcr.io/replicate/base-ubuntu18.04-python3.8-cuda10.2-cudnn...
#4 DONE 0.0s

#5 [internal] load build context
#5 transferring context: 47.82kB 0.0s done
#5 DONE 0.0s

#6 [2/6] COPY requirements.txt /tmp/requirements.txt
#6 CACHED

#7 [3/6] RUN pip install -r /tmp/requirements.txt
#7 0.991 Collecting matplotlib
#7 1.021   Downloading matplotlib-3.3.0-1-cp38-cp38-manylinux1_x86_64.whl (11.5 MB)
#7 2.607 Collecting tensorboard==1.14.0
#7 2.618   Downloading tensorboard-1.14.0-py3-none-any.whl (3.1 MB)
#7 2.830 Collecting faiss-cpu==1.6.3
#7 2.842   Downloading faiss_cpu-1.6.3-cp38-cp38-manylinux2010_x86_64.whl (7.2 MB)
#7 4.122 ERROR: Could not find a version that satisfies the requirement pytorch==1.4.0 (from -r /tmp/requirements.txt (line 4)) (from versions: 0.1.2, 1.0.2)
#7 4.122 ERROR: No matching distribution found for pytorch==1.4.0 (from -r /tmp/requirements.txt (line 4))
#7 4.234 WARNING: You are using pip version 20.1.1; however, version 20.2.1 is available.
#7 4.234 You should consider upgrading via the '/root/.pyenv/versions/3.8.4/bin/python3.8 -m pip install --upgrade pip' command.
#7 ERROR: executor failed running [/bin/sh -c pip install -r /tmp/requirements.txt]: runc did not terminate sucessfully
------
 > [3/6] RUN pip install -r /tmp/requirements.txt:
------
failed to solve with frontend dockerfile.v0: failed to build LLB: executor failed running [/bin/sh -c pip install -r /tmp/requirements.txt]: runc did not terminate sucessfully
═══╡ Process exited with status 1

If there is no heartbeat, experiment is considered stopped

There should probably be a third "unknown" state.

Error from Docker container: `bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)`

Happens every time you replicate run

Host doesn't actually do anything any longer I think

Hangover from #196

Replace pyenv with deadsnakes ppa (or something else?)

We're currently using pyenv to install arbitrary python versions. This is easy, but adds a layer of indirection that may bite us later.

The best(?) alternative is the deadsnakes ppa, but it installs python 3.x with the name python3.x, which may be confusing for users. It's possible to symlink /usr/bin/python -> /usr/bin/python3.x, /usr/bin/pip -> /usr/bin/pip3.x, etc, but that may cause problems with tools like apt which depends on python being python 2.7.

Command in replicate show is missing quotes

When I do

$ replicate run -H 35.229.78.80 -m /opt/data/ml-25m:/tmp/book-rec/data python train.py -c shallow -p "gpu = True"

I get

Created:           Thu, 01 Oct 2020 12:02:28 CEST
Status:            stopped
Host:              
User:              root
Command:           train.py -c shallow -p gpu = True

Binaries are put in "release" directory instead of release number

Filing this here because it doesn't break getting started guide, but we should fix at some point.

Killing `replicate run` doesn't kill remote `docker build`

I noticed this just now, haven't reproduced but will more info in this issue later.

Simple way to install a development version of replicate locally

We used to be able to just make install in the cli/ directory, but now since the build process is more complicated that doesn't work. I now run make build from the top level directory, then pip install -e . in python/, and then for the CLI I have to use

$ ~/r8/replicate/python/build/bin/replicate run python train.py

We should have a make develop command in the top level that runs pip install -e . or python setup.py develop as well as symlinks the CLI binary to the right place.

Docker build appears to hang on `COPY requirements.txt /tmp/requirements.txt`

Though it doesn't actually hang, it just takes forever:

#5 [2/6] COPY requirements.txt /tmp/requirements.txt
#5 DONE 181.3s

Is this an issue with buildkit? Or some other weird issue?

I was able to reproduce twice on a new host, first build.

Environment variables (including secret keys) are exposed in `ps` on remote host

ps aux gives me this (with keys scrubbed):

ubuntu   22502  0.0  0.0  13312  3204 ?        Ss   17:32   0:00 bash -c export AWS_SECRET_ACCESS_KEY=*** SENDGRID_API_KEY=*** AWS_ACCESS_KEY_ID=*** REPLICATE_NO_ANALYTICS=1 VSBL_SECRET_ACCESS_KEY=*** VSBL_ACCESS_KEY_ID=*** DOCKER_BUILDKIT=1 IPY_TEST_SIMPLE_PROMPT=1 CI_AWS_ACCESS_KEY_ID=*** INSIDE_EMACS=27.0.91,comint CI_AWS_SECRET_ACCESS_KEY=*** ZOOM_API_SECRET=*** GO111MODULE=on ZOOM_API_KEY=***; cd /tmp/replicate/upload/YBHS7j1OWGdr4w0bm1uu; docker build . --build-arg BUILDKIT_INLINE_CACHE=1 --build-arg BASE_IMAGE=us.gcr.io/replicate/base-ubuntu18.04-python3.8-cuda10.1-cudnn7-pytorch1.4.0:0.3 --progress plain --file - --tag replicate-02079df5641e3b841fddd2bf4bc6be9e021794ad6b84dd614c5f9216bda432ca --build-arg HAS_GPU=1

We should find a better way to forward environment variables, without exposing them like this

Docker build output is broken in non-standard shells

In Emacs shell it's spewing out thousands of lines of junk, probably the same in other shells too:

We should do a better job at detecting shell features.

Reduce number of Go packages, and perhaps group some of them instead all being top-level

Our top-level packages are getting quite sprawly. Go tends to work better work better with fewer, bigger packages.

E.g. we should probably have things like a big metadata package instead of separate commit and experiment packages.

replicate delete should confirm

Currently it just deletes whatever you pass to it. It should give you a prompt to ask if you're sure, ideally by first listing what it will delete (ideally with hashes + hyperparameters / metrics). And then we also need a way to force delete without interaction.

Get CI to check the assets are build correctly

There is some code in the previous repository that does this for the docs.

"step 0" displays in list, even if step is not set

`replicate ls` / `replicate ps` in non-replicate directories gives weird error

andreas@Andreass-MacBook-Pro:~/scratch
$ replicate ps
═══╡ open /Users/andreas/scratch/.replicate/storage/metadata/experiments: no such file or directory

It should say something like "~/scratch doesn't appear to belong to a Replicate project"

.replicateignore isn't used when uploading to remote server with `replicate run`

Which means that it takes forever to rsync the first time if you have big files. I can't think of a good reason to not use .replicateignore in both places, though there might be some.

Odd error message when you forget private key on `replicate run` on EC2

$ replicate run -H [email protected] python train.py -c shallow -p "gpu = True"
═══╡ [email protected]: Permission denied (publickey).
═══╡ Error creating remote client: EOF

Creating two different plots in same cell doesn't work

Maybe it should start a new plot if the axes are different.

Show usage instructions when you give bad arguments to `replicate run`

If you get the CLI args wrong you get the not very helpful message:

$ ~/r8/replicate/python/build/bin/replicate run -m /tmp/book-rec/data:/tmp/book-rec/data
═══╡ requires at least 1 arg(s), only received 0

`replicate run python train.py -H myhost` behaves differently to `replicate run -H myhost python train.py`

It's a non-obvious gotcha.

replicate run doesn't exit when docker process fails on remote

Steps to repro:

replicate run on remote host with some long-running script
In a different shell, log in to remote host, kill the container

Expected result:

replicate run process exists

Actual results:

replicate run process is still running (hangs)

Add --json flag to replicate show

This would allow me to easily graph arbitrary metrics on the command line.

`replicate ls` output is very wide

Not sure what to do about it really, I am interested in all the data that is displayed (except maybe user and host since I'm doing this project alone). So maybe we can just close this ticket unless you have a clever idea @bfirsh?

On remote `replicate run`, host is missing and user is `root`

Seems new, the experiment I ran yesterday had correct data, the new one I started today looks like this:

$ replicate ls
═══╡ Fetching new data from "gs://andreas-bookrec-2"...
EXPERIMENT  STARTED        STATUS   HOST          USER     LEARNING_RATE  LATEST CHECKPOINT   TEST_LOSS  BEST CHECKPOINT   TEST_LOSS
1971ade     17 hours ago   stopped  35.229.78.80  andreas  0.0001         0fd8422 (step 398)  0.79566                              
eb51e11     8 minutes ago  stopped                root     0.001          7638366 (step 2)    1.0475     d3d8dd2 (step 1)  1.0395

Remote disk storage is "undefined"

It currently writes to a directory within the container. Should probably write to a mounted directory on the host.

A few python versions are missing tensorflow

Some python versions that are supposed to have compatible tensorflow versions in https://www.tensorflow.org/install/source#tested_build_configurations don't actually have compatible versions. This seems to happen to new-ish patch versions of python, and I'm guessing tensorflow update their python versions in time.

Maybe we should build base images with less recent python versions (latest - 1)? Or automatically find the latest python version that actually supports all torch and tensorflow versions it's supposed to support.

Creating new GCS bucket in Colab throws error 400

Authenticating with

from google.colab import auth
auth.authenticate_user()

and then using Replicate with a new gs:// bucket name raises:

═══╡ Error creating experiment: Error creating bucket: Failed to create bucket gs://replicate-logo-generation: googleapi: Error 400: Unknown project id: , invalid

It's hard to explain what a "label" is and what a "metric" is

I guess this is a bug, so filing here! It's a design bug.

Writing the documentation and fiddling with the user interface, it's weird there is a concept of both a "label" and a "metric". The things you pass to commit() are "labels", but then if you define a statement in replicate.yaml called "metrics" then those "labels" get, um, upgraded to "metrics".

That is non-intuitive unless you actually spell it out and people learn it, which is not ideal. Both those concepts are exposed when showing a commit, so they're things users need to understand. It would be much better if we could come up with a word that applies to both, and you could "augment" that thing with extra meaning in replicate.yaml. That way users don't have to learn two concepts.

Either labels or metrics could work as the universal word, but neither is ideal:

If they were called "metrics", then semantically some of the things you pass to commit() aren't metrics (e.g. just a string description of something is very much not a metric). Maybe that's ok?
If they were called "labels", then the key in replicate.yaml would look a bit like this:
```
labels:
  - name: loss
    goal: minimize
```
Which is a bit weird, because you're defining metrics, but maybe makes sense if you think of them as "labels" that you're augmenting with meaning.

Shrug. Anyway. Not a major design bug but the current thing doesn't seem optimal.

Make end to end tests easier to diagnose

If a command fails to run, it spews tracebacks instead of printing clearly what the output of the command was.

Ability to manually specify experiment ID in `replicate.init`

Two use cases:

Resuming an experiment: If a training node dies, you're in luck because you've used Replicate to save all your progress to the cloud. But there's currently no way to resume an experiment. The simplest way to enable resume would be to let the user set an experiment ID manually
AI Platform: When you launch an AI Platform training job you have to give it a job name. You probably want this linked to an experiment. You could do that right now using params, but it feels cleaner to have the AI Platform job ID also be the Replicate experiment ID

experiment.save is exposed in the python API

I was tired and accidentally wrote experiment.save rather than experiment.commit. My linter didn't complain because .save() a method on Experiment. The error I got was

Traceback (most recent call last):
  File "train.py", line 145, in <module>
    train()
  File "train.py", line 140, in train
    train_loop(model, train_dl, test_dl)
  File "train.py", line 116, in train_loop
    experiment.save(
TypeError: save() got an unexpected keyword argument 'step'

Might be worth prefixing "private" methods with underscore.

Ctrl-C'ing a `replicate run` keeps running the job in the background without any message

We should say something....

Filing this as a bug with medium prio since it's actually very strange behaviour.

If there is a blank replicate.yaml, no error is thrown

It should require a repository key.

Empty directory before checkout

To make replicate checkout behave like git checkout or rsync --delete.

Clean up IAM accounts and other detritus from external tests

See #151

IAM accounts
service accounts
S3 & GCS buckets

Fetching new data is slow

It takes about 9 seconds to fetch new data in my current project:

$ (replicate ls --json | jq '.[].num_checkpoints') 2>&1 | ts
Oct 06 18:36:18 ═══╡ Fetching new data from "gs://andreas-bookrec-2"...
Oct 06 18:36:27 399
Oct 06 18:36:27 2952
Oct 06 18:36:27 332
Oct 06 18:36:27 27

Which means that almost everything I do with Replicate takes at least 9 seconds. It makes my whole process feel very sluggish.

In this project I have 4 experiments, with between 26 and 2952 checkpoints.

Maybe switch from tox to makefiles?

See #150

Test on OS X

GitHub actions can do this for us.

Make base image configuration string part of the version tag rather than the name

Change

us.gcr.io/replicate/base-ubuntu18.04-python3.7-cuda10.1-cudnn7-pytorch1.4.0:0.3

us.gcr.io/replicate/base:ubuntu18.04-python3.7-cuda10.1-cudnn7-pytorch1.4.0-v0.3

Allow specifying storage_url in replicate.init()

This is potentially controversial and I'm curious to hear your thoughts.

When using AI Platform, you launch jobs by pointing to the path of a Python package. AI Platform then packages that with sdist locally, uploads it to GCS, and executes a module in that package on the worker node.

This makes it really hard to use replicate.yaml, since we don't know which directory the training script is executed in, nor do we know the exact path to the installed package so we can't include replicate.yaml in the package and point to it with --project-directory.

The only alternative I can see is for us to add a project_directory argument to replicate.init(). But then we have two ways of specifying project directories, which may be confusing and hard to maintain. Still, my hunch is that AI Platform isn't the only case we'll run into this issue, so we should probably support that.

$ replicate ls
═══╡ Fetching new data from "gs://andreas-bookrec-2"...
═══╡ Get: path does not exist:
   │ gs://andreas-bookrec-2/metadata/checkpoints/672a061518abb13fa4b32257556f3402ee77c6585054d39e847948bd50b41842.json