Giter VIP home page Giter VIP logo

keepsake's Introduction

πŸ“£ This project is not actively maintained. If you'd like to help maintain it, please let us know.


Keepsake

Version control for machine learning.

Keepsake is a Python library that uploads files and metadata (like hyperparameters) to Amazon S3 or Google Cloud Storage. You can get the data back out using the command-line interface or a notebook.

  • Track experiments: Automatically track code, hyperparameters, training data, weights, metrics, Python dependencies β€” everything.
  • Go back in time: Get back the code and weights from any checkpoint if you need to replicate your results or commit to Git after the fact.
  • Version your models: Model weights are stored on your own Amazon S3 or Google Cloud bucket, so it's really easy to feed them into production systems.

How it works

Just add two lines to your training code:

import torch
import keepsake

def train():
    # Save training code and hyperparameters
    experiment = keepsake.init(path=".", params={...})
    model = Model()

    for epoch in range(num_epochs):
        # ...

        torch.save(model, "model.pth")
        # Save model weights and metrics
        experiment.checkpoint(path="model.pth", metrics={...})

Then Keepsake will start tracking everything: code, hyperparameters, training data, weights, metrics, Python dependencies, and so on.

  • Open source & community-built: We’re trying to pull together the ML community so we can build this foundational piece of technology together.
  • You're in control of your data: All the data is stored on your own Amazon S3 or Google Cloud Storage as plain old files. There's no server to run.
  • It works with everything: Tensorflow, PyTorch, scikit-learn, XGBoost, you name it. It's just saving files and dictionaries – export however you want.

Features

Throw away your spreadsheet

Your experiments are all in one place, with filter and sort. Because the data's stored on S3, you can even see experiments that were run on other machines.

$ keepsake ls --filter "val_loss<0.2"
EXPERIMENT   HOST         STATUS    BEST CHECKPOINT
e510303      10.52.2.23   stopped   49668cb (val_loss=0.1484)
9e97e07      10.52.7.11   running   41f0c60 (val_loss=0.1989)

Analyze in a notebook

Don't like the CLI? No problem. You can retrieve, analyze, and plot your results from within a notebook. Think of it like a programmable Tensorboard.

Compare experiments

It diffs everything, all the way down to versions of dependencies, just in case that latest Tensorflow version did something weird.

$ keepsake diff 49668cb 41f0c60
Checkpoint:       49668cb     41f0c60
Experiment:       e510303     9e97e07

Params
learning_rate:    0.001       0.002

Python Packages
tensorflow:       2.3.0       2.3.1

Metrics
train_loss:       0.4626      0.8155
train_accuracy:   0.7909      0.7254
val_loss:         0.1484      0.1989
val_accuracy:     0.9607      0.9411

Commit to Git, after the fact

If you eventually want to store your code on Git, there's no need to commit everything as you go. Keepsake lets you get back to any point you called experiment.checkpoint() so, you can commit to Git once you've found something that works.

$ keepsake checkout f81069d
Copying code and weights to working directory...

# save the code to git
$ git commit -am "Use hinge loss"

Load models in production

You can use Keepsake to feed your models into production systems. Connect them back to how they were trained, who trained them, and what their metrics were.

import keepsake
model = torch.load(keepsake.experiments.get("e45a203").best().open("model.pth"))

Install

pip install -U keepsake

Get started

If you prefer training scripts and the CLI, follow the our tutorial to learn how Keepsake works.

If you prefer working in notebooks, follow our notebook tutorial on Colab.

If you like to learn concepts first, read our guide about how Keepsake works.

Get involved

Everyone uses version control for software, but it is much less common in machine learning.

Why is this? We spent a year talking to people in the ML community and this is what we found out:

  • Git doesn’t work well with machine learning. It can’t handle large files, it can’t handle key/value metadata like metrics, and it can’t commit automatically in your training script. There are some solutions for this, but they feel like band-aids.
  • It should be open source. There are a number of proprietary solutions, but something so foundational needs to be built by and for the ML community.
  • It needs to be small, easy to use, and extensible. We found people struggling to integrate with β€œAI Platforms”. We want to make a tool that does one thing well and can be combined with other tools to produce the system you need.

We think the ML community needs a good version control system. But, version control systems are complex, and to make this a reality we need your help.

Have you strung together some shell scripts to build this for yourself? Are you interested in the problem of making machine learning reproducible?

Here are some ways you can help out:

Contributing & development environment

Take a look at our contributing instructions.

keepsake's People

Contributors

andreasjansson avatar bfirsh avatar dependabot-preview[bot] avatar dependabot[bot] avatar enochkan avatar gabrielmbmb avatar gan3sh500 avatar justinchuby avatar kvthr avatar murthy95 avatar ronit-j avatar ryanbloom avatar samuelstevens avatar techytushar avatar thinkmake avatar vastolorde95 avatar zeke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

keepsake's Issues

Hearbeats are broken

After just a few minutes, running experiments are reported as "stopped"

$ replicate ls
═══║ Fetching new data from "gs://andreas-bookrec-2"...
EXPERIMENT  STARTED        STATUS   HOST          USER     LEARNING_RATE  LATEST CHECKPOINT   TEST_LOSS  BEST CHECKPOINT   TEST_LOSS
1971ade     17 hours ago   stopped  35.229.78.80  andreas  0.0001         0fd8422 (step 398)  0.79566                              
eb51e11     6 minutes ago  running                root     0.001          d3d8dd2 (step 1)    1.0395     d3d8dd2 (step 1)  1.0395  
andreas@Andreass-MBP:~/r8/book-rec
$ replicate ls
═══║ Fetching new data from "gs://andreas-bookrec-2"...
EXPERIMENT  STARTED        STATUS   HOST          USER     LEARNING_RATE  LATEST CHECKPOINT   TEST_LOSS  BEST CHECKPOINT   TEST_LOSS
1971ade     17 hours ago   stopped  35.229.78.80  andreas  0.0001         0fd8422 (step 398)  0.79566                              
eb51e11     6 minutes ago  stopped                root     0.001          d3d8dd2 (step 1)    1.0395     d3d8dd2 (step 1)  1.0395  

(eb51e11 is actually still running)

It's really hard to see failures during build

Try to spot the error at a glance:

═══║ Using directory: /Users/andreas/r8/book-rec
═══║ Building Docker image...
═══║ Found CUDA driver on remote host: 440.33.01
═══║ No CUDA version specified in replicate.yaml, using CUDA 10.2 and CuDNN 8
═══║ Using base image: us.gcr.io/replicate/base-ubuntu18.04-python3.8-cuda10.2-cudnn8:0.3
═══║ Running '/usr/local/bin/docker build . --build-arg BUILDKIT_INLINE_CACHE=1 --build-arg BASE_IMAGE=us.gcr.io/replicate/base-ubuntu18.04-python3.8-cuda10.2-cudnn8:0.3 --progress plain --file - --tag replicate-6724bb9308c32a404992f11db4d0dc1c622360fba56c840ded84e6b7a1c0494e --build-arg HAS_GPU=1'
═══║ Uploading /Users/andreas/r8/book-rec to [email protected]:/tmp/replicate/upload/aWV4GzipkssLRZe7sYGn
#2 [internal] load build definition from Dockerfile
#2 transferring dockerfile: 433B done
#2 DONE 0.0s

#1 [internal] load .dockerignore
#1 transferring context: 2B done
#1 DONE 0.0s

#3 [internal] load metadata for us.gcr.io/replicate/base-ubuntu18.04-python...
#3 DONE 0.2s

#4 [1/6] FROM us.gcr.io/replicate/base-ubuntu18.04-python3.8-cuda10.2-cudnn...
#4 DONE 0.0s

#5 [internal] load build context
#5 transferring context: 47.82kB 0.0s done
#5 DONE 0.0s

#6 [2/6] COPY requirements.txt /tmp/requirements.txt
#6 CACHED

#7 [3/6] RUN pip install -r /tmp/requirements.txt
#7 0.991 Collecting matplotlib
#7 1.021   Downloading matplotlib-3.3.0-1-cp38-cp38-manylinux1_x86_64.whl (11.5 MB)
#7 2.607 Collecting tensorboard==1.14.0
#7 2.618   Downloading tensorboard-1.14.0-py3-none-any.whl (3.1 MB)
#7 2.830 Collecting faiss-cpu==1.6.3
#7 2.842   Downloading faiss_cpu-1.6.3-cp38-cp38-manylinux2010_x86_64.whl (7.2 MB)
#7 4.122 ERROR: Could not find a version that satisfies the requirement pytorch==1.4.0 (from -r /tmp/requirements.txt (line 4)) (from versions: 0.1.2, 1.0.2)
#7 4.122 ERROR: No matching distribution found for pytorch==1.4.0 (from -r /tmp/requirements.txt (line 4))
#7 4.234 WARNING: You are using pip version 20.1.1; however, version 20.2.1 is available.
#7 4.234 You should consider upgrading via the '/root/.pyenv/versions/3.8.4/bin/python3.8 -m pip install --upgrade pip' command.
#7 ERROR: executor failed running [/bin/sh -c pip install -r /tmp/requirements.txt]: runc did not terminate sucessfully
------
 > [3/6] RUN pip install -r /tmp/requirements.txt:
------
failed to solve with frontend dockerfile.v0: failed to build LLB: executor failed running [/bin/sh -c pip install -r /tmp/requirements.txt]: runc did not terminate sucessfully
═══║ Process exited with status 1

Replace pyenv with deadsnakes ppa (or something else?)

We're currently using pyenv to install arbitrary python versions. This is easy, but adds a layer of indirection that may bite us later.

The best(?) alternative is the deadsnakes ppa, but it installs python 3.x with the name python3.x, which may be confusing for users. It's possible to symlink /usr/bin/python -> /usr/bin/python3.x, /usr/bin/pip -> /usr/bin/pip3.x, etc, but that may cause problems with tools like apt which depends on python being python 2.7.

Command in replicate show is missing quotes

When I do

$ replicate run -H 35.229.78.80 -m /opt/data/ml-25m:/tmp/book-rec/data python train.py -c shallow -p "gpu = True"

I get

Created:           Thu, 01 Oct 2020 12:02:28 CEST
Status:            stopped
Host:              
User:              root
Command:           train.py -c shallow -p gpu = True

Simple way to install a development version of replicate locally

We used to be able to just make install in the cli/ directory, but now since the build process is more complicated that doesn't work. I now run make build from the top level directory, then pip install -e . in python/, and then for the CLI I have to use

$ ~/r8/replicate/python/build/bin/replicate run python train.py

We should have a make develop command in the top level that runs pip install -e . or python setup.py develop as well as symlinks the CLI binary to the right place.

Environment variables (including secret keys) are exposed in `ps` on remote host

ps aux gives me this (with keys scrubbed):

ubuntu   22502  0.0  0.0  13312  3204 ?        Ss   17:32   0:00 bash -c export AWS_SECRET_ACCESS_KEY=*** SENDGRID_API_KEY=*** AWS_ACCESS_KEY_ID=*** REPLICATE_NO_ANALYTICS=1 VSBL_SECRET_ACCESS_KEY=*** VSBL_ACCESS_KEY_ID=*** DOCKER_BUILDKIT=1 IPY_TEST_SIMPLE_PROMPT=1 CI_AWS_ACCESS_KEY_ID=*** INSIDE_EMACS=27.0.91,comint CI_AWS_SECRET_ACCESS_KEY=*** ZOOM_API_SECRET=*** GO111MODULE=on ZOOM_API_KEY=***; cd /tmp/replicate/upload/YBHS7j1OWGdr4w0bm1uu; docker build . --build-arg BUILDKIT_INLINE_CACHE=1 --build-arg BASE_IMAGE=us.gcr.io/replicate/base-ubuntu18.04-python3.8-cuda10.1-cudnn7-pytorch1.4.0:0.3 --progress plain --file - --tag replicate-02079df5641e3b841fddd2bf4bc6be9e021794ad6b84dd614c5f9216bda432ca --build-arg HAS_GPU=1

We should find a better way to forward environment variables, without exposing them like this

replicate delete should confirm

Currently it just deletes whatever you pass to it. It should give you a prompt to ask if you're sure, ideally by first listing what it will delete (ideally with hashes + hyperparameters / metrics). And then we also need a way to force delete without interaction.

replicate run doesn't exit when docker process fails on remote

Steps to repro:

  • replicate run on remote host with some long-running script
  • In a different shell, log in to remote host, kill the container

Expected result:

  • replicate run process exists

Actual results:

  • replicate run process is still running (hangs)

`replicate ls` output is very wide

image

Not sure what to do about it really, I am interested in all the data that is displayed (except maybe user and host since I'm doing this project alone). So maybe we can just close this ticket unless you have a clever idea @bfirsh?

On remote `replicate run`, host is missing and user is `root`

Seems new, the experiment I ran yesterday had correct data, the new one I started today looks like this:

$ replicate ls
═══║ Fetching new data from "gs://andreas-bookrec-2"...
EXPERIMENT  STARTED        STATUS   HOST          USER     LEARNING_RATE  LATEST CHECKPOINT   TEST_LOSS  BEST CHECKPOINT   TEST_LOSS
1971ade     17 hours ago   stopped  35.229.78.80  andreas  0.0001         0fd8422 (step 398)  0.79566                              
eb51e11     8 minutes ago  stopped                root     0.001          7638366 (step 2)    1.0475     d3d8dd2 (step 1)  1.0395  

A few python versions are missing tensorflow

Some python versions that are supposed to have compatible tensorflow versions in https://www.tensorflow.org/install/source#tested_build_configurations don't actually have compatible versions. This seems to happen to new-ish patch versions of python, and I'm guessing tensorflow update their python versions in time.

Maybe we should build base images with less recent python versions (latest - 1)? Or automatically find the latest python version that actually supports all torch and tensorflow versions it's supposed to support.

Creating new GCS bucket in Colab throws error 400

Authenticating with

from google.colab import auth
auth.authenticate_user()

and then using Replicate with a new gs:// bucket name raises:

═══║ Error creating experiment: Error creating bucket: Failed to create bucket gs://replicate-logo-generation: googleapi: Error 400: Unknown project id: , invalid

It's hard to explain what a "label" is and what a "metric" is

I guess this is a bug, so filing here! It's a design bug.

Writing the documentation and fiddling with the user interface, it's weird there is a concept of both a "label" and a "metric". The things you pass to commit() are "labels", but then if you define a statement in replicate.yaml called "metrics" then those "labels" get, um, upgraded to "metrics".

That is non-intuitive unless you actually spell it out and people learn it, which is not ideal. Both those concepts are exposed when showing a commit, so they're things users need to understand. It would be much better if we could come up with a word that applies to both, and you could "augment" that thing with extra meaning in replicate.yaml. That way users don't have to learn two concepts.

Either labels or metrics could work as the universal word, but neither is ideal:

  • If they were called "metrics", then semantically some of the things you pass to commit() aren't metrics (e.g. just a string description of something is very much not a metric). Maybe that's ok?

  • If they were called "labels", then the key in replicate.yaml would look a bit like this:

    labels:
      - name: loss
        goal: minimize
    

    Which is a bit weird, because you're defining metrics, but maybe makes sense if you think of them as "labels" that you're augmenting with meaning.

Shrug. Anyway. Not a major design bug but the current thing doesn't seem optimal.

Ability to manually specify experiment ID in `replicate.init`

Two use cases:

  1. Resuming an experiment: If a training node dies, you're in luck because you've used Replicate to save all your progress to the cloud. But there's currently no way to resume an experiment. The simplest way to enable resume would be to let the user set an experiment ID manually

  2. AI Platform: When you launch an AI Platform training job you have to give it a job name. You probably want this linked to an experiment. You could do that right now using params, but it feels cleaner to have the AI Platform job ID also be the Replicate experiment ID

experiment.save is exposed in the python API

I was tired and accidentally wrote experiment.save rather than experiment.commit. My linter didn't complain because .save() a method on Experiment. The error I got was

Traceback (most recent call last):
  File "train.py", line 145, in <module>
    train()
  File "train.py", line 140, in train
    train_loop(model, train_dl, test_dl)
  File "train.py", line 116, in train_loop
    experiment.save(
TypeError: save() got an unexpected keyword argument 'step'

Might be worth prefixing "private" methods with underscore.

Fetching new data is slow

It takes about 9 seconds to fetch new data in my current project:

$ (replicate ls --json | jq '.[].num_checkpoints') 2>&1 | ts
Oct 06 18:36:18 ═══║ Fetching new data from "gs://andreas-bookrec-2"...
Oct 06 18:36:27 399
Oct 06 18:36:27 2952
Oct 06 18:36:27 332
Oct 06 18:36:27 27

Which means that almost everything I do with Replicate takes at least 9 seconds. It makes my whole process feel very sluggish.

In this project I have 4 experiments, with between 26 and 2952 checkpoints.

Allow specifying storage_url in replicate.init()

This is potentially controversial and I'm curious to hear your thoughts.

When using AI Platform, you launch jobs by pointing to the path of a Python package. AI Platform then packages that with sdist locally, uploads it to GCS, and executes a module in that package on the worker node.

This makes it really hard to use replicate.yaml, since we don't know which directory the training script is executed in, nor do we know the exact path to the installed package so we can't include replicate.yaml in the package and point to it with --project-directory.

The only alternative I can see is for us to add a project_directory argument to replicate.init(). But then we have two ways of specifying project directories, which may be confusing and hard to maintain. Still, my hunch is that AI Platform isn't the only case we'll run into this issue, so we should probably support that.

Notebooks are always running

The heartbeat keeps on running in a notebook, so they never stop until you stop the notebook. Also, lots of heartbeat proceses.

`replicate ls` fails if experiments are currently being deleted

It's an edge case but filing it anyway, since delete is quite slow and some people surely will run into this:

$ replicate ls
═══║ Fetching new data from "gs://andreas-bookrec-2"...
═══║ Get: path does not exist:
   β”‚ gs://andreas-bookrec-2/metadata/checkpoints/672a061518abb13fa4b32257556f3402ee77c6585054d39e847948bd50b41842.json

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.