r-three / git-theta Goto Github PK

git extension for {collaborative, communal, continual} model development

License: Apache License 2.0

Python 92.40% Shell 7.60%

git-theta's Introduction

Git-Theta

Git-Theta is a Git extension for collaborative, continual, and communal development of machine learning models.

Version control systems like Git enable large distributed teams to collaborate on shared codebases by tracking changes over time and providing tools for merging changes from multiple sources. Git-Theta is a Git extension that aims to provide similar functionality for machine learning model checkpoints by efficiently and meaningfully track a model's version history natively through Git. Specifically, rather than treating the checkpoint as a blob of data (as done by other systems for tracking models with Git), Git-Theta

atomically tracks each parameter "group" (e.g. a weight matrix or bias vector in a neural network)
tracks dense or communication-efficient updates like low-rank or sparse changes to parameter groups
allows models to be merged automatically or manually
displays meaningful "diffs" by showing which parameter groups have changed
supports checkpoint formats from most popular machine learning frameworks
enables easy extension of update types, merging methods, and checkpoint formats through a plugin system

Git-Theta is currently under active development and should be used with caution. For feature discussions and debugging help, please join the #git-theta stream in the CCCML Zulip community. If you use Git-Theta as part of a published research project, please cite our paper.

Quick Start

Installing Git LFS

Download and install Git LFS using the instructions from the Git LFS website.

Installing Git-Theta

Install the git-theta Python package:

pip install git-theta

By default, installing git-theta with pip will not install any of the supported machine learning frameworks (PyTorch, TensorFlow, etc.). If you want to install the framework you intend to use when installing git-theta, you can specify it when installing (e.g. by running pip install git-theta[pytorch] for PyTorch).

Configure Git to use Git-Theta when tracking model checkpoints:

git theta install

Tracking a model

Say you have a codebase for training a model along with the model's checkpoint:

my_codebase
├── model.pt
└── train.py

Git-Theta allows you to use Git to track the changes to your code and your model's parameters in tandem. To use Git-Theta to track the model checkpoint, first run

git theta track model.pt

This will create or update the .gitattributes file that tells Git to use Git-Theta to handle the checkpoint file. You can then add and commit the .gitattributes file:

git add .gitattributes
git commit

After tracking the model, you can regular Git commands (add, commit, push, pull, checkout, status, diff, etc.) as if the checkpoint file was any other file. To add and commit the initial version of the checkpoint, simply run

git add model.pt
git commit

Storing updates efficiently

Additionally, git theta add can be used instead of git add to provide optional extra information, including e.g., the checkpoint format with --checkpoint-type, the Update used to update parameters with --update-type, and the location of auxiliary information/data for the update with --update-path. For example, if the model was updated using using LoRA, the low-rank factors can be efficiently stored by Git-Theta by running:

# After training with LoRA and saving the factors to updates.pt...
git theta add model.pt --update-type low-rank --update-path updates.pt
git commit

Merging updates

Git-Theta can also handle merging of models trained with differing updates. For example, if an existing model is further trained on a new branch called alternate-training:

git checkout -b alternate-training
# After performing training...
git add model.pt
git commit

and is separately trained on the main branch:

git checkout main
# After some other training...
git add model.pt
git commit

We then can then merge the updates from the alternate-training branch via a standard git merge:

git merge alternate-training

Git-Theta supports various methods for automatically merging models, including parameter averaging. The merge tools shows us each parameter that is different between the two models and asks what merge operation to perform.

Efficiently tracking updates

Git-Theta supports various workflows for efficiently tracking updates to a checkpoint.

Parameter groups

Under the hood, Git-Theta tracks changes to a checkpoint at the parameter group level. A parameter group is a semantically-grouped collection of parameters like a weight matrix or bias vector in a neural network. Parameter groups are determined based on the structure of the checkpoint file itself as specified in the format-specific Checkpoint class. In the simplest case where all of the parameters of a model are updated, Git-Theta will effectively store an entirely new copy of the checkpoint. However, if only a subset of the model's parameter groups are updated, Git-Theta will only store the updates to the changed parameter groups, which saves space and communication costs. Similarly, if a model is updated by adding new parameter groups, Git-Theta will only store the new parameter groups.

Parameter-efficient updates

Beyond updating a subset of a model's parameter groups, Git-Theta also natively supports parameter-efficient updates. Examples of parameter-efficient updates include updating a sparse subset of the model's parameters (as in FISH Mask or Diff Pruning) or applying a low-rank update (as in LoRA). There are multiple workflows for efficiently tracking parameter-efficient updates with Git-Theta.

Saving update information as new parameter groups

A simple way to track parameter-efficient updates is to store the information required to produce the update (e.g., the low-rank factors for LoRA or the indices and values for a sparse update) as new parameter groups in the checkpoint file itself. In this case, model code handles creating and applying the update and the checkpoint is saved and loaded as usual.

Pros:

Simple to implement.
Original checkpoint and updates are bundled together and saving and loading is done as usual without special logic.

Cons:

Checkpoint saving may result in unnecessary writes of unchanged parameters.
If many subsequent parameter-efficient updates are made, the number of parameter groups stored in the checkpoint file could become onerous.

After saving update information in the checkpoint, the new checkpoint can be committed simply using git add and git commit as usual.

Applying updates to existing parameter groups before saving

A second option is to apply the updates to the parameter groups before saving them. Git-Theta will treat these updates in the same way it treats updating all parameters in a parameter group, so this approach sacrifices any savings to communication or storage costs that would have been achieved by using a parameter-efficient method.

Pros:

Similar to saving update information as new parameter groups, this is simple to implement and only involves handling a single checkpoint file.
The checkpoint can be used as-is without any special logic for re-applying the update.

Cons:

Sacrifices any communication/storage savings from using a parameter-efficient update.
Checkpoint saving may result in unnecessary writes of unchanged parameters.

After folding the updates into the parameter groups, the model can be saved, added, and committed as usual.

Saving update information externally

Another option is to save parameter-efficient update information in a separate file from the original checkpoint. This maintains storage and communication efficiency at the cost of requiring additional implementation overhead.

Pros:

Only the parameter updates are saved, reducing storage requirements.
Only updated parameters are saved during the training loop, removing wasteful writes.
Makes it easy to work with multiple datasets via different file names or branches.

Cons:

Implementation overhead. Training code needs to be able to segment out and save only the parameters that have changed. Inference code needs to know how to load both the original checkpoint and the update from the new checkpoint as well as how to merge them.
The original checkpoint and parameter updates are decoupled, running the risk that one could be changed without appropriately modifying the other.

Assuming we have already committed the original model, the auxiliary information checkpoint needs to be separately added and committed as normal.

Using Git-Theta to incorporate external update information

To streamline the workflow of saving update information externally, Git-Theta has functionality for applying the update as part of the version control process. This ties together the main model checkpoint and the update checkpoint to prevent them from diverging. In addition, Git-Theta takes care of applying the update so that the model checkpoint can be used as-is after checkout. Git-Theta assumes assumes that the update information checkpoint uses the same format as the original checkpoint and that the names of updates are prefixed by the name of the parameter group they are applied to. For example, if a parameter group called /layer1/weights was updated with a low-rank update, then Git-Theta would look for parameters named /layer1/weights/R and /layer1/weights/C in the update information checkpoint based on the naming conventions in the LowRankUpdate class. The low-rank update can then be efficiently tracked and applied with Git-Theta via

git theta add /path/to/original/checkpoint.ckpt --update-type low-rank --update-path /path/to/updates.ckpt
git commit

Note that using this approach requires using git theta add instead of just git add to allow for additional command line arguments. Updates that involve modifying existing parameters (rather than just completely replacing them) are referred to by Git-Theta as "incremental updates" and are handled via a plugin system (described below).

Managing model development with Git-Theta

Git-Theta provides principled and rigorous way to keep track of different versions of a model based on the standard version control workflow.

Tracking the progression of a model

Pre-trained models are increasingly being continually updated to make them applicable to new tasks and domains. For example, a pre-trained language model might be adapted to a new objective, process text in a new domain, and improve its instruction-following capabilities before being fine-tuned on a target task. Git-Theta allows the provenance of these steps to be straightforwardly tracked using Git's built-in functionality. Apart from committing each model to keep track of a checkpoint's history, other Git functionality like tagging can be used to keep track of notable versions. When checking out a particular version of a model, Git-Theta will only download what's required to reconstruct it and won't download any files that have already been cached.

Tracking different versions of a model

Model development is not always straightforward - often we want to try out different versions of a base model, or we might create different versions that are applicable to different tasks. Git-Theta supports this mode of development natively simply by using Git's branch feature - simply create a new branch (git checkout -b), modify the model, and add and commit it as usual. This provides a straightforward workflow for trying out different ways to update a model. If parameter groups are shared across checkpoints being tracked by Git-Theta (whether they are on the same or different branches), Git-Theta will only store a single copy of each parameter group. Contributors can also develop their own updated versions of a model by forking the base repository.

Merging models

If different versions of a model are created on different branches or repositories, Git-Theta will handle merging them. When git merge is run and there is a merge conflict between two histories of a model, Git-Theta will automatically open its merge tool. Git-Theta's merge tool currently supports basic resolution patterns like choosing the parameters from one of the models or merging parameter groups via averaging. For more sophisticated merges, the environment variable GIT_THETA_MANUAL_MERGE can be set to true when performing the merge operation, i.e.

export GIT_THETA_MANUAL_MERGE=True
git merge ${other-branch}

and the merge tool will write out 3 copies of the model, one for each branch being merged and an additional one that represents the model at the most recent commit in the history of both branches. The merge tool will also specify where to save the merged model. After the merged model has been saved to the specified location, a merge commit can be created as usual.

Sharp Edges

Git-Theta aims to support all standard Git workflows. However, there are currently some situations that Git-Theta does not currently support.

Git Rebase

Currently, git rebase is not supported when special update types are used. Additionally, repeated merge-conflict resolution---often encountered in a rebase---can be onerous for large models.

Octopus Merges

Currently, git-theta's merge utilities are optimized for (and only tested for) 3-way merges where two branches with a shared ancestor commit are merged together. We are working on support for Octopus merges where multiple branches are all combined at once.

Under the hood

This section describes how Git-Theta works in more detail.

Git-Theta's filters

Git offers several points of customization where specialized, model-aware Git-Theta versions of various tools are run. Git has a "working tree" where human-facing files live and a "staging area" where a copies of working tree files live before they are stored in Git. When a file is moved from the working tree to the staging area, the "clean filter" is run. When it is moved back the "smudge filter" is run. Git-theta provides model-aware versions of these filters.

When a model checkpoint is cleaned (git add):

Git-Theta reads the checkpoint from the working tree using a plug-in system to support different deep-learning frameworks.
Git-Theta converts the checkpoint into a tree of parameter group names that map to parameter values.
Git-Theta records metadata for each parameter group, including a hash of the parameter values.
Git-Theta compares the metadata for the current parameter group with its previous value. If the metadata doesn't match, the parameter is serialized and then saved using Git LFS. The Git LFS metadata is recorded in the metadata file.
The metadata is written to the staging area.

Thus, Git itself only tracks the model metadata; actual values are stored efficiently Git LFS. Additionally, by checking for matching metadata, only changed parameters are stored.

When a model checkpoint is smudged (git checkout):

The Git-Theta metadata file is retrieved from Git.
For each parameter, the Update plug-in system is used to get actual parameter values. a. For updates that change all parameter values, the Git LFS metadata is used to get the values directly. b. For parameter-efficient updates, Git LFS metadata is used to get update values, previous parameter values are retrieved from Git itself, and the update is applied.
The parameter values are written into the working tree using the checkpoint plug-in system to handle different deep learning frameworks.

When installing Git-Theta with git theta install, the following lines are added to the global ~/.gitconfig:

[filter "theta"]
    clean = git-theta-filter clean %f
    smudge = git-theta-filter smudge %f
    required = true
[merge "theta"]
    name = Merge Models with Git-Theta
    driver = git-theta-merge %O %A %B %P
[diff "theta"]
    command = git-theta-diff

This configuration defines two Git filter drivers for Git-Theta and registers them under the name theta. In addition, it defines merge and diff programs, also named theta. When git theta track path/to/model is run, an entry is added to the .gitattributes file to configure Git to use Git-Theta. The new entry looks like

path/to/model filter=theta merge=theta diff=theta

This tells git that anytime a file that matches the pattern path/to/model is processed, use the filter/merge/diff driver named theta.

Incremental updates

Git-Theta supports updates that are based on the previous version of the parameter values. For example, if a few entries of a parameter group are updated, Git-Theta can avoid storing a new copy of the parameter group; instead, it can be computed on the fly during a smudge filter based on the sparse update and the previous value. Such updates are implemented as subclasses of the IncrementalUpdate class. IncrementalUpdates include references to the commit that holds the last parameter value in their metadata. Then, when the new value is needed, the IncrementalUpdate class will fetch the value of the previous parameter from git and apply the current update. This yields a massive reduction in storage costs. Additionaly, this can be done recursively, i.e. Git-Theta will continuous fetch previous values and apply IncrementalUpdates until a self-contained update (such as a Dense update that replaces all parameter values with new ones) is hit.

Locality-sensitive hashing

To avoid processing parameter groups that have not been changed, Git-Theta needs a way to determine whether a given parameter group's values have changed. Directly testing for equality or comparing bitwise hashes might be overly strict due to numerical instability and noise that could arise from using incremental updates, different hardware, or different software stacks. Instead, Git-Theta uses uses locality sensitive hashing (LSH) for parameter hashes. Specifically, an LSH that approximates Euclidean distance and uses the random-pool approach to hash parameters of variable sizes. Git-Theta's LSH uses 16 hash functions and is calibrated so that two parameter groups with a Euclidean distance less than $1e^{-8}$ will have the same hash with a probability of at least $0.99$. Additionally, weights with a distance $\in [1e{-8}, 1e^{-6}]$ are double-checked with numpy.allclose.

Plug-ins

Git-theta makes heavy use of python plug-ins to enable users to add support for additional checkpoint formats as well as custom merge patterns and incremental updates. Specifically, Git-Theta currently support plug-ins for the Checkpoint, Update, and Merge classes. Third-party users can register a plug-in by creating a small installable package that defines the plugin and registers it as an entry point under the name scope git_theta.plugins.(checkpoints|updates|merges). An example plugin for JSON formatted checkpoints can be found here. Alternatively, plug-ins can be added directly to the git-theta package by adding new subclasses to the appropriate modules, then declaring it in the entry_points dict in setup.py.

Development Setup

This project uses black for code formatting and isort for import statement ordering. Additionally, it includes CI that checks for compliance. We include pre-commit hooks that will automatically run black and isort against any python files staged for commit. These hooks can be installed with:

$ pip install -r requirements-dev.txt
$ pre-commit install

When one of these tools must reformat your file, it will show as the pre-commit hook failing and your commit will be cancelled. Reformatted source files will appear in your working directory ready to be re-added to staging (git add). Running git commit -m ${msg} again will result in the hooks passing and the commit actually happening. Note: As your initial commit was blocked, you will probably want to use the same message in the commit that actually goes through.

Citation

If you use git-theta in your work, please cite:

@InProceedings{kandpal-etal-2023-git-theta
    title={Git-Theta: A Git Extension for Collaborative Development of Machine Learning Models},
    author={Kandpal, Nikhil and Lester, Brian and Muqeeth, Mohammed and Mascarenhas, Anisha and Evans, Monty and Baskaran, Vishal and Huang, Tenghao and Liu, Haokun and Raffel, Colin},
    journal={International Conference on Machine Learning, {ICML}},
    year={2023},
    month={july},
    url={https://arxiv.org/abs/2306.04529},
}

git-theta's People

Contributors

Stargazers

Watchers

Forkers

nkandpa2 vishalathreya blester125 mod-cpu eltociear eunchan24 shism2 afaiyaz006

git-theta's Issues

Figure out storage of the initial checkpoint

Probably needs to be able to refer to an external location for the initial checkpoint since we don't want to store it all in git - it's too big. Might look like git LFS. Bonus: Store the random seed that can be used to reconstruct the initial parameter values.

Adding TensorFlow checkpoint plugin

Remove `iterate_*_leaves`

With the change to using flat maps, we are no longer using the iterate_(dict|dir)_leaves functions.

They should be removed. The biggest part of the effort is that most of the tests that are operating on new functions like flatten or walk_dir are indirectly tested through these iterate functions. They need to be updated to test the actually functions we use.

Figure out interface (functions vs command-line tool)

We also need to decide what operations we would support. The obvious requirements for a POC, in order of implementation:

Commit
Apply
Revert
Checkout
Log

Beyond that, we would also want to consider:

Merge
Branch

Read about how Git works under the hood

Add any useful links below

Get git-cml to work for Windows

Create a System Diagram

Define all the pieces in the pipeline
Define input and output for each piece
Define functionality of each piece
Identify which pieces are required for the PoC and which pieces can be built later

Force merge conflicts

Have the global checkpoint checksum written out somewhere so that git always flags a merge conflict.

Unify flattened leave iterations to flattened maps.

Lots of code uses small changes between (sorted) iteration through (value, key) pairs to do things like intersections and unions.

Convert these functions to use flattened maps and things like dict.update methods.

Support `git reset --hard`

After staging a change to a checkpoint with git theta add /path/to/my-model.pt, we should be able to use git reset --hard to destage the changes and blow away working tree modification, restoring back to the last commit.

Currently this results in a file not found error for ${git_repo}/path/to/my-model.pt file during one of the smudges.

Build the automatic diff tool and/or interface for creating diff files

Run black formatting on `bin/` scirpts.

The scripts in our bin/ file don't end in .py so they seem to get missed by black (I have confirmed they are missed in the pre-commit hook and I am pretty sure they are missed in the CI lint).

Update both the pre-commit hook and the ci to actually format these files. Will probably result in needing a regex as I think that specifying specific files in pre-commit removes the file-type based default change detection.

use tensorstore async for writing out parameter group files

Update/finalize code for taking a diff file and applying it

More robust logging configuration

Currently logging is only done through the basicConfig and everything is done at the debug level.

We should update this, we should also be logging to a file (whose location is user controllable), especially for clean and smudge filters, we should have some messages at debug and some at info, and a user configurable way to control the verbosity of logs.

Ideally there would also be a way to see our debug messages without getting the ones from GitPython as some of their debug logs look like errors (the message about CYGWIN for example) and like they come from git-theta as we are currently configuring the root logger.

Define File System for version control

Simplify file object and file path string polymorphic functions

Use something like the @file_or_name decorator to remove our many checks for if the input is a file object or a string.

Write only parameter groups that have changed rather than all parameter groups in the checkpoint

When running git theta add all of the parameter groups are saved in .git_theta/<path to model> even if only a few of the parameter groups were actually modified. Instead of writing the whole model to disk every time we run git theta add, check what has changed and only write those parameters to disk.

Define a Project Structure for the Repo

Meeting Notes (Running Thread)

January 19th, 2022

Summary of work the previous week

Read the proposal and blog post for VCS for collaborative update of models
Created Drive Folder for project

Meeting Summary

Why do we need sparse updates and other communication efficiency strategies
With large models, updating all the parameters can create very large checkpoints that would become infeasible to store (diff history) and communicate
May not be as much of a problem with small models or models that are rarely updated
Merge updates from models not fully in the scope of this project. Next layer after building a version control system
Fall back on some kind of averaging method, or for newly added layers that are not conflicting it would be a simple merge (Eg kNN), mixture of models
What do we do in the case of merge conflicts that cannot be resolved automatically?
Some form of distillation
Last semester tried to see how we could merge different update methods
Evaluation/Downstream tasks
Differentiate the scope of this project as building something similar to Git but not dealing with CI (continuous integration) just yet
Eventually we may also want to know what data and hyperparameters resulted in that model update. But that’s an added layer
If one were to update a large model, wouldn't one also need to be resource-rich to even load these large models for training?
Yes, but there are ways to run them on a single GPU -> DeepSpeedZero
A very basic version of a VCS using Git with a model stored in ONNX format? So everytime you update the model, git saves your version history?
May support some update types and not others - need to explore this
Does git only store line-level changes or is more nuanced

ToDo List

Please take a look at the notebook and see if you can figure out a cleaner way to update a specific parameter value in the ONNX checkpoint. I'm currently doing initializer[1], it would be nice to choose it by parameter name. And also figure out why it's called "6", etc. And possibly also play around with the on-disk format, see whether it's at all usable by git, etc.

Add `.name` as a `@property` to our checkpoint handler objects

We should record a .name value on each of the plugins. This property on a checkpoint object should return a string so that a get_checkpoint function call with this name would return the class of this object.

This will make things like logging what checkpoint type is used (and making sure we use the same one across multiple cleans, etc) much easier, especially when the value is set via an environment variable.

Prevent users from unintentionally running `git add <checkpoint>`

Regular files are staged with git add <file> while checkpoints are staged with git theta add <checkpoint file>. Talking with users, a common mistake is trying to stage some code by running something like git add . and unintentionally staging a checkpoint file in the current directory. We should prevent this behavior.

Hello world git example

Run on a simple .json file (specifying parameter name -> parameter value)
Implement simple workflow for initial model -> make a change to a model -> produce diff file -> checkout commit
Simple example showing applying and rewinding a few changes

Create binary for pretty-printing diffs

This would determine what to print out when git diff is run.

Update README and examples

Function to rename initializers (variables) in ONNX checkpoints

Implement a function to rename initializers (variables) in ONNX checkpoints. May require traversing the graph and updating all references.

Add documentation compilation and integration testing

Create binary for merging

Probably it should just always designate a merge conflict? We also could eventually implement parameter averaging or allowing merges when complementary sets of parameters are updated.

Change PyTorchCheckpoint to PickledDictCheckpoint

Have multiple aliases in the plugins.

Change name (Again)

Change name to git-theta (both repo name and in the code)

Write proof-of-concept of commit/checkout functionality

Consider writing or repurposing gitattributes parsing/manipulation library

If we are doing a lot of manipulating or parsing of gitattributes files, we might want to roll that out into a separate (well-tested) library, or try to rely on another library for that if possibe.

Extract part of one ONNX checkpoint (e.g. a layer) and copy it into another

See e.g. https://github.com/onnx/onnx/blob/main/docs/PythonAPIOverview.md#creating-an-onnx-model-using-helper-functions
https://github.com/onnx/onnx/blob/main/docs/PythonAPIOverview.md#extracting-sub-model-with-inputs-outputs-tensor-names

Function for replacing the value of an initializer in an ONNX checkpoint

Given an initializer name and a new value for a variable, replace the value of the variable with the new value. See e.g. https://github.com/bindog/onnx-surgery/blob/master/surgery.py#L118

Function for accessing a particular parameter by name

Probably requires looping through entire set of variables/initializers. https://github.com/bindog/onnx-surgery/blob/master/surgery.py#L113

Updates to git_cml root

Use TensorStore (or something else) for the leaf nodes
Integrate LFS for tracking files in the git_cml root
Parent directory should be full filename

support other update types

As described under https://github.com/r-three/git-cml/blob/main/README.md#design-notes

Allow specifying checkpoint type

rather than assuming it's a pytorch checkpoint

Update `params` module

Currently the params module uses torch as a dependency only to convert the tensor back into a numpy array.

As we are working on supporting multiple checkpoint formats, can we just use numpy for most of these methods?

Add optional framework installs.

As we have moved to plugins for checkpoint handling, we don't need all of the deep learning frameworks installed all the time. Therefore we don't need to install them all, especially given that they can be heavy.

Update the setup.py to include extras_require for various frameworks that install them with git-theta. Also include some target that installs all the frameworks, or at least some of the most popular ones.

Make subdirectories based on full path to model checkpoint

E.g. hyperformer/tracked_outputs/pytorch_model.bin should appear in .git_cml/hyperformer/tracked_outputs/pytorch_model.bin

Update git-theta metadata file format

Currently the metadata file produced by the clean filter looks like this

{
  "model/scoping/to/param/1-weight shape": List[int],
  "model/scoping/to/param/1-weight dtype": str,
  "model/scoping/to/param/1-weight hash": str,
  ...,
  "model/scoping/to/param/2-bias shape": List[int],
  "model/scoping/to/param/2-bias dtype": str,
  "model/scoping/to/param/2-bias": str,
  ...

To make fetching metadata for a single parameter we are converting to a nested format:

{
  "model/scoping/to/param/1-weight": {
      "tensor_metadata": {
        "shape": List[str],
        "dtype": str,
        "hash": str,
      },
  },
  ...,
  "model/scoping/to/param/2-bias": {
      "tensor_metadata": {
        "shape": List[str],
        "dtype": str,
        "hash": str,
      },
  },
  ...,
}

Tensor metadata is in it's own nested dict because we may add other keys like git_theta_metadata for tracking things like update types eventually.

Note: We need a consistent serialization order (lexical sort on keys of each dict) when writing to disk to support diffs.

Make checkpoint backend for PyTorch files

Should take a PyTorch checkpoint and basically construct a dict-like object that is keyed by parameter name and whose values are the parameter values.

Make git diff prettier

Involves making a custom difftool

Add basic integration tests

Add a simple test that creates a pytorch checkpoint and does as few operations on it; set up continuous integration to run it.

Replace init with install and track and support specifying checkpoint format

Replace git cml init with git cml install (only run once) and git cml track (run separately for each file).

git cml track should specify both the file to be tracked and the checkpoint format. There will need to be an attribute/metadata file somewhere in .git_cml that specifies that the checkpoint is a given format.

Create a format for representing incremental changes

As a starting point, this could be:
Operation type (e.g. dense update a parameter)
Parameter name
New value

Write down designs for possible git integrations

Use TensorStore for tracked files

It should still store the parameters and how they are changed.

Investigate using git for tracking sparse updates and git smudge to apply them.

Instead of tracking/applying sparse updates manually (for example storing them in a different directory) can we just checking sparse updates and then move backwards through git history to build the real value (apply updates).

I have written this recursive smudge where when you smudge a file it will be transformed to include the content at each point in the history where it changed (and the commit the change happened at).

#!/bin/bash

COMMIT=${2:-"HEAD"}

echo "----------------------------" >> /tmp/smudge.log
echo "${COMMIT}" >> /tmp/smudge.log

if [ ${COMMIT} != "HEAD" ]; then
  PREV_COMMIT="${COMMIT}~1"
else
  PREV_COMMIT="${COMMIT}"
fi

echo "${PREV_COMMIT}" >> /tmp/smudge.log

echo "I'm running smudge"
LAST_CHANGE=$(git rev-list -1 ${PREV_COMMIT} -- $1)
echo "${LAST_CHANGE}" >> /tmp/smudge.log

if [ -z ${LAST_CHANGE} ]; then
  exit 0
else
  echo "The last time this file changed was ${LAST_CHANGE}"
  echo `git show ${LAST_CHANGE}:$1`
  /usr/local/google/home/brianlester/dev/git-theta-test/smudge.sh $1 ${LAST_CHANGE}
fi

Note, we can't run something like git checkout ${COMMIT} from inside a smudge but we can run things like git show and git rev-list.

We can apply this same idea to parameters. Reading in a sparse update will recurse backwards through history, until it hits a dense update. Once the dense update (which just returns the value) each sparse update (read from git) will be applied as we move back up the stack.

The main open questions are:

Does this still work when we hit a commit with multiple parents (from a merge for example)
Can tensorstore read a tensor when the binary blob (and the metadata file) are bytes sequences from git show
- If it can't this solution would need to write the blobs to a temporary space causing an extra read/write per updated parameter. This could be mitigated by only changing updated parameters but could be costly otherwise.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.