iterative / cml Goto Github PK
View Code? Open in Web Editor NEW♾️ CML - Continuous Machine Learning | CI/CD for ML
Home Page: http://cml.dev
License: Apache License 2.0
♾️ CML - Continuous Machine Learning | CI/CD for ML
Home Page: http://cml.dev
License: Apache License 2.0
User can set tags prefixes. to have interbranch experiements list and also reports in Gitlab, however nothing of this will happen if the user does not setup the prefix accordingly.
SSH is out of supported remotes since actually dvc ssh is backed by sftp with paramiko.
Strategy of adding the PEM key was wrong since its actually handled and located by dvc in the config file.
Open questions are:
skip_push
that defaults to true.
Skips all the push process including dvc and git.
Report still has to happen
Only CML
please use something like test requirements.txt && pip install -r requirements.txt
@dmpetrov and I have been talking about how we'll build tutorials for dvc-cml
. One idea, which I've been building in a repo, is a project where anyone can make a fork and then submit a PR to see the workflow in action.
However, I've found this note on the Settings/Secrets page:
Secrets are not passed to workflows that are triggered by a pull request from a fork. Learn more.
If I understand correctly, this means that if someone in the public/outside DVC cloned our repo and attempted to make a PR, dvc repro
might be triggered BUT the runner would not be able to access credentials, such as the Google Drive credentials needed to push/pull project artifacts. Does this sound correct?
If it's an issue, seems like we could simply put the credentials in a config file in the repo- I think, with GDrive, this is often alright?
Introduce in the docs an advanced case of workflow where one could set different baselines depending on the branch of the Job
simple-git is a wrapper over git that its already in use. All the git commands should replaced
Use a proper logging library
console.log
into proper logger calls@0x2b3bfa0 moved the rest of items to a separate issue, as @casperdcl suggested in the past weekly meeting.
(put in separate issue)
Edit:
Coming back to this we should now, with the 0.3.0 release, attack this.
The proposal would be use winston and a configurable file. We can also collect runner heartbeat using openmetrics format
Sometimes exec rejects the error instead of returning the error so throw_err is useless.
Refactor this may be also a good chance to review easy-git
It supports only true/false while users might need to pull a particular data file like
$ dvc pull images
or
$ dvc pull users/ cities/ companies.csv
Setup deploy workflow dependant on test. While this should be possible currently its not and both workflows has to be joined into one.
Rename it. Maybe _run is not needed?
Metrics is more important than file data. Has to be changed the order.
Additional issues could be:
Gitlab is caching the docker image for a long time.
This enforce us to use a versioning strategy and not use latest
It might be not easy for users to customize our docker image which contains CML code. It is more flexible to have our code as an NPM package.
iterative/setup-dvc#2 (comment)
await exec('dvc pull -f');
from #47
ERROR: unexpected error - Unable to locate credentials <--- THIS IS IN RED
[EXIT NOW - NO REASON TO CONTINUE]
##[error]Process completed with exit code 1. <--- THIS IS IN RED
It's fundamental to test all the available remotes.
If the remote is out of scope throw not implemented
Useful to know which dvc version is running the container
I'm trying out the system on GitLab CI now and it all works very easily, except getting tags to generate after each run. I had to create an environmental variable, tag_prefix
.
It doesn't seem like tag_prefix
is an ideal mechanism for coding whether or not the user wants DVC reports generated- at least, the variable name wouldn't signal to me that I need to assign it to allow tags. Is there a better way that I'm missing? I would think that by default, we'd want tags enabled?
Two jobs are triggered because of:
on: [push, pull_request]
Github when Pull Request exists without execution if the same sha is running. That has to be implemented in Gitlab as well.
Wrap them under the same name.
Please change it to use export
.
That will imply using compiled js in bin files.
The approach with unified GH-GL tags looks really appealing from the GPU optimization point of view. But it complicates the basic solution a lot. This code with the customize gh-runner
needs to be extracted to a separate project and docker files.
Also, the readme file needs to be changed correspondingly.
In GH reports are accesible through checks and or releases but in GL if tags are not generated GL does not have any reports.
Including the output of dvcreport
as an artefact would mitigate this issue
I replicated the experiment in the Wiki (GitHub version). When I made my first commit, the commit messages are showing up as dvc repro [ci skip]
. I wouldn't expect to see [ci skip]
if I didn't include a flag for that, and in fact, I'm sure the CI ran! It might help me avoid confusion if we avoid printing that flag in commit messages except when it is explicitly called by the user.
Since we are deciding to call this tool CML, and not DVC CML, should we rename the reports CML Reports?
add .dockerignore to not add node_modules, this may reduce the size
I've tried DVC_PULL: false
and DVC_PULL: ""
. In any of the cases, it tries to pull data. What should I put?
- name: dvc_action_run
env:
....
DVC_PULL: ""
I am using GDrive for remote storage. This is great on my local machine, but when I go to the runner, it seems to be stuck forever "Pulling from DVC remote".
I don't have proof but these seems like an authentication issue? My bet is that on my local machine, the first time I try to access GDrive as a remote I am given a link to visit in my browser, and then I get a validation code that I copy and paste back into the CLI. I'm guessing we are simply not getting past this authentication stage.
I have followed the instructions on the README for cml for GDrive (to my understanding); I copied and pasted the contents of .dvc/tmp/gdrive-user-credentials.json into the value field for the Secret "GDRIVE_USER_CREDENTIALS_DATA".
The repo is here if needed: https://github.com/iterative/mnist_classifier
@DavidGOrtega what do you think?
Encapsulate all the settings inside a file.
const DVC_TITLE = 'DVC Report';
const DVC_TAG_PREFIX = 'dvc_';
const MAX_CHARS = 65000;
const METRICS_FORMAT = '0[.][0000000]';
const {
baseline = 'origin/master',
metrics_format = '0[.][0000000]',
dvc_pull = true
} = process.env;
const repro_targets = getInputArray('repro_targets', ['Dvcfile']);
const metrics_diff_targets = getInputArray('metrics_diff_targets');
.demandOption('output')
.alias('o', 'output')
.default('diff_target', '')
.default('metrics_diff_targets', '')
.array('metrics_diff_targets')
.default('a_rev', 'HEAD~1')
.default('b_rev', 'HEAD')
.help('h')
.alias('h', 'help').argv;
I've been able to make changes to my code, and then git commit & git push
to initiate model retraining in the mnist example from the Wiki. For some reason, though, I'm not seeing reports visible.
Here's a screenshot from a case where I made a new branch mybranch
, changed the learning rate, and pushed. The CI ran, but no sign of a report. Any ideas?
To avoid manual testing we should have e2e testing, some ideas:
It was a bug in DVC which was already fixed iterative/dvc#3529
$ dvc metrics diff --show-json
{"mm.json": {"TP": {"old": null, "new": 456}}}
We need to make sure it was fixed in CML as well.
iterative/setup-dvc#2 (comment)
const dvc_remote_list = (
await exec('dvc remote list', { throw_err: false })
We need to reflect on the fact that CML has done a commit. It is better to use a prefix.
This is a discussion point, not really an issue. I'm thinking about how metrics are displayed:
I definitely want to know that I'm comparing two experiments in which hyperparameters of my model (here, the maximum depth of a random forest classifier max_depth
) changed. But, whereas it makes sense to have a "diff" presented for the accuracy metric, I'm not so sure it matters to have a diff present for the hyperparameters. It's not a number we're trying to optimize (unlike accuracy diffs) and visually, it makes the display more cluttered.
I might suggest having a separate table for comparing hyperparameters that doesn't present diffs, just a side-by-side comparison. And then a table for comparing the output metrics, where I do care about the diff. Would this be challenging to implement? Maybe, for each distinct metric file, its own table? And then somewhere in project preferences a user could specify if we want diffs.
Another way of thinking about this is that if I had a spreadsheet of experiments I was trying to compare, I would lay it out this way:
experiment id | parameterA | parameterB | parameterC | accuracy |
---|---|---|---|---|
1bac226 | 24 | 5 | 140 | 0.899 |
f90k153 | 24 | 2 | 140 | 0.9111 |
And then perhaps highlight the row containing the best experiment (assuming that we can specify somehwere if we want + or - for the metric). If you want the diff explicitly calculated, maybe put it in its own field below the table.
Currently, the remote type is checked in a not consistent way by finding string patterns like s3://
in dvc remote list
outputs. That might be a problem when multiple remotes are defined.
In fact, dvc remote default
returns the default remote name which can be properly resolved to URL in the dvc remote list
output by a simple pattern.
Also, it makes sense to throw an error and exit with a proper message if dvc pull
is required but the corresponded setting are not.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.