iterative / cml Goto Github PK

♾️ CML - Continuous Machine Learning | CI/CD for ML

License: Apache License 2.0

JavaScript 97.69% Dockerfile 2.20% HCL 0.11%

machine-learning data-science cicd ci-cd github-actions gitlab-ci developer-tools continuous-integration continuous-delivery ci

cml's People

Contributors

Stargazers

Watchers

Forkers

casperdcl totalgood pineappleaf justcherie dsuess congvm-cs deeplearning2012 meztez popin0 gopalsi ahmedahmedov curioustauseef yangchenghuang lyubov888l rosco5 srinivasgutta7 akshaybahadur21 langley sugatoray eddiecityu shreyanshp devops043 ssitb parampavar nextaiml jiapengwei ibluerose tchigher pet1330 hikmah94 spottybadrabbit unburied sailfish009 varun7447 nikhilt1998 allerick-sha-sunil bhausleitner cybernetics courentin pavannayakanti h0m3brew hadryan timmy61109 dattachandan rishirelan sethupavan12 trendingtechnology zeta1999 ivyleavedtoadflax tspannhw iamvazu micseb cosnomi irononet ryurikritz silky gvravi lcbasu arachchi jbencook guptam danlkv shyamalschandra imrehg md-experiments matheus-asilva ai-repositories esskay0000 shubhamchauhan22 sethuramanio musa-atlihan mrleu nsamzhao chemseddine-git troublem1 abhinavsp0730 vinayya summon-ml ditschuk raimannma mukeshbhati1777 leasencloud huamichaelchen bidurkandel omowunmi-svg jamesmowatt ssahgal sanyam07 mrifkikurniawan andriihomiak nameartem mohitgarg851 nico-gui sandeepmistry tkhan3 dt021 lucasclopesr kbsinc gojira ernestong

cml's Issues

No tags by default

User can set tags prefixes. to have interbranch experiements list and also reports in Gitlab, however nothing of this will happen if the user does not setup the prefix accordingly.

Create one more workflow parameter/env-variable tag_prefix for tagging the commits that we
create. By default it is empty that means - no tags.
Mention in the GitLab documentation that tag_prefix has to be defined.
Mention in GH docs than you can add tag_prefix

Support remote SSH, HTTP and HDFS

SSH is out of supported remotes since actually dvc ssh is backed by sftp with paramiko.

Strategy of adding the PEM key was wrong since its actually handled and located by dvc in the config file.
Open questions are:

Should we allow multiple SSH remotes?
Since all the remotes works with ENV variables... Should dvc support SSH, HTTP and HDFS credentials with ENV variables as well?

be able to skip push

skip_push that defaults to true.
Skips all the push process including dvc and git.
Report still has to happen

Use requirements.txt in wiki

please use something like test requirements.txt && pip install -r requirements.txt

public access to a dvc-cml project?

@dmpetrov and I have been talking about how we'll build tutorials for dvc-cml. One idea, which I've been building in a repo, is a project where anyone can make a fork and then submit a PR to see the workflow in action.

However, I've found this note on the Settings/Secrets page:

Secrets are not passed to workflows that are triggered by a pull request from a fork. Learn more.

If I understand correctly, this means that if someone in the public/outside DVC cloned our repo and attempted to make a PR, dvc repro might be triggered BUT the runner would not be able to access credentials, such as the Google Drive credentials needed to push/pull project artifacts. Does this sound correct?

If it's an issue, seems like we could simply put the credentials in a config file in the repo- I think, with GDrive, this is often alright?

Workflow with master baseline pointing to Head~1 and others to origin/master

Introduce in the docs an advanced case of workflow where one could set different baselines depending on the branch of the Job

Replace exec git with simple-git

simple-git is a wrapper over git that its already in use. All the git commands should replaced

Introduce execa in favour of utils exec

promise based
pipe stdout which is very convenient in the self hosted runner

Better logging

Use a proper logging library

where/what/how raising errors & messages #606
change every console.log into proper logger calls

@0x2b3bfa0 moved the rest of items to a separate issue, as @casperdcl suggested in the past weekly meeting.

Click to see the original text...

nice to have

(put in separate issue)

wrap specific CI vendor capabilities
heartbeats (in openmetrics format?)
file configurable
integration with studio

Edit:
Coming back to this we should now, with the 0.3.0 release, attack this.
The proposal would be use winston and a configurable file. We can also collect runner heartbeat using openmetrics format

promisify exec sometimes rejects

Sometimes exec rejects the error instead of returning the error so throw_err is useless.
Refactor this may be also a good chance to review easy-git

dvc pull: cannot specify a target

It supports only true/false while users might need to pull a particular data file like

$ dvc pull images

$ dvc pull users/ cities/ companies.csv

workflow dependency

Setup deploy workflow dependant on test. While this should be possible currently its not and both workflows has to be joined into one.

https://github.community/t5/GitHub-Actions/How-do-I-specify-job-dependency-running-in-another-workflow/td-p/33938

dvc_action_run -> dvc_cml_run

Rename it. Maybe _run is not needed?

Better Report

Metrics is more important than file data. Has to be changed the order.

Additional issues could be:

Last experiments as a list
Metrics could be also collapsible to reduce the space
warning if current branch is comparing with itself branch == rev

Errors due to Gitlab caching the docker image

Gitlab is caching the docker image for a long time.
This enforce us to use a versioning strategy and not use latest

Create cml NPM package

It might be not easy for users to customize our docker image which contains CML code. It is more flexible to have our code as an NPM package.

dvc pull is using force parameter

iterative/setup-dvc#2 (comment)

await exec('dvc pull -f');

dvc-cml must stop if no credentials found or error in pull

from #47

ERROR: unexpected error - Unable to locate credentials   <--- THIS IS IN RED
[EXIT NOW - NO REASON TO CONTINUE]
##[error]Process completed with exit code 1.               <--- THIS IS IN RED

Introduce the purely Github Action into docs and properly test

Test if works as expected
Introduce it into docs and use cases

test remotes

It's fundamental to test all the available remotes.
If the remote is out of scope throw not implemented

Add dvc version in error handler

Useful to know which dvc version is running the container

GitLab CI Tags

I'm trying out the system on GitLab CI now and it all works very easily, except getting tags to generate after each run. I had to create an environmental variable, tag_prefix.

It doesn't seem like tag_prefix is an ideal mechanism for coding whether or not the user wants DVC reports generated- at least, the variable name wouldn't signal to me that I need to assign it to allow tags. Is there a better way that I'm missing? I would think that by default, we'd want tags enabled?

Double jobs

Two jobs are triggered because of:

   on: [push, pull_request]

Micro architecture change _ for - in parameters

cml-metrics

Gitlab stop in MR if push is still running like Github does

Github when Pull Request exists without execution if the same sha is running. That has to be implemented in Gitlab as well.

GPU GitLab

Implement
Add to wiki

github_token and gitlab_token under repo_token

Wrap them under the same name.

Settings.js is very verbose

Please change it to use export.
That will imply using compiled js in bin files.

Extract self-hosted gpu tags code into a separate repo and docker files

The approach with unified GH-GL tags looks really appealing from the GPU optimization point of view. But it complicates the basic solution a lot. This code with the customize gh-runner needs to be extracted to a separate project and docker files.

Also, the readme file needs to be changed correspondingly.

Docs - (Wip)

#25 (comment)
GITLAB_TOKEN docs are confusing
#25 (comment)
make clear where the reports can be located

Generate report as an artefact

In GH reports are accesible through checks and or releases but in GL if tags are not generated GL does not have any reports.
Including the output of dvcreport as an artefact would mitigate this issue

[ci skip] flag shows up in commit message when it wasn't called

I replicated the experiment in the Wiki (GitHub version). When I made my first commit, the commit messages are showing up as dvc repro [ci skip]. I wouldn't expect to see [ci skip] if I didn't include a flag for that, and in fact, I'm sure the CI ran! It might help me avoid confusion if we avoid printing that flag in commit messages except when it is explicitly called by the user.

DVC Report --> CML Report

Since we are deciding to call this tool CML, and not DVC CML, should we rename the reports CML Reports?

Preserve branch gh-runner

via #55 our gh runner is removed, to be able to finish the issue we need separate repo and docker hub repos.

@dmpetrov, may work leave the branch in the repo and use a tag in Docker hub?

dvcorg/dvc-cml:custom-runner-latest
dvcorg/dvc-cml-gpu:custom-runner-latest

add .dockerignore

add .dockerignore to not add node_modules, this may reduce the size

Unable to suppress dvc pull

I've tried DVC_PULL: false and DVC_PULL: "". In any of the cases, it tries to pull data. What should I put?

      - name: dvc_action_run
        env:
          ....
          DVC_PULL: ""

gdrive stuck on runner

I am using GDrive for remote storage. This is great on my local machine, but when I go to the runner, it seems to be stuck forever "Pulling from DVC remote".

I don't have proof but these seems like an authentication issue? My bet is that on my local machine, the first time I try to access GDrive as a remote I am given a link to visit in my browser, and then I get a validation code that I copy and paste back into the CLI. I'm guessing we are simply not getting past this authentication stage.

I have followed the instructions on the README for cml for GDrive (to my understanding); I copied and pasted the contents of .dvc/tmp/gdrive-user-credentials.json into the value field for the Secret "GDRIVE_USER_CREDENTIALS_DATA".

The repo is here if needed: https://github.com/iterative/mnist_classifier

@DavidGOrtega what do you think?

Metrics not available

Update the Mnist demo to output floats instead of strings that are not processed by DVC

Settings file

Encapsulate all the settings inside a file.

const DVC_TITLE = 'DVC Report';
const DVC_TAG_PREFIX = 'dvc_';

const MAX_CHARS = 65000;
const METRICS_FORMAT = '0[.][0000000]';

const {
    baseline = 'origin/master',
    metrics_format = '0[.][0000000]',
    dvc_pull = true
  } = process.env;

  const repro_targets = getInputArray('repro_targets', ['Dvcfile']);
  const metrics_diff_targets = getInputArray('metrics_diff_targets');

  .demandOption('output')
  .alias('o', 'output')
  .default('diff_target', '')
  .default('metrics_diff_targets', '')
  .array('metrics_diff_targets')
  .default('a_rev', 'HEAD~1')
  .default('b_rev', 'HEAD')
  .help('h')
  .alias('h', 'help').argv;

Locating DVC Report

I've been able to make changes to my code, and then git commit & git push to initiate model retraining in the mnist example from the Wiki. For some reason, though, I'm not seeing reports visible.

Here's a screenshot from a case where I made a new branch mybranch, changed the learning rate, and pushed. The CI ran, but no sign of a report. Any ideas?

Enviroment Variables

enable and document metrics_format
Refactor rev to baseline

Add GPU capabilites

Docs
Support for CUDA 10.1
Support for Github

Function setup_repo is long and name is misleading

iterative/setup-dvc#2 (comment)

e2e testing

To avoid manual testing we should have e2e testing, some ideas:

create a repo during CI testing
use external repos and checks to make this one fail on build

No metrics in the first run

It was a bug in DVC which was already fixed iterative/dvc#3529

$ dvc metrics diff --show-json
{"mm.json": {"TP": {"old": null, "new": 456}}}

We need to make sure it was fixed in CML as well.

dvc_remote_list is just checking a string

iterative/setup-dvc#2 (comment)

const dvc_remote_list = (
    await exec('dvc remote list', { throw_err: false })

Open a ticket in dvc to request json output
Change code accordingly

Change commit message prefix

We need to reflect on the fact that CML has done a commit. It is better to use a prefix.

#31 (comment)

Metrics of experiments with different tech implementation

This is a discussion point, not really an issue. I'm thinking about how metrics are displayed:

I definitely want to know that I'm comparing two experiments in which hyperparameters of my model (here, the maximum depth of a random forest classifier max_depth) changed. But, whereas it makes sense to have a "diff" presented for the accuracy metric, I'm not so sure it matters to have a diff present for the hyperparameters. It's not a number we're trying to optimize (unlike accuracy diffs) and visually, it makes the display more cluttered.

I might suggest having a separate table for comparing hyperparameters that doesn't present diffs, just a side-by-side comparison. And then a table for comparing the output metrics, where I do care about the diff. Would this be challenging to implement? Maybe, for each distinct metric file, its own table? And then somewhere in project preferences a user could specify if we want diffs.

Another way of thinking about this is that if I had a spreadsheet of experiments I was trying to compare, I would lay it out this way:

experiment id	parameterA	parameterB	parameterC	accuracy
1bac226	24	5	140	0.899
f90k153	24	2	140	0.9111

And then perhaps highlight the row containing the best experiment (assuming that we can specify somehwere if we want + or - for the metric). If you want the diff explicitly calculated, maybe put it in its own field below the table.

Parse default remote properly

Currently, the remote type is checked in a not consistent way by finding string patterns like s3:// in dvc remote list outputs. That might be a problem when multiple remotes are defined.

In fact, dvc remote default returns the default remote name which can be properly resolved to URL in the dvc remote list output by a simple pattern.

Also, it makes sense to throw an error and exit with a proper message if dvc pull is required but the corresponded setting are not.

iterative / cml Goto Github PK

cml's People

Contributors

Stargazers

Watchers

Forkers

cml's Issues

nice to have

Recommend Projects

Recommend Topics

Recommend Org