Giter VIP home page Giter VIP logo

cml's People

Contributors

0x2b3bfa0 avatar casperdcl avatar courentin avatar dacbd avatar davidgortega avatar deepyaman avatar dependabot[bot] avatar dmpetrov avatar duijf avatar elleobrien avatar francesco086 avatar github-actions[bot] avatar h2oa avatar iterative-olivaw avatar jamesmowatt avatar jorgeorpinel avatar josemaia avatar ludelafo avatar magdapoppins avatar nipierre avatar omesser avatar pandyaved98 avatar restyled-commits avatar restyled-io[bot] avatar samknightgit avatar shcheklein avatar skn0tt avatar snyk-bot avatar tasdomas avatar vincent-leonardo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cml's Issues

No tags by default

User can set tags prefixes. to have interbranch experiements list and also reports in Gitlab, however nothing of this will happen if the user does not setup the prefix accordingly.

  • Create one more workflow parameter/env-variable tag_prefix for tagging the commits that we
    create. By default it is empty that means - no tags.
  • Mention in the GitLab documentation that tag_prefix has to be defined.
  • Mention in GH docs than you can add tag_prefix

Support remote SSH, HTTP and HDFS

SSH is out of supported remotes since actually dvc ssh is backed by sftp with paramiko.

Strategy of adding the PEM key was wrong since its actually handled and located by dvc in the config file.
Open questions are:

  • Should we allow multiple SSH remotes?
  • Since all the remotes works with ENV variables... Should dvc support SSH, HTTP and HDFS credentials with ENV variables as well?

be able to skip push

skip_push that defaults to true.
Skips all the push process including dvc and git.
Report still has to happen

public access to a dvc-cml project?

@dmpetrov and I have been talking about how we'll build tutorials for dvc-cml. One idea, which I've been building in a repo, is a project where anyone can make a fork and then submit a PR to see the workflow in action.

However, I've found this note on the Settings/Secrets page:

Secrets are not passed to workflows that are triggered by a pull request from a fork. Learn more.

If I understand correctly, this means that if someone in the public/outside DVC cloned our repo and attempted to make a PR, dvc repro might be triggered BUT the runner would not be able to access credentials, such as the Google Drive credentials needed to push/pull project artifacts. Does this sound correct?

If it's an issue, seems like we could simply put the credentials in a config file in the repo- I think, with GDrive, this is often alright?

Better logging

Use a proper logging library

  • where/what/how raising errors & messages #606
  • change every console.log into proper logger calls

@0x2b3bfa0 moved the rest of items to a separate issue, as @casperdcl suggested in the past weekly meeting.

Click to see the original text...

nice to have

(put in separate issue)

  • wrap specific CI vendor capabilities
  • heartbeats (in openmetrics format?)
  • file configurable
  • integration with studio

Edit:
Coming back to this we should now, with the 0.3.0 release, attack this.
The proposal would be use winston and a configurable file. We can also collect runner heartbeat using openmetrics format

promisify exec sometimes rejects

Sometimes exec rejects the error instead of returning the error so throw_err is useless.
Refactor this may be also a good chance to review easy-git

dvc pull: cannot specify a target

It supports only true/false while users might need to pull a particular data file like

$ dvc pull images 

or

$ dvc pull users/ cities/ companies.csv

Better Report

Metrics is more important than file data. Has to be changed the order.

Additional issues could be:

  • Last experiments as a list
  • Metrics could be also collapsible to reduce the space
  • warning if current branch is comparing with itself branch == rev

Create cml NPM package

It might be not easy for users to customize our docker image which contains CML code. It is more flexible to have our code as an NPM package.

test remotes

It's fundamental to test all the available remotes.
If the remote is out of scope throw not implemented

GitLab CI Tags

I'm trying out the system on GitLab CI now and it all works very easily, except getting tags to generate after each run. I had to create an environmental variable, tag_prefix.

It doesn't seem like tag_prefix is an ideal mechanism for coding whether or not the user wants DVC reports generated- at least, the variable name wouldn't signal to me that I need to assign it to allow tags. Is there a better way that I'm missing? I would think that by default, we'd want tags enabled?

Double jobs

Two jobs are triggered because of:

   on: [push, pull_request]

Extract self-hosted gpu tags code into a separate repo and docker files

The approach with unified GH-GL tags looks really appealing from the GPU optimization point of view. But it complicates the basic solution a lot. This code with the customize gh-runner needs to be extracted to a separate project and docker files.

Also, the readme file needs to be changed correspondingly.

Generate report as an artefact

In GH reports are accesible through checks and or releases but in GL if tags are not generated GL does not have any reports.
Including the output of dvcreport as an artefact would mitigate this issue

[ci skip] flag shows up in commit message when it wasn't called

I replicated the experiment in the Wiki (GitHub version). When I made my first commit, the commit messages are showing up as dvc repro [ci skip]. I wouldn't expect to see [ci skip] if I didn't include a flag for that, and in fact, I'm sure the CI ran! It might help me avoid confusion if we avoid printing that flag in commit messages except when it is explicitly called by the user.

image

DVC Report --> CML Report

Since we are deciding to call this tool CML, and not DVC CML, should we rename the reports CML Reports?

Preserve branch gh-runner

via #55 our gh runner is removed, to be able to finish the issue we need separate repo and docker hub repos.

@dmpetrov, may work leave the branch in the repo and use a tag in Docker hub?

  • dvcorg/dvc-cml:custom-runner-latest
  • dvcorg/dvc-cml-gpu:custom-runner-latest

add .dockerignore

add .dockerignore to not add node_modules, this may reduce the size

Unable to suppress dvc pull

I've tried DVC_PULL: false and DVC_PULL: "". In any of the cases, it tries to pull data. What should I put?

      - name: dvc_action_run
        env:
          ....
          DVC_PULL: ""

gdrive stuck on runner

I am using GDrive for remote storage. This is great on my local machine, but when I go to the runner, it seems to be stuck forever "Pulling from DVC remote".

I don't have proof but these seems like an authentication issue? My bet is that on my local machine, the first time I try to access GDrive as a remote I am given a link to visit in my browser, and then I get a validation code that I copy and paste back into the CLI. I'm guessing we are simply not getting past this authentication stage.

I have followed the instructions on the README for cml for GDrive (to my understanding); I copied and pasted the contents of .dvc/tmp/gdrive-user-credentials.json into the value field for the Secret "GDRIVE_USER_CREDENTIALS_DATA".

The repo is here if needed: https://github.com/iterative/mnist_classifier

@DavidGOrtega what do you think?

Metrics not available

image

Update the Mnist demo to output floats instead of strings that are not processed by DVC

Settings file

Encapsulate all the settings inside a file.

const DVC_TITLE = 'DVC Report';
const DVC_TAG_PREFIX = 'dvc_';
const MAX_CHARS = 65000;
const METRICS_FORMAT = '0[.][0000000]'; 
const {
    baseline = 'origin/master',
    metrics_format = '0[.][0000000]',
    dvc_pull = true
  } = process.env;
  const repro_targets = getInputArray('repro_targets', ['Dvcfile']);
  const metrics_diff_targets = getInputArray('metrics_diff_targets');
  .demandOption('output')
  .alias('o', 'output')
  .default('diff_target', '')
  .default('metrics_diff_targets', '')
  .array('metrics_diff_targets')
  .default('a_rev', 'HEAD~1')
  .default('b_rev', 'HEAD')
  .help('h')
  .alias('h', 'help').argv;

Locating DVC Report

I've been able to make changes to my code, and then git commit & git push to initiate model retraining in the mnist example from the Wiki. For some reason, though, I'm not seeing reports visible.

Here's a screenshot from a case where I made a new branch mybranch, changed the learning rate, and pushed. The CI ran, but no sign of a report. Any ideas?

Screen Shot 2020-03-25 at 11 49 52 AM

e2e testing

To avoid manual testing we should have e2e testing, some ideas:

  • create a repo during CI testing
  • use external repos and checks to make this one fail on build

Metrics of experiments with different tech implementation

This is a discussion point, not really an issue. I'm thinking about how metrics are displayed:

Screen Shot 2020-03-26 at 3 28 14 PM

I definitely want to know that I'm comparing two experiments in which hyperparameters of my model (here, the maximum depth of a random forest classifier max_depth) changed. But, whereas it makes sense to have a "diff" presented for the accuracy metric, I'm not so sure it matters to have a diff present for the hyperparameters. It's not a number we're trying to optimize (unlike accuracy diffs) and visually, it makes the display more cluttered.

I might suggest having a separate table for comparing hyperparameters that doesn't present diffs, just a side-by-side comparison. And then a table for comparing the output metrics, where I do care about the diff. Would this be challenging to implement? Maybe, for each distinct metric file, its own table? And then somewhere in project preferences a user could specify if we want diffs.

Another way of thinking about this is that if I had a spreadsheet of experiments I was trying to compare, I would lay it out this way:

experiment id parameterA parameterB parameterC accuracy
1bac226 24 5 140 0.899
f90k153 24 2 140 0.9111

And then perhaps highlight the row containing the best experiment (assuming that we can specify somehwere if we want + or - for the metric). If you want the diff explicitly calculated, maybe put it in its own field below the table.

Parse default remote properly

Currently, the remote type is checked in a not consistent way by finding string patterns like s3:// in dvc remote list outputs. That might be a problem when multiple remotes are defined.

In fact, dvc remote default returns the default remote name which can be properly resolved to URL in the dvc remote list output by a simple pattern.

Also, it makes sense to throw an error and exit with a proper message if dvc pull is required but the corresponded setting are not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.