Giter VIP home page Giter VIP logo

dvc_pipelines_and_experiments_tutorial's Introduction

Building a maintainable Machine Learning pipeline using DVC

This guides uses the DVC Get Started Guide as a starting point and takes you on how to build maintainable Machine Learning pipelines using DVC.

If you have some time you can check the full article here (it has more in depth explanations than this readme 😉)

The principles are:

  • Write a python script for each pipeline step
  • Save the parameters each script uses in a yaml file
  • Specify the files each script depends on
  • Specify the files each script generates

In this tutorial we're going to build a model to classify the 20newsgroups dataset.

Environment: Linux with Python 3, pip and Git installed

First: installing DVC as a Python library

$ mkdir dvc_tutorial
$ cd dvc_tutorial
$ python3 -m venv .env
$ source .env/bin/activate
(.env)$ pip3 install dvc
(.env)$ git init
(.env)$ dvc init

1 - Create a params.yaml file

# file params.yaml
prepare:
    categories:
        - comp.graphics
        - sci.space

2 - Create the prepare.py script

Save the file prepare.py file (it's available here on this repo) inside /src. Your folder structure should look like this:

├── params.yaml
└── src
    └── prepare.py

3 - Create the prepare.py stage usinf DVC

The steps for doing that are:

  • Write a python script: prepare.py
  • Save the parameters: categories inside params.yaml
  • Specify the files the script depends on: prepare.py
  • Specify the files the script generates: the folder data/prepared
  • Defined the command line instruction to run this step
(.env)$ pip install pyyaml scikit-learn pandas

(.env)$ dvc run -n prepare -p prepare.categories -d src/prepare.py -o data/prepared python3 src/prepare.py

4 - Create the scripts and the stages for all the other steps

(.env)$ dvc run -n featurize -d src/featurize.py -d data/prepared -o data/features python3 src/featurize.py data/prepared data/features

(.env)$ dvc run -n train -p train.alpha -d src/train.py -d data/features -o model.pkl python3 src/train.py data/features model.pkl

(.env)$ dvc run -n evaluate -d src/evaluate.py -d model.pkl -d data/features --metrics-no-cache scores.json --plots-no-cache plots.json python3 src/evaluate.py model.pkl data/features scores.json plots.json

5 - Change parameters

# file params.yaml
prepare:
    categories:
        - comp.graphics
        - rec.sport.baseball
train:
    alpha: 0.9

6 - Run the pipeline

(.env)$ dvc repro

7 - Compare the metrics

(.env)$ dvc params diff

(.env)$ dvc metrics diff

8 - Visualize and compare metrics using plots

(.env)$ dvc plots show -y precision -x recall plots.json

(.env)$ dvc plots diff --targets plots.json -y precision

dvc_pipelines_and_experiments_tutorial's People

Contributors

dmesquita avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.