Giter VIP home page Giter VIP logo

valohai-utils's Introduction

valohai-utils

Python helper library for the Valohai machine learning platform.

Install

pip install valohai-utils

Execution

Run locally

python mycode.py

Run in the cloud

vh yaml step mycode.py
vh exec run -a mystep

What does valohai-utils do?

  • Generates and updates the valohai.yaml configuration file based on the source code
  • Agnostic input handling (single file, multiple files, zip, tar)
  • Parse command-line parameters
  • Compress outputs
  • Download inputs for local experiments
  • Straightforward way to print metrics as Valohai metadata
  • Code parity between local vs. cloud

Parameters

Valohai parameters are variables & hyper-parameters that are parsed from the command-line. You define parameters in a dictionary:

default_parameters = {
    'iterations': 100,
    'learning_rate': 0.001
}

The dictionary is fed to valohai.prepare() method:

The values given are default values. You can override them from the command-line or using the Valohai web UI.

Example

import valohai

default_parameters = {
    'iterations': 10,
}

valohai.prepare(step="helloworld", default_parameters=default_parameters)

for i in range(valohai.parameters('iterations').value):
    print("Iteration %s" % i)

Inputs

Valohai inputs are the data files required by the experiment. They are automatically downloaded for you, if the data is from a public source. You define inputs with a dictionary:

default_inputs = {
    'input_name': 'http://example.com/1.png'
}

An input can also be a list of URLs or a folder:

default_inputs = {
    'input_name': [
        'http://example.com/1.png',
        'http://example.com/2.png'
    ],
    'input_folder': [
        's3://mybucket/images/*',
        'azure://mycontainer/images/*',
        'gs://mybucket/images/*'
    ]
}

Or it can be an archive full of files (uncompressed automagically on-demand):

default_inputs = {
    'images': 'http://example.com/myimages.zip'
}

The dictionary is fed to valohai.prepare() method.

The url(s) given are defaults. You can override them from the command-line or using the Valohai web UI.

Example

import csv
import valohai

default_inputs = {
    'myinput': 'https://pokemon-images-example.s3-eu-west-1.amazonaws.com/pokemon.csv'
}

valohai.prepare(step="test", default_inputs=default_inputs)

with open(valohai.inputs("myinput").path()) as csv_file:
    reader = csv.reader(csv_file, delimiter=',')

Outputs

Valohai outputs are the files that your step produces an end result.

When you are ready to save your output file, you can query for the correct path from the valohai-utils.

Example

image = Image.open(in_path)
new_image = image.resize((width, height))
out_path = valohai.outputs('resized').path('resized_image.png')
new_image.save(out_path)

Sometimes there are so many outputs that you may want to compress them into a single file.

In this case, once you have all your outputs saved, you can finalize the output with the compress() method.

Example

valohai.outputs('resized').compress("*.png", "images.zip", remove_originals=True)

Logging

You can log metrics using the Valohai metadata system and then render interactive graphs on the web interface. The valohai-utils logger will print JSON logs that Valohai will parse as metadata.

It is important for visualization that logs for single epoch are flushed out as a single JSON object.

Example

import valohai

for epoch in range(100):
    with valohai.metadata.logger() as logger:
        logger.log("epoch", epoch)
        logger.log("accuracy", accuracy)
        logger.log("loss", loss)

Example 2

import valohai

logger = valohai.logger()
for epoch in range(100):
    logger.log("epoch", epoch)
    logger.log("accuracy", accuracy)
    logger.log("loss", loss)
    logger.flush()

Distributed Workloads

valohai.distributed contains a toolset for running distributed tasks on Valohai.

import valohai

if valohai.distributed.is_distributed_task():

    # `master()` reports the same worker on all contexts
    master = valohai.distributed.master()
    master_url = f'tcp://{master.primary_local_ip}:1234'

    # `members()` contains all workers in the distributed task
    member_public_ips = ",".join([
        m.primary_public_ip
        for m
        in valohai.distributed.members()
    ])

    # `me()` has full details about the current worker context
    details = valohai.distributed.me()

    size = valohai.distributed.required_count
    rank = valohai.distributed.rank  # 0, 1, 2, etc. depending on run context

Full example

Preprocess step for resizing image files

This example step will do the following:

  1. Take image files (or an archive containing images) as input.
  2. Resize each image to the size provided by the width & height parameters.
  3. Compress the resized images into resized/images.zip Valohai output file.
import os
import valohai
from PIL import Image

default_parameters = {
    "width": 640,
    "height": 480,
}
default_inputs = {
    "images": [
        "https://dist.valohai.com/valohai-utils-tests/Example.jpg",
        "https://dist.valohai.com/valohai-utils-tests/planeshark.jpg",
    ],
}

valohai.prepare(step="resize", default_parameters=default_parameters, default_inputs=default_inputs)


def resize_image(in_path, out_path, width, height, logger):
    image = Image.open(in_path)
    logger.log("from_width", image.size[0])
    logger.log("from_height", image.size[1])
    logger.log("to_width", width)
    logger.log("to_height", height)
    new_image = image.resize((width, height))
    new_image.save(out_path)


if __name__ == '__main__':
    for image_path in valohai.inputs('images').paths():
        with valohai.metadata.logger() as logger:
            filename = os.path.basename(image_path)
            resize_image(
                in_path=image_path,
                out_path=valohai.outputs('resized').path(filename),
                width=valohai.parameters('width').value,
                height=valohai.parameters('height').value,
                logger=logger
            )
    valohai.outputs('resized').compress("*", "images.zip", remove_originals=True)

CLI command:

vh yaml step resize.py

Will produce this valohai.yaml config:

- step:
    name: resize
    image: python:3.7
    command: python ./resize.py {parameters}
    parameters:
    - name: width
      default: 640
      multiple-separator: ','
      optional: false
      type: integer
    - name: height
      default: 480
      multiple-separator: ','
      optional: false
      type: integer
    inputs:
    - name: images
      default:
      - https://dist.valohai.com/valohai-utils-tests/Example.jpg
      - https://dist.valohai.com/valohai-utils-tests/planeshark.jpg
      optional: false

Configuration

There are some environment variables that affect how valohai-utils works when not running within a Valohai execution context.

  • VH_FLAT_LOCAL_OUTPUTS
    • If set, flattens the local outputs directory structure into a single directory. This means that outputs from subsequent runs can clobber old files.

Development

If you wish to further develop valohai-utils, remember to install development dependencies and write tests for your additions.

Linting

Lints are run via pre-commit.

If you want pre-commit to check your commits via git hooks,

pip install pre-commit
pre-commit install

You can also run the lints manually with pre-commit run --all-files.

Testing

pip install -e . -r requirements-dev.txt
pytest

valohai-utils's People

Contributors

akx avatar dependabot[bot] avatar drazendee avatar hylje avatar juhakiili avatar magdapoppins avatar memona008 avatar orasimus avatar ruksi avatar tokkoro avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

drazendee

valohai-utils's Issues

Support .ipynb files in vh yaml step command

You can create YAML step from Python file today:

vh yaml step hello.py

Now with valohai-utils support for notebooks, you need to be able to do the same with notebooks:

vh yaml step hello.ipynb

valohai-utils should ignore input defaults pointing to local paths when updating the yaml

If you have:

inputs = {
    "classes": ".data/classes.txt",
    "train": ".data/train/train.tfrecord",
    "test": ".data/test/test.tfrecord",
    "weights": ".data/weights/yolov3.tf.data-00000-of-00001",
}

valohai.prepare(step="train", default_inputs=inputs)

You now get:

    inputs:
    - name: classes
      default:
      - .data/classes.txt
      optional: false
    - name: train
      default:
      - .data/train/train.tfrecord
      optional: false
    - name: test
      default:
      - .data/test/test.tfrecord
      optional: false
    - name: weights
      default:
      - .data/weights/yolov3.tf.data-00000-of-00001
      optional: false

You should get:

    inputs:
    - name: classes
      optional: false
    - name: train
      optional: false
    - name: test
      optional: false
    - name: weights
      optional: false

Use cached input files during local runs

Use local cached files from .valohai/inputs/<step>/<input-name>/ if they're available.

Now it tries to download the files from S3 (and ofc fails because there is no authentication) if you try to define a default s3 URI and run the script locally.

valohai-utils logger to print deployment metrics correctly

With valohai-utils:

logger.log("x", 0.54)
logger.log("y", 0.12)
logger.flush_logs()

The output is:

{"x": 0.54, "y": 0.12}

But in deployment you'd want:

{"vh_metadata": {"x": 0.54, "y": 0.12}}

In the short-term, the logger should look for existence of VALOHAI_PORT env var or /valohai/valohai-metadata.json to figure out that it is in deployment and print the metrics accordingly.

In the long-term a VALOHAI_DEPLOYMENT should be injected into the container for this purpose.

Allow dash in the input name for local runs

When doing a local run with a input called my-input,

  1. Valohai doesn't understand this and will return None for the input.
  2. There is also no local ./valohai/<step>/my-input/ folder created.

With input name myinput this would work.

valohai-utils should have an output mode where all outputs go to a fixed folder

Currently when you output files, they go to an experiment subfolder. For example:

.valohai/outputs/20210515-173144-383db6/stepname/outputname/yolov3.tf

With this behaviour, it is hard to build/debug pipelines locally, where you want to use outputs from previous step as input of the next step.

There should be a mode, where outputs go to a fixed folder. For example:

.valohai/outputs/stepname/outputname/yolov3.tf

Or even:

.valohai/outputs/outputname/yolov3.tf

This way, you could use a fixed path as your input default and not need to copy-paste new paths constantly.

valohai-utils

  • Parse Python source code for inputs & parameters
  • Implement valohai.prepare() without writing anything to disk
  • Downloading & caching inputs (for local runs)
  • Unpacking and iterating input files
  • Auto-argparsing parameters
  • Output packing
  • Generate valohai.yaml
  • Standardized logging
  • Default image for valohai.yaml
  • Valohai CLI command(s)
  • Refactor global state to singleton
  • Redesign of API (https://valohai.slack.com/archives/C3J44LTL6/p1605277828112500)
  • Write decent readme.md

Original Google doc: https://docs.google.com/document/d/1Jbi2PhLKgSgvjtSsk6e1FvS5eEjCXgO0LCt7oJSYDnE/edit


User story

  • Annoying to get and manage dataset locally:
    • Heavily manual process with varying tools
    • Annoying to manually download
    • Forget to ignore data folder in .gitignore
    • Can’t remember how to use unzip/tar command
    • Forget to delete the local archive, iterate files in code expecting everything to be JPEGs, crashes on .zip
    • Write separate code for local & Valohai (different local paths)
      • Need know about VH_INPUTS_DIR
      • Need to write unzip code just for Valohai

What I wantz

  • This is my input called “my_input” and this is the url
    • Classic: Write to valohai.yaml input called my_images with default url
    • Jupyhai: Write to cell tagged as inputs has variable called my_images
  • One line of code that gets me a path to my files
    • If local machine, download & cache
    • If it is a zip, unzip it for me plz

Proposal

  • Call get_input_file_{paths, streams, ...}(”my_images”, unarchive=True, extensions=(“.jpg”, “.jpeg”, “.gif”))
    • Does the unarchiving happen in the named input root, say, the archive name doesn’t create an extra directory?
      Assumption: yes, it happens in the named input root directory without an extra directory
    • What happens if there are multiple archives in the input?
      Assumption: unarchive all
    • What happens if the multiple archives have overwriting files?
      Assumption: overwrite silently, but if e.g. 2 input files have the same directories with differently named files, we should maintain all of them
    • Local:
      1. Figure out url
  1. Classic: Look for my_input input default url in valohai.yaml
  2. Jupyhai: globals()[”my_images”]
      1. Download & cache
      1. Auto unzip/untar/etc if possible
      1. Return
    • ./.valohai/inputs/my_images/some.jpg
    • ./.valohai/inputs/my_images/some2.jpg
    • ./.valohai/inputs/my_images/some3.jpg
    • Valohai:
      1. Already downloaded by Peon
      1. Return
    • /valohai/inputs/my_images/some.jpg
    • /valohai/inputs/my_images/some2.jpg
    • /valohai/inputs/my_images/some3.jpg

Duplicate code is needed to properly handle Execution parameters

Currently when using Execution parameters, you need to define them in the valohai.yaml and then for example using @click or ArgumentParser in Python. This is basically duplicating the same information.

If this could be defined only once, it would improve the user experience and also remove possible errors caused by mismatching parameter handling.

This is based on feedback from Ari (Skillup).

valohai-utils Pipeline wrapper to Papi should find the default config for the user

There is needless importing hassle for loading the existing config. There should be an easier alternative that loads the existing config under the hood.

Now:

import valohai
import os
import yaml
from valohai_yaml.objs import Config


def read_config(filename):
    if not os.path.isabs(filename):
        filename = os.path.join(os.path.dirname(__file__), filename)
    with open(filename) as f:
        data = yaml.safe_load(f)
    return Config.parse(data)


def main():
    papi = valohai.Pipeline(name="mypipeline", config=read_config("valohai.yaml"))
    # return pipeline

Alternative:

import valohai

def main():
    papi = valohai.Pipeline(name="mypipeline")
    # return pipeline

Updating valohai.yaml with valohai.prepare does not change the file name

If the valohai.prepare has been run in myfirstfile.py to create a step called mystep, updating the step from another file, e.g. mysecondfile.py, does not update the filename in the command section.

I.e. even after running vh yaml step mysecondfile.py the command section in valohai.yaml looks like this:

    command:
    - pip install -r requirements.txt
    - python ./myfirstfile.py {parameters}

Valohai-utils caches inputs with empty defaults poorly.

If you put local files in (fill the cache manually)
./valohai/inputs/mystep/images

Valohai-utils doesn't load it from the cache if the default is empty.

inputs = {
    "images": "",
}

valohai.prepare(step="mystep", default_inputs=inputs)

vh yaml step notebook.ipynb doesn't work with the new papermill

We are moving to the new Papermill in the Jupyhai so Valohai CLI does the same.

Today you can generate step with vh yaml step hello.ipynb

The YAML generating code for a notebook is also duplicated in two places (Jupyhai, Valohai-CLI).

  • Move command generation code to valohai-utils
  • Make valohai-cli and jupyhai use that code

Improved pipeline creation with valohai-utils

It is possible to create very basic pipelines with valohai-utils. Would be great if you could define also

  • parameter-to-parameter/metadata-to-parameter edges
  • pipeline actions
  • overrides
  • eventually also pipelines (in Tasks) in pipelines

Support files and folders as local inputs

Today I can use a local file as an input and valohai-utils will feed the file to the code properly. However, this doesn't work for a list of files only a single one.

Command-line overriding is not working if the parameter name has a '-'

Overriding parameters on the command-line doesn't work if the parameter name has a - character

hello.py

import valohai

params = {
    "my-param": "123",
}
valohai.prepare(step="foo", default_parameters=params)
print(valohai.parameters('my-param).value)

Run:

python hello.py --my-param=666
123

hello2.py

import valohai

params = {
    "myparam": "123",
}
valohai.prepare(step="foo", default_parameters=params)
print(valohai.parameters('my-param).value)

Run:

python hello2.py --myparam=666
666

Allow creating parameter choices with valohai-utils

It is possible to define a list of choices for a parameter and this will appear as a dropdown menu in the UI. Would be great to be able to define this with valohai-utils and the valohai.prepare() command.

Users other than root cannot write to /valohai/inputs

If the Docker container is started with other user than root it is not possible to write to /valohai/inputs as other does not have writing permissions there. This is a problem when using archives as inputs as valohai-utils tries to unarchive them into /valohai/inputs by default.

Suggested solutions:

  • Allow other to write in /valohai/inputs or even /valohai/*
    OR
  • If the user is other than root do the unarchiving somewhere else where the user has write permission.

Cannot repare multiple styled parameter

Some parameters may have array format and that is noted by multiple property in valohai_yaml. Would be nice if valohai-utils's prepare() function could handle such config for parameters.

Allow full parameter and input definition via valohai-utils prepare()

Today, you can only define:

parameter name, type, and default value.
input name and default value.

Instead of passing just the dict of name and default value key-pairs, we should also allow for passing dict of dicts, where you can define all the other things like optionality, pass-as, keep-directories, etc.

Example:

default_parameters = {
  "learning_rate": {
    "default": 0.001, 
    "type": "float",
    "pass-as": "pass-as: --lr={v}",
    "description": "Initial learning rate"
  }
}

valohai-utils vfs mangles paths with archived inputs

When auto-processing an archived input (they are auto-processed by default), valohai-utils mangles the path and the name of the archived files.

Often there is information encoded into the path or filename (example: labels/UUID.json) that is now lost.

This code

import valohai

inputs = {"labels": "/home/juha/ws/yolov3-tf2/labels.zip"}
valohai.prepare(step="test", default_inputs=inputs)
for label_path in valohai.inputs('labels').paths():
    print(label_path)

With this zip contents:

labels/02494272-79e7-4e8b-9d5b-5da07a350b16.json
labels/3f7a8a63-33c0-4654-b203-d4c51b2374dd.json
labels/4247d9db-81f1-46a8-8601-4059e68032b8.json
labels/442fcac9-1623-458d-937e-50b6a7ba44ea.json
labels/7dcae7e8-748a-40da-936f-7c3fa7c70e05.json
labels/82b93121-a4d5-4c71-a7ba-7702155e247b.json
labels/cce9141c-8cb1-4e85-9b01-e01f83cf838c.json
labels/d9abdd08-ab0b-444c-b6f2-a8b6b5209132.json
labels/e1864b66-302b-40a0-88b7-39681a8c2e4b.json
labels/e1f9c295-27cd-4828-b442-1d1b1ff14047.json

Returns this:

/tmp/tmp5vtugf_l.json
/tmp/tmpiuv92505.json
/tmp/tmpdb9crwg7.json
/tmp/tmpb5biboa8.json
/tmp/tmp8ebrfq_s.json
/tmp/tmp_nsqsyy6.json
/tmp/tmp6qrqnjni.json
/tmp/tmp6yc5y0w8.json
/tmp/tmppn0xf1gv.json
/tmp/tmp0w7_bsyd.json

Improve typing

This package isn't shipping a py.typed file, so downstream packages such as valohai-cli are oblivious of the types here.

(On that note, the typing for python_to_yaml_merge_strategy is wrong :-) )

.prepare for Jupyhai overrides parameters

So currently if I have a .prepare call in my Jupyter notebook with default_parameters=..., the values I pass into a copied execution/task from Valohai get overridden by these default values.

Effectively this means that I cannot for example copy a notebook run into a task, as all the task's executions would run with the same parameters.

Allow defining node overrides

Today you can only define pipeline nodes and edges. There is no way to define overrides for those edges so most pipelines end up having multiple input files passed (default inputs, and the ones from the edge).

Utils can't find local cached input file by filename

When using datums as default input values you get a error message "No connection adapters were found for datum" even if the file is locally available.

Current workaround: local file needs to be renamed to the datum UID and without a file extension.

It would be fantastic if utils could find the file based on the filename.

mydata = {
    'train': 'datum://0181663c-4305-2cf0-298e-eb670f5c06dc',
    'test': 'datum://0181663c-4194-d20e-a804-7a8b4bfe1131'
}

.compess() should be able to compress subdirectories also

Today compress() allows users to compress a set of files to an individual file.

valohai.outputs('resized').compress("*.png", "images.zip", remove_originals=True)

However, it doesn't allow you to compress all subfolders and their files from the output. For example:

- myoutput
  - train
    - images
    - labels
  - eval
    - images
    - labels

There are several cases where you'd output files into a specific folder structure, and then compress that and upload to your cloud storage (and maintain the folder structure for when you pull it down for the next execution)

Bool parameters are not parsed correctly from the CLI

With this code:

params = {
    'falseflag': True,
    'trueflag': False,
}

valohai.prepare(step='paramtest', default_parameters=params)

print(valohai.parameters('falseflag').value)
print(valohai.parameters('trueflag').value)

This YAML

    parameters:
    - name: falseflag
      default: false
      description: This is a flag parameter.
      multiple-separator: ','
      pass-false-as: --falseflag=false
      pass-true-as: --falseflag=true
      type: flag
    - name: trueflag
      default: true
      description: This is a flag parameter.
      multiple-separator: ','
      pass-false-as: --trueflag=false
      pass-true-as: --trueflag=true
      type: flag

And this CLI command:

python ./paramtest.py --falseflag=false --trueflag=true

We get this printout

True
True

The reason is that valohai-utils CLI parsing expects flags and not booleans.

We should move to expecting the pass-false-as and pass-true-as bool-like behavior for the YAMLgen + parsing.

Can't define the environment in valohai.prepare()

It would be great if I could define the default environment for the step inside valohai.prepare()

Something like
valohai.prepare(step="train", image="tensorflow/tensorflow:2.6.0", environment="myorg-aws-eu-west-1-p2.xlarge")

Slash handling when using Windows

A customer using Windows just got an error python:
can't open file '.model.py': [Errno 2] No such file or directory
due to having .\model.py instead of ./model.py in their yaml.

Could it be possible for utils to handle cases like this automatically?

valohai-utils input .path() and .paths() should support wildcards.

It is quite annoying to find a single file from an input with potentially many files.

Today you are forced to write things like:

import fnmatch
for path in fnmatch.filter(valohai.inputs("myinput").paths(), "weights*.pb"):
  print(path)

It should be:

my_file = valohai.inputs("myinput").path("weights*.pb")

Wildcards for local inputs not working

After the big refactor, wildcards for local inputs are not working.

This code:

valohai.prepare(step="train", default_inputs={"myinput": "*.png"})
print(valohai.inputs("myinput").path())

Will output:

ValueError: Can not download file with no URI

Local downloads not working on Windows

If you have this code

import valohai
valohai.prepare(step="train", default_inputs={"test": "https://upload.wikimedia.org/wikipedia/en/a/a9/Example.jpg"})
print(valohai.inputs("test").path())

And run it in Windows, you get a path like:

.valohai/inputs\\test\\example.jpg

The consts.py hardcodes the POSIX path separator, which doesn't work on Windows. Need to fix that.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.