valohai / valohai-utils Goto Github PK

View Code? Open in Web Editor NEW

2.0 11.0 1.0 270 KB

Python helper library for Valohai

License: MIT License

Python 97.73% Jupyter Notebook 2.27%

python machine-learning valohai hacktoberfest

valohai-utils's Introduction

valohai-utils

Python helper library for the Valohai machine learning platform.

Install

pip install valohai-utils

Execution

Run locally

python mycode.py

Run in the cloud

vh yaml step mycode.py
vh exec run -a mystep

What does valohai-utils do?

Generates and updates the valohai.yaml configuration file based on the source code
Agnostic input handling (single file, multiple files, zip, tar)
Parse command-line parameters
Compress outputs
Download inputs for local experiments
Straightforward way to print metrics as Valohai metadata
Code parity between local vs. cloud

Parameters

Valohai parameters are variables & hyper-parameters that are parsed from the command-line. You define parameters in a dictionary:

default_parameters = {
    'iterations': 100,
    'learning_rate': 0.001
}

The dictionary is fed to valohai.prepare() method:

The values given are default values. You can override them from the command-line or using the Valohai web UI.

Example

import valohai

default_parameters = {
    'iterations': 10,
}

valohai.prepare(step="helloworld", default_parameters=default_parameters)

for i in range(valohai.parameters('iterations').value):
    print("Iteration %s" % i)

Inputs

Valohai inputs are the data files required by the experiment. They are automatically downloaded for you, if the data is from a public source. You define inputs with a dictionary:

default_inputs = {
    'input_name': 'http://example.com/1.png'
}

An input can also be a list of URLs or a folder:

default_inputs = {
    'input_name': [
        'http://example.com/1.png',
        'http://example.com/2.png'
    ],
    'input_folder': [
        's3://mybucket/images/*',
        'azure://mycontainer/images/*',
        'gs://mybucket/images/*'
    ]
}

Or it can be an archive full of files (uncompressed automagically on-demand):

default_inputs = {
    'images': 'http://example.com/myimages.zip'
}

The dictionary is fed to valohai.prepare() method.

The url(s) given are defaults. You can override them from the command-line or using the Valohai web UI.

Example

import csv
import valohai

default_inputs = {
    'myinput': 'https://pokemon-images-example.s3-eu-west-1.amazonaws.com/pokemon.csv'
}

valohai.prepare(step="test", default_inputs=default_inputs)

with open(valohai.inputs("myinput").path()) as csv_file:
    reader = csv.reader(csv_file, delimiter=',')

Outputs

Valohai outputs are the files that your step produces an end result.

When you are ready to save your output file, you can query for the correct path from the valohai-utils.

Example

image = Image.open(in_path)
new_image = image.resize((width, height))
out_path = valohai.outputs('resized').path('resized_image.png')
new_image.save(out_path)

Sometimes there are so many outputs that you may want to compress them into a single file.

In this case, once you have all your outputs saved, you can finalize the output with the compress() method.

Example

valohai.outputs('resized').compress("*.png", "images.zip", remove_originals=True)

Logging

You can log metrics using the Valohai metadata system and then render interactive graphs on the web interface. The valohai-utils logger will print JSON logs that Valohai will parse as metadata.

It is important for visualization that logs for single epoch are flushed out as a single JSON object.

Example

import valohai

for epoch in range(100):
    with valohai.metadata.logger() as logger:
        logger.log("epoch", epoch)
        logger.log("accuracy", accuracy)
        logger.log("loss", loss)

Example 2

import valohai

logger = valohai.logger()
for epoch in range(100):
    logger.log("epoch", epoch)
    logger.log("accuracy", accuracy)
    logger.log("loss", loss)
    logger.flush()

Distributed Workloads

valohai.distributed contains a toolset for running distributed tasks on Valohai.

import valohai

if valohai.distributed.is_distributed_task():

    # `master()` reports the same worker on all contexts
    master = valohai.distributed.master()
    master_url = f'tcp://{master.primary_local_ip}:1234'

    # `members()` contains all workers in the distributed task
    member_public_ips = ",".join([
        m.primary_public_ip
        for m
        in valohai.distributed.members()
    ])

    # `me()` has full details about the current worker context
    details = valohai.distributed.me()

    size = valohai.distributed.required_count
    rank = valohai.distributed.rank  # 0, 1, 2, etc. depending on run context

Full example

Preprocess step for resizing image files

This example step will do the following:

Take image files (or an archive containing images) as input.
Resize each image to the size provided by the width & height parameters.
Compress the resized images into resized/images.zip Valohai output file.

import os
import valohai
from PIL import Image

default_parameters = {
    "width": 640,
    "height": 480,
}
default_inputs = {
    "images": [
        "https://dist.valohai.com/valohai-utils-tests/Example.jpg",
        "https://dist.valohai.com/valohai-utils-tests/planeshark.jpg",
    ],
}

valohai.prepare(step="resize", default_parameters=default_parameters, default_inputs=default_inputs)


def resize_image(in_path, out_path, width, height, logger):
    image = Image.open(in_path)
    logger.log("from_width", image.size[0])
    logger.log("from_height", image.size[1])
    logger.log("to_width", width)
    logger.log("to_height", height)
    new_image = image.resize((width, height))
    new_image.save(out_path)


if __name__ == '__main__':
    for image_path in valohai.inputs('images').paths():
        with valohai.metadata.logger() as logger:
            filename = os.path.basename(image_path)
            resize_image(
                in_path=image_path,
                out_path=valohai.outputs('resized').path(filename),
                width=valohai.parameters('width').value,
                height=valohai.parameters('height').value,
                logger=logger
            )
    valohai.outputs('resized').compress("*", "images.zip", remove_originals=True)

CLI command:

vh yaml step resize.py

Will produce this valohai.yaml config:

- step:
    name: resize
    image: python:3.7
    command: python ./resize.py {parameters}
    parameters:
    - name: width
      default: 640
      multiple-separator: ','
      optional: false
      type: integer
    - name: height
      default: 480
      multiple-separator: ','
      optional: false
      type: integer
    inputs:
    - name: images
      default:
      - https://dist.valohai.com/valohai-utils-tests/Example.jpg
      - https://dist.valohai.com/valohai-utils-tests/planeshark.jpg
      optional: false

Configuration

There are some environment variables that affect how valohai-utils works when not running within a Valohai execution context.

VH_FLAT_LOCAL_OUTPUTS
- If set, flattens the local outputs directory structure into a single directory. This means that outputs from subsequent runs can clobber old files.

Development

If you wish to further develop valohai-utils, remember to install development dependencies and write tests for your additions.

Linting

Lints are run via pre-commit.

If you want pre-commit to check your commits via git hooks,

pip install pre-commit
pre-commit install

You can also run the lints manually with pre-commit run --all-files.

Testing

pip install -e . -r requirements-dev.txt
pytest

valohai-utils's People

Contributors

Stargazers

Watchers

Forkers

drazendee

valohai-utils's Issues

Support .ipynb files in vh yaml step command

You can create YAML step from Python file today:

vh yaml step hello.py

Now with valohai-utils support for notebooks, you need to be able to do the same with notebooks:

vh yaml step hello.ipynb

valohai-utils should ignore input defaults pointing to local paths when updating the yaml

If you have:

inputs = {
    "classes": ".data/classes.txt",
    "train": ".data/train/train.tfrecord",
    "test": ".data/test/test.tfrecord",
    "weights": ".data/weights/yolov3.tf.data-00000-of-00001",
}

valohai.prepare(step="train", default_inputs=inputs)

You now get:

    inputs:
    - name: classes
      default:
      - .data/classes.txt
      optional: false
    - name: train
      default:
      - .data/train/train.tfrecord
      optional: false
    - name: test
      default:
      - .data/test/test.tfrecord
      optional: false
    - name: weights
      default:
      - .data/weights/yolov3.tf.data-00000-of-00001
      optional: false

You should get:

    inputs:
    - name: classes
      optional: false
    - name: train
      optional: false
    - name: test
      optional: false
    - name: weights
      optional: false

Use cached input files during local runs

Use local cached files from .valohai/inputs/<step>/<input-name>/ if they're available.

Now it tries to download the files from S3 (and ofc fails because there is no authentication) if you try to define a default s3 URI and run the script locally.

Building a pipeline in Python (valohai-utils)

Valohai-utils throws AttributeError when parsing some assigments

If valohai-utils faces code like...

X['Intercept'] = 1.

it can't eval (which is expected). However, it is currently only designed to pass on ValueError, but fails on the AttributeError that evaling this code raises.

valohai-utils logger to print deployment metrics correctly

With valohai-utils:

logger.log("x", 0.54)
logger.log("y", 0.12)
logger.flush_logs()

The output is:

{"x": 0.54, "y": 0.12}

But in deployment you'd want:

{"vh_metadata": {"x": 0.54, "y": 0.12}}

In the short-term, the logger should look for existence of VALOHAI_PORT env var or /valohai/valohai-metadata.json to figure out that it is in deployment and print the metrics accordingly.

In the long-term a VALOHAI_DEPLOYMENT should be injected into the container for this purpose.

Spec new valohai-utils API + refactor

Move from get_* styled API to valohai.inputs('myinput').path() style.

Entire slack conversation here: https://app.slack.com/client/T2KGM3TJ9/C3J44LTL6/thread/C3J44LTL6-1605277828.112500

Add API for getting an output directory

From my valohai/mistral-example#3 notes:

no canonical way of getting the base directory for an output, have to resort to
```
out = valohai.outputs(f'encoded_{tag}')
save_path = out.path(".")
```

Allow dash in the input name for local runs

When doing a local run with a input called my-input,

Valohai doesn't understand this and will return None for the input.
There is also no local ./valohai/<step>/my-input/ folder created.

With input name myinput this would work.

valohai-utils should have an output mode where all outputs go to a fixed folder

Currently when you output files, they go to an experiment subfolder. For example:

.valohai/outputs/20210515-173144-383db6/stepname/outputname/yolov3.tf

With this behaviour, it is hard to build/debug pipelines locally, where you want to use outputs from previous step as input of the next step.

There should be a mode, where outputs go to a fixed folder. For example:

.valohai/outputs/stepname/outputname/yolov3.tf

Or even:

.valohai/outputs/outputname/yolov3.tf

This way, you could use a fixed path as your input default and not need to copy-paste new paths constantly.

valohai-utils

Original Google doc: https://docs.google.com/document/d/1Jbi2PhLKgSgvjtSsk6e1FvS5eEjCXgO0LCt7oJSYDnE/edit

User story

Annoying to get and manage dataset locally:
- Heavily manual process with varying tools
- Annoying to manually download
- Forget to ignore data folder in .gitignore
- Can’t remember how to use unzip/tar command
- Forget to delete the local archive, iterate files in code expecting everything to be JPEGs, crashes on .zip
- Write separate code for local & Valohai (different local paths)
  - Need know about VH_INPUTS_DIR
  - Need to write unzip code just for Valohai

What I wantz

This is my input called “my_input” and this is the url
- Classic: Write to valohai.yaml input called my_images with default url
- Jupyhai: Write to cell tagged as inputs has variable called my_images
One line of code that gets me a path to my files
- If local machine, download & cache
- If it is a zip, unzip it for me plz

Proposal

Call get_input_file_{paths, streams, ...}(”my_images”, unarchive=True, extensions=(“.jpg”, “.jpeg”, “.gif”))
- Does the unarchiving happen in the named input root, say, the archive name doesn’t create an extra directory?
  Assumption: yes, it happens in the named input root directory without an extra directory
- What happens if there are multiple archives in the input?
  Assumption: unarchive all
- What happens if the multiple archives have overwriting files?
  Assumption: overwrite silently, but if e.g. 2 input files have the same directories with differently named files, we should maintain all of them
- Local:
- 1. Figure out url

Classic: Look for my_input input default url in valohai.yaml
Jupyhai: globals()[”my_images”]
- 1. Download & cache
- 1. Auto unzip/untar/etc if possible
- 1. Return
- ./.valohai/inputs/my_images/some.jpg
- ./.valohai/inputs/my_images/some2.jpg
- ./.valohai/inputs/my_images/some3.jpg
- Valohai:
- 1. Already downloaded by Peon
- 1. Return
- /valohai/inputs/my_images/some.jpg
- /valohai/inputs/my_images/some2.jpg
- /valohai/inputs/my_images/some3.jpg

Duplicate code is needed to properly handle Execution parameters

Currently when using Execution parameters, you need to define them in the valohai.yaml and then for example using @click or ArgumentParser in Python. This is basically duplicating the same information.

If this could be defined only once, it would improve the user experience and also remove possible errors caused by mismatching parameter handling.

This is based on feedback from Ari (Skillup).

valohai-utils Pipeline wrapper to Papi should find the default config for the user

There is needless importing hassle for loading the existing config. There should be an easier alternative that loads the existing config under the hood.

Now:

import valohai
import os
import yaml
from valohai_yaml.objs import Config


def read_config(filename):
    if not os.path.isabs(filename):
        filename = os.path.join(os.path.dirname(__file__), filename)
    with open(filename) as f:
        data = yaml.safe_load(f)
    return Config.parse(data)


def main():
    papi = valohai.Pipeline(name="mypipeline", config=read_config("valohai.yaml"))
    # return pipeline

Alternative:

import valohai

def main():
    papi = valohai.Pipeline(name="mypipeline")
    # return pipeline

Updating valohai.yaml with valohai.prepare does not change the file name

If the valohai.prepare has been run in myfirstfile.py to create a step called mystep, updating the step from another file, e.g. mysecondfile.py, does not update the filename in the command section.

I.e. even after running vh yaml step mysecondfile.py the command section in valohai.yaml looks like this:

    command:
    - pip install -r requirements.txt
    - python ./myfirstfile.py {parameters}

Valohai-utils caches inputs with empty defaults poorly.

If you put local files in (fill the cache manually)
./valohai/inputs/mystep/images

Valohai-utils doesn't load it from the cache if the default is empty.

inputs = {
    "images": "",
}

valohai.prepare(step="mystep", default_inputs=inputs)

vh yaml step notebook.ipynb doesn't work with the new papermill

We are moving to the new Papermill in the Jupyhai so Valohai CLI does the same.

Today you can generate step with vh yaml step hello.ipynb

The YAML generating code for a notebook is also duplicated in two places (Jupyhai, Valohai-CLI).

Move command generation code to valohai-utils
Make valohai-cli and jupyhai use that code

Improved pipeline creation with valohai-utils

It is possible to create very basic pipelines with valohai-utils. Would be great if you could define also

parameter-to-parameter/metadata-to-parameter edges
pipeline actions
overrides
eventually also pipelines (in Tasks) in pipelines

Metadata printing: Detect if `sys.stdout` has been redirected

E.g. Jupyter redirects sys.stdout for obvious reasons, so metadata doesn't work.

The metadata printing helpers in valohai-utils should detect that, and use the non-redirected sys.__stdout__ handle (when available) instead.

Add `distributed.json` parsing

Support files and folders as local inputs

Today I can use a local file as an input and valohai-utils will feed the file to the code properly. However, this doesn't work for a list of files only a single one.

Command-line overriding is not working if the parameter name has a '-'

Overriding parameters on the command-line doesn't work if the parameter name has a - character

hello.py

import valohai

params = {
    "my-param": "123",
}
valohai.prepare(step="foo", default_parameters=params)
print(valohai.parameters('my-param).value)

Run:

python hello.py --my-param=666
123

hello2.py

import valohai

params = {
    "myparam": "123",
}
valohai.prepare(step="foo", default_parameters=params)
print(valohai.parameters('my-param).value)

Run:

python hello2.py --myparam=666
666

Allow creating parameter choices with valohai-utils

It is possible to define a list of choices for a parameter and this will appear as a dropdown menu in the UI. Would be great to be able to define this with valohai-utils and the valohai.prepare() command.

Users other than root cannot write to /valohai/inputs

If the Docker container is started with other user than root it is not possible to write to /valohai/inputs as other does not have writing permissions there. This is a problem when using archives as inputs as valohai-utils tries to unarchive them into /valohai/inputs by default.

Cannot repare multiple styled parameter

Some parameters may have array format and that is noted by multiple property in valohai_yaml. Would be nice if valohai-utils's prepare() function could handle such config for parameters.

Allow full parameter and input definition via valohai-utils prepare()

Today, you can only define:

parameter name, type, and default value.
input name and default value.

Instead of passing just the dict of name and default value key-pairs, we should also allow for passing dict of dicts, where you can define all the other things like optionality, pass-as, keep-directories, etc.

Example:

default_parameters = {
  "learning_rate": {
    "default": 0.001, 
    "type": "float",
    "pass-as": "pass-as: --lr={v}",
    "description": "Initial learning rate"
  }
}

valohai-utils vfs mangles paths with archived inputs

When auto-processing an archived input (they are auto-processed by default), valohai-utils mangles the path and the name of the archived files.

Often there is information encoded into the path or filename (example: labels/UUID.json) that is now lost.

This code

import valohai

inputs = {"labels": "/home/juha/ws/yolov3-tf2/labels.zip"}
valohai.prepare(step="test", default_inputs=inputs)
for label_path in valohai.inputs('labels').paths():
    print(label_path)

With this zip contents:

labels/02494272-79e7-4e8b-9d5b-5da07a350b16.json
labels/3f7a8a63-33c0-4654-b203-d4c51b2374dd.json
labels/4247d9db-81f1-46a8-8601-4059e68032b8.json
labels/442fcac9-1623-458d-937e-50b6a7ba44ea.json
labels/7dcae7e8-748a-40da-936f-7c3fa7c70e05.json
labels/82b93121-a4d5-4c71-a7ba-7702155e247b.json
labels/cce9141c-8cb1-4e85-9b01-e01f83cf838c.json
labels/d9abdd08-ab0b-444c-b6f2-a8b6b5209132.json
labels/e1864b66-302b-40a0-88b7-39681a8c2e4b.json
labels/e1f9c295-27cd-4828-b442-1d1b1ff14047.json

Returns this:

/tmp/tmp5vtugf_l.json
/tmp/tmpiuv92505.json
/tmp/tmpdb9crwg7.json
/tmp/tmpb5biboa8.json
/tmp/tmp8ebrfq_s.json
/tmp/tmp_nsqsyy6.json
/tmp/tmp6qrqnjni.json
/tmp/tmp6yc5y0w8.json
/tmp/tmppn0xf1gv.json
/tmp/tmp0w7_bsyd.json

Improve typing

This package isn't shipping a py.typed file, so downstream packages such as valohai-cli are oblivious of the types here.

(On that note, the typing for python_to_yaml_merge_strategy is wrong :-) )

.prepare for Jupyhai overrides parameters

So currently if I have a .prepare call in my Jupyter notebook with default_parameters=..., the values I pass into a copied execution/task from Valohai get overridden by these default values.

Effectively this means that I cannot for example copy a notebook run into a task, as all the task's executions would run with the same parameters.

Allow defining node overrides

Today you can only define pipeline nodes and edges. There is no way to define overrides for those edges so most pipelines end up having multiple input files passed (default inputs, and the ones from the edge).

Add API for writing metadata sidecars

From my valohai/mistral-example#3 notes:

no easy way of writing a metadata json for a given named output:

        for file in os.listdir(save_path):
            with open(out.path(f'{file}.metadata.json'), 'w') as outfile:
                json.dump(metadata, outfile)

Utils can't find local cached input file by filename

When using datums as default input values you get a error message "No connection adapters were found for datum" even if the file is locally available.

Current workaround: local file needs to be renamed to the datum UID and without a file extension.

It would be fantastic if utils could find the file based on the filename.

mydata = {
    'train': 'datum://0181663c-4305-2cf0-298e-eb670f5c06dc',
    'test': 'datum://0181663c-4194-d20e-a804-7a8b4bfe1131'
}

valohai-utils throws AssertionError if folder is not connected to vh project

Running vh step train.py throws a generic AssertionError instead of explaining what's causing the issue.

Workaround is to run vh project create or vh project link

API for accessing execution config

From my valohai/mistral-example#3 notes:

Shouldn't need to ever f = open('/valohai/config/execution.json'); exec_details = json.load(f)

Default docker image could be using the local python version and not hard coded value

valohai/consts.py defines python version to 3.7 but user is running 3.9, would be cool to use user's local version

.compess() should be able to compress subdirectories also

Today compress() allows users to compress a set of files to an individual file.

valohai.outputs('resized').compress("*.png", "images.zip", remove_originals=True)

However, it doesn't allow you to compress all subfolders and their files from the output. For example:

- myoutput
  - train
    - images
    - labels
  - eval
    - images
    - labels

There are several cases where you'd output files into a specific folder structure, and then compress that and upload to your cloud storage (and maintain the folder structure for when you pull it down for the next execution)

Bool parameters are not parsed correctly from the CLI

With this code:

params = {
    'falseflag': True,
    'trueflag': False,
}

valohai.prepare(step='paramtest', default_parameters=params)

print(valohai.parameters('falseflag').value)
print(valohai.parameters('trueflag').value)

This YAML

    parameters:
    - name: falseflag
      default: false
      description: This is a flag parameter.
      multiple-separator: ','
      pass-false-as: --falseflag=false
      pass-true-as: --falseflag=true
      type: flag
    - name: trueflag
      default: true
      description: This is a flag parameter.
      multiple-separator: ','
      pass-false-as: --trueflag=false
      pass-true-as: --trueflag=true
      type: flag

And this CLI command:

python ./paramtest.py --falseflag=false --trueflag=true

We get this printout

True
True

The reason is that valohai-utils CLI parsing expects flags and not booleans.

We should move to expecting the pass-false-as and pass-true-as bool-like behavior for the YAMLgen + parsing.

Can't define the environment in valohai.prepare()

It would be great if I could define the default environment for the step inside valohai.prepare()

Something like
valohai.prepare(step="train", image="tensorflow/tensorflow:2.6.0", environment="myorg-aws-eu-west-1-p2.xlarge")

Slash handling when using Windows

A customer using Windows just got an error python:
can't open file '.model.py': [Errno 2] No such file or directory
due to having .\model.py instead of ./model.py in their yaml.

Could it be possible for utils to handle cases like this automatically?

valohai-utils inputs should have a method for returning directory path only

Sometimes you only want the directory.

Now you can only call

valohai.inputs('myinput').path()

Which returns a path to the first file

It would be nice to have:

valohai.inputs('myinput').dir_path()

valohai-utils input .path() and .paths() should support wildcards.

It is quite annoying to find a single file from an input with potentially many files.

Today you are forced to write things like:

import fnmatch
for path in fnmatch.filter(valohai.inputs("myinput").paths(), "weights*.pb"):
  print(path)

It should be:

my_file = valohai.inputs("myinput").path("weights*.pb")

Wildcards for local inputs not working

After the big refactor, wildcards for local inputs are not working.

This code:

valohai.prepare(step="train", default_inputs={"myinput": "*.png"})
print(valohai.inputs("myinput").path())

Will output:

ValueError: Can not download file with no URI

Local downloads not working on Windows

If you have this code

import valohai
valohai.prepare(step="train", default_inputs={"test": "https://upload.wikimedia.org/wikipedia/en/a/a9/Example.jpg"})
print(valohai.inputs("test").path())

And run it in Windows, you get a path like:

.valohai/inputs\\test\\example.jpg

The consts.py hardcodes the POSIX path separator, which doesn't work on Windows. Need to fix that.

Access data stores connected to project locally with valohai-utils

To improve using valohai-utils locally, it would be useful to allow access to the data stores connected to project. The organization admins should be able to set whether this is allowed or not (on store level at least).

valohai / valohai-utils Goto Github PK

valohai-utils's Introduction

valohai-utils

Install

Execution

What does valohai-utils do?

Parameters

Example

Inputs

Example

Outputs

Example

Example

Logging

Example

Example 2

Distributed Workloads

Full example

Preprocess step for resizing image files

Configuration

Development

Linting

Testing

valohai-utils's People

Contributors

Stargazers

Watchers

Forkers

valohai-utils's Issues

User story

What I wantz

Proposal

Recommend Projects

Recommend Topics

Recommend Org