Giter VIP home page Giter VIP logo

netflix / metaflow Goto Github PK

View Code? Open in Web Editor NEW
7.6K 280.0 716.0 7.61 MB

:rocket: Build and manage real-life ML, AI, and data science projects with ease!

Home Page: https://metaflow.org

License: Apache License 2.0

Shell 0.06% Python 91.27% Jupyter Notebook 0.49% R 4.19% HTML 1.17% CSS 0.10% JavaScript 0.10% Svelte 2.05% TypeScript 0.55%
machine-learning data-science productivity model-management ai ml ml-platform ml-infrastructure python r

metaflow's Introduction

Metaflow_Logo_Horizontal_FullColor_Ribbon_Dark_RGB

Metaflow

Metaflow is a human-friendly library that helps scientists and engineers build and manage real-life data science projects. Metaflow was originally developed at Netflix to boost productivity of data scientists who work on a wide variety of projects from classical statistics to state-of-the-art deep learning.

For more information, see Metaflow's website and documentation.

From prototype to production (and back)

Metaflow provides a simple, friendly API that covers foundational needs of ML, AI, and data science projects:

  1. Rapid local prototyping, support for notebooks, and built-in experiment tracking and versioning.
  2. Horizontal and vertical scalability to the cloud, utilizing both CPUs and GPUs, and fast data access.
  3. Managing dependencies and one-click deployments to highly available production orchestrators.

Getting started

Getting up and running is easy. If you don't know where to start, Metaflow sandbox will have you running and exploring Metaflow in seconds.

Installing Metaflow in your Python environment

To install Metaflow in your local environment, you can install from PyPi:

pip install metaflow

Alternatively, you can also install from conda-forge:

conda install -c conda-forge metaflow

If you are eager to try out Metaflow in practice, you can start with the tutorial. After the tutorial, you can learn more about how Metaflow works here.

Deploying infrastructure for Metaflow in your cloud

While you can get started with Metaflow easily on your laptop, the main benefits of Metaflow lie in its ability to scale out to external compute clusters and to deploy to production-grade workflow orchestrators. To benefit from these features, follow this guide to configure Metaflow and the infrastructure behind it appropriately.

An active community of thousands of data scientists and ML engineers discussing the ins-and-outs of applied machine learning.

Get in touch

There are several ways to get in touch with us:

Contributing

We welcome contributions to Metaflow. Please see our contribution guide for more details.

metaflow's People

Contributors

akyrola avatar amerberg avatar bishax avatar bsridatta avatar cclauss avatar crk-codaio avatar darinyu avatar dependabot[bot] avatar dhpollack avatar dpoznik avatar emattia avatar ferras avatar jackie-ob avatar jasonge27 avatar jimbudarz avatar madhur-ob avatar mdneuzerling avatar oavdeev avatar obgibson avatar pjoshi30 avatar romain-intel avatar ryan-williams avatar saikonen avatar sam-watts avatar savingoyal avatar shrinandj avatar tfurmston avatar tuulos avatar tylerpotts avatar valaydave avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metaflow's Issues

Windows Issue with Tutorial: No module named 'fcntl'

Hello metaflow,

I am interested in learning about the metaflow offering but am hitting a snag in the very first tutorial:

λ python 00-helloworld/helloworld.py show
Traceback (most recent call last):
File "00-helloworld/helloworld.py", line 1, in
from metaflow import FlowSpec, step
File "C:\Users\Nick.Franciose\AppData\Local\Programs\Python\Python36\lib\site-packages\metaflow_init_.py", line 45, in
from .event_logger import EventLogger
File "C:\Users\Nick.Franciose\AppData\Local\Programs\Python\Python36\lib\site-packages\metaflow\event_logger.py", line 1, in
from .sidecar import SidecarSubProcess
File "C:\Users\Nick.Franciose\AppData\Local\Programs\Python\Python36\lib\site-packages\metaflow\sidecar.py", line 4, in
import fcntl
ModuleNotFoundError: No module named 'fcntl'

Stackoverflow suggests fcntl is a linux specific. Is this offering windows compatible? If so, do you have a workaround?

Best,
NIck

Integration with Kubeflow pipelines

First things first, thanks to open this project!! We are working in a very similar project, and Metaflow is going to help us a lot.

Do you plan to create a kind of integration with Kubeflow Pipelines? For us, this is very helpful to deploy these pipelines in our production environment.

Support for Airflow and Kubernetes

An open-source version of issue #2 -- would love to be able to have Metaflow plugins that support Airflow and Kubernetes!

We currently deploy our machine learning models to Kubernetes as restful API-wrapped microservices, then create Airflow dags to orchestrate and schedule the execution of all the model components.

Admittedly not entirely familiar with what all Metaflow offers just yet, but would love to see seamless integrations with these other awesome open-source tools!

Support for another public cloud - Google Cloud Platform

Currently, Metaflow is set up to work with AWS as the default public cloud. The architecture of Metaflow allows for additional public clouds to be supported.

Adding support for Google Cloud Platform might broaden the potential user base, which could increase the adaption rate. This, in turn, could lead to increased community attention.

continuously updating data

self.days = [0 to NOW]
self.next(self.compute_day, foreach='days')

This will work, but if I rerun it it will recompute days 0 to NOW. I can easily "hack" that by interfacing with s3 directly to skip running day 0 if we already have the results... but that breaks local testing.

There's a few ways of doing this, params/client api/interfacing with s3 directly, but they are not super elegant.

Is this a pattern that you've discussed? Is it a valid usecase of metaflow, or do you recommend delegating this to airflow or similar?

Instruction to setup on AWS is not clear

I can not successfully finish the aws setup. I use CloudFormation Template here https://github.com/Netflix/metaflow-tools/tree/master/aws/cloudformation and it gives me all the resources I need to use here.
image

When I do metaflow configure aws, As I understand, I need to put output resources arns there. But I notice

  1. It's hard for me to map resources to configurations. For example,
  • Please enter the job queue to use for batch: -> Queue name or ARN?

  • Please enter the IAM role to use for the container to get AWS S3 access -> Is it ECSJobRole from CF outputs?

  • Please enter the URL for your metadata service: -> Is it the ServiceUrl? There's another one InternalServiceUrl.

  1. Some configuration we can not find in the CloudFormation outputs
  • Please enter the default container image to use -> Can not find any instruction here. I assume we at least need to have python env for the base container image?

  • Please enter the container registry -> should be <account_id>.dkr.ecr.us-west-2.amazonaws.com

I would suggest to improve the doc here.
https://docs.metaflow.org/metaflow-on-aws/deploy-to-aws

Improve CTRL-C handling

The current handling of interrupted flows (ie: when the user hits CTRL-C while a flow is running) has two issues:

  • there is a mandatory 1 second wait for each and every subprocess running (ie: 1 per in-flight-task)
  • there is the possibility that the subprocesses are killed before they have a chance to clean up (this is particularly important for subprocesses handling Batch)

It should be possible to:

  • wait for all subprocess to properly shut down (with a timeout)
  • kill any stragglers

This would both eliminate the 1 second minimum time per task and avoid (or at least mitigate) early kills.

ModuleNotFoundError: No module named 'fcntl' on Windows

Steps to Reproduce

  1. pip install metaflow
  2. metaflow
Traceback (most recent call last):
  File "C:\Program Files\Python37\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Program Files\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\Code\Python\CT83-PC\venv\metaflow\Scripts\metaflow.exe\__main__.py", line 5, in <module>
  File "d:\code\python\ct83-pc\venv\metaflow\lib\site-packages\metaflow\__init__.py", line 45, in <module>
    from .event_logger import EventLogger
  File "d:\code\python\ct83-pc\venv\metaflow\lib\site-packages\metaflow\event_logger.py", line 1, in <module>
    from .sidecar import SidecarSubProcess
  File "d:\code\python\ct83-pc\venv\metaflow\lib\site-packages\metaflow\sidecar.py", line 4, in <module>
    import fcntl
ModuleNotFoundError: No module named 'fcntl'

What I think the problem is?

The module fctnl is not available on Windows systems, which makes it impossible to run metaflow on Windows.

I am open to suggestions. 🤔

Update 1

Obvious Solution - Windows Sub-System for Linux

This is what I ended up doing, I used Ubuntu with WSL

References

https://stackoverflow.com/questions/45228395/error-no-module-named-fcntl
cs01/gdbgui#18
https://stackoverflow.com/questions/1422368/fcntl-substitute-on-windows

UnicodeDecodeError issue with 'MovieStatsFlow' tutorial

After following the 'MovieStatsFlow' tutorial over here, after opening the jupyter lab on the provided notebook, getting 'UnicodeDecodeError' on the cell which gets the latest successful run.

Exact error: UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 320: ordinal not in range(128)

Image:
Screenshot from 2019-12-15 21-57-56

GDPR and data storage

Hi

How do you deal with GDPR in the internal data stores? They are versioned and stored over time in permanent storage and some are likely subject to gdpr.

  • Is enabling retention on the s3 buckets safe, or will that delete data that metaflow actively uses? (This is a bit complicated once you have a long chain of these though..)
  • Can you say something about how you deal with gdpr when using metaflow + data storage?

Cloudformation template missing some env variables

Template should set the following environment variables for the metaflow service:

  • MF_USER_IAM_ROLE
  • MF_REGION
  • MF_STS_ENDPOINT

otherwise the auth endpoint will fail. Currently this is only being set when sandbox is set to true, however they should still be set if a user decides to take the template and run it on their own account.

Issue with tutorial #4

Currently I am using pyenv local for my Python environment

➜  metaflow-tutorials python 04-playlist-plus/playlist.py run
Metaflow 2.0.0 executing PlayListFlow for user:minhtue
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
    Incompatible environment:
    The @conda decorator requires --environment=conda

In-memory large dataframe processing

Metaflow tries to make the life of data scientists easier; this sometimes means providing ways to optimize certain common but expensive operations. Processing large dataframes in memory can be difficult and Metaflow could provide ways to do this more efficiently.

A Metaflow slack bot

A Metaflow Slack bot could be used to query the status of currently running runs, inspect the results of past runs, etc. In other words, it is, among other things, a convenient interface to Metaflow's client API.

feature request: maintain directory structure for local cache when using metaflow.s3.get_many

metaflow.s3.get_many (and the other get* methods) will download the files to a local cache dir, but don't maintain the original directory structure.

This is fine when the task needs access to single files at a time (the path can be accessed from the resulting S3Object), but there are use cases where an internal library expects to get a subdirectory with specific structure (like shared parquet datasets)

Support for the R programming language

Metaflow is currently a Python library. Provide R bindings that would allow a Flow to be written entirely in R and use the Python library as a backend.

GCP Integration

First of all, thank you for open-sourcing this excellent tool!

My team uses GCP not AWS so if metaflow could be integrated that would be great. I'm sure its on your roadmap but just putting it out there :)

Flows block the usage of decorators

Usage of decorators (for example for performance tracking of flow steps) is blocked by metaflow:

class MyFlow(FlowSpec):
    @step
    @track_memory_usage
    @track_time_usage
    def start(self):
        ...
        self.next(self.second_step)


    @step
    @track_memory_usage
    @track_time_usage
    def second_step(self):
        ...

Results in:

2019-12-15 10:06:12.910 [1576400770037993/start/1 (pid 10217)] Elapsed time in <function MyFlow.start at 0x7fd831923510>: 1.057846 s
2019-12-15 10:06:12.910 [1576400770037993/start/1 (pid 10217)] Memory in <function track_time_usage.<locals>.track_time_usage_wrapper at 0x7fd831923598>: 179 -> 241 MB
2019-12-15 10:06:12.916 [1576400770037993/start/1 (pid 10217)] Task finished successfully.
2019-12-15 10:06:12.930 [1576400770037993/split/2 (pid 10229)] Task is starting.
2019-12-15 10:06:17.351 [1576400770037993/split/2 (pid 10229)] <flow MyFlow step second_step> failed:
2019-12-15 10:06:17.351 [1576400770037993/split/2 (pid 10229)] Invalid self.next() transition detected on line 61:
2019-12-15 10:06:17.351 [1576400770037993/split/2 (pid 10229)] Step start specifies a self.next() transition to an unknown step, track_memory_usage_decorator_wrapper.

Windows support

Thanks for open sourcing this library. I was quite excited to take it for a spin, only to get an error "no module named 'fcntl'", and learning through #10, #23 and #46 that Windows is not supported, and there are no active plans for Windows support.

That is of course fine, but I have a few related questions.

  • I see #10 has a wontfix and #46 has a help wanted label. That begs the question; would you be open to accepting contributions that add Windows support?

  • Do you know what the major technical obstacles to Windows support are?

  • Just now I see on the Installing Metaflow page "Metaflow is available as a Python package for MacOS and Linux." Perhaps if it was followed by a more explicit "Windows is not supported.", fewer people would miss this.

  • Would it be suitable to place this on the roadmap, perhaps stating that there are no Netflix plans but outside contributions are welcome?

Support for LocalDataStore clean-up

It doesn't appear that there is support within the Metaflow framework to clean out/purge the .metaflow directory created when running through FlowSpecs locally. I imagine such a command might be a useful extension of the CLI given that data scientists using Metaflow might not always audit their hidden local directories.

Metaflow 2.0.0 executing HelloFlow Unknown user

I am trying to run metaflow-tutorials on local mac.
after

pip install metaflow
metaflow
cd 00-helloworld 
python 00-helloworld/helloworld.py show

It shows the error

Metaflow could not determine your user name based on environment variables ($USERNAME etc.)

Did't I miss some step?
thanks

Support for stepwise joins

Problem description

Metaflow is unable to handle the following graph expecting the branches to converge in a single vertex.
image

Code
from metaflow import FlowSpec, step


class SayHelloMetaFlow(FlowSpec):
    @step
    def start(self):
        print('start')
        self.next(self.say, self.hello, self.metaflow)

    @step
    def say(self):
        self.shout = 'say'
        self.next(self.say_hello)

    @step
    def hello(self):
        self.shout = 'hello'
        self.next(self.say_hello)

    @step
    def metaflow(self):
        self.shout = 'metaflow'
        self.next(self.say_hello_metaflow)

    @step
    def say_hello(self, inputs):
        self.shout = f'{inputs.say.shout} {inputs.hello.shout}'
        self.next(self.say_hello_metaflow)

    @step
    def say_hello_metaflow(self, inputs):
        print(inputs.say_hello.shout, inputs.metaflow.shout)
        self.next(self.end)

    @step
    def end(self):
        print('end')


if __name__ == '__main__':
    SayHelloMetaFlow()

Expected output

Metaflow 2.0.0 executing SayHelloMetaFlow for user:...
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
Workflow starting (run-id ...):
[.../start/1 (pid ...)] Task is starting.
[.../start/1 (pid ...)] start
[.../start/1 (pid ...)] Task finished successfully.
[.../say/2 (pid ...)] Task is starting.
[.../hello/3 (pid ...)] Task is starting.
[.../metaflow/4 (pid ...)] Task is starting.
[.../say/2 (pid ...)] Task finished successfully.
[.../hello/3 (pid ...)] Task finished successfully.
[.../say_hello/5 (pid ...)] Task is starting.
[.../say_hello/5 (pid ...)] Task finished successfully.
[.../metaflow/4 (pid ...)] Task finished successfully.
[.../say_hello_metaflow/6 (pid ...)] Task is starting.
[.../say_hello_metaflow/6 (pid ...)] say hello metaflow
[.../say_hello_metaflow/6 (pid ...)] Task finished successfully.
[.../end/7 (pid ...)] Task is starting.
[.../end/7 (pid ...)] end
[.../end/7 (pid ...)] Task finished successfully.
Done!

Actual output

Metaflow 2.0.0 executing SayHelloMetaFlow for user:...
Validating your flow...
    Validity checker found an issue on line 25:
    Step say_hello seems like a join step (it takes an extra input argument) but an incorrect number of steps (hello, say) lead to it. This join was expecting 3 incoming paths, starting from splitted step(s) say, hello, metaflow.

It would be extremely helpful for Metaflow to support DAGs in full.

Progress bars for steps (integration of tqdm)

Is there a way to use tqdm inside a step (especially a foreach task)?
I want to have a progress bar for each parallel task.
Currently I am only able to see progress bars when a task has already successfully finished.

metaflow_with_tqdm

I am not familiar how logging is handled in metaflow but here are some examples from tqdm that could be helpful to make the progress bar work:
https://github.com/tqdm/tqdm/blob/master/examples/parallel_bars.py
https://github.com/tqdm/tqdm/blob/master/examples/redirect_print.py

In general, it looks like messages are only printed when the task completed.
This makes the use of tqdm pointless.
Is there a way to immediately print to console?

Here is the source code for the example from the gif:

from metaflow import FlowSpec, step


class HelloFlow(FlowSpec):
    """
    A flow where Metaflow prints 'Hi'.

    Run this flow to validate that Metaflow is installed correctly.

    """
    @step
    def start(self):
        """
        This is the 'start' step. All flows must have a step named 'start' that
        is the first step in the flow.

        """
        print("HelloFlow is starting.")
        self.multi_processing = list(range(4))
        self.next(self.hello, foreach="multi_processing")

    @step
    def hello(self):
        """
        A step with parallel processing that should be monitored with tqdm.

        """

        from tqdm import tqdm
        from time import sleep
        from random import random

        interval = random() * 0.001
        for _ in tqdm(range(10000)):
            sleep(interval)

        self.next(self.join)

    @step
    def join(self, inputs):
        """
        Join our parallel branches and merge results.

        """

        self.next(self.end)

    @step
    def end(self):
        """
        This is the 'end' step. All flows must have an 'end' step, which is the
        last step in the flow.

        """

        print("HelloFlow is all done.")


if __name__ == '__main__':
    HelloFlow()

Prevent versioning of artifacts

I really like the idea and structure of metaflow. For my use case, it looks like it could simultaneously solve a lot of different problems. That said, is there any way to disable versioning and archiving of specific artifacts? If I can guarantee that my upstream data source is versioned and archived appropriately, then I don't necessarily want duplication of all artifacts (because of the storage overhead).

I could just remove certain artifacts after a time, but this would require the cleaning tool to know what should and shouldn't be archived. It would be nicer if there was some syntax to declare an artifact as transient, or at the very least a call we can make at the end of a flow to dispose of artifacts that shouldn't be versioned.

Issue with tutorial #3

Python environment: pyenv local with Python 3.6.8

Metaflow 2.0.0 executing PlayListFlow for user:minhtue

The next version of our playlist generator that uses the statistics
generated from 'Episode 02' to improve the title recommendations.

The flow performs the following steps:

1) Load the genre specific statistics from the MovieStatsFlow.
2) In parallel branches:
- A) Build a playlist from the top grossing films in the requested genre.
- B) Choose a random movie.
3) Join the two to create a movie playlist and display it.

Step start
    Use the Metaflow client to retrieve the latest successful run from our
    MovieStatsFlow and assign them as data artifacts in this flow.
    => bonus_movie, genre_movies

Step bonus_movie
    This step chooses a random title for a different movie genre.
    => join

Step genre_movies
    Select the top performing movies from the use specified genre.
    => join

Step join
    Join our parallel branches and merge results.
    => end

Step end
    Print out the playlist and bonus movie.

➜  metaflow-tutorials python 03-playlist-redux/playlist.py run
Metaflow 2.0.0 executing PlayListFlow for user:minhtue
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2019-12-03 16:46:33.660 Workflow starting (run-id 1575420393649410):
2019-12-03 16:46:33.681 [1575420393649410/start/1 (pid 34515)] Task is starting.
2019-12-03 16:46:34.074 [1575420393649410/start/1 (pid 34515)] <flow PlayListFlow step start> failed:
2019-12-03 16:46:34.075 [1575420393649410/start/1 (pid 34515)] Object not found:
2019-12-03 16:46:34.075 [1575420393649410/start/1 (pid 34515)] Using metadata provider: local@/Users/minhtue/workspace/metaflow/metaflow-tutorials
2019-12-03 16:46:34.075 [1575420393649410/start/1 (pid 34515)] Flow('MovieStatsFlow') does not exist
2019-12-03 16:46:34.126 [1575420393649410/start/1 (pid 34515)]
2019-12-03 16:46:34.131 [1575420393649410/start/1 (pid 34515)] Task failed.
2019-12-03 16:46:34.131 Workflow failed.
    Step failure:
    Step start (task-id 1) failed.

Interrupting Workflow Does Not Terminate AWS Batch Job(s)

Steps to Reproduce

  1. Clone the tutorials:
metaflow tutorials pull
cd metaflow-tutorials
  1. Configure AWS
metaflow configure aws
  1. Start Tutorial Episode 5: "Hello AWS"
python 05-helloaws/helloaws.py run
  1. As soon as you see the output "Task is starting (status STARTING)..." perform a keyboard interrupt (Ctrl+C) to stop the workflow. Note: because this hello AWS example runs so quickly, it may be easier if you add a time.sleep(10) and interrupt it during that delay

  2. View the AWS Batch Job console and notice the Job is not terminated

Non-trivial example

Is there a non-trivial example of a flow where steps are not running directly in the FlowSpec, but in different docker container?

    @step
    def a(self):
        # Step should be processed by a worker running "DockerImageA"
        self.next(self.b)
   @step
    def b(self):
        # Step should be processed by a worker running "DockerImageB"
        self.next(self.end)

Mutable default arguments usage in functions and methods

There appears to be several cases of mutable args (e.g., dict, list) set as default values in functions or methods.

For example:

  • metaflow/metadata/service.py
  • metaflow/metadata/local.py

This pattern can often yield difficult to debug issues.

https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments

pylint metaflow | grep W0102 can reveal the offending locations.

Full list of sites: https://gist.github.com/mpkocher/7d2db19fcde3fc8e728c6143817fc024

Using tensorflow.keras models leads to "TypeError: can't pickle _thread._local objects"

I am currently playing around with metaflow and having problems using it in combination with tensorflow. I am trying to define, train and evaluate a model defined with the keras API in seperate steps. The program crashes at the end of the step that defines the model since metaflow tries to store the model as an artifact using pickle which is apparently not supported by tensorflow models. The error message is "TypeError: can't pickle _thread._local objects".

I do not think that this is an issue that can necessarily be fixed in metaflow when pickling is not supported in general by tensorflow models. However I was hoping that someone knows a way to use tensorflow models within metaflow and could share that knowledge.

If it helps here is some example code and the traceback produced when running it (This is using tensorflow 2.0.0):

import tensorflow as tf

from metaflow import FlowSpec, step


class ExampleFlow(FlowSpec):
    """Example of a flow using a tensorflow.keras model"""

    @step
    def start(self):
        """Defines a model."""
        self.model = tf.keras.models.Sequential([
                tf.keras.layers.Dense(4, input_shape=(4, ), activation='relu'),
                tf.keras.layers.Dense(1, activation='sigmoid')
        ])
        self.model.compile(
            loss='binary_crossentropy',
            optimizer='adam',
            metrics=['accuracy']
        )
        self.next(self.end)
    
    @step
    def end(self):
        """Uses the model defined in the prior step."""
        self.model.summary()

if __name__ == "__main__":
    ExampleFlow()

Metaflow 2.0.0 executing ExampleFlow for user:mfr
Validating your flow...
The graph looks good!
Running pylint...
Pylint not found, so extra checks are disabled.
2019-12-11 19:28:41.009 Workflow starting (run-id 1576088921002498):
2019-12-11 19:28:41.020 [1576088921002498/start/1 (pid 7847)] Task is starting.
2019-12-11 19:28:43.109 [1576088921002498/start/1 (pid 7847)] 2019-12-11 19:28:43.109000: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-11 19:28:43.136 [1576088921002498/start/1 (pid 7847)] 2019-12-11 19:28:43.135689: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2096165000 Hz
2019-12-11 19:28:43.138 [1576088921002498/start/1 (pid 7847)] 2019-12-11 19:28:43.137785: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5638147fe6d0 executing computations on platform Host. Devices:
2019-12-11 19:28:43.263 [1576088921002498/start/1 (pid 7847)] 2019-12-11 19:28:43.137882: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
2019-12-11 19:28:43.263 [1576088921002498/start/1 (pid 7847)] Internal error
2019-12-11 19:28:43.264 [1576088921002498/start/1 (pid 7847)] Traceback (most recent call last):
2019-12-11 19:28:43.265 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/metaflow/cli.py", line 853, in main
2019-12-11 19:28:43.265 [1576088921002498/start/1 (pid 7847)] start(auto_envvar_prefix='METAFLOW', obj=state)
2019-12-11 19:28:43.265 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/click/core.py", line 764, in call
2019-12-11 19:28:43.265 [1576088921002498/start/1 (pid 7847)] return self.main(args, kwargs)
2019-12-11 19:28:43.265 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/click/core.py", line 717, in main
2019-12-11 19:28:43.266 [1576088921002498/start/1 (pid 7847)] rv = self.invoke(ctx)
2019-12-11 19:28:43.266 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
2019-12-11 19:28:43.266 [1576088921002498/start/1 (pid 7847)] return _process_result(sub_ctx.command.invoke(sub_ctx))
2019-12-11 19:28:43.266 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/click/core.py", line 956, in invoke
2019-12-11 19:28:43.266 [1576088921002498/start/1 (pid 7847)] return ctx.invoke(self.callback, ctx.params)
2019-12-11 19:28:43.266 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/click/core.py", line 555, in invoke
2019-12-11 19:28:43.267 [1576088921002498/start/1 (pid 7847)] return callback(args, kwargs)
2019-12-11 19:28:43.749 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/click/decorators.py", line 27, in new_func
2019-12-11 19:28:43.750 [1576088921002498/start/1 (pid 7847)] return f(get_current_context().obj, args, kwargs)
2019-12-11 19:28:43.750 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/metaflow/cli.py", line 430, in step
2019-12-11 19:28:43.750 [1576088921002498/start/1 (pid 7847)] max_user_code_retries)
2019-12-11 19:28:43.750 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/metaflow/task.py", line 447, in run_step
2019-12-11 19:28:43.750 [1576088921002498/start/1 (pid 7847)] output.persist(self.flow)
2019-12-11 19:28:43.750 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/metaflow/datastore/datastore.py", line 50, in method
2019-12-11 19:28:43.751 [1576088921002498/start/1 (pid 7847)] return f(self, args, kwargs)
2019-12-11 19:28:43.751 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/metaflow/datastore/datastore.py", line 507, in persist
2019-12-11 19:28:43.751 [1576088921002498/start/1 (pid 7847)] sha, size, encoding = self._save_object(obj, var, force_v4)
2019-12-11 19:28:43.751 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/metaflow/datastore/datastore.py", line 431, in _save_object
2019-12-11 19:28:43.751 [1576088921002498/start/1 (pid 7847)] transformable_obj.transform(lambda x: pickle.dumps(x, protocol=2))
2019-12-11 19:28:43.751 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/metaflow/datastore/datastore.py", line 68, in transform
2019-12-11 19:28:43.751 [1576088921002498/start/1 (pid 7847)] temp = transformer(self._object)
2019-12-11 19:28:43.751 [1576088921002498/start/1 (pid 7847)] File "/home/mfr/anaconda3/envs/data-science/lib/python3.7/site-packages/metaflow/datastore/datastore.py", line 431, in
2019-12-11 19:28:43.752 [1576088921002498/start/1 (pid 7847)] transformable_obj.transform(lambda x: pickle.dumps(x, protocol=2))
2019-12-11 19:28:43.752 [1576088921002498/start/1 (pid 7847)] TypeError: can't pickle _thread._local objects
2019-12-11 19:28:43.752 [1576088921002498/start/1 (pid 7847)]
2019-12-11 19:28:43.754 [1576088921002498/start/1 (pid 7847)] Task failed.
2019-12-11 19:28:43.754 Workflow failed.
Step failure:
Step start (task-id 1) failed.

More examples and tutorials for metaflow

Thanks a lot for open-sourcing this great library. Is it possible to provide more real-world examples of using this tool? It would be really helpful to have a real-world example that goes through a whole Data Science or Machine Learning project life cycle, such as data loading/cleaning, parameter tuning, model deployment and performance monitoring. Many thanks!

urlparse can not parse s3 url correctly

After setting s3 bucket, an error occurs:

S3 datastore operation _put_s3_object failed (Parameter validation failed:
Invalid bucket name "": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$"). Retrying 7 more times..

This is due to in metaflow/datastore/s3.py urlparse can not figure out schema and netloc correctly.

A simple workaround is replacing the following content

try:
    # python2
    from urlparse import urlparse
    import cStringIO
    BytesIO = cStringIO.StringIO
except:
    # python3
    from urllib.parse import urlparse as offcial_urlparse
    import io
    BytesIO = io.BytesIO

with

try:
    # python2
    from urlparse import urlparse as offcial_urlparse
    import cStringIO
    BytesIO = cStringIO.StringIO
except:
    # python3
    import io
    from urllib.parse import urlparse as offcial_urlparse
    BytesIO = io.BytesIO
# modified by Kevin
# 07/12/2019
def urlparse(path):
    return offcial_urlparse(path if path.startswith('s3:') else 's3://' + path)

There should be a more elegant way to fix the issue

Naked exception catching

As I've been looking to the code a bit, I'm running into a lot of "naked" except catching.

E.g.,

try:
     return json.loads(value)
except:
     self.fail("%s is not a valid JSON object" % value, param, ctx)

There's a difference between except: and except Exception:.

https://docs.python.org/3/library/exceptions.html#exception-hierarchy

The list of potential issues can be obtained by pylint metaflow | grep W0702.

In general, I would humbly suggest addressing some of the low hanging fruit from pylint as well as using a formatting tool such as black or autopep8.

More documention about when a task can clone vs re-run on resume

If I have a flow defined with steps

  • Foo
  • Bar
  • Baz

And I have a run where "Baz" fails...
And I explicitly resume my run at Baz...

...I want to know what causes Foo or Bar to rerun vs Clone, so I can re-write my code to not cause a re-run.

The current documentation says
https://docs.metaflow.org/metaflow/debugging#resuming-from-an-arbitrary-step

By default, resume resumes from the step that failed, like b above. Sometimes fixing the failed step requires re-execution of some steps that precede it.

That is pretty vague, so the user is left guessing.

Support Flow Visualization

Provide MetaFlow WebUi to support flow visualization.Understanding and debugging flow is increasingly important, especially for deep learning. While we have made some important first steps with visualization tools for flow, much more needs to be done to enable data scientists to understand, debug, and tune their flow and for users to trust the results.

Support HPC & GPU local clusters

Is is possible to use a local HPC or GPU cluster? I understand that it works perfectly with AWS but what about when the use of AWS is not possible but there are other resources available? Can it be configured to use other resources?
Thanks,

oriol

Support for hosting artifacts as microservices

Once Metaflow has been used to train a model, it produces artifacts that are typically persisted (for example in S3). A natural extension of this is to provide an easy mechanism to deploy web services that would take these artifacts and serve them in some way so that they can be consumed by downstream applications.

No Module named 'fcntl'

What I did:

  • I installed using pip install metaflow on Windows 10
  • typed metaflow
  • errored "No module named 'fcntl'"

Support for another public cloud - Microsoft Azure

Currently, Metaflow is set up to work with AWS as the default public cloud. The architecture of Metaflow allows for additional public clouds to be supported.

Adding support for Microsoft Azure might broaden the potential user base, which could increase the adaption rate. This, in turn, could lead to increased community attention.

Support for AWS Step functions

Metaflow on AWS currently requires a human-in-the-loop to execute and cannot automatically be scheduled. Metaflow could be made to work with AWS Step functions to allow the orchestration of Metaflow steps to be done by AWS.

Support for Slurm?

All,

I'm working on setting up a new DSS-8440 and am evaluating different management options. It appears that Slurm is best for job scheduling. Does metaflow support or have any integration with Slurm? Alternatively, are there any tips for handling machines like this?

Thank!

can't create tasks in jobqueue

Metaflow 2.0.1 executing DataSelectionFlow for user:neuron
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
2019-12-17 18:24:58.096 Workflow starting (run-id 21):
2019-12-17 18:24:58.831 [21/start/39 (pid 21279)] Task is starting.
2019-12-17 18:24:59.709 [21/start/39 (pid 21279)] INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
2019-12-17 18:25:00.013 [21/start/39 (pid 21279)] An error occurred (ClientException) when calling the SubmitJob operation: JobQueue arn:aws:batch:eu-west-1:<accountid>:job-queue/job-queue-<local username>-metaflow-test not found.
2019-12-17 18:25:00.723 [21/start/39 (pid 21279)] Task failed.
2019-12-17 18:25:00.723 This failed task will not be retried.
    Internal error:
    The end step was not successful by the end of flow.

At the same time this works just fine:

import os
import json
import boto3
client = boto3.client("batch")
queue = json.load(open(os.path.expanduser("~/.metaflowconfig/config.json"), "rb"))["METAFLOW_BATCH_JOB_QUEUE"]
client.list_jobs(jobQueue=queue)

Results in 'HTTPStatusCode': 200...

What am I doing wrong here?

My anonymized config:

{
    "METAFLOW_BATCH_JOB_QUEUE": "arn:aws:batch:eu-west-1:<account id>:job-queue/job-queue-<aws username>-metaflow-test",
    "METAFLOW_DATASTORE_SYSROOT_S3": "s3://<aws username>-metaflow-test-metaflows3bucket-<bucket identifier>",
    "METAFLOW_DATATOOLS_SYSROOT_S3": "s3://<aws username>-metaflow-test-metaflows3bucket-<bucket identifier>/data",
    "METAFLOW_DEFAULT_DATASTORE": "s3",
    "METAFLOW_DEFAULT_METADATA": "service",
    "METAFLOW_ECS_S3_ACCESS_IAM_ROLE": "arn:aws:iam::<account id>:role/<aws username>-metaflow-test-BatchS3TaskRole-<random identifier>",
    "METAFLOW_SERVICE_INTERNAL_URL": "https://<random identifier>.execute-api.eu-west-1.amazonaws.com/api/",
    "METAFLOW_SERVICE_URL": "https://<random identifier>.execute-api.eu-west-1.amazonaws.com/api/"
}

Still have METAFLOW_SERVICE_URL missing after configuration

I follow the guidance to setup metaflow on aws, but METAFLOW_SERVICE_URL is not part of the configuration flow, when I check ./metaflowconfig/config.json, it only has METADATA_SERVICE_URL. Seems the step in the flow set variable to METADATA_SERVICE_URL but not METAFLOW_SERVICE_URL

ave you setup your AWS credentials? [y/N]: y

Do you want to use AWS S3 as your datastore? [Y/n]: Y
	AWS S3
	Please enter the bucket prefix to use for your flows: metaflow3-metaflows3bucket-pxxxe
	Please enter the bucket prefix to use for your data [metaflow3-metaflows3bucket-pxxxe/data]:

Do you want to use AWS Batch for compute? [Y/n]: y

	AWS Batch
	Please enter the job queue to use for batch: arn:aws:batch:us-west-2:<account_id>:job-queue/job-queue-metaflow3
	Please enter the default container image to use:
	Please enter the default container image to use: continuumio/anaconda
	Please enter the container registry: <account_id>.dkr.ecr.us-west-2.amazonaws.com/metaflow
	Please enter the IAM role to use for the container to get AWS S3 access: arn:aws:iam::<account_id>:role/metaflow3-BatchS3TaskRole-1IXNDQD2ND1AL

Do you want to use a (remote) metadata service? [Y/n]: y
	Metadata service
	Please enter the URL for your metadata service: https://tgp9.execute-api.us-west-2.amazonaws.com/api/

Do you want to use conda for dependency management? [Y/n]: Y
	Conda on AWS S3
	Please enter the bucket prefix for storing conda packages [metaflow3-metaflows3bucket-pxxxe/conda]:

➜  metaflow-tutorials python3 00-helloworld/helloworld.py run
Metaflow 2.0.0 executing HelloFlow for user:shjiaxin
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
    Flow failed:
    Missing Metaflow Service URL. Specify with METAFLOW_SERVICE_URL environment variable

Error on running metaflow on Mac

(py3) [temp]> python -V
Python 3.5.2

(py3) [temp]> metaflow
Traceback (most recent call last):
File "/Users/.../venv/py3/bin/metaflow", line 5, in
from metaflow.main_cli import main
File "/Users/.../venv/py3/lib/python3.5/site-packages/metaflow/main_cli.py", line 243, in
@click.argument('episode', autocompletion=autocomplete_episodes)
File "/Users/.../venv/py3/lib/python3.5/site-packages/click/decorators.py", line 151, in decorator
_param_memo(f, ArgumentClass(param_decls, **attrs))
File "/Users/.../venv/py3/lib/python3.5/site-packages/click/core.py", line 1699, in init
Parameter.init(self, param_decls, required=required, **attrs)
TypeError: init() got an unexpected keyword argument 'autocompletion'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.