Giter VIP home page Giter VIP logo

fairing's Introduction

OpenSSF Best Practices OpenSSF Scorecard CLOMonitor

Kubeflow the cloud-native platform for machine learning operations - pipelines, training and deployment.


Documentation

Please refer to the official docs at kubeflow.org.

Working Groups

The Kubeflow community is organized into working groups (WGs) with associated repositories, that focus on specific pieces of the ML platform.

Quick Links

Get Involved

Please refer to the Community page.

fairing's People

Contributors

aachunella avatar abcdefgs0324 avatar abhi-g avatar berndverst avatar dalfos avatar fenglixa avatar hamedhsn avatar iancoffey avatar jeffwan avatar jer0nim0 avatar jinchihe avatar jlewi avatar joeliedtke avatar karthikv2k avatar lluunn avatar minkyu-choi avatar mochiliu3000 avatar nrchakradhar avatar pshiko avatar pugangxa avatar qxiaoq avatar r2d4 avatar rpasricha avatar shikha130vv avatar takmatsu avatar vjrantal avatar wbuchwalter avatar wzhanw avatar xauthulei avatar zoyun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fairing's Issues

Import error when file named `notebook.py`

b'  File "/app/notebook.py", line 39, in <module>'
b'    import fairing'
b'  File "/home/jovyan/.local/lib/python3.6/site-packages/fairing/__init__.py", line 8, in <module>'
b'    from fairing.config import config'
b'  File "/home/jovyan/.local/lib/python3.6/site-packages/fairing/config.py", line 11, in <module>'
b'    from fairing import builders'
b'  File "/home/jovyan/.local/lib/python3.6/site-packages/fairing/builders/__init__.py", line 10, in <module>'
b'    from .docker_builder import DockerBuilder'
b'  File "/home/jovyan/.local/lib/python3.6/site-packages/fairing/builders/docker_builder.py", line 20, in <module>'
b'    from .dockerfile import write_dockerfile'
b'  File "/home/jovyan/.local/lib/python3.6/site-packages/fairing/builders/dockerfile.py", line 15, in <module>'
b'    from fairing.notebook_helper import get_notebook_name, is_in_notebook'
b'  File "/home/jovyan/.local/lib/python3.6/site-packages/fairing/notebook_helper.py", line 13, in <module>'
b'    import nbconvert'
b'  File "/opt/conda/lib/python3.6/site-packages/nbconvert/__init__.py", line 6, in <module>'
b'    from . import preprocessors'
b'  File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/__init__.py", line 7, in <module>'
b'    from .csshtmlheader import CSSHTMLHeaderPreprocessor'
b'  File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/csshtmlheader.py", line 17, in <module>'
b'    from notebook import DEFAULT_STATIC_FILES_PATH'
b'  File "/app/notebook.py", line 46, in <module>'
b'    fairing.config.set_builder(builders.AppendBuilder('
b"AttributeError: module 'fairing' has no attribute 'config'"

We should assign the generated notebook file a random name (possibly based on a reproducible, fast hash like md5).

/kind bug

Figure out how to monitor TFJobs launched by fairing

We need to figure out how we want to monitor TFjobs launched from a notebook using fairing.

Right now we just tail the logs of the pods. Does this prevent execution of other cells in the notebook?

Should we just provide a link to dashboards? Should these links be configurable based on the user's deployment? e.g. on GKE we would provide links to stackdriver and the GKE workloads dashboard?

Should we use the TFJobs dashboard?

Setup a basic CI/CD system

Let's use the basic version of travis ci to get unit tests and simple integration test using local docker builder.

Support batch prediction

We'd like fairing to support batch prediction.

See also #38 support deploying models.

we'd like to be able to fire of a batch predict job from a notebook using fairing.

As in online predict (#38) there are likely two cases

  1. TF model
  2. Non TF model

In the case of #1 we can probably use our existing Beam transform for doing batch predict using a saved model.

In the case of #2 the user probably needs to write a batch_predict method that can then be invoked.

We should consider whether to use beam to parallelize the computation or maybe just fire off a bunch of K8s jobs.

Avoid pinning exact versions of dependencies

Currently all of the pypi dependencies are pinned to exact versions ("=="). Typically dependencies are specified with min versions (">=") to avoid any conflict with other packages in the environment.

Use a builder that does not require creating an image

For teams with less experience in setting up a secure and scalable image-registry and promotion pipeline, we would like a mode/option/builder in fairing that allows running the notebook code in distributed tfjobs but with no additional images.

How would the notebook code get distributed?
There are multiple solutions:

  1. saving it to a configmap as #21 is doing.
  2. saving it to a persistent storage (perhaps setup by an admission controller)

Builder should return just the image

Instead of a podspec. The container entrypoint and environment variables should be baked into the container manifest.

/area fairing
/priority p1

Support for tensor2tensor

It would be awesome if Fairing were compatible with tensor2tensor and if so to put together examples illustrating this.

Toward understanding compatibility - the default way of training with t2t is to run the main function in t2t_trainer.py, simplified below omitting logging and parts that don't seem relevant to compatibility.

def main(argv):
  usr_dir.import_usr_dir(FLAGS.t2t_usr_dir)

  # Create HParams.
  if argv:
    set_hparams_from_args(argv[1:])
  hparams = create_hparams()

  if FLAGS.schedule == "run_std_server":
    run_std_server()
  trainer_lib.set_random_seed(FLAGS.random_seed)

  exp_fn = create_experiment_fn()
  exp = exp_fn(create_run_config(hparams), hparams)
  if is_chief():
    save_metadata(hparams)
  execute_schedule(exp)

Using parsed tensorflow FLAGS attributes, this function first loads a user's registry of problem, model, and hparam set definitions, calls "run_std_server" if the worker is a PS, and constructs and executes an experiment function (which you can see here pulls a lot of info from FLAGS to parameterize the experiment function but could be called directly, see https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/trainer_lib.py#L650).

I don't understand the design of Fairing in detail so I don't yet know whether there is more complexity to consider than just preparing an object that trains a model when obj.train() is called. If it were this simple you could just have a decorator that accepts the multitude of flags typically passed in to t2t_trainer and passes these to a workflow similar to that shown above (wherein an experiment function is created and executed).

Perhaps another approach to consider would be decorating sub-classes of T2TExperiment? Which brings up the distinction between a single model and an experiment.

Maybe one of you can help clarify this! @wbuchwalter @jlewi @r2d4

JupyterHub support

JupyterHub works in a different way than single notebooks causing various things to break, such as authentication.

It seems Kubeflow will be dropping JupyterHub support (kubeflow/kubeflow#1769), but we might still want to support it as some users may still want to use it.

Add TensorBoard support

It should be easy for a user to specify that they want a TensorBoard instance to be started alongside their jobs.
E.g:

@Training(..., tensorboard=True)

This suppose telling the user where to write the tfevents files so they can then be mounted on an existing multiple-write (to support hyperparameter search later on) PVC.

Admin Config

Admin should be able to create a ConfigMap that would then be pulled by fairing and act as a global configuration.
This ConfigMap could contain settings such as:

  • Which Builders are allowed
  • Which Backend is allowd (kubeflow vs k8s)
  • Restriction on maximum number of parallels run
  • etc.

Remove Metaparticle

Metaparticle isn't maintained anymore, so we probably should just used the kubernetes client directly instead.

Chain preprocessors

It might be interesting to chain multiple pre-processors together

e.g.

ConvertNotebookPreprocessor ->
FunctionPreProcessor ->

Add support for serializing current state of a function in a notebook along with its deps

When users are working in a notebook, they would like to submit a function in in its current state to a remote cluster for training. This falls under the category of using notebooks as a scratch pad so executing the notebook/corresponding python code linearly doesn't make sense for these users. We can use cloudlpickle (https://github.com/cloudpipe/cloudpickle) to serialize the functions/classes in the notebook. Cloudpickle is developed in PySpark project and prominent users of it are PySpark, and Ray. Initially we can restrict serialization support to only functions, and global variables of standard types like python primitives and numpy arrays/pandas dataframes. Then based on the feedback and need we can extend support for things like TF graphs. For example, Ray supports users adding custom serialization support for complex objects.

No module named 'fairing.training

getting this when trigger a job
'Traceback (most recent call last):'
' File "/app/main.py", line 31, in '
' from fairing.training import kubeflow'
"ModuleNotFoundError: No module named 'fairing.training'"

Seems like you have not published the new version. fairing==0.0.3

fairing should support deploying models

We'd like to support deploying models using fairing.

A strawman goal would be able to walk through mnist E2E entirely in a notebook.

e.g. train the model, deploy the model, send predictions to the model.

Design Document

Write a design doc explaining fairing mechanisms at a high-level.

Support Python2

A sizeable chunck of data scientists are still using python 2, so we should make fairing compatible with it sooner than later.

Running in ipython kernel gives an error

commit hash: 27357f8

:~/fairing$ ipython
Python 3.5.3 (default, Sep 27 2018, 17:25:39) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.3.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import fairing                                                                                         
---------------------------------------------------------------------------
MultipleInstanceError                     Traceback (most recent call last)
<ipython-input-1-011d40e952a6> in <module>
----> 1 import fairing

~/fairing/fairing/__init__.py in <module>
      4     from fairing.runtime_config import config
      5 else:
----> 6     from fairing.config import config
      7 
      8 name = "fairing"

~/fairing/fairing/config.py in <module>
     98         return ret_fn
     99 
--> 100 config = Config()

~/fairing/fairing/config.py in __init__(self)
     43 class Config(object):
     44     def __init__(self):
---> 45         self.reset()
     46 
     47     def reset(self):

~/fairing/fairing/config.py in reset(self)
     46 
     47     def reset(self):
---> 48         if notebook_util.is_in_notebook():
     49             self._preprocessor_name = 'notebook'
     50         else:

~/fairing/fairing/notebook/notebook_util.py in is_in_notebook()
     27 def is_in_notebook():
     28     try:
---> 29         ipykernel.get_connection_info()
     30     except RuntimeError:
     31         return False

~/fairing/venv/lib/python3.5/site-packages/ipykernel/connect.py in get_connection_info(connection_file, unpack, profile)
    126     depending on `unpack`.
    127     """
--> 128     cf = _find_connection_file(connection_file, profile)
    129 
    130     with open(cf) as f:

~/fairing/venv/lib/python3.5/site-packages/ipykernel/connect.py in _find_connection_file(connection_file, profile)
     91     if connection_file is None:
     92         # get connection file from current kernel
---> 93         return get_connection_file()
     94     else:
     95         # connection file specified, allow shortnames:

~/fairing/venv/lib/python3.5/site-packages/ipykernel/connect.py in get_connection_file(app)
     34             raise RuntimeError("app not specified, and not in a running Kernel")
     35 
---> 36         app = IPKernelApp.instance()
     37     return filefind(app.connection_file, ['.', app.connection_dir])
     38 

~/fairing/venv/lib/python3.5/site-packages/traitlets/config/configurable.py in instance(cls, *args, **kwargs)
    421             raise MultipleInstanceError(
    422                 'Multiple incompatible subclass instances of '
--> 423                 '%s are being created.' % cls.__name__
    424             )
    425 

MultipleInstanceError: Multiple incompatible subclass instances of IPKernelApp are being created.



Run notebook locally launch on K8s cluster

I think one of the biggest use cases for fairing will be running a notebook locally but firing off jobs on a cluster.

Does this already work? I assume its just a matter of configuring fairing to communicate properly with the K8s cluster.

Lets assume the user isn't running in a container; i.e. they just ran jupyter on their local machine. Can we get the K8s config from their kubeconfig file?

We can close this issue once we have instructions for how to make this work.

import fairing fails for gcr.io/kubeflow-images-public/fairing:v0.0.1

I tried spinning up a notebook in Kubeflow 0.3 using image gcr.io/kubeflow-images-public/fairing:v0.0.1

image: gcr.io/kubeflow-images-public/fairing:v0.0.1
imageID: docker-pullable://gcr.io/kubeflow-images-public/fairing@sha256:3cfffe528819a307ebe845b22b4eb9bc0f18f743c1ab19b3a8c0c2f88ab78f34

In the notebook when I try to import fairing I get the following error

ImportErrorTraceback (most recent call last)
<ipython-input-4-011d40e952a6> in <module>()
----> 1 import fairing

ImportError: No module named fairing

But if I start a pod running that image

kubectl run -it jlewi-fairing --restart=Never --image=gcr.io/kubeflow-images-public/fairing:v0.0.1 --command /bin/bash

Import works just fine.

seems to be different from #31

training mode/config should be set via functions not class decorators

Currently the class decorator determines the training mode (e.g. local vs. distributed).

@kubeflow.DistributedTraining(worker_count=3, ps_count=1, namespace='kubeflow')
class MyModel(object):
    def train(self):
         ....

The downside of this approach is that the user has to change the class annotation in order to change the mode of training. I think it would be preferable to be able to change how it runs just with parameters to the train function e.g

config = LocalConfig()
myModel.train(config)

config = DistributedConfig(....)
myModel.train(config)

Update examples

The examples are out of date according the new API

/kind bug
/kind documentation

Create issues to get an initial POC out

It would be great to get an initial POC out as part of 0.3.

What issues need to be resolved in order to get to an initial POC?

Admittedly this doesn't leave much time since we are trying to cut 0.3 this week although we could always cherry pick it into minor releases.

Lets create specific issues for all the work that needs to be addressed.

Job log ends with rpc error

I ran the demo in #32.

The logs from the job end with an rpc error

b'At step 1900, loss = 0.1070864349603653'
b'rpc error: code = Unknown desc = Error: No such container: 6a5c382f4a57c40c484d3cd0a7643b58c2219fe5ecaaed66d4e32a1e70012e74'

I believe this error is to be expected because the container will be deleted when the job finishes and if we are tailing the logs the tail will end with an error.

Assuming thats accurate, this behavior will undoubtedly confuse some people.

So we should try to avoid surfacing that error if possible.

Fire off K8s job to train python models

fairing should support firing off K8s jobs to train python models (e.g. xgboost/scikits) learn models.

Need to figure out

  1. What the syntax is
  2. How to move data back and forth

Add support for managed backends

In addition to supporting Kubeflow and native Kubernetes job submissions, we should provide the option for users to swap out their deployment for a managed service as well.

Support TF models online predictions with TF Serving

With fairing we'd like to be able to predict locally within the notebook.

For example

model=Model()
model.predict(...)

We need to figure out the exact syntax and model signatures.

For example do we use TF.Example protos?

Train TF locally within notebook

Related to #39 don't use decorators

We'd like the user to be able to train the model locally within the notebook. Right now this is possible but it requires setting appropriate function decorators.

We'd like this to be more natural e.g.

model = MyModel()
model.train()

Pypi Organization

Figure out how to allow owners to publish new release of the package on the pypi.

Notebook narratives (question/feature)

Putting together narratives about lines of experimentation involving a series of jobs, loss plots, etc. would be really valuable for both communication as well as personal note-taking (e.g. https://github.com/cwbeitel/tk/blob/master/docs/demo.ipynb). @wbuchwalter shows an example of launching a TFJob from Jupyter notebook here: https://github.com/wbuchwalter/fairing/blob/master/examples/kubeflow-jupyter-notebook/TfJob.ipynb

Does this design require instantiating the model in the notebook each time a new job is run, even if the model does not change? E.g. when hparams are being tuned. This isn't necessarily bad - more verbose but also more clear.

What about running jobs with well-established models such as ones from tensor2tensor? In this case (and perhaps this answers the above question) perhaps people could just instantiate a trivial subclass of one of these models, e.g.

@Train(repository: '<your-repository>', 
       base_image='tensorflow/tensorflow:latest-py3', 
       architecture=BasicArchitecture())
class MyModel(tensor2tensor.models.ImageTransformer2D):
  pass

Dependency Management

We need a better way to handle dependency management.
Ideally we would like to ensure that all python packages installed locally will also be installed on the target environment. This would solve two issues:

  • Ensure that all the needed dependencies are available (e.g. some people might install additional deps doing something like !pip install somelib in their notebook)
  • Greatly simplify the development workflow as it would remove the need to build a custom image on every change to fairing (or to release on test pipy).

cc @r2d4

Job errors; fairing not stripping out code

I took the demo notebook in #32 and added a cell with the following commands

# Should be unnecessary if using an image with the credential helpers
!gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}
!gcloud auth configure-docker --quiet

When I submitted the model for training with fairing the output was

Running...
Uploading gcr.io/code-search-demo/fairing-job:4b1fa953164a8dcfeca28cecf0e3e6c8da4eda8eff872d0bccbfecc3ec6c948a
Pushed image gcr.io/code-search-demo/fairing-job:4b1fa953164a8dcfeca28cecf0e3e6c8da4eda8eff872d0bccbfecc3ec6c948a
Training(s) launched.
Waiting for job to start...

b'Traceback (most recent call last):'
b'  File "/app/demo.py", line 32, in <module>'
b"    get_ipython().system('gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}')"
b"NameError: name 'get_ipython' is not defined"
b'rpc error: code = Unknown desc = Error: No such container: 68b744c52794350a08b859877c624d3c6c8952d0bea71c3cf6ea59ea75d7b875'

It looks like the problem might be due to the fact that the shell commands aren't being stripped out as part of converting the notebook to a python file.

Add Support for Hyperparameters Search

I should be able to do something akin to:

@HPTuning(runs=10, ...)

And it should start multiple jobs with different HP combination so I can quickly find the best one.

You can the previously existing implementation of that here: https://github.com/wbuchwalter/fairing/blob/master/examples/hyperparameter-tuning/main.py.

This implementation is not perfect however, mainly because the hyper parameters where computed at runtime, and not when deploying, this is problematic as it doesn't allow to guarantee coverage of the hyper parameter search space.

Consider the following case:
I want to try 3 different values for momentum: [0.1, 0.5, 1].
Ideally we want to start 3 jobs, each having a different value for momentum. However because we are selecting at runtime, we have to choose randomly among the possibility, which in the worst case mean that all jobs are running the same momentum in parallel.

A better approach would probably be to ask the user to provide the search space as argument to the decorator, e.g:

@HPTuning(runs=10, hp={
    'momentum':  `[0.1, 0.5, 1]`, 
    'learning_rate': log_uniform(0.001, 0.5)
   }
)

And pre-compute the values to ensure that each run has a unique combination, and export them as environment variable that will be picked up by Runtime.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.