kubeflow / fairing Goto Github PK

View Code? Open in Web Editor NEW

337.0 337.0 145.0 10.83 MB

Python SDK for building, training, and deploying ML models

License: Apache License 2.0

Python 17.63% Dockerfile 0.07% Jupyter Notebook 0.05% Shell 0.01% Jsonnet 82.23%

fairing's Introduction

Kubeflow the cloud-native platform for machine learning operations - pipelines, training and deployment.

Documentation

Please refer to the official docs at kubeflow.org.

Working Groups

The Kubeflow community is organized into working groups (WGs) with associated repositories, that focus on specific pieces of the ML platform.

Quick Links

PR Dashboard

Get Involved

Please refer to the Community page.

fairing's People

Contributors

Stargazers

Watchers

Forkers

wbuchwalter r2d4 ashahab hamedhsn devops8012 jlewi gyliu513 karthikv2k cheyang rpasricha zhenghuiwang zabbasi abhi-g architectureofthings jkruzek hamelsmu joeliedtke fyuan1316 hongye-sun otomakorihciok adamjm tnachen takmatsu awesome-archive amoliu rainerenglisch brightfly ankit-cliqz neelimamukiri vijaykyr vjrantal jinchihe jeffwan dippynark hemantha-kumara yangtaokm tilyp magencio lluunn abcdefgs0324 kraghupathi javaderek shikhabitgrit hiro-o918 karthikrajkumar akashdesarda pombredanne swiftdiaries aflc rbrishabh ycheng shikha130vv nrchakradhar pavithrasv zoyun ettiee gabrielwen wzhanw dalfos qxiaoq wyw64962771 kunmingg animeshsingh consideratio fenglixa saurabh24292 valmach rushins hypertensorai pshiko xauthulei robertdigital marwenbdiri2 mochiliu3000 berndverst anupash147 sravi999 jl-massey iancoffey songm28 radhakrishnang raviranjan0309 suluner yashjakhotiya jer0nim0 pratikfalke vanillakola dstnluong fzhurd minkyu-choi data-infra kangking2019 stjordanis aakarshg pugangxa suomitek izapolsk krasoffka mojokb dongmin-jung

fairing's Issues

Import error when file named `notebook.py`

b'  File "/app/notebook.py", line 39, in <module>'
b'    import fairing'
b'  File "/home/jovyan/.local/lib/python3.6/site-packages/fairing/__init__.py", line 8, in <module>'
b'    from fairing.config import config'
b'  File "/home/jovyan/.local/lib/python3.6/site-packages/fairing/config.py", line 11, in <module>'
b'    from fairing import builders'
b'  File "/home/jovyan/.local/lib/python3.6/site-packages/fairing/builders/__init__.py", line 10, in <module>'
b'    from .docker_builder import DockerBuilder'
b'  File "/home/jovyan/.local/lib/python3.6/site-packages/fairing/builders/docker_builder.py", line 20, in <module>'
b'    from .dockerfile import write_dockerfile'
b'  File "/home/jovyan/.local/lib/python3.6/site-packages/fairing/builders/dockerfile.py", line 15, in <module>'
b'    from fairing.notebook_helper import get_notebook_name, is_in_notebook'
b'  File "/home/jovyan/.local/lib/python3.6/site-packages/fairing/notebook_helper.py", line 13, in <module>'
b'    import nbconvert'
b'  File "/opt/conda/lib/python3.6/site-packages/nbconvert/__init__.py", line 6, in <module>'
b'    from . import preprocessors'
b'  File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/__init__.py", line 7, in <module>'
b'    from .csshtmlheader import CSSHTMLHeaderPreprocessor'
b'  File "/opt/conda/lib/python3.6/site-packages/nbconvert/preprocessors/csshtmlheader.py", line 17, in <module>'
b'    from notebook import DEFAULT_STATIC_FILES_PATH'
b'  File "/app/notebook.py", line 46, in <module>'
b'    fairing.config.set_builder(builders.AppendBuilder('
b"AttributeError: module 'fairing' has no attribute 'config'"

We should assign the generated notebook file a random name (possibly based on a reproducible, fast hash like md5).

/kind bug

Figure out how to monitor TFJobs launched by fairing

We need to figure out how we want to monitor TFjobs launched from a notebook using fairing.

Right now we just tail the logs of the pods. Does this prevent execution of other cells in the notebook?

Should we just provide a link to dashboards? Should these links be configurable based on the user's deployment? e.g. on GKE we would provide links to stackdriver and the GKE workloads dashboard?

Should we use the TFJobs dashboard?

support local prediction with python models (scikits/xgboost)

We need to support running prediction inside the notebook for python models (e.g. scikits/xgboost).

What should the input output format be?

Should we use numpy arrays?
Should we follow Seldon's API?

/cc @cliveseldon

Add pylint presubmit test to travis

Setup a basic CI/CD system

Let's use the basic version of travis ci to get unit tests and simple integration test using local docker builder.

Support batch prediction

We'd like fairing to support batch prediction.

Avoid pinning exact versions of dependencies

Currently all of the pypi dependencies are pinned to exact versions ("=="). Typically dependencies are specified with min versions (">=") to avoid any conflict with other packages in the environment.

Use a builder that does not require creating an image

For teams with less experience in setting up a secure and scalable image-registry and promotion pipeline, we would like a mode/option/builder in fairing that allows running the notebook code in distributed tfjobs but with no additional images.

How would the notebook code get distributed?
There are multiple solutions:

saving it to a configmap as #21 is doing.
saving it to a persistent storage (perhaps setup by an admission controller)

Builder should return just the image

Instead of a podspec. The container entrypoint and environment variables should be baked into the container manifest.

/area fairing
/priority p1

Support for tensor2tensor

It would be awesome if Fairing were compatible with tensor2tensor and if so to put together examples illustrating this.

Toward understanding compatibility - the default way of training with t2t is to run the main function in t2t_trainer.py, simplified below omitting logging and parts that don't seem relevant to compatibility.

def main(argv):
  usr_dir.import_usr_dir(FLAGS.t2t_usr_dir)

  # Create HParams.
  if argv:
    set_hparams_from_args(argv[1:])
  hparams = create_hparams()

  if FLAGS.schedule == "run_std_server":
    run_std_server()
  trainer_lib.set_random_seed(FLAGS.random_seed)

  exp_fn = create_experiment_fn()
  exp = exp_fn(create_run_config(hparams), hparams)
  if is_chief():
    save_metadata(hparams)
  execute_schedule(exp)

Using parsed tensorflow FLAGS attributes, this function first loads a user's registry of problem, model, and hparam set definitions, calls "run_std_server" if the worker is a PS, and constructs and executes an experiment function (which you can see here pulls a lot of info from FLAGS to parameterize the experiment function but could be called directly, see https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/trainer_lib.py#L650).

I don't understand the design of Fairing in detail so I don't yet know whether there is more complexity to consider than just preparing an object that trains a model when obj.train() is called. If it were this simple you could just have a decorator that accepts the multitude of flags typically passed in to t2t_trainer and passes these to a workflow similar to that shown above (wherein an experiment function is created and executed).

Perhaps another approach to consider would be decorating sub-classes of T2TExperiment? Which brings up the distinction between a single model and an experiment.

Maybe one of you can help clarify this! @wbuchwalter @jlewi @r2d4

JupyterHub support

JupyterHub works in a different way than single notebooks causing various things to break, such as authentication.

It seems Kubeflow will be dropping JupyterHub support (kubeflow/kubeflow#1769), but we might still want to support it as some users may still want to use it.

Add TensorBoard support

It should be easy for a user to specify that they want a TensorBoard instance to be started alongside their jobs.
E.g:

@Training(..., tensorboard=True)

This suppose telling the user where to write the tfevents files so they can then be mounted on an existing multiple-write (to support hyperparameter search later on) PVC.

Admin Config

Admin should be able to create a ConfigMap that would then be pulled by fairing and act as a global configuration.
This ConfigMap could contain settings such as:

Which Builders are allowed
Which Backend is allowd (kubeflow vs k8s)
Restriction on maximum number of parallels run
etc.

Add Unit Tests

Document Strategy and Architecture APIs

Once the APIs get stable enough, we should document them to make integration of new strategies and architectures easier.

Remove Metaparticle

Metaparticle isn't maintained anymore, so we probably should just used the kubernetes client directly instead.

Remove dependencies on Kubeconfig at class / function definition time

Transferred from #77

Many classes (e.g. deployers) assume a Kubeconfig is present, which cause tests to fail. These classes will need to be updated if they should be accessible without an existing Kubeconfig set up.

Allow specifying different base images for worker and ps in distributed training

When using @kubeflow.DistributedTraining I should be able to specify different base images for Worker and PS, e.g: tensorflow-gpu and tensorflow.

The Deployment should probably call the Builder two times with the correct base image as argument.

Function preprocessor does not work in notebooks

We need to chain the notebook preprocessor together with the function preprocessor

#69

/kind bug

Chain preprocessors

It might be interesting to chain multiple pre-processors together

e.g.

ConvertNotebookPreprocessor ->
FunctionPreProcessor ->

Evaluate Cloud Build as a CI/CD system

To make it easier to access various services and run end-to-end and integration tests.

Add support for serializing current state of a function in a notebook along with its deps

When users are working in a notebook, they would like to submit a function in in its current state to a remote cluster for training. This falls under the category of using notebooks as a scratch pad so executing the notebook/corresponding python code linearly doesn't make sense for these users. We can use cloudlpickle (https://github.com/cloudpipe/cloudpickle) to serialize the functions/classes in the notebook. Cloudpickle is developed in PySpark project and prominent users of it are PySpark, and Ray. Initially we can restrict serialization support to only functions, and global variables of standard types like python primitives and numpy arrays/pandas dataframes. Then based on the feedback and need we can extend support for things like TF graphs. For example, Ray supports users adding custom serialization support for complex objects.

No module named 'fairing.training

getting this when trigger a job
'Traceback (most recent call last):'
' File "/app/main.py", line 31, in '
' from fairing.training import kubeflow'
"ModuleNotFoundError: No module named 'fairing.training'"

Seems like you have not published the new version. fairing==0.0.3

fairing should support deploying models

We'd like to support deploying models using fairing.

A strawman goal would be able to walk through mnist E2E entirely in a notebook.

e.g. train the model, deploy the model, send predictions to the model.

Look at Papermill

https://papermill.readthedocs.io/en/latest/usage.html#executing-a-notebook

This could potentially be a better approach than using nbconvert.
cc @r2d4

Update Jupyter example

The Jupyter example (https://github.com/kubeflow/fairing/tree/master/examples/kubeflow-jupyter-notebook) is not up to date with the new API

fairing spits out K8s manifest when deploying a model

See #38 we'd like fairing to support deploying models on K8s.

When it does, should fairing spit out a YAML or ksonnet manifest that could be checked into source control and used?

Design Document

Write a design doc explaining fairing mechanisms at a high-level.

Setup Prow for this repository

See:
https://github.com/kubeflow/community/blob/master/repository-setup.md

Support Python2

A sizeable chunck of data scientists are still using python 2, so we should make fairing compatible with it sooner than later.

Running in ipython kernel gives an error

commit hash: 27357f8

:~/fairing$ ipython
Python 3.5.3 (default, Sep 27 2018, 17:25:39) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.3.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import fairing                                                                                         
---------------------------------------------------------------------------
MultipleInstanceError                     Traceback (most recent call last)
<ipython-input-1-011d40e952a6> in <module>
----> 1 import fairing

~/fairing/fairing/__init__.py in <module>
      4     from fairing.runtime_config import config
      5 else:
----> 6     from fairing.config import config
      7 
      8 name = "fairing"

~/fairing/fairing/config.py in <module>
     98         return ret_fn
     99 
--> 100 config = Config()

~/fairing/fairing/config.py in __init__(self)
     43 class Config(object):
     44     def __init__(self):
---> 45         self.reset()
     46 
     47     def reset(self):

~/fairing/fairing/config.py in reset(self)
     46 
     47     def reset(self):
---> 48         if notebook_util.is_in_notebook():
     49             self._preprocessor_name = 'notebook'
     50         else:

~/fairing/fairing/notebook/notebook_util.py in is_in_notebook()
     27 def is_in_notebook():
     28     try:
---> 29         ipykernel.get_connection_info()
     30     except RuntimeError:
     31         return False

~/fairing/venv/lib/python3.5/site-packages/ipykernel/connect.py in get_connection_info(connection_file, unpack, profile)
    126     depending on `unpack`.
    127     """
--> 128     cf = _find_connection_file(connection_file, profile)
    129 
    130     with open(cf) as f:

~/fairing/venv/lib/python3.5/site-packages/ipykernel/connect.py in _find_connection_file(connection_file, profile)
     91     if connection_file is None:
     92         # get connection file from current kernel
---> 93         return get_connection_file()
     94     else:
     95         # connection file specified, allow shortnames:

~/fairing/venv/lib/python3.5/site-packages/ipykernel/connect.py in get_connection_file(app)
     34             raise RuntimeError("app not specified, and not in a running Kernel")
     35 
---> 36         app = IPKernelApp.instance()
     37     return filefind(app.connection_file, ['.', app.connection_dir])
     38 

~/fairing/venv/lib/python3.5/site-packages/traitlets/config/configurable.py in instance(cls, *args, **kwargs)
    421             raise MultipleInstanceError(
    422                 'Multiple incompatible subclass instances of '
--> 423                 '%s are being created.' % cls.__name__
    424             )
    425 

MultipleInstanceError: Multiple incompatible subclass instances of IPKernelApp are being created.

Run notebook locally launch on K8s cluster

I think one of the biggest use cases for fairing will be running a notebook locally but firing off jobs on a cluster.

Does this already work? I assume its just a matter of configuring fairing to communicate properly with the K8s cluster.

Lets assume the user isn't running in a container; i.e. they just ran jupyter on their local machine. Can we get the K8s config from their kubeconfig file?

We can close this issue once we have instructions for how to make this work.

import fairing fails for gcr.io/kubeflow-images-public/fairing:v0.0.1

I tried spinning up a notebook in Kubeflow 0.3 using image gcr.io/kubeflow-images-public/fairing:v0.0.1

image: gcr.io/kubeflow-images-public/fairing:v0.0.1
imageID: docker-pullable://gcr.io/kubeflow-images-public/fairing@sha256:3cfffe528819a307ebe845b22b4eb9bc0f18f743c1ab19b3a8c0c2f88ab78f34

In the notebook when I try to import fairing I get the following error

ImportErrorTraceback (most recent call last)
<ipython-input-4-011d40e952a6> in <module>()
----> 1 import fairing

ImportError: No module named fairing

But if I start a pod running that image

kubectl run -it jlewi-fairing --restart=Never --image=gcr.io/kubeflow-images-public/fairing:v0.0.1 --command /bin/bash

Import works just fine.

seems to be different from #31

training mode/config should be set via functions not class decorators

Currently the class decorator determines the training mode (e.g. local vs. distributed).

@kubeflow.DistributedTraining(worker_count=3, ps_count=1, namespace='kubeflow')
class MyModel(object):
    def train(self):
         ....

The downside of this approach is that the user has to change the class annotation in order to change the mode of training. I think it would be preferable to be able to change how it runs just with parameters to the train function e.g

config = LocalConfig()
myModel.train(config)

config = DistributedConfig(....)
myModel.train(config)

Update examples

The examples are out of date according the new API

/kind bug
/kind documentation

Create issues to get an initial POC out

It would be great to get an initial POC out as part of 0.3.

What issues need to be resolved in order to get to an initial POC?

Admittedly this doesn't leave much time since we are trying to cut 0.3 this week although we could always cherry pick it into minor releases.

Lets create specific issues for all the work that needs to be addressed.

Job log ends with rpc error

I ran the demo in #32.

The logs from the job end with an rpc error

b'At step 1900, loss = 0.1070864349603653'
b'rpc error: code = Unknown desc = Error: No such container: 6a5c382f4a57c40c484d3cd0a7643b58c2219fe5ecaaed66d4e32a1e70012e74'

I believe this error is to be expected because the container will be deleted when the job finishes and if we are tailing the logs the tail will end with an error.

Assuming thats accurate, this behavior will undoubtedly confuse some people.

So we should try to avoid surfacing that error if possible.

Fire off K8s job to train python models

fairing should support firing off K8s jobs to train python models (e.g. xgboost/scikits) learn models.

Need to figure out

What the syntax is
How to move data back and forth

Add support for managed backends

In addition to supporting Kubeflow and native Kubernetes job submissions, we should provide the option for users to swap out their deployment for a managed service as well.

Support TF models online predictions with TF Serving

With fairing we'd like to be able to predict locally within the notebook.

For example

model=Model()
model.predict(...)

We need to figure out the exact syntax and model signatures.

For example do we use TF.Example protos?

Train TF locally within notebook

Related to #39 don't use decorators

We'd like the user to be able to train the model locally within the notebook. Right now this is possible but it requires setting appropriate function decorators.

We'd like this to be more natural e.g.

model = MyModel()
model.train()

Pypi Organization

Figure out how to allow owners to publish new release of the package on the pypi.

Notebook narratives (question/feature)

Putting together narratives about lines of experimentation involving a series of jobs, loss plots, etc. would be really valuable for both communication as well as personal note-taking (e.g. https://github.com/cwbeitel/tk/blob/master/docs/demo.ipynb). @wbuchwalter shows an example of launching a TFJob from Jupyter notebook here: https://github.com/wbuchwalter/fairing/blob/master/examples/kubeflow-jupyter-notebook/TfJob.ipynb

Does this design require instantiating the model in the notebook each time a new job is run, even if the model does not change? E.g. when hparams are being tuned. This isn't necessarily bad - more verbose but also more clear.

What about running jobs with well-established models such as ones from tensor2tensor? In this case (and perhaps this answers the above question) perhaps people could just instantiate a trivial subclass of one of these models, e.g.

@Train(repository: '<your-repository>', 
       base_image='tensorflow/tensorflow:latest-py3', 
       architecture=BasicArchitecture())
class MyModel(tensor2tensor.models.ImageTransformer2D):
  pass

Create temp files and folders dynamically instead of using constant default values

Currently preprocessors defaults to using a constant filename to write artifacts that will create issues in concurrent job submissions in real use case and also running concurrent tests. We should default to using tempfile module for creating temp files/dirs.

Dependency Management

We need a better way to handle dependency management.
Ideally we would like to ensure that all python packages installed locally will also be installed on the target environment. This would solve two issues:

Ensure that all the needed dependencies are available (e.g. some people might install additional deps doing something like !pip install somelib in their notebook)
Greatly simplify the development workflow as it would remove the need to build a custom image on every change to fairing (or to release on test pipy).

cc @r2d4

Job errors; fairing not stripping out code

I took the demo notebook in #32 and added a cell with the following commands

# Should be unnecessary if using an image with the credential helpers
!gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}
!gcloud auth configure-docker --quiet

When I submitted the model for training with fairing the output was

Running...
Uploading gcr.io/code-search-demo/fairing-job:4b1fa953164a8dcfeca28cecf0e3e6c8da4eda8eff872d0bccbfecc3ec6c948a
Pushed image gcr.io/code-search-demo/fairing-job:4b1fa953164a8dcfeca28cecf0e3e6c8da4eda8eff872d0bccbfecc3ec6c948a
Training(s) launched.
Waiting for job to start...

b'Traceback (most recent call last):'
b'  File "/app/demo.py", line 32, in <module>'
b"    get_ipython().system('gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}')"
b"NameError: name 'get_ipython' is not defined"
b'rpc error: code = Unknown desc = Error: No such container: 68b744c52794350a08b859877c624d3c6c8952d0bea71c3cf6ea59ea75d7b875'

It looks like the problem might be due to the fact that the shell commands aren't being stripped out as part of converting the notebook to a python file.

Add Support for Hyperparameters Search

I should be able to do something akin to:

@HPTuning(runs=10, ...)

And it should start multiple jobs with different HP combination so I can quickly find the best one.

You can the previously existing implementation of that here: https://github.com/wbuchwalter/fairing/blob/master/examples/hyperparameter-tuning/main.py.

This implementation is not perfect however, mainly because the hyper parameters where computed at runtime, and not when deploying, this is problematic as it doesn't allow to guarantee coverage of the hyper parameter search space.

Consider the following case:
I want to try 3 different values for momentum: [0.1, 0.5, 1].
Ideally we want to start 3 jobs, each having a different value for momentum. However because we are selecting at runtime, we have to choose randomly among the possibility, which in the worst case mean that all jobs are running the same momentum in parallel.

A better approach would probably be to ask the user to provide the search space as argument to the decorator, e.g:

@HPTuning(runs=10, hp={
    'momentum':  `[0.1, 0.5, 1]`, 
    'learning_rate': log_uniform(0.001, 0.5)
   }
)

And pre-compute the values to ensure that each run has a unique combination, and export them as environment variable that will be picked up by Runtime.

Add Support for PyTorch

We should add support for PyTorch, and later to other frameworks as well.

Create strong boundaries among preprocessors, builders, and deployers

Currently preprocessor is directly passed to builders and some use them to get the context tar files. This should be changed so that builders get standard input from preprocessors. It is also makes testing builders easy without relying on creating a preprocessor.