Giter VIP home page Giter VIP logo

azure / kubeflow-labs Goto Github PK

View Code? Open in Web Editor NEW
290.0 20.0 96.0 3.63 MB

๐Ÿ‘ฉโ€๐Ÿ”ฌ Train and Serve TensorFlow Models at Scale with Kubernetes and Kubeflow on Azure

License: Creative Commons Attribution 4.0 International

Python 96.02% Smarty 2.15% Dockerfile 1.83%
kubernetes kubeflow machine-learning tensorflow tensorflow-serving distributed-tensorflow docker jupyter-notebook jupyterhub

kubeflow-labs's Introduction

Labs for Training and Serving TensorFlow Models with Kubernetes and Kubeflow on Azure Container Service (AKS)

Prerequisites

  1. Have a valid Microsoft Azure subscription allowing the creation of an AKS cluster
  2. Docker client installed: Installing Docker
  3. Azure-cli (2.0) installed: Installing the Azure CLI 2.0 | Microsoft Docs
  4. Git cli installed: Installing Git CLI
  5. Kubectl installed: Installing Kubectl
  6. Helm installed: Installing Helm CLI (Note: On Windows you can extract the tar file using a tool like 7Zip.)
  7. ksonnet installed: Installing ksonnet CLI

Clone this repository somewhere so you can easily access the different source files:

git clone https://github.com/Azure/kubeflow-labs

Content Summary

Module Description
0 Introduction Introduction to this workshop. Motivations and goals.
1 Docker Docker and containers 101.
2 Kubernetes Kubernetes important concepts overview.
3 Helm Introduction to Helm
4 Kubeflow Introduction to Kubeflow and how to deploy it in your cluster.
5 JupyterHub Learn how to run JupyterHub to create and manage Jupyter notebooks using Kubeflow
6 TFJob Introduction to TFJob and how to use it to deploy a simple TensorFlow training.
7 Distributed Tensorflow Learn how to deploy and monitor distributed TensorFlow trainings with TFJob
8 Hyperparameters Sweep with Helm Using Helm to deploy a large number of trainings testing different hypothesis, and TensorBoard to monitor and compare the results
9 Serving Using TensorFlow Serving to serve predictions
10 Going Further Links and resources to go further: Autoscaling, Distributed Storage etc.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Legal Notices

Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE file, and grant you a license to any code in the repository under the MIT License, see the LICENSE-CODE file.

Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.

Privacy information can be found at https://privacy.microsoft.com/en-us/

Microsoft and any contributors reserve all others rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.

kubeflow-labs's People

Contributors

chzbrgr71 avatar danielfrg avatar julienstroheker avatar microsoftopensource avatar msftgits avatar ritazh avatar sanketsudake avatar sozercan avatar wbuchwalter avatar xiaoyongzhu avatar yanrez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kubeflow-labs's Issues

9-serving updates

  • This command needs to be updated to

mc config host add minio $S3_ENDPOINT $ACCESS_KEY $ACCESS_SECRET_KEY

mc config host add minio http://$S3_ENDPOINT $ACCESS_KEY $ACCESS_SECRET_KEY

  • Add details of how to build the exported mnist model.
  • Add details of how to build or use an off the shelf tensorflow serving container

2-kubernetes: Error with initial job on AKS GPU

I'm walking through the labs and I got an error on my first job using the wbuchwalter/tf-mnist:gpu image. My yaml is described below (copied from the labs). I created an AKS cluster with Standard_NC6 VM size and it looks like the GPU is in place.

When I create the job, the pod shows the below error:

2018-06-18 09:05:14.835740: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

My yaml for the job:

apiVersion: batch/v1
kind: Job # Our training should be a Job since it is supposed to terminate at some point
metadata:
  name: 2-mnist-training # Name of our job
spec:
  template: # Template of the Pod that is going to be run by the Job
    metadata:
      name: 2-mnist-training # Name of the pod
    spec:
      containers: # List of containers that should run inside the pod, in our case there is only one.
      - name: tensorflow
        image: wbuchwalter/tf-mnist:gpu # The image to run, you can replace by your own.
        args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
        resources:
          limits:
            alpha.kubernetes.io/nvidia-gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
        volumeMounts:
        - name: nvidia
          mountPath: /usr/local/nvidia
      volumes:
      - name: nvidia
        hostPath:
          path: /usr/local/nvidia
      restartPolicy: OnFailure # restart the pod if it fails

variable undefined

on page 5-JupyterHub :
ks apply ${YOUR_KF_ENV}
should be instead
ks apply default ? or something else... as variable isn't defined.
also the USERNAME variable is undefined (see last few lines of the page.

Why PS is taken as the master in distributed training?

I'm trying to follow https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow to test the distributed training. But I have the result is that the PS is completed, not the master.

# RUNTIMEID=$(kubectl get tfjob mnist-simple-gpu-dist -o=jsonpath='{.spec.RuntimeId}')
# kubectl get po -lruntime_id=$RUNTIMEID -a
NAME                                        READY     STATUS      RESTARTS   AGE
mnist-simple-gpu-dist-master-0rzp-0-v0kk6   1/1       Running     0          2h
mnist-simple-gpu-dist-ps-0rzp-0-dtuin       0/1       Completed   0          2h
mnist-simple-gpu-dist-worker-0rzp-0-cz3f5   1/1       Running     0          2h 

And the PS logs are:

kubectl logs mnist-simple-gpu-dist-ps-0rzp-0-dtuin
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
2018-06-09 14:19:12.971461: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-06-09 14:19:12.972787: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> mnist-simple-gpu-dist-master-0rzp-0:2222}
2018-06-09 14:19:12.972811: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-06-09 14:19:12.972818: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> mnist-simple-gpu-dist-worker-0rzp-0:2222}
2018-06-09 14:19:12.974524: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
WARNING:tensorflow:From /app/main.py:151: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

WARNING:tensorflow:From /app/main.py:188: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-09 14:19:33.448292: I tensorflow/core/distributed_runtime/master_session.cc:1017] Start master session 8cede9eb21bff1b6 with config:
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.1088
Accuracy at step 10: 0.7341
Accuracy at step 20: 0.8266
Accuracy at step 30: 0.8784
Accuracy at step 40: 0.8966
Accuracy at step 50: 0.9095
Accuracy at step 60: 0.9149
Accuracy at step 70: 0.9176
Accuracy at step 80: 0.92
Accuracy at step 90: 0.9217
Adding run metadata for 99
Accuracy at step 100: 0.9283
Accuracy at step 110: 0.9244
Accuracy at step 120: 0.9369
Accuracy at step 130: 0.9415
Accuracy at step 140: 0.9421
Accuracy at step 150: 0.945
Accuracy at step 160: 0.9484
Accuracy at step 170: 0.9511

And here is tfjob definition:

apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-06-09T14:19:10Z
  generation: 0
  name: mnist-simple-gpu-dist
  namespace: default
  resourceVersion: "5259591"
  selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/tfjobs/mnist-simple-gpu-dist
  uid: 14169ad0-6bf0-11e8-9b09-00163e085552
spec:
  RuntimeId: 0rzp
  replicaSpecs:
  - replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - command:
          - python
          - /app/main.py
          env:
          - name: TEST_TMPDIR
            value: /training
          image: ritazh/tf-mnist:distributedgpu 
          name: tensorflow
          resources:
            limits:
              nvidia.com/gpu: "1"
          volumeMounts:
          - mountPath: /training
            name: kubeflow-dist-nas-mnist
        restartPolicy: OnFailure
        volumes:
        - name: kubeflow-dist-nas-mnist
          persistentVolumeClaim:
            claimName: kubeflow-dist-nas-mnist
    tfPort: 2222
    tfReplicaType: MASTER
  - replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - command:
          - python
          - /app/main.py
          image: ritazh/tf-mnist:distributedgpu 
          imagePullPolicy: Always
          name: tensorflow
          resources:
            limits:
              nvidia.com/gpu: "1"
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: WORKER
  - replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - command:
          - python
          - /app/main.py
          image: ritazh/tf-mnist:distributed
          imagePullPolicy: Always
          name: tensorflow
          resources: {}
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: PS
  terminationPolicy:
    chief:
      replicaIndex: 0
      replicaName: MASTER
  tfImage: tensorflow/tensorflow:1.3.0
status:
  phase: Running
  reason: ""
  replicaStatuses:
  - ReplicasStates:
      Running: 1
    state: Running
    tf_replica_type: MASTER
  - ReplicasStates:
      Running: 1
    state: Running
    tf_replica_type: WORKER
  - ReplicasStates:
      Succeeded: 1
    state: Succeeded
    tf_replica_type: PS
  state: Running

tfjob example 2

Image should be gpu, instead of cpu (since it requests gpus)

Add TensorBoard

We should add instructions on how to deploy TensorBoard to monitor a job with ksonnnet

4-KubeFlow ksonnet issue

When I run this line as part of the KubeFlow chapter:

ks param set kubeflow-core cloud aks

I end up with the following error:

ERROR could not find component: open C:\users\xxx\my-kubeflow\components\C:\users\xxx\my-kubeflow\components:
The filename, directory name, or volume label syntax is incorrect.

For some reason its concatenating the directory twice. If i manually edit the libsonnet file then it doesnt get me much futher because I get a similar path error on the next command.

What OS was the tutorial run on? I am using a Windows Azure DSVM, does it need to be Linux??

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.