azure / kubeflow-labs Goto Github PK

👩‍🔬 Train and Serve TensorFlow Models at Scale with Kubernetes and Kubeflow on Azure

License: Creative Commons Attribution 4.0 International

Python 96.02% Smarty 2.15% Dockerfile 1.83%

kubernetes kubeflow machine-learning tensorflow tensorflow-serving distributed-tensorflow docker jupyter-notebook jupyterhub

kubeflow-labs's Introduction

Labs for Training and Serving TensorFlow Models with Kubernetes and Kubeflow on Azure Container Service (AKS)

Prerequisites

Have a valid Microsoft Azure subscription allowing the creation of an AKS cluster
Docker client installed: Installing Docker
Azure-cli (2.0) installed: Installing the Azure CLI 2.0 | Microsoft Docs
Git cli installed: Installing Git CLI
Kubectl installed: Installing Kubectl
Helm installed: Installing Helm CLI (Note: On Windows you can extract the tar file using a tool like 7Zip.)
ksonnet installed: Installing ksonnet CLI

Clone this repository somewhere so you can easily access the different source files:

git clone https://github.com/Azure/kubeflow-labs

Content Summary

	Module	Description
0	Introduction	Introduction to this workshop. Motivations and goals.
1	Docker	Docker and containers 101.
2	Kubernetes	Kubernetes important concepts overview.
3	Helm	Introduction to Helm
4	Kubeflow	Introduction to Kubeflow and how to deploy it in your cluster.
5	JupyterHub	Learn how to run JupyterHub to create and manage Jupyter notebooks using Kubeflow
6	TFJob	Introduction to `TFJob` and how to use it to deploy a simple TensorFlow training.
7	Distributed Tensorflow	Learn how to deploy and monitor distributed TensorFlow trainings with `TFJob`
8	Hyperparameters Sweep with Helm	Using Helm to deploy a large number of trainings testing different hypothesis, and TensorBoard to monitor and compare the results
9	Serving	Using TensorFlow Serving to serve predictions
10	Going Further	Links and resources to go further: Autoscaling, Distributed Storage etc.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Legal Notices

Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE file, and grant you a license to any code in the repository under the MIT License, see the LICENSE-CODE file.

Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.

Privacy information can be found at https://privacy.microsoft.com/en-us/

Microsoft and any contributors reserve all others rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.

kubeflow-labs's People

Contributors

Stargazers

Watchers

Forkers

sozercan chao-shi-git bwindsor22 julienstroheker sanketsudake jeromeku kzadorozhny yogeshraheja michellen camer314 suensummit zhangruiskyline iou2much chzbrgr71 rekha-balan talkingtomcat azurementor pankajmehar micseb duke-lv kevintrannz nileshgule gyliu513 satonaoki fanguangkong tomy7376 palindromed atillasilva jimmycao bydeath mbrukman gbechara sethjuarez colinsongf ritazh henry-zeng nkiraly clever-scientist nageshmahadev riedwaanb vic0777 jtanios qdj0511 tilyp fnocera parthi10 sdruc laurentgrangeau ram-msft hypathia jacopomangiavacchi pymia ravvereddy jageshmaharjan rquintin qike-ms pratikfalke felix0080 spsarkar xaviercallens anselxu box9527 srinivasgutta7 learningdymyr stianborgesen hgwr paradiddle-luuk penggu cpeeyush bajutae pawanrana deephivemind rakesh283343 sysmoon wangjing5802 cipahi abhishek-tyagi87 ceteongvanness dav009 apolanki nik-shvetsov venubattula isabella232 vikashrmb davar-playgrounds sjnyos mlopslabs02 aqib-ahmedj python-repository-hub spursy juliawabant sandy4321 sourcecodecheck shubhampachori12110095

kubeflow-labs's Issues

K8s lab example doesn't have gpu driver mount

No GPU driver mounts at https://github.com/Azure/kubeflow-labs/tree/master/2-kubernetes#running-our-model-on-kubernetes

Kubeflow tfjob version

Looks like tfjob v0.1.2 or v0.1.3 doesn't include commits for nvidia driver mounting, need to use master or kubeflow/kubeflow@62d7a09

[Refactor] jupyterhub

9-serving updates

This command needs to be updated to

mc config host add minio $S3_ENDPOINT $ACCESS_KEY $ACCESS_SECRET_KEY

mc config host add minio http://$S3_ENDPOINT $ACCESS_KEY $ACCESS_SECRET_KEY

Add details of how to build the exported mnist model.
Add details of how to build or use an off the shelf tensorflow serving container

Mistake in the title

The title of Readme.md is incorrect: "Azure Container Service (AKS)"

[Refactor] HP

[Refactor] distributed tf

missing helm rbac and tiller

Without these on AKS, you might end with

Error: unknown command "init--service-account" for "helm"

2-kubernetes: Error with initial job on AKS GPU

I'm walking through the labs and I got an error on my first job using the wbuchwalter/tf-mnist:gpu image. My yaml is described below (copied from the labs). I created an AKS cluster with Standard_NC6 VM size and it looks like the GPU is in place.

When I create the job, the pod shows the below error:

2018-06-18 09:05:14.835740: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

My yaml for the job:

apiVersion: batch/v1
kind: Job # Our training should be a Job since it is supposed to terminate at some point
metadata:
  name: 2-mnist-training # Name of our job
spec:
  template: # Template of the Pod that is going to be run by the Job
    metadata:
      name: 2-mnist-training # Name of the pod
    spec:
      containers: # List of containers that should run inside the pod, in our case there is only one.
      - name: tensorflow
        image: wbuchwalter/tf-mnist:gpu # The image to run, you can replace by your own.
        args: ["--max_steps", "500"] # Optional arguments to pass to our command. By default the command is defined by ENTRYPOINT in the Dockerfile
        resources:
          limits:
            alpha.kubernetes.io/nvidia-gpu: 1 # We ask Kubernetes to assign 1 GPU to this container
        volumeMounts:
        - name: nvidia
          mountPath: /usr/local/nvidia
      volumes:
      - name: nvidia
        hostPath:
          path: /usr/local/nvidia
      restartPolicy: OnFailure # restart the pod if it fails

Add minio for training

Need example with details about wbuchwalter/tf-mnist:gpu

In tfjob introduction , it mentions wbuchwalter/tf-mnist:gpu. Can you please provide details about how to build this docker image? Thanks.

AzureFile is very slow

The poor performance of AzureFils is a known problem. See also Azure/AKS#223.

maybe we should replace AzureFile with AzureDisk or minio?

[Add Section] - Train / Serve workflow using Argo

variable undefined

on page 5-JupyterHub :
ks apply ${YOUR_KF_ENV}
should be instead
ks apply default ? or something else... as variable isn't defined.
also the USERNAME variable is undefined (see last few lines of the page.

Update intro image

Since we are also covering serving, we should update the image at 0-intro
https://github.com/Azure/kubeflow-labs/raw/master/0-intro/workflow.png

Why PS is taken as the master in distributed training?

I'm trying to follow https://github.com/Azure/kubeflow-labs/tree/master/7-distributed-tensorflow to test the distributed training. But I have the result is that the PS is completed, not the master.

# RUNTIMEID=$(kubectl get tfjob mnist-simple-gpu-dist -o=jsonpath='{.spec.RuntimeId}')
# kubectl get po -lruntime_id=$RUNTIMEID -a
NAME                                        READY     STATUS      RESTARTS   AGE
mnist-simple-gpu-dist-master-0rzp-0-v0kk6   1/1       Running     0          2h
mnist-simple-gpu-dist-ps-0rzp-0-dtuin       0/1       Completed   0          2h
mnist-simple-gpu-dist-worker-0rzp-0-cz3f5   1/1       Running     0          2h

And the PS logs are:

kubectl logs mnist-simple-gpu-dist-ps-0rzp-0-dtuin
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
2018-06-09 14:19:12.971461: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-06-09 14:19:12.972787: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> mnist-simple-gpu-dist-master-0rzp-0:2222}
2018-06-09 14:19:12.972811: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2018-06-09 14:19:12.972818: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> mnist-simple-gpu-dist-worker-0rzp-0:2222}
2018-06-09 14:19:12.974524: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
WARNING:tensorflow:From /app/main.py:151: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

WARNING:tensorflow:From /app/main.py:188: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-06-09 14:19:33.448292: I tensorflow/core/distributed_runtime/master_session.cc:1017] Start master session 8cede9eb21bff1b6 with config:
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.1088
Accuracy at step 10: 0.7341
Accuracy at step 20: 0.8266
Accuracy at step 30: 0.8784
Accuracy at step 40: 0.8966
Accuracy at step 50: 0.9095
Accuracy at step 60: 0.9149
Accuracy at step 70: 0.9176
Accuracy at step 80: 0.92
Accuracy at step 90: 0.9217
Adding run metadata for 99
Accuracy at step 100: 0.9283
Accuracy at step 110: 0.9244
Accuracy at step 120: 0.9369
Accuracy at step 130: 0.9415
Accuracy at step 140: 0.9421
Accuracy at step 150: 0.945
Accuracy at step 160: 0.9484
Accuracy at step 170: 0.9511

And here is tfjob definition:

apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
  clusterName: ""
  creationTimestamp: 2018-06-09T14:19:10Z
  generation: 0
  name: mnist-simple-gpu-dist
  namespace: default
  resourceVersion: "5259591"
  selfLink: /apis/kubeflow.org/v1alpha1/namespaces/default/tfjobs/mnist-simple-gpu-dist
  uid: 14169ad0-6bf0-11e8-9b09-00163e085552
spec:
  RuntimeId: 0rzp
  replicaSpecs:
  - replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - command:
          - python
          - /app/main.py
          env:
          - name: TEST_TMPDIR
            value: /training
          image: ritazh/tf-mnist:distributedgpu 
          name: tensorflow
          resources:
            limits:
              nvidia.com/gpu: "1"
          volumeMounts:
          - mountPath: /training
            name: kubeflow-dist-nas-mnist
        restartPolicy: OnFailure
        volumes:
        - name: kubeflow-dist-nas-mnist
          persistentVolumeClaim:
            claimName: kubeflow-dist-nas-mnist
    tfPort: 2222
    tfReplicaType: MASTER
  - replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - command:
          - python
          - /app/main.py
          image: ritazh/tf-mnist:distributedgpu 
          imagePullPolicy: Always
          name: tensorflow
          resources:
            limits:
              nvidia.com/gpu: "1"
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: WORKER
  - replicas: 1
    template:
      metadata:
        creationTimestamp: null
      spec:
        containers:
        - command:
          - python
          - /app/main.py
          image: ritazh/tf-mnist:distributed
          imagePullPolicy: Always
          name: tensorflow
          resources: {}
        restartPolicy: OnFailure
    tfPort: 2222
    tfReplicaType: PS
  terminationPolicy:
    chief:
      replicaIndex: 0
      replicaName: MASTER
  tfImage: tensorflow/tensorflow:1.3.0
status:
  phase: Running
  reason: ""
  replicaStatuses:
  - ReplicasStates:
      Running: 1
    state: Running
    tf_replica_type: MASTER
  - ReplicasStates:
      Running: 1
    state: Running
    tf_replica_type: WORKER
  - ReplicasStates:
      Succeeded: 1
    state: Succeeded
    tf_replica_type: PS
  state: Running

issue while trying to authenticate using OIDC

I am following kubeflow documentation and I am having an issue while trying to authenticate kubeflow.

Step no 4 in the below link doesn't contain the path as expected. Please help me to resolve the issue.

https://www.kubeflow.org/docs/azure/authentication-oidc/

env variable

What's my kf env?

ks apply ${YOUR_KF_ENV}

tfjob example 2

Image should be gpu, instead of cpu (since it requests gpus)

Add TensorBoard

We should add instructions on how to deploy TensorBoard to monitor a job with ksonnnet

4-KubeFlow ksonnet issue

When I run this line as part of the KubeFlow chapter:

ks param set kubeflow-core cloud aks

I end up with the following error:

ERROR could not find component: open C:\users\xxx\my-kubeflow\components\C:\users\xxx\my-kubeflow\components:
The filename, directory name, or volume label syntax is incorrect.

For some reason its concatenating the directory twice. If i manually edit the libsonnet file then it doesnt get me much futher because I get a similar path error on the next command.

What OS was the tutorial run on? I am using a Windows Azure DSVM, does it need to be Linux??

Azure files instructions

Labs are covering 2 ways to mount azure files:

tfjob has secrets
dt and hp has pvc (but no instructions, at least i didn't noticed this)

We should unify this and use pvc only (since its simpler then creating secrets) and add instructions from https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv

Update serving to build on kubeflow project

instead of creating a new project (my-model-server), we can add tf-serving component to the project that was created in 4-kubeflow