Giter VIP home page Giter VIP logo

mpi-operator's Introduction

MPI Operator

Build Status Docker Pulls

The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes. Please check out this blog post for an introduction to MPI Operator and its industry adoption.

Installation

You can deploy the operator with default settings by running the following commands:

  • Latest Development Version
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml
  • Release Version
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.5.0/deploy/v2beta1/mpi-operator.yaml

Alternatively, follow the getting started guide to deploy Kubeflow.

An alpha version of MPI support was introduced with Kubeflow 0.2.0. You must be using a version of Kubeflow newer than 0.2.0.

You can check whether the MPI Job custom resource is installed via:

kubectl get crd

The output should include mpijobs.kubeflow.org like the following:

NAME                                       AGE
...
mpijobs.kubeflow.org                       4d
...

If it is not included, you can add it as follows using kustomize:

git clone https://github.com/kubeflow/mpi-operator
cd mpi-operator
kustomize build manifests/overlays/kubeflow | kubectl apply -f -

Note that since Kubernetes v1.14, kustomize became a subcommand in kubectl so you can also run the following command instead:

Since Kubernetes v1.21, you can use:

kubectl apply -k manifests/overlays/kubeflow
kubectl kustomize base | kubectl apply -f -

Creating an MPI Job

You can create an MPI job by defining an MPIJob config file. See TensorFlow benchmark example config file for launching a multi-node TensorFlow benchmark training job. You may change the config file based on your requirements.

cat examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml

Deploy the MPIJob resource to start training:

kubectl apply -f examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml

Monitoring an MPI Job

Once the MPIJob resource is created, you should now be able to see the created pods matching the specified number of GPUs. You can also monitor the job status from the status section. Here is sample output when the job is successfully completed.

kubectl get -o yaml mpijobs tensorflow-benchmarks
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  creationTimestamp: "2019-07-09T22:15:51Z"
  generation: 1
  name: tensorflow-benchmarks
  namespace: default
  resourceVersion: "5645868"
  selfLink: /apis/kubeflow.org/v1alpha2/namespaces/default/mpijobs/tensorflow-benchmarks
  uid: 1c5b470f-a297-11e9-964d-88d7f67c6e6d
spec:
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - command:
            - mpirun
            - --allow-run-as-root
            - -np
            - "2"
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - NCCL_DEBUG=INFO
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
            - --model=resnet101
            - --batch_size=64
            - --variable_update=horovod
            image: mpioperator/tensorflow-benchmarks:latest
            name: tensorflow-benchmarks
    Worker:
      replicas: 1
      template:
        spec:
          containers:
          - image: mpioperator/tensorflow-benchmarks:latest
            name: tensorflow-benchmarks
            resources:
              limits:
                nvidia.com/gpu: 2
  slotsPerWorker: 2
status:
  completionTime: "2019-07-09T22:17:06Z"
  conditions:
  - lastTransitionTime: "2019-07-09T22:15:51Z"
    lastUpdateTime: "2019-07-09T22:15:51Z"
    message: MPIJob default/tensorflow-benchmarks is created.
    reason: MPIJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2019-07-09T22:15:54Z"
    lastUpdateTime: "2019-07-09T22:15:54Z"
    message: MPIJob default/tensorflow-benchmarks is running.
    reason: MPIJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2019-07-09T22:17:06Z"
    lastUpdateTime: "2019-07-09T22:17:06Z"
    message: MPIJob default/tensorflow-benchmarks successfully completed.
    reason: MPIJobSucceeded
    status: "True"
    type: Succeeded
  replicaStatuses:
    Launcher:
      succeeded: 1
    Worker: {}
  startTime: "2019-07-09T22:15:51Z"

Training should run for 100 steps and takes a few minutes on a GPU cluster. You can inspect the logs to see the training progress. When the job starts, access the logs from the launcher pod:

PODNAME=$(kubectl get pods -l training.kubeflow.org/job-name=tensorflow-benchmarks,training.kubeflow.org/job-role=launcher -o name)
kubectl logs -f ${PODNAME}
TensorFlow:  1.14
Model:       resnet101
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  128 global
             64 per device
Num batches: 100
Num epochs:  0.01
Devices:     ['horovod/gpu:0', 'horovod/gpu:1']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   horovod

...

40	images/sec: 154.4 +/- 0.7 (jitter = 4.0)	8.280
40	images/sec: 154.4 +/- 0.7 (jitter = 4.1)	8.482
50	images/sec: 154.8 +/- 0.6 (jitter = 4.0)	8.397
50	images/sec: 154.8 +/- 0.6 (jitter = 4.2)	8.450
60	images/sec: 154.5 +/- 0.5 (jitter = 4.1)	8.321
60	images/sec: 154.5 +/- 0.5 (jitter = 4.4)	8.349
70	images/sec: 154.5 +/- 0.5 (jitter = 4.0)	8.433
70	images/sec: 154.5 +/- 0.5 (jitter = 4.4)	8.430
80	images/sec: 154.8 +/- 0.4 (jitter = 3.6)	8.199
80	images/sec: 154.8 +/- 0.4 (jitter = 3.8)	8.404
90	images/sec: 154.6 +/- 0.4 (jitter = 3.7)	8.418
90	images/sec: 154.6 +/- 0.4 (jitter = 3.6)	8.459
100	images/sec: 154.2 +/- 0.4 (jitter = 4.0)	8.372
100	images/sec: 154.2 +/- 0.4 (jitter = 4.0)	8.542
----------------------------------------------------------------
total images/sec: 308.27

For a sample that uses Intel MPI, see:

cat examples/pi/pi-intel.yaml

For a sample that uses MPICH, see:

cat examples/pi/pi-mpich.yaml

Exposed Metrics

Metric name Metric type Description Labels
mpi_operator_jobs_created_total Counter Counts number of MPI jobs created
mpi_operator_jobs_successful_total Counter Counts number of MPI jobs successful
mpi_operator_jobs_failed_total Counter Counts number of MPI jobs failed
mpi_operator_job_info Gauge Information about MPIJob launcher=<launcher-pod-name>
namespace=<job-namespace>

Join Metrics

With kube-state-metrics, one can join metrics by labels. For example kube_pod_info * on(pod,namespace) group_left label_replace(mpi_operator_job_infos, "pod", "$0", "launcher", ".*")

Docker Images

We push Docker images of mpioperator on Dockerhub for every release. You can use the following Dockerfile to build the image yourself:

Alternative, you can build the image using make:

make RELEASE_VERSION=dev IMAGE_NAME=registry.example.com/mpi-operator images

This will produce an image with the tag registry.example.com/mpi-operator:dev.

Contributing

Learn more in CONTRIBUTING.

mpi-operator's People

Contributors

alculquicondor avatar arangogutierrez avatar carmark avatar cheyang avatar czheng94 avatar dependabot[bot] avatar emsixteeen avatar fisherxu avatar gaocegege avatar ggaaooppeenngg avatar hegaoyuan avatar jlewi avatar jq avatar kuizhiqing avatar lianghao208 avatar lowang-bh avatar mimowo avatar mkkb473 avatar naveensrinivasan avatar pugangxa avatar rongou avatar sheevy avatar stpabhi avatar tenzen-y avatar terrytangyuan avatar vtlrazin avatar wackxu avatar xhejtman avatar zhujl1991 avatar zw0610 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mpi-operator's Issues

is the cmd in launcher and worker pod different?

when I start a mpijob as follow

apiVersion: kubeflow.org/v1alpha1
kind: MPIJob
metadata:
  name: mpijob-horovod
spec:
  replicas: 2
  backoffLimit: 0
  template:
    spec:
      containers:
      - image: uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5
        args:
         - cp -rf /code /temp && cd /temp && mpirun python tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod
        command:
         - /bin/bash
         - -c
        imagePullPolicy: IfNotPresent
        name: mpijob-horovod
        resources:
          limits:
            nvidia.com/gpu: 8
        volumeMounts:
        - mountPath: /code
          name: codevolume
          readOnly: true
      restartPolicy: Never
      volumes:
      - hostPath:
          path: /root/Horovod/local/tf_cnn_benchmarks
          type: DirectoryOrCreate
        name: codevolume

I get an error python: can't open file 'tf_cnn_benchmarks.py': [Errno 2] No such file or directory.
I check that the /temp dir is existed in launcher pod but not in the worker.
I guess the cmd cp -rf /code /temp && cd /temp have not executed in the worker pod.

when I change cmd into cd /code && mpirun python tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod. It works well.

so, is the cmd in launcher and worker pod different?

thanks a lot if you reply

ksonnet package

Create a new ksonnet package and add it to the kubeflow registry.

launcher pod is not OK when namespace set ResourceQuota

1、 I set ResourceQuota of namespace cv. the yaml is as following

apiVersion: v1
kind: ResourceQuota
metadata:
  name: mem-cpu-demo
  namespace: cv
spec:
  hard:
    requests.cpu: "3"
    requests.memory: 30Gi
    requests.nvidia.com/gpu: 3
    limits.cpu: "3"
    limits.memory: 30Gi
    requests.nvidia.com/gpu: 3

2、I create mpijob as following. but only mpijob-0-worker-0 and mpijob-0-worker-1 pod is Running.

[root@ quota]# kubectl get pods -n cv
NAME                READY     STATUS    RESTARTS   AGE
mpijob-0-worker-0   1/1       Running   0          4m
mpijob-0-worker-1   1/1       Running   0          4m
apiVersion: kubeflow.org/v1alpha1
kind: MPIJob
metadata:
  name: mpijob-0
  namespace: cv
spec:
  backoffLimit: 0
  replicas: 2
  template:
    spec:
      containers:
        - image: uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5
        command:
          - mpirun
          - python
          - tensorflow_mnist.py
        imagePullPolicy: IfNotPresent
        name: mpijob-0
        resources:
        limits:
            cpu: "1"
            memory: 10Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "1"
            memory: 10Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /examples/MNIST-data-0
          name: datavolume
        - mountPath: /examples/MNIST-data-1
          name: datavolume
      restartPolicy: Never
      volumes:
      - hostPath:
          path: /opt/data/mnist
          type: DirectoryOrCreate
        name: datavolume

3、I delete ResourceQuota of namespace cv that I create in step one. and all the pod is Running

[root@quota]# kubectl get pods -n cv
NAME                      READY     STATUS    RESTARTS   AGE
mpijob-0-launcher-l97mr   1/1       Running   0          28s
mpijob-0-worker-0         1/1       Running   0          5m
mpijob-0-worker-1         1/1       Running   0          5m

I guess the reason is the launcher pod don't set resouce Limits and Requests?
Could you mind to give the considerations you design to set launcher pod QoS=BestEffort?

Error from server: error dialing backend: dial tcp HOSTIP:10250: getsockopt: connection timed out

Recent months, we meet the following error accidentally when we run 8*8 job. and almost everytime we get the same HOSTIP connection timed out.
so I have run the telnet cmd telnet HOSTIP 10250from other host to it. it seems OK.
could you give me some idea about how to fix this problem ?
thanks a lot .

Error from server: error dialing backend: dial tcp HOSTIP:10250: getsockopt: connection timed out

--------------------------------------------------------------------------

ORTE was unable to reliably start one or more daemons.

This usually is caused by:


* not finding the required libraries and/or binaries on

one or more nodes. Please check your PATH and LD_LIBRARY_PATH

settings, or configure OMPI with --enable-orterun-prefix-by-default


* lack of authority to execute on one or more specified nodes.

Please verify your allocation and authorities.


* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).

Please check with your sys admin to determine the correct location to use.


* compilation of the orted with dynamic libraries when static are required

(e.g., on Cray). Please check your configure cmd line and consider using

one of the contrib/platform definitions for your system type.


* an inability to create a connection back to mpirun due to a

lack of common network interfaces and/or no route found between

them. Please check network connectivity (including firewalls

and network routing requirements).

--------------------------------------------------------------------------

the performance when use mpi-operator

I run tf_cnn_benchmarks.py on two 8GPU nodes use MPIJOB
the output is

Done warm up
Step	Img/sec	total_loss
1	images/sec: 3.7 +/- 0.0 (jitter = 0.0)	9.155
1	images/sec: 3.7 +/- 0.0 (jitter = 0.0)	9.296
1	images/sec: 3.7 +/- 0.0 (jitter = 0.0)	9.182
1	images/sec: 3.7 +/- 0.0 (jitter = 0.0)	9.127
1	images/sec: 3.7 +/- 0.0 (jitter = 0.0)	9.257
1	images/sec: 3.7 +/- 0.0 (jitter = 0.0)	9.045
1	images/sec: 3.7 +/- 0.0 (jitter = 0.0)	9.263
1	images/sec: 3.7 +/- 0.0 (jitter = 0.0)	9.277
1	images/sec: 3.7 +/- 0.0 (jitter = 0.0)	9.215
1	images/sec: 3.7 +/- 0.0 (jitter = 0.0)	9.298
1	images/sec: 3.7 +/- 0.0 (jitter = 0.0)	9.037
1	images/sec: 3.6 +/- 0.0 (jitter = 0.0)	9.031
1	images/sec: 3.6 +/- 0.0 (jitter = 0.0)	9.091
1	images/sec: 3.6 +/- 0.0 (jitter = 0.0)	9.127
1	images/sec: 3.6 +/- 0.0 (jitter = 0.0)	9.359
1	images/sec: 3.6 +/- 0.0 (jitter = 0.0)	9.171
10	images/sec: 4.9 +/- 2.7 (jitter = 2.2)	9.047
10	images/sec: 4.9 +/- 2.7 (jitter = 2.2)	8.969
10	images/sec: 4.9 +/- 2.7 (jitter = 2.2)	9.074
10	images/sec: 4.9 +/- 2.7 (jitter = 2.2)	9.209
10	images/sec: 4.9 +/- 2.7 (jitter = 2.2)	8.983
10	images/sec: 4.9 +/- 2.4 (jitter = 2.3)	8.984
10	images/sec: 4.9 +/- 2.7 (jitter = 2.2)	9.045
10	images/sec: 4.9 +/- 2.4 (jitter = 2.3)	9.044
10	images/sec: 4.9 +/- 2.4 (jitter = 2.3)	9.005
10	images/sec: 4.9 +/- 2.7 (jitter = 2.2)	9.099
10	images/sec: 4.9 +/- 2.4 (jitter = 2.3)	9.081
10	images/sec: 4.9 +/- 2.7 (jitter = 2.2)	9.025
10	images/sec: 4.9 +/- 2.7 (jitter = 2.2)	8.938
10	images/sec: 4.9 +/- 2.4 (jitter = 2.4)	8.917
10	images/sec: 4.9 +/- 2.7 (jitter = 2.2)	9.052
10	images/sec: 4.9 +/- 2.4 (jitter = 2.3)	9.023
20	images/sec: 5.0 +/- 3.0 (jitter = 2.3)	8.841
20	images/sec: 5.0 +/- 3.0 (jitter = 2.3)	8.999
20	images/sec: 5.0 +/- 3.0 (jitter = 2.2)	9.107
20	images/sec: 5.0 +/- 3.0 (jitter = 2.2)	8.985
20	images/sec: 5.0 +/- 3.1 (jitter = 2.2)	9.016
20	images/sec: 5.0 +/- 3.0 (jitter = 2.2)	9.024
20	images/sec: 5.0 +/- 3.0 (jitter = 2.3)	8.867
20	images/sec: 5.0 +/- 3.0 (jitter = 2.2)	8.956
20	images/sec: 5.0 +/- 3.0 (jitter = 2.2)	9.019
20	images/sec: 5.0 +/- 3.0 (jitter = 2.2)	8.829
20	images/sec: 5.0 +/- 3.0 (jitter = 2.2)	8.840
20	images/sec: 5.0 +/- 3.0 (jitter = 2.3)	8.917
20	images/sec: 5.0 +/- 3.0 (jitter = 2.3)	9.034
20	images/sec: 5.0 +/- 3.0 (jitter = 2.3)	8.981
20	images/sec: 5.0 +/- 3.0 (jitter = 2.3)	8.911
20	images/sec: 5.0 +/- 3.0 (jitter = 2.2)	9.045
30	images/sec: 4.8 +/- 2.3 (jitter = 1.7)	8.998
30	images/sec: 4.8 +/- 2.3 (jitter = 1.7)	8.850
30	images/sec: 4.8 +/- 2.3 (jitter = 1.7)	8.904
30	images/sec: 4.8 +/- 2.3 (jitter = 1.7)	8.961
30	images/sec: 4.8 +/- 2.3 (jitter = 1.7)	8.922
30	images/sec: 4.8 +/- 2.3 (jitter = 1.7)	8.842
30	images/sec: 4.8 +/- 2.3 (jitter = 1.7)	8.777
30	images/sec: 4.8 +/- 2.3 (jitter = 1.6)	8.815
30	images/sec: 4.8 +/- 2.3 (jitter = 1.6)	8.949
30	images/sec: 4.8 +/- 2.3 (jitter = 1.6)	8.954
30	images/sec: 4.8 +/- 2.3 (jitter = 1.6)	8.976
30	images/sec: 4.8 +/- 2.3 (jitter = 1.6)	8.685
30	images/sec: 4.8 +/- 2.3 (jitter = 1.7)	8.768
30	images/sec: 4.8 +/- 2.3 (jitter = 1.7)	8.886
30	images/sec: 4.8 +/- 2.3 (jitter = 1.7)	8.808
30	images/sec: 4.8 +/- 2.3 (jitter = 1.7)	8.926
40	images/sec: 4.4 +/- 1.8 (jitter = 0.8)	8.722
40	images/sec: 4.4 +/- 1.8 (jitter = 0.9)	8.793
40	images/sec: 4.4 +/- 1.8 (jitter = 0.9)	8.882
40	images/sec: 4.4 +/- 1.8 (jitter = 0.8)	8.842
40	images/sec: 4.4 +/- 1.8 (jitter = 0.8)	8.736
40	images/sec: 4.4 +/- 1.8 (jitter = 0.9)	8.741
40	images/sec: 4.4 +/- 1.8 (jitter = 0.9)	8.819
40	images/sec: 4.4 +/- 1.8 (jitter = 0.8)	8.734
40	images/sec: 4.4 +/- 1.8 (jitter = 0.8)	8.909
40	images/sec: 4.4 +/- 1.8 (jitter = 0.8)	8.706
40	images/sec: 4.4 +/- 1.8 (jitter = 0.8)	8.752
40	images/sec: 4.4 +/- 1.8 (jitter = 0.8)	8.823
40	images/sec: 4.4 +/- 1.8 (jitter = 0.8)	8.710
40	images/sec: 4.4 +/- 1.8 (jitter = 0.8)	8.837
40	images/sec: 4.4 +/- 1.8 (jitter = 0.8)	8.831
40	images/sec: 4.4 +/- 1.8 (jitter = 0.8)	8.846
50	images/sec: 4.2 +/- 1.5 (jitter = 0.7)	8.723
50	images/sec: 4.2 +/- 1.5 (jitter = 0.7)	8.788
50	images/sec: 4.2 +/- 1.5 (jitter = 0.7)	8.810
50	images/sec: 4.2 +/- 1.5 (jitter = 0.7)	8.752
50	images/sec: 4.2 +/- 1.5 (jitter = 0.7)	8.843
50	images/sec: 4.2 +/- 1.5 (jitter = 0.7)	8.759
50	images/sec: 4.2 +/- 1.5 (jitter = 0.7)	8.819
50	images/sec: 4.2 +/- 1.5 (jitter = 0.7)	8.692
50	images/sec: 4.2 +/- 1.5 (jitter = 0.7)	8.820
50	images/sec: 4.2 +/- 1.5 (jitter = 0.7)	8.765
50	images/sec: 4.2 +/- 1.4 (jitter = 0.7)	8.819
50	images/sec: 4.2 +/- 1.4 (jitter = 0.7)	8.789
50	images/sec: 4.2 +/- 1.4 (jitter = 0.7)	8.839
50	images/sec: 4.2 +/- 1.5 (jitter = 0.7)	8.780
50	images/sec: 4.2 +/- 1.4 (jitter = 0.7)	8.706
50	images/sec: 4.2 +/- 1.5 (jitter = 0.7)	8.782
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.769
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.684
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.706
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.727
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.731
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.737
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.680
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.788
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.736
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.702
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.790
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.811
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.733
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.764
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.731
60	images/sec: 4.0 +/- 1.2 (jitter = 0.7)	8.794
70	images/sec: 3.8 +/- 1.1 (jitter = 0.7)	8.724
70	images/sec: 3.8 +/- 1.1 (jitter = 0.7)	8.690
70	images/sec: 3.8 +/- 1.1 (jitter = 0.8)	8.751
70	images/sec: 3.8 +/- 1.1 (jitter = 0.7)	8.693
70	images/sec: 3.8 +/- 1.1 (jitter = 0.8)	8.716
70	images/sec: 3.8 +/- 1.1 (jitter = 0.8)	8.743
70	images/sec: 3.8 +/- 1.1 (jitter = 0.7)	8.746
70	images/sec: 3.8 +/- 1.1 (jitter = 0.7)	8.741
70	images/sec: 3.8 +/- 1.1 (jitter = 0.8)	8.765
70	images/sec: 3.8 +/- 1.1 (jitter = 0.8)	8.721
70	images/sec: 3.8 +/- 1.1 (jitter = 0.7)	8.693
70	images/sec: 3.8 +/- 1.1 (jitter = 0.8)	8.815
70	images/sec: 3.8 +/- 1.1 (jitter = 0.7)	8.736
70	images/sec: 3.8 +/- 1.1 (jitter = 0.8)	8.713
70	images/sec: 3.8 +/- 1.1 (jitter = 0.8)	8.734
70	images/sec: 3.8 +/- 1.1 (jitter = 0.7)	8.757
80	images/sec: 3.7 +/- 0.9 (jitter = 0.8)	8.794
80	images/sec: 3.7 +/- 0.9 (jitter = 0.7)	8.755
80	images/sec: 3.7 +/- 0.9 (jitter = 0.7)	8.729
80	images/sec: 3.7 +/- 0.9 (jitter = 0.7)	8.719
80	images/sec: 3.7 +/- 0.9 (jitter = 0.7)	8.738
80	images/sec: 3.7 +/- 0.9 (jitter = 0.8)	8.773
80	images/sec: 3.7 +/- 0.9 (jitter = 0.8)	8.773
80	images/sec: 3.7 +/- 0.9 (jitter = 0.8)	8.741
80	images/sec: 3.7 +/- 0.9 (jitter = 0.8)	8.751
80	images/sec: 3.7 +/- 0.9 (jitter = 0.8)	8.767
80	images/sec: 3.7 +/- 0.9 (jitter = 0.8)	8.715
80	images/sec: 3.7 +/- 0.9 (jitter = 0.8)	8.700
80	images/sec: 3.7 +/- 0.9 (jitter = 0.8)	8.761
80	images/sec: 3.7 +/- 0.9 (jitter = 0.8)	8.746
80	images/sec: 3.7 +/- 0.9 (jitter = 0.8)	8.762
80	images/sec: 3.7 +/- 0.9 (jitter = 0.8)	8.730
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.673
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.773
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.643
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.733
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.702
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.798
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.694
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.758
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.734
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.685
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.729
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.721
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.756
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.743
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.770
90	images/sec: 3.7 +/- 0.8 (jitter = 0.8)	8.750
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.735
----------------------------------------------------------------
total images/sec: 58.10
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.748
----------------------------------------------------------------
total images/sec: 58.10
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.714
----------------------------------------------------------------
total images/sec: 58.10
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.754
----------------------------------------------------------------
total images/sec: 58.10
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.702
----------------------------------------------------------------
total images/sec: 58.10
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.691
----------------------------------------------------------------
total images/sec: 58.10
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.709
----------------------------------------------------------------
total images/sec: 58.10
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.707
----------------------------------------------------------------
total images/sec: 58.07
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.713
----------------------------------------------------------------
total images/sec: 58.08
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.753
----------------------------------------------------------------
total images/sec: 58.08
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.730
----------------------------------------------------------------
total images/sec: 58.08
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.722
----------------------------------------------------------------
total images/sec: 58.08
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.720
----------------------------------------------------------------
total images/sec: 58.08
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.803
----------------------------------------------------------------
total images/sec: 58.08
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.685
----------------------------------------------------------------
total images/sec: 58.08
----------------------------------------------------------------
100	images/sec: 3.6 +/- 0.8 (jitter = 0.8)	8.728
----------------------------------------------------------------
total images/sec: 58.08
----------------------------------------------------------------


and when I run tf_cnn_benchmarks.py with pod host-network
the output is as following.
could you get me some suggest why the result above is so slow?

Step	Img/sec	total_loss
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.066
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.211
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.427
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.033
1	images/sec: 48.0 +/- 0.0 (jitter = 0.0)	9.081
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	8.989
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.017
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.273
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.294
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.309
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.306
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.100
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.318
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.244
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.214
1	images/sec: 47.9 +/- 0.0 (jitter = 0.0)	9.168
10	images/sec: 47.3 +/- 0.7 (jitter = 1.1)	9.068
10	images/sec: 47.3 +/- 0.7 (jitter = 1.2)	8.991
10	images/sec: 47.3 +/- 0.7 (jitter = 0.9)	8.981
10	images/sec: 47.3 +/- 0.7 (jitter = 1.0)	8.928
10	images/sec: 47.3 +/- 0.7 (jitter = 1.0)	9.040
10	images/sec: 47.3 +/- 0.7 (jitter = 0.9)	9.202
10	images/sec: 47.3 +/- 0.7 (jitter = 0.9)	8.955
10	images/sec: 47.3 +/- 0.8 (jitter = 1.0)	9.031
10	images/sec: 47.3 +/- 0.7 (jitter = 1.2)	9.012
10	images/sec: 47.3 +/- 0.7 (jitter = 1.2)	9.058
10	images/sec: 47.3 +/- 0.7 (jitter = 1.1)	8.890
10	images/sec: 47.3 +/- 0.8 (jitter = 0.9)	9.124
10	images/sec: 47.3 +/- 0.7 (jitter = 0.9)	9.066
10	images/sec: 47.3 +/- 0.8 (jitter = 1.0)	9.059
10	images/sec: 47.3 +/- 0.8 (jitter = 1.1)	9.085
10	images/sec: 47.3 +/- 0.8 (jitter = 1.1)	8.921
20	images/sec: 47.0 +/- 0.5 (jitter = 1.5)	9.113
20	images/sec: 47.0 +/- 0.5 (jitter = 1.5)	9.032
20	images/sec: 47.0 +/- 0.5 (jitter = 1.4)	9.004
20	images/sec: 47.0 +/- 0.5 (jitter = 1.4)	8.864
20	images/sec: 47.0 +/- 0.5 (jitter = 1.5)	8.827
20	images/sec: 47.0 +/- 0.5 (jitter = 1.5)	9.023
20	images/sec: 47.0 +/- 0.5 (jitter = 1.4)	9.074
20	images/sec: 47.0 +/- 0.5 (jitter = 1.5)	9.000
20	images/sec: 47.0 +/- 0.5 (jitter = 1.5)	8.834
20	images/sec: 47.0 +/- 0.5 (jitter = 1.5)	9.001
20	images/sec: 47.0 +/- 0.5 (jitter = 1.2)	9.010
20	images/sec: 47.0 +/- 0.5 (jitter = 1.4)	9.056
20	images/sec: 47.0 +/- 0.5 (jitter = 1.4)	8.870
20	images/sec: 47.0 +/- 0.5 (jitter = 1.4)	8.920
20	images/sec: 47.0 +/- 0.5 (jitter = 1.5)	8.990
20	images/sec: 46.9 +/- 0.5 (jitter = 1.5)	8.896
30	images/sec: 46.4 +/- 0.7 (jitter = 1.6)	8.787
30	images/sec: 46.4 +/- 0.7 (jitter = 1.6)	8.823
30	images/sec: 46.4 +/- 0.7 (jitter = 1.5)	8.906
30	images/sec: 46.4 +/- 0.7 (jitter = 1.8)	8.697
30	images/sec: 46.4 +/- 0.7 (jitter = 1.6)	8.978
30	images/sec: 46.4 +/- 0.7 (jitter = 1.5)	8.976
30	images/sec: 46.4 +/- 0.7 (jitter = 1.5)	8.952
30	images/sec: 46.4 +/- 0.7 (jitter = 1.6)	9.015
30	images/sec: 46.4 +/- 0.7 (jitter = 1.6)	8.893
30	images/sec: 46.4 +/- 0.7 (jitter = 1.6)	8.860
30	images/sec: 46.4 +/- 0.7 (jitter = 1.6)	8.891
30	images/sec: 46.4 +/- 0.7 (jitter = 1.7)	8.821
30	images/sec: 46.4 +/- 0.7 (jitter = 1.5)	8.948
30	images/sec: 46.4 +/- 0.7 (jitter = 1.6)	8.909
30	images/sec: 46.4 +/- 0.7 (jitter = 1.7)	8.745
30	images/sec: 46.3 +/- 0.7 (jitter = 1.6)	8.878
40	images/sec: 46.5 +/- 0.6 (jitter = 1.6)	8.734
40	images/sec: 46.5 +/- 0.6 (jitter = 1.6)	8.703
40	images/sec: 46.5 +/- 0.6 (jitter = 1.7)	8.791
40	images/sec: 46.5 +/- 0.6 (jitter = 1.9)	8.839
40	images/sec: 46.5 +/- 0.6 (jitter = 1.7)	8.791
40	images/sec: 46.5 +/- 0.6 (jitter = 1.6)	8.908
40	images/sec: 46.5 +/- 0.6 (jitter = 1.7)	8.788
40	images/sec: 46.5 +/- 0.6 (jitter = 1.8)	8.824
40	images/sec: 46.5 +/- 0.6 (jitter = 1.7)	8.749
40	images/sec: 46.5 +/- 0.6 (jitter = 1.9)	8.851
40	images/sec: 46.5 +/- 0.6 (jitter = 1.7)	8.778
40	images/sec: 46.5 +/- 0.6 (jitter = 1.7)	8.824
40	images/sec: 46.5 +/- 0.6 (jitter = 1.7)	8.862
40	images/sec: 46.5 +/- 0.6 (jitter = 1.7)	8.745
40	images/sec: 46.5 +/- 0.6 (jitter = 1.7)	8.729
40	images/sec: 46.5 +/- 0.6 (jitter = 1.7)	8.714
50	images/sec: 46.7 +/- 0.5 (jitter = 1.8)	8.780
50	images/sec: 46.7 +/- 0.5 (jitter = 1.8)	8.859
50	images/sec: 46.7 +/- 0.5 (jitter = 1.8)	8.750
50	images/sec: 46.7 +/- 0.5 (jitter = 1.8)	8.854
50	images/sec: 46.7 +/- 0.5 (jitter = 1.8)	8.805
50	images/sec: 46.7 +/- 0.5 (jitter = 1.8)	8.766
50	images/sec: 46.7 +/- 0.5 (jitter = 1.8)	8.780
50	images/sec: 46.7 +/- 0.5 (jitter = 1.8)	8.782
50	images/sec: 46.7 +/- 0.5 (jitter = 1.8)	8.718
50	images/sec: 46.7 +/- 0.5 (jitter = 1.9)	8.803
50	images/sec: 46.7 +/- 0.5 (jitter = 2.0)	8.792
50	images/sec: 46.7 +/- 0.5 (jitter = 1.8)	8.700
50	images/sec: 46.7 +/- 0.5 (jitter = 1.9)	8.753
50	images/sec: 46.7 +/- 0.5 (jitter = 1.8)	8.820
50	images/sec: 46.7 +/- 0.5 (jitter = 1.9)	8.751
50	images/sec: 46.7 +/- 0.5 (jitter = 2.0)	8.742
60	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.708
60	images/sec: 46.7 +/- 0.4 (jitter = 1.7)	8.733
60	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.760
60	images/sec: 46.7 +/- 0.4 (jitter = 1.7)	8.707
60	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.744
60	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.730
60	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.711
60	images/sec: 46.7 +/- 0.4 (jitter = 1.7)	8.724
60	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.793
60	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.757
60	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.755
60	images/sec: 46.7 +/- 0.4 (jitter = 1.9)	8.779
60	images/sec: 46.7 +/- 0.4 (jitter = 1.9)	8.742
60	images/sec: 46.7 +/- 0.4 (jitter = 1.9)	8.713
60	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.703
60	images/sec: 46.7 +/- 0.4 (jitter = 1.9)	8.785
70	images/sec: 46.7 +/- 0.4 (jitter = 1.7)	8.733
70	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.737
70	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.697
70	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.734
70	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.750
70	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.753
70	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.748
70	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.708
70	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.737
70	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.723
70	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.732
70	images/sec: 46.7 +/- 0.4 (jitter = 1.9)	8.752
70	images/sec: 46.7 +/- 0.4 (jitter = 1.9)	8.731
70	images/sec: 46.7 +/- 0.4 (jitter = 1.9)	8.715
70	images/sec: 46.7 +/- 0.4 (jitter = 1.8)	8.757
70	images/sec: 46.7 +/- 0.4 (jitter = 1.9)	8.733
80	images/sec: 46.5 +/- 0.4 (jitter = 1.9)	8.790
80	images/sec: 46.5 +/- 0.4 (jitter = 1.9)	8.739
80	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.700
80	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.701
80	images/sec: 46.5 +/- 0.4 (jitter = 1.9)	8.717
80	images/sec: 46.5 +/- 0.4 (jitter = 1.9)	8.785
80	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.780
80	images/sec: 46.5 +/- 0.4 (jitter = 1.9)	8.698
80	images/sec: 46.5 +/- 0.4 (jitter = 1.9)	8.733
80	images/sec: 46.5 +/- 0.4 (jitter = 1.9)	8.707
80	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.769
80	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.749
80	images/sec: 46.5 +/- 0.4 (jitter = 1.9)	8.714
80	images/sec: 46.5 +/- 0.4 (jitter = 1.9)	8.693
80	images/sec: 46.5 +/- 0.4 (jitter = 1.9)	8.757
80	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.767
90	images/sec: 46.5 +/- 0.4 (jitter = 2.1)	8.715
90	images/sec: 46.5 +/- 0.4 (jitter = 2.1)	8.741
90	images/sec: 46.5 +/- 0.4 (jitter = 2.1)	8.779
90	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.730
90	images/sec: 46.5 +/- 0.4 (jitter = 2.1)	8.680
90	images/sec: 46.5 +/- 0.4 (jitter = 2.1)	8.725
90	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.775
90	images/sec: 46.5 +/- 0.4 (jitter = 1.9)	8.748
90	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.796
90	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.675
90	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.727
90	images/sec: 46.5 +/- 0.4 (jitter = 2.1)	8.717
90	images/sec: 46.5 +/- 0.4 (jitter = 2.1)	8.747
90	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.754
90	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.721
90	images/sec: 46.5 +/- 0.4 (jitter = 2.0)	8.734
100	images/sec: 46.6 +/- 0.3 (jitter = 1.9)	8.743
----------------------------------------------------------------
total images/sec: 745.14
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 1.9)	8.762
----------------------------------------------------------------
total images/sec: 745.13
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 1.9)	8.725
----------------------------------------------------------------
total images/sec: 745.13
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 1.9)	8.746
----------------------------------------------------------------
total images/sec: 745.13
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 2.0)	8.703
----------------------------------------------------------------
total images/sec: 745.12
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 1.9)	8.733
----------------------------------------------------------------
total images/sec: 745.12
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 1.9)	8.717
----------------------------------------------------------------
total images/sec: 745.13
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 2.0)	8.722
----------------------------------------------------------------
total images/sec: 745.14
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 1.9)	8.679
----------------------------------------------------------------
total images/sec: 745.14
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 2.0)	8.754
----------------------------------------------------------------
total images/sec: 745.11
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 1.9)	8.685
----------------------------------------------------------------
total images/sec: 745.14
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 2.0)	8.715
----------------------------------------------------------------
total images/sec: 745.13
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 2.0)	8.641
----------------------------------------------------------------
total images/sec: 745.13
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 1.9)	8.731
----------------------------------------------------------------
total images/sec: 745.13
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 1.9)	8.725
----------------------------------------------------------------
total images/sec: 745.11
----------------------------------------------------------------
100	images/sec: 46.6 +/- 0.3 (jitter = 2.0)	8.719
----------------------------------------------------------------
total images/sec: 745.06
----------------------------------------------------------------

better support for gang scheduling

The current code splits GPUs between the worker StatefulSet and the launcher Job. For gang scheduling, it may be better to allocate all the GPUs to the workers. Not sure if that should be the default, but it should at least be an option.

Should the limits and requests of the resource(cpu, memory) copy to launcher resource?

When I create a MPIJob which specifies cpu and memory, it means Job Launcher also requires such a mount of resources. But it's difficult to allocate such large resource for the Job launcher.

apiVersion: kubeflow.org/v1alpha1
kind: MPIJob
metadata:
  name: mpi
spec:
  replicas: 3
  template:
    spec:
      containers:
      - image: mpioperator/tensorflow-benchmarks:latest
        name: tensorflow-benchmarks
        command: ["tail", "-f", "/dev/null"]
        resources:
          limits:
            nvidia.com/gpu: 1
            cpu: 55
            memory: 120Gi
          requests:
            nvidia.com/gpu: 1
            cpu: 55
            memory: 120Gi

How do we handle such issue? The ideas in my head:

  1. Remove requests and limits from Job launcher

  2. Make Job launcher as worker, it also participates the computation.

@rongou @everpeace @jlewi Can you share your thoughts?

validate # GPUs

When specifying GPUs in the simplified version, the number of GPUs should satisfy some conditions. For example, if each node has 4 GPUs, the valid # GPUs should be:

1, 2, 4, 8, 12, ...

We should add validation for this when creating the mpi job.

Hard requirement on cluster-level roles

Current to deploy MPI-Operator, we have to apply cluster-level roles, e.g. ClusterRole and ClusterRoleBinding specified in deploy/2-rbac.yaml. However, some infrastructure rules/restrictions that disallow such cluster-level roles.

The cluster level roles seem needed even though the MPIJob, controller, etc. are in the same namespace, e.g.

E1220 21:58:10.883867       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Role: roles.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:kube:mpi-operator" cannot list roles.rbac.authorization.k8s.io at the cluster scope
E1220 21:58:10.884831       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.Job: jobs.batch is forbidden: User "system:serviceaccount:kube:mpi-operator" cannot list jobs.batch at the cluster scope
E1220 21:58:10.885790       1 reflector.go:205] github.com/kubeflow/mpi-operator/vendor/k8s.io/client-go/informers/factory.go:86: Failed to list *v1.StatefulSet: statefulsets.apps is forbidden: User "system:serviceaccount:kube:mpi-operator" cannot list statefulsets.apps at the cluster scope
E1220 21:58:10.886972       1 reflector.go:205] github.com/kubeflow/mpi-operator/pkg/client/informers/externalversions/factory.go:62: Failed to list *v1alpha1.MPIJob: mpijobs.kubeflow.org is forbidden: User "system:serviceaccount:kube:mpi-operator" cannot list mpijobs.kubeflow.org at the cluster scope

This is because the current MPI Operator code lists everything at the cluster scope and we don't know which namespace MPIJob might be in. Changes are needed in the codebase in order to remove this hard requirement.

Failed to launch MPI Job when specifying gpu request

make small changes of mpijob in the example, only add gpu in requests. The yaml is as below:

apiVersion: kubeflow.org/v1alpha1
kind: MPIJob
metadata:
  name: mpi
spec:
  replicas: 3
  template:
    spec:
      containers:
      - image: mpioperator/tensorflow-benchmarks:latest
        name: tensorflow-benchmarks
        command: ["tail", "-f", "/dev/null"]
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            nvidia.com/gpu: 1

The launcher job is not created, and checking the mpi-operator logs:

E0811 04:53:50.407244       1 mpi_job_controller.go:369] error syncing 'default/mpi-dist-mpijob': Job.batch "mpi-dist-mpijob-launcher" is invalid: spec.template.spec.containers[0].resources.limits: Required value: Limit must be set for non overcommitable resources

separate out worker and launcher pod specs

The worker pods and the launcher typically have very different resource requirements. Right now we have a single spec for them and the controller munges the launcher podspec to make it more resource efficient. We should give user more control by separating out the worker and launcher pod specs. Need to move the api version to v1alpha2.

See #53

MPI Operator v1alpha2 API Design Proposal

Hi community,

I am proposing the design for v1alpha2 API version for MPI Operator. You are very welcomed to join the discussion here if you have any questions, comments, concerns, and suggestions. Once we have a concensus from the community, we can then start working on individual items.

Here are the main API changes before we dive into the detail API spec (not including specific implementations):

  • Removes deprecated fields that are GPU specific, specifically GPUs and GPUsPerNode. This is the remaining work from #75 and #85.
  • Separates Template into LauncherSpec and WorkerSpec. See #54 and #90.
  • Replaces MPIJobLauncherStatusType with a more generic MPIJobPodStatusType that represents different states of the either launcher or worker pods.
  • Adds ReplicaStatuses that represents statuses of all the worker replicas and removes WorkerReplicas since it can be inferred from ReplicaStatuses. See #90.

Below is the proposed API spec for v1alpha2:

type MPIJob struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`
	Spec              MPIJobSpec   `json:"spec,omitempty"`
	Status            MPIJobStatus `json:"status,omitempty"`
}

type MPIJobList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata"`
	Items           []MPIJob `json:"items"`
}

type MPIJobSpec struct {

	// Specifies the desired number of processing units the MPIJob should run on.
	// Mutually exclusive with the `Replicas` field.
	// +optional
	ProcessingUnits *int32 `json:"processingUnits,omitempty"`

	// The maximum number of processing units available per node.
	// Note that this will be ignored if the processing resources are explicitly
	// specified in the MPIJob pod spec.
	// +optional
	ProcessingUnitsPerNode *int32 `json:"processingUnitsPerNode,omitempty"`

	// The processing resource type, e.g. 'nvidia.com/gpu' or 'cpu'.
	// Defaults to 'nvidia.com/gpu'
	// +optional
	ProcessingResourceType string `json:"processingResourceType,omitempty"`

	// Specifies the number of slots per worker used in hostfile.
	// Defaults to the number of processing units per worker.
	// +optional
	SlotsPerWorker *int32 `json:"slotsPerWorker,omitempty"`

	// Run the launcher on the master.
	// Defaults to false.
	// +optional
	LauncherOnMaster bool `json:"launcherOnMaster,omitempty"`

	// Specifies the number of retries before marking this job failed.
	// Defaults to 6.
	// +optional
	BackoffLimit *int32 `json:"backoffLimit,omitempty"`

	// Specifies the duration in seconds relative to the start time that
	// the job may be active before the system tries to terminate it.
	// Note that this takes precedence over `BackoffLimit` field.
	// +optional
	ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`

	// Specifies the desired number of replicas the MPIJob should run on.
	// The `PodSpec` should specify the number of processing units.
	// Mutually exclusive with the `ProcessingUnits` fields.
	// +optional
	Replicas *int32 `json:"replicas,omitempty"`

	// Describes the launcher pod that will be created when executing an MPIJob.
	LauncherSpec corev1.PodTemplateSpec `json:"template,omitempty"`

	// Describes the worker pods that will be created when executing an MPIJob.
	WorkerSpec corev1.PodTemplateSpec `json:"template,omitempty"`
}

type MPIJobPodStatusType string

// The current observed state of the corresponding pod (either launcher or worker pods).
const (
	// Active means the corresponding pod is actively running.
	Active MPIJobPodStatusType = "Active"
	// Succeeded means the corresponding pod has succeeded.
	Succeeded MPIJobPodStatusType = "Succeeded"
	// Failed means the corresponding pod has failed its execution.
	Failed MPIJobPodStatusType = "Failed"
)


type MPIJobStatus struct {
	// Current status of the launcher job.
	// +optional
	LauncherStatus MPIJobPodStatusType `json:"launcherStatus,omitempty"`

	// Current statuses of the worker replicas.
	// +optional
	ReplicaStatuses []MPIJobPodStatusType `json:"replicaStatuses,omitempty"`

	// Represents time when the job was acknowledged by the job controller.
	// It is not guaranteed to be set in happens-before order across separate operations.
	// It is represented in RFC3339 form and is in UTC.
	StartTime *metav1.Time `json:"startTime,omitempty"`

	// Represents time when the job was completed. It is not guaranteed to
	// be set in happens-before order across separate operations.
	// It is represented in RFC3339 form and is in UTC.
	CompletionTime *metav1.Time `json:"completionTime,omitempty"`
}

cc: @rongou @anfeng @jlewi @everpeace @gaocegege @Nivedita-V @madhukarkm @ywskycn @ScorpioCPH @jian-he @cheyang @richardsliu

Feel free to tag others who might be interested.

Slow to start a MPIJob

My k8s has thousands of mpijob,when I create a new mpijob,it takes long time to be run.

I find that the code below:
kubeflowInformerFactory := informers.NewSharedInformerFactory(kubeflowClient, time.Second*30) in md/mpi-operator/main.go

Because of the implement of informer, the mpi-operator will check and sync the MPIJob status every 30 second. I have thousands of MPIJobs that make the new job to be run late.

After I set the second parament of NewSharedInformerFactory to be vary large for example 200000, or 0 which means never resync, the new job will be run at a quick time.

I just want to ask, is any side effect if I make this modify?

Thank you very much

the rank num when run with muti-nodes

I run in 8*8node.
and the start log is as following

+ + POD_NAME=p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-1
+ POD_NAME=p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-0shift

+ shift
+ /opt/kube/kubectl exec+  p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-1/opt/kube/kubectl -- exec /bin/sh p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-0 -c -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "694157312" -mca ess_base_vpid 2 -mca ess_base_num_procs "9" -mca orte_node_regex "p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-launcher-c28l2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-0,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-1,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-3,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-4,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-5,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-6,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "694157312.0;tcp://192.168.112.3:53980" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "694157312" -mca ess_base_vpid 1 -mca ess_base_num_procs "9" -mca orte_node_regex "p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-launcher-c28l2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-0,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-1,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-3,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-4,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-5,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-6,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "694157312.0;tcp://192.168.112.3:53980" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-2
+ shift
+ /opt/kube/kubectl exec p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-2 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "694157312" -mca ess_base_vpid 3 -mca ess_base_num_procs "9" -mca orte_node_regex "p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-launcher-c28l2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-0,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-1,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-3,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-4,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-5,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-6,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "694157312.0;tcp://192.168.112.3:53980" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-3
+ shift
+ /opt/kube/kubectl exec p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-3 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "694157312" -mca ess_base_vpid 4 -mca ess_base_num_procs "9" -mca orte_node_regex "p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-launcher-c28l2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-0,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-1,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-3,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-4,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-5,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-6,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "694157312.0;tcp://192.168.112.3:53980" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-5
+ shift
+ /opt/kube/kubectl exec p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-5 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "694157312" -mca ess_base_vpid 6 -mca ess_base_num_procs "9" -mca orte_node_regex "p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-launcher-c28l2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-0,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-1,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-3,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-4,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-5,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-6,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "694157312.0;tcp://192.168.112.3:53980" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-4
+ shift
+ /opt/kube/kubectl exec p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-4 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "694157312" -mca ess_base_vpid 5 -mca ess_base_num_procs "9" -mca orte_node_regex "p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-launcher-c28l2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-0,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-1,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-3,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-4,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-5,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-6,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "694157312.0;tcp://192.168.112.3:53980" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-6
+ shift
+ /opt/kube/kubectl exec p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-6 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "694157312" -mca ess_base_vpid 7 -mca ess_base_num_procs "9" -mca orte_node_regex "p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-launcher-c28l2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-0,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-1,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-3,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-4,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-5,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-6,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "694157312.0;tcp://192.168.112.3:53980" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-7
+ shift
+ /opt/kube/kubectl exec p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-7 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "694157312" -mca ess_base_vpid 8 -mca ess_base_num_procs "9" -mca orte_node_regex "p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-launcher-c28l2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-0,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-1,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-2,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-3,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-4,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-5,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-6,p[1:6]be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-7@0(9)" -mca orte_hnp_uri "694157312.0;tcp://192.168.112.3:53980" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"

I got missing rank error

training/Adam/DistributedAdam_Allreduce/HorovodAllreduce_training_Adam_gradients_AddN_149_0 [missing ranks: 2]

I want to ask is
rank 2 -> is the 3rd gpu in p6be5d36d24d4c7f9087c49150e4b0a4-mpijob-0-worker-0?

rank0.....rank7 in worker-0
rank8.....rank15 in worker-1
rank16....rank23 in worker-2
.....
rank56.....rank63 in worker-7?

NCCL Connection refused when run horovod with mpijob

I run tf_cnn_benchmarks.py with horovod in two ways.
(1) with host-network docker as describe in Horovod in Docker

 on first node
docker run --rm -it --privileged --network=host -v /root/.ssh:/root/.ssh  -v /root/Horovod/local/tf_cnn_benchmarks:/code uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5  bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

on second node
docker run --rm -it --privileged --network=host -v /root/.ssh:/root/.ssh  -v /root/Horovod/local/tf_cnn_benchmarks:/code uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5  mpirun -np 16 -H IP1:8,IP2:8 -bind-to none -map-by slot -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_IB_HCA=mlx5_0 -mca btl_tcp_if_exclude docker0,tunl0,lo -x NCCL_DEBUG=ERROR -x LD_LIBRARY_PATH -mca plm_rsh_args "-p 12345 -vvvv" python /code/tf_cnn_benchmarks.py --model=resnet101 --batch_size=64 --variable_update=horovod

(2) with MPIJob as follow

apiVersion: kubeflow.org/v1alpha1
kind: MPIJob
metadata:
  name: mpijob-horovod
spec:
  replicas: 2
  backoffLimit: 0
  template:
    spec:
      containers:
      - image: uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5
        args:
         - cd /code && mpirun python tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod
        command:
         - /bin/bash
         - -c
        imagePullPolicy: IfNotPresent
        name: mpijob-horovod
        resources:
          limits:
            nvidia.com/gpu: 8
        volumeMounts:
        - mountPath: /code
          name: codevolume
      restartPolicy: Never
      volumes:
      - hostPath:
          path: /root/Horovod/local/tf_cnn_benchmarks
          type: DirectoryOrCreate
        name: codevolume

when I use image uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5
host-network docker and MPIJob both go well.

but when use image uber/horovod:0.15.2-tf1.12.0-torch1.0.0-py3.5

host-network docker run well
but I get NCCL Connection refused error with MPIJob
the error as follow

mpijob-horovod-worker-1:16:935 [0] include/socket.h:361 NCCL WARN Call to connect timeout : Connection refused
mpijob-horovod-worker-1:16:935 [0] NCCL INFO transport/net_socket.cu:139 -> 2
mpijob-horovod-worker-1:16:935 [0] NCCL INFO bootstrap.cu:19 -> 2
mpijob-horovod-worker-1:16:935 [0] NCCL INFO bootstrap.cu:225 -> 2
mpijob-horovod-worker-1:16:935 [0] NCCL INFO init.cu:420 -> 2
mpijob-horovod-worker-1:16:935 [0] NCCL INFO init.cu:557 -> 2

I also Try

cd /code  && mpirun -bind-to none -map-by slot -x NCCL_IB_HCA=mlx5_0 -mca btl_tcp_if_exclude docker0,tunl0,lo  python tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod

and get the same error

I compare the two docker images uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5 and uber/horovod:0.15.2-tf1.12.0-torch1.0.0-py3.5

the difference is
horovod 0.15.2

CUDNN_VERSION=7.3.1.20-1+cuda9.0
NCCL_VERSION=2.3.5-2+cuda9.0
openmpi-3.1.2.tar.gz

horovod 0.13.10:

CUDNN_VERSION=7.0.5.15-1+cuda9.0
NCCL_VERSION=2.2.12-1+cuda9.0
openmpi-3.0.0.tar.gz

all my test case are run on the some nodes. and now I have no idea why I get NCCL Connection refused when I use MPIJob.

Could you give some suggestion? thanks a lot

Sometimes appear this problem

  • POD_NAME=tensorflow-benchmarks-16-custom-worker-0
  • shift
  • /opt/kube/kubectl exec tensorflow-benchmarks-16-custom-worker-0 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "633602048" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-benchmarks-16-custom-launcher-s96hl,tensorflow-benchmarks-16-custom-worker-0,tensorflow-benchmarks-16-custom-worker-1@0(3)" -mca orte_hnp_uri "633602048.0;tcp://192.168.201.141:33011" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
  • POD_NAME=tensorflow-benchmarks-16-custom-worker-1
  • shift
  • /opt/kube/kubectl exec tensorflow-benchmarks-16-custom-worker-1 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "633602048" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-benchmarks-16-custom-launcher-s96hl,tensorflow-benchmarks-16-custom-worker-0,tensorflow-benchmarks-16-custom-worker-1@0(3)" -mca orte_hnp_uri "633602048.0;tcp://192.168.201.141:33011" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
    error: unable to upgrade connection: Unauthorized

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

  • not finding the required libraries and/or binaries on
    one or more nodes. Please check your PATH and LD_LIBRARY_PATH
    settings, or configure OMPI with --enable-orterun-prefix-by-default

  • lack of authority to execute on one or more specified nodes.
    Please verify your allocation and authorities.

  • the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
    Please check with your sys admin to determine the correct location to use.

  • compilation of the orted with dynamic libraries when static are required
    (e.g., on Cray). Please check your configure cmd line and consider using
    one of the contrib/platform definitions for your system type.

  • an inability to create a connection back to mpirun due to a
    lack of common network interfaces and/or no route found between
    them. Please check network connectivity (including firewalls
    and network routing requirements).



ORTE does not know how to route a message to the specified daemon
located on the indicated node:

my node: tensorflow-benchmarks-16-custom-launcher-s96hl
target node: tensorflow-benchmarks-16-custom-worker-1

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the proble

A process or daemon was unable to complete a TCP connection?

create a Job with 2 node like this

NAME                            READY     STATUS    RESTARTS   AGE       IP                NODE
mpijob-horovod-launcher-jgshh   1/1       Running   0          12s       192.168.254.130   bms-ccc8-0001
mpijob-horovod-worker-0         1/1       Running   0          15s       192.168.65.3      bms-ccc8-0003
mpijob-horovod-worker-1         1/1       Running   0          15s       192.168.170.6     bms-ccc8-0002


and get error

+ POD_NAME=mpijob-horovod-worker-0
+ shift
+ /opt/kube/kubectl exec mpijob-horovod-worker-0 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "346619904" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "mpijob-horovod-launcher-jgshh,mpijob-horovod-worker-0,mpijob-horovod-worker-1@0(3)" -mca orte_hnp_uri "346619904.0;tcp://192.168.254.130:58711" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=mpijob-horovod-worker-1
+ shift
+ /opt/kube/kubectl exec mpijob-horovod-worker-1 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "346619904" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "mpijob-horovod-launcher-jgshh,mpijob-horovod-worker-0,mpijob-horovod-worker-1@0(3)" -mca orte_hnp_uri "346619904.0;tcp://192.168.254.130:58711" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    mpijob-horovod-worker-1
  Remote host:   mpijob-horovod-launcher-jgshh
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
command terminated with exit code 1
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   mpijob-horovod-launcher-jgshh
  target node:  mpijob-horovod-worker-0

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

I have run systemctl disable firewalld on everynode and check all the nodes with systemctl status firewalld

[root@bms-ccc8-0001 ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

[root@bms-ccc8-0002 ~]# systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

Got an error when start a job with OpenMPI 3.1.2

When I start my own MPI Job with OpenMPI installed with caffe. My Launcher get logs below:

+ POD_NAME=caffe-mpi-test-worker-0
+ shift
+ /opt/kube/kubectl exec caffe-mpi-test-worker-0 -- /bin/sh -c  orted -mca ess "env" -mca ess_base_jobid "1741029376" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "caffe-mpi-test-launcher-xn[1:4]ds,caffe-mpi-test-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "1741029376.0;tcp://10.244.1.4:32935" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=caffe-mpi-test-worker-1
+ shift
+ /opt/kube/kubectl exec caffe-mpi-test-worker-1 -- /bin/sh -c  orted -mca ess "env" -mca ess_base_jobid "1741029376" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "caffe-mpi-test-launcher-xn[1:4]ds,caffe-mpi-test-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "1741029376.0;tcp://10.244.1.4:32935" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
Unable to connect to the server: Connection timed out
Unable to connect to the server: Connection timed out
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   caffe-mpi-test-launcher-xn4ds
  target node:  caffe-mpi-test-worker-0

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

Here is my Dockerfile of wich install OpenMPI 3.1.2:

from nvidia/cuda:9.1-cudnn7-devel-ubuntu16.04

...

#uninstall all old version of openmpi and install openmpi 3.1.2 from source
RUN apt-get purge -y libopenmpi* openmpi* && \
    cd /openmpi-3.1.2 && \
    ./configure -prefix=/usr/local/openmpi  && \
    make -j"$(nproc)" all && \
    make install && \
    cd -
ENV PATH=/usr/local/openmpi/bin:$PATH
ENV LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH

# Configure OpenMPI to run good defaults:
# --bind-to none --map-by slot --mca btl_tcp_if_exclude lo,docker0
RUN echo "hwloc_base_binding_policy = none" >> /usr/local/etc/openmpi-mca-params.conf && \
    echo "rmaps_base_mapping_policy = slot" >> /usr/local/etc/openmpi-mca-params.conf && \
    echo "btl_tcp_if_exclude = lo,docker0" >> /usr/local/etc/openmpi-mca-params.conf && ldconfig 

...
#install caffe
...

Seems Launcher node cannot connect to worker node? My network is flannel. How can i fix this error?

MPIJob error

When i use mpi-operator , meet a error.
@rongou @everpeace an you help me?

2018-09-04T11:27:10.396528141Z + POD_NAME=mj-mpijob-worker-0
2018-09-04T11:27:10.396588241Z + shift
2018-09-04T11:27:10.3966107Z + /opt/kube/kubectl exec mj-mpijob-worker-0 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2540437504" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "mj-mpijob-launcher-xz82w,mj-mpijob-worker-0,mj-mpijob-worker-1@0(3)" -mca orte_hnp_uri "2540437504.0;tcp://10.99.54.237:54789" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
2018-09-04T11:27:10.403184919Z + POD_NAME=mj-mpijob-worker-1
2018-09-04T11:27:10.403206606Z + shift
2018-09-04T11:27:10.403214612Z + /opt/kube/kubectl exec mj-mpijob-worker-1 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2540437504" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "mj-mpijob-launcher-xz82w,mj-mpijob-worker-0,mj-mpijob-worker-1@0(3)" -mca orte_hnp_uri "2540437504.0;tcp://10.99.54.237:54789" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
2018-09-04T11:27:10.519774782Z error: You must be logged in to the server (Unauthorized)
2018-09-04T11:27:10.523895676Z --------------------------------------------------------------------------
2018-09-04T11:27:10.523923119Z ORTE was unable to reliably start one or more daemons.
2018-09-04T11:27:10.523928998Z This usually is caused by:
2018-09-04T11:27:10.523934092Z
2018-09-04T11:27:10.523938568Z * not finding the required libraries and/or binaries on
2018-09-04T11:27:10.523943765Z   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
2018-09-04T11:27:10.523964661Z   settings, or configure OMPI with --enable-orterun-prefix-by-default
2018-09-04T11:27:10.523969026Z
2018-09-04T11:27:10.523972741Z * lack of authority to execute on one or more specified nodes.
2018-09-04T11:27:10.523976999Z   Please verify your allocation and authorities.
2018-09-04T11:27:10.523981471Z
2018-09-04T11:27:10.523985666Z * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
2018-09-04T11:27:10.523989991Z   Please check with your sys admin to determine the correct location to use.
2018-09-04T11:27:10.523994541Z
2018-09-04T11:27:10.523998398Z *  compilation of the orted with dynamic libraries when static are required
2018-09-04T11:27:10.524002643Z   (e.g., on Cray). Please check your configure cmd line and consider using
2018-09-04T11:27:10.524007166Z   one of the contrib/platform definitions for your system type.
2018-09-04T11:27:10.524029258Z
2018-09-04T11:27:10.524033768Z * an inability to create a connection back to mpirun due to a
2018-09-04T11:27:10.524038063Z   lack of common network interfaces and/or no route found between
2018-09-04T11:27:10.52404222Z   them. Please check network connectivity (including firewalls
2018-09-04T11:27:10.524046688Z   and network routing requirements).
2018-09-04T11:27:10.52405065Z --------------------------------------------------------------------------
2018-09-04T11:27:10.524192237Z --------------------------------------------------------------------------
2018-09-04T11:27:10.524220037Z ORTE does not know how to route a message to the specified daemon
2018-09-04T11:27:10.524225782Z located on the indicated node:
2018-09-04T11:27:10.524230751Z
2018-09-04T11:27:10.524235151Z   my node:   mj-mpijob-launcher-xz82w
2018-09-04T11:27:10.524240959Z   target node:  mj-mpijob-worker-1
2018-09-04T11:27:10.524245479Z
2018-09-04T11:27:10.524249765Z This is usually an internal programming error that should be
2018-09-04T11:27:10.524254233Z reported to the developers. In the meantime, a workaround may
2018-09-04T11:27:10.524258733Z be to set the MCA param routed=direct on the command line or
2018-09-04T11:27:10.524263242Z in your environment. We apologize for the problem.
2018-09-04T11:27:10.524267525Z --------------------------------------------------------------------------
2018-09-04T11:27:10.526544855Z error: You must be logged in to the server (Unauthorized)
2018-09-04T11:27:26.308143618Z W0904 11:27:26.307579 139714005165824 tf_logging.py:125] From /root/code/rev-21feac2bb22438e390434b0557b99a412d8c2828/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1841: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
2018-09-04T11:27:26.308194395Z Instructions for updating:
2018-09-04T11:27:26.308201763Z Please switch to tf.train.MonitoredTrainingSession
2018-09-04T11:27:27.616498875Z 2018-09-04 11:27:27.615954: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-09-04T11:27:29.624772611Z 2018-09-04 11:27:29.624450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
2018-09-04T11:27:29.624821781Z name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
2018-09-04T11:27:29.62482965Z pciBusID: 0000:04:00.0
2018-09-04T11:27:29.624835355Z totalMemory: 11.90GiB freeMemory: 11.73GiB
2018-09-04T11:27:29.624841024Z 2018-09-04 11:27:29.624515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-04T11:27:30.21913606Z 2018-09-04 11:27:30.218790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-04T11:27:30.219193257Z 2018-09-04 11:27:30.218850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2018-09-04T11:27:30.219218242Z 2018-09-04 11:27:30.218860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2018-09-04T11:27:30.219924452Z 2018-09-04 11:27:30.219699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11347 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
2018-09-04T11:27:31.565575798Z I0904 11:27:31.564569 139714005165824 tf_logging.py:115] Running local_init_op.
2018-09-04T11:27:31.692932618Z I0904 11:27:31.692514 139714005165824 tf_logging.py:115] Done running local_init_op.
2018-09-04T11:27:36.602705156Z I0904 11:27:36.602036 139714005165824 tf_logging.py:115] Starting standard services.
2018-09-04T11:27:36.721213629Z I0904 11:27:36.720841 139714005165824 tf_logging.py:115] Starting queue runners.
2018-09-04T11:27:36.722333227Z I0904 11:27:36.721974 139692485875456 tf_logging.py:159] global_step/sec: 0
2018-09-04T11:27:43.669449916Z
2018-09-04T11:27:43.669539411Z mj-mpijob-launcher-xz82w:2416:2489 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
2018-09-04T11:27:43.669548044Z mj-mpijob-launcher-xz82w:2416:2489 [0] INFO Using internal Network Socket
2018-09-04T11:27:43.669554081Z mj-mpijob-launcher-xz82w:2416:2489 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
2018-09-04T11:27:43.669559954Z mj-mpijob-launcher-xz82w:2416:2489 [0] INFO NET : Using interface eth0:10.99.54.237<0>
2018-09-04T11:27:43.669567154Z mj-mpijob-launcher-xz82w:2416:2489 [0] INFO NET/Socket : 1 interfaces found
2018-09-04T11:27:43.669572801Z NCCL version 2.2.13+cuda9.0
2018-09-04T11:27:43.797949598Z Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/var/lib/docker/overlay2/l/3DQJJQNGOQ5SUT5E4T6KWHGEQ3:/var/lib/docker/overlay2/l/P7U72T3AKJK7DV4DGGM4YZIUVX:/var/lib/docker/overlay2/l/MGVIIJEFQXYFYEZNEVDSJG5BL3:/var/lib/docker/overlay2/l/NAF525P6GCKRRKNEPUXR7XHPJC:/var/lib/docker/overlay2/l/YETKJSJ7GLMZHM4T4ATHVDYCXG:/var/lib/docker/overlay2/l/GSS6XIGWCZRHWYV6AR5IVRUVVT:/var/lib/docker/overlay2/l/YVLCL4VHUFNG5DXGDKN6MNINC4:/var/lib/docker/overlay2/l/A55OWZ3YPCZDKJRGEGAFZTA7IH:/var/lib/docker/overlay2/l/UJMSIWBDIAXLU'
2018-09-04T11:27:43.798007017Z Unexpected end of /proc/mounts line `NW3OPRXLDLTD2:/var/lib/docker/overlay2/l/EFB7NMYJ7JEPH3DCEUW4NBEDKJ:/var/lib/docker/overlay2/l/YMNO7EHKMS5UJTMNJ3FXCJMGQS:/var/lib/docker/overlay2/l/IM6MALQ2QZGBNSDC7UJP6GNBXD:/var/lib/docker/overlay2/l/WFW5VJGFEOVDZSY4LIILLI2BWD:/var/lib/docker/overlay2/l/FESQO6ZKWPCMDRKHTEITMDNHKS:/var/lib/docker/overlay2/l/YDHECWKEPNNDL6HTNYXHMDSWEW:/var/lib/docker/overlay2/l/2WHRJRNCC7LJUR354QCU3SI23T:/var/lib/docker/overlay2/l/NCJA3SSJCZJB7LZR6TRKJO4FMS:/var/lib/docker/overlay2/l/GHXFKHV4LHRJYEFEQQGZUU6YXI:/var/lib/do'
2018-09-04T11:27:43.798017944Z Unexpected end of /proc/mounts line `cker/overlay2/l/TCWFJTGXV34SYUWR5TTYWWXDPJ:/var/lib/docker/overlay2/l/XKK2K4GGMUEBZFAHTPQZ6OENQW:/var/lib/docker/overlay2/l/NY7TL42WWXUBRY6AFGI77AE4VU:/var/lib/docker/overlay2/l/K3DIH7RMJHICF3BMESDD5Q3VUM:/var/lib/docker/overlay2/l/TWFOUM6G67RQC2GNJYGHWHU4AE:/var/lib/docker/overlay2/l/NE7PNUVRP7ZY2LHKL77N7BLTUT:/var/lib/docker/overlay2/l/3BXLZ3GSRLQMJAHKGGTGC4JN7B,upperdir=/var/lib/docker/overlay2/318812f70e10fab578c788396d70e4c88155a865521bceb4a0909f084f2fe35b/diff,workdir=/var/lib/docker/overlay2/318812f'
2018-09-04T11:27:43.883378556Z mj-mpijob-launcher-xz82w:2416:2489 [0] INFO comm 0x7f10c42f2f30 rank 0 nranks 1
2018-09-04T11:27:43.884257712Z mj-mpijob-launcher-xz82w:2416:2489 [0] INFO Using 256 threads
2018-09-04T11:27:43.884291681Z mj-mpijob-launcher-xz82w:2416:2489 [0] INFO Min Comp Cap 6
2018-09-04T11:27:43.884300231Z mj-mpijob-launcher-xz82w:2416:2489 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
2018-09-04T11:28:45.221212005Z TensorFlow:  1.10
2018-09-04T11:28:45.221275224Z Model:       resnet101
2018-09-04T11:28:45.22128385Z Dataset:     imagenet (synthetic)
2018-09-04T11:28:45.221290144Z Mode:        training
2018-09-04T11:28:45.22129589Z SingleSess:  False
2018-09-04T11:28:45.221301384Z Batch size:  32 global
2018-09-04T11:28:45.221307Z              32.0 per device
2018-09-04T11:28:45.221312867Z Num batches: 100
2018-09-04T11:28:45.221318487Z Num epochs:  0.00
2018-09-04T11:28:45.221324047Z Devices:     ['horovod/gpu:0']
2018-09-04T11:28:45.221329636Z Data format: NCHW
2018-09-04T11:28:45.22133513Z Optimizer:   sgd
2018-09-04T11:28:45.22134058Z Variables:   horovod
2018-09-04T11:28:45.221346313Z ==========
2018-09-04T11:28:45.22135422Z Generating model
2018-09-04T11:28:45.221359766Z Running warm up
2018-09-04T11:28:45.221365259Z Done warm up
2018-09-04T11:28:45.221370693Z Step	Img/sec	total_loss
2018-09-04T11:28:45.221376873Z 1	images/sec: 128.2 +/- 0.0 (jitter = 0.0)	9.147
2018-09-04T11:28:45.221383283Z 10	images/sec: 66.4 +/- 11.1 (jitter = 1.9)	9.185
2018-09-04T11:28:45.221389586Z 20	images/sec: 64.4 +/- 7.9 (jitter = 2.1)	9.002
2018-09-04T11:28:45.221395542Z 30	images/sec: 65.1 +/- 6.4 (jitter = 2.0)	8.931
2018-09-04T11:28:45.221401529Z 40	images/sec: 65.2 +/- 5.6 (jitter = 1.9)	9.018
2018-09-04T11:28:45.221408149Z 50	images/sec: 65.3 +/- 5.0 (jitter = 2.0)	9.232
2018-09-04T11:28:45.221413935Z 60	images/sec: 64.5 +/- 4.5 (jitter = 1.8)	9.275
2018-09-04T11:28:45.221419789Z 70	images/sec: 64.0 +/- 4.2 (jitter = 1.7)	9.099
2018-09-04T11:28:45.221425509Z 80	images/sec: 63.8 +/- 3.9 (jitter = 2.0)	8.953
2018-09-04T11:28:45.221433185Z 90	images/sec: 64.0 +/- 3.7 (jitter = 1.8)	9.045
2018-09-04T11:28:45.221439009Z 100	images/sec: 64.1 +/- 3.5 (jitter = 1.7)	9.146
2018-09-04T11:28:45.221445882Z ----------------------------------------------------------------
2018-09-04T11:28:45.221451702Z total images/sec: 63.73
2018-09-04T11:28:45.221541691Z ----------------------------------------------------------------

are there any concerns to replace sshd with `kubectl exec` in MPI interaction?

When I read the design of MPI-operator, I noticed:

The launcher pod invokes mpirun and communicates with worker pods through MPI. The initial handshake is done through kubectl exec instead of SSH. 

Is there any concerns about performance which connects through API server? And if it's possible to use sshd for alternative solution?

Launcher and worker statuses do not correctly indicate the underlying states

Launcher keeps crashing:

mpi-test-2-mpijob-launcher-lv2fx                                    1/1      CrashLoopBackOff             2          1m
mpi-test-2-mpijob-worker-0                                             1/1      Running                                0          1m
mpi-test-2-mpijob-worker-1                                             1/1       Running                                0          1m

However, from the launcher's log, one of the worker is the one that's failing and is killed (later found that it was due to OOM)

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real noticed that process rank 1 with PID 39 on node mpi-test-2-mpijob-worker-1 exited on signal 9 (Killed).
--------------------------------------------------------------------------

Here's the description for launcher job which does not indicate any abnormal events:

Events:
  Type    Reason            Age   From            Message
  ----    ------            ----  ----            -------
  Normal  SuccessfulCreate  14m   job-controller  Created pod: mpi-test-mpijob-launcher-m8kw6

The above problems could potentially be addressed by #12 (currently mpirun does not give us helpful error messages so maybe PMIx is a better here) and #54 (currently only launcher pod is shown as failing but the workers are actually failing). There are other solutions too but I just wanted to link to other existing issues.

Failed to create MPIJob due to invalid values

Hi. I am new and trying to run the tensorflow-benchmark using argo-events (https://github.com/argoproj/argo-events). When the K8s CustomResourceDefinition (CRD) tries to launch the MPIJob, it also includes empty arrays such as below:

  1. parameters with empty array such as spec.arguments, spec.arguments.templates (input,output etc) (removed private information)

`apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
clusterName: ""
creationTimestamp: 2019-01-14T04:52:00Z
generateName: sch201901141352001-mwf20190036-
generation: 1
labels:
pipeline-id: MWF20190036
schedule-id: SCH201901141352001
name: sch201901141352001-mwf20190036-cq66b
spec:
arguments: {}
entrypoint: workflow-steps
templates:

  • inputs: {}
    metadata: {}
    name: workflow-steps
    outputs: {}
    steps:
      • arguments: {}
        name: job20190038
        template: job20190038
  • activeDeadlineSeconds: 36000
    inputs: {}
    metadata:
    labels:
    job-id: JOB20190038
    pipeline-id: MWF20190036
    schedule-id: SCH201901141352001
    name: job20190038
    outputs: {}
    resource:
    action: create
    failureCondition: status.launcherStatus == Failed
    manifest: |
    apiVersion: kubeflow.org/v1alpha1
    kind: MPIJob
    metadata:
    annotations: null
    clusterName: null
    creationTimestamp: null
    deletionGracePeriodSeconds: null
    deletionTimestamp: null
    finalizers: null
    generateName: null
    generation: null
    initializers: null
    labels: null
    name: sch201901141352001-mwf20190036-job20190038-mpijob
    namespace: null
    ownerReferences: null
    resourceVersion: null
    selfLink: null
    uid: null
    spec:
    backoffLimit: null
    gpus: null
    launcherOnMaster: null
    replicas: 2
    template:
    metadata: null
    spec:
    activeDeadlineSeconds: null
    affinity: null
    containers:
    - args:
    - mpirun
    - --bind-to
    - none
    - --map-by
    - slot
    - -x
    - NCCL_DEBUG=INFO
    - -x
    - LD_LIBRARY_PATH
    - -x
    - HOROVOD_MPI_THREADS_DISABLE=1
    - -x
    - PATH
    - -x
    - NCCL_SOCKET_IFNAME=bond0
    - -mca
    - pml
    - ob1
    - python
    - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
    - --model=resnet101
    - --batch_size=64
    - --variable_update=horovod
    command: null
    .....
    `
  1. error log received (removed private information)
    message: |- The MPIJob "sch201901141352001-mwf20190036-job20190038-mpijob" is invalid: []: Invalid value: map[string]interface {}{"apiVersion":"kubeflow.org/v1alpha1", "kind":"MPIJob", "metadata":map[string]interface {}{"finalizers":interface {}(nil), "name":"sch201901141352001-mwf20190036-job20190038-mpijob", "ownerReferences":interface {}(nil), "selfLink":"", "uid":"21944a2d-17b8-11e9-b830-9c713a20a9b0", "clusterName":"", "labels":interface {}(nil), "resourceVersion":interface {}(nil), "annotations":interface {}(nil), "generateName":interface {}(nil), "creationTimestamp":"2019-01-14T04:52:01Z", "generation":1}, "spec":map[string]interface {}{"backoffLimit":interface {}(nil), "gpus":interface {}(nil), "launcherOnMaster":interface {}(nil), "replicas":2, "template":map[string]interface {}{"metadata":interface {}(nil), "spec":map[string]interface {}{"initContainers":interface {}(nil), "nodeName":interface {}(nil), "serviceAccountName":interface {}(nil), "subdomain":interface {}(nil), "terminationGracePeriodSeconds":interface {}(nil), "tolerations":interface {}(nil), "volumes":interface {}(nil), "activeDeadlineSeconds":interface {}(nil), "dnsPolicy":interface {}(nil), "hostname":interface {}(nil), "priorityClassName":interface {}(nil), "schedulerName":interface {}(nil), "affinity":interface {}(nil), "hostAliases":interface {}(nil),"nodeSelector":interface {}(nil), "restartPolicy":interface {}(nil), "containers":[]interface {}{map[string]interface {}{"volumeMounts":[]interface {}{map[string]interface {}{"mountPropagation":interface {}(nil), .... , "securityContext":interface {}(nil), "lifecycle":interface {}(nil), "ports":interface {}(nil), "terminationMessagePath":interface {}(nil)}}, "dnsConfig":interface {}(nil), "priority":interface {}(nil), "securityContext":interface {}(nil), "serviceAccount":interface {}(nil)}}}, "status":interface {}(nil)}: validation failure list: must validate one and only one schema (oneOf) name: sch201901141352001-mwf20190036-cq66b[0].job20190038 phase: Failed startedAt: 2019-01-14T04:52:00Z templateName: job20190038 type: Pod phase: Failed

Is there any way to work around the error for invalid values?
Many Thanks!

when restartPolicy is set never, launcher pod got into error and started a new one?

[root@examples]# kubectl get pods -o wide
NAME                            READY     STATUS    RESTARTS   AGE       IP              
mpijob-horovod-launcher-6sv6w   1/1       Running   0          3m        172.16.203.152   
mpijob-horovod-worker-0         1/1       Running   0          6m        172.16.203.151   
mpijob-horovod-worker-1         1/1       Running   0          6m        172.16.216.236  

and logs from launcher pod show than there is network error betweenlauncher and worker-1.

[root@offlinetraining-0001 examples]# kubectl logs mpijob-horovod-launcher-6sv6w 
+ POD_NAME=mpijob-horovod-worker-0
+ shift
+ /opt/kube/kubectl exec mpijob-horovod-worker-0 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "1679228928" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "mpijob-horovod-launcher-6sv6w,mpijob-horovod-worker-0,mpijob-horovod-worker-1@0(3)" -mca orte_hnp_uri "1679228928.0;tcp://172.16.203.152:42479" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=mpijob-horovod-worker-1
+ shift
+ /opt/kube/kubectl exec mpijob-horovod-worker-1 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "1679228928" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "mpijob-horovod-launcher-6sv6w,mpijob-horovod-worker-0,mpijob-horovod-worker-1@0(3)" -mca orte_hnp_uri "1679228928.0;tcp://172.16.203.152:42479" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
Unable to connect to the server: dial tcp 10.96.0.1:443: i/o timeout
Unable to connect to the server: dial tcp 10.96.0.1:443: i/o timeout
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   mpijob-horovod-launcher-6sv6w
  target node:  mpijob-horovod-worker-1

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------

after that I saw launcher pod go into Error and start a new pod.

how can I set launcher pod go into Error and do not start a new pod when error occurs?

[root@examples]# kubectl get pods -o wide
NAME                      READY     STATUS    RESTARTS   AGE       IP               
mpijob-horovod-worker-0   1/1       Running   0          7m        172.16.203.151   
mpijob-horovod-worker-1   1/1       Running   0          7m        172.16.216.236   
[root@examples]# 
[root@examples]# kubectl get mpijob
NAME             AGE
mpijob-horovod   11m
[root@examples]# kubectl get pods -o wide
NAME                            READY     STATUS    RESTARTS   AGE       IP               
mpijob-horovod-launcher-2fjrw   1/1       Running   0          1m        172.16.203.153   
mpijob-horovod-worker-0         1/1       Running   0          7m        172.16.203.151   
mpijob-horovod-worker-1         1/1       Running   0          7m        172.16.216.236   

[root@examples]# kubectl get pods -o wide --show-all
NAME                            READY     STATUS      RESTARTS   AGE       IP               NODE
mpijob-horovod-launcher-2fjrw   0/1       Error       0          11m       172.16.203.153 
mpijob-horovod-launcher-5vr5c   0/1       Error       0          23m       172.16.203.148
mpijob-horovod-launcher-6sv6w   0/1       Error       0          15m       172.16.203.152
mpijob-horovod-launcher-cgz8d   0/1       Error       0          17m       172.16.203.150 
mpijob-horovod-launcher-f2597   1/1       Running     0          2m        172.16.203.155
mpijob-horovod-launcher-jfkkf   0/1       Error       0          7m        172.16.203.154   
mpijob-horovod-launcher-s29qt   0/1       Error       0          20m       172.16.203.149 
mpijob-horovod-worker-0         1/1       Running     0          17m       172.16.203.151   
mpijob-horovod-worker-1         1/1       Running     0          17m       172.16.216.236  

allow specifying resources explicitly

The current code only allows specifying total # GPUs. It's probably useful to allow users to specify # replicas and GPUs/replica explicitly. This is defined in the proposal but not yet implemented.

Additional tags for docker images on Dockerhub

Currently Docker images are being built and pushed automatically to mpioperator on Dockerhub. However, the images built automatically do not have tags other than "latest". We can probably tag them with Git commit hashes and then have an official release tag. Otherwise, users will end up maintaining their own images and releases which is unaffordable in the long term.

how can I Launch a multi-node horovod training job with mpi-openrator?

after I do kubectl create -f deploy/
I want a test training jod

As I just 2 node. one 1 GPU , anoher 2 GPUs
I tried examples/tensorflow-benchmarks-imagenet.yaml with a modified but I get error.
change

spec:
  gpus: 32

to

spec:
  gpus: 2

and error is

# kubectl create -f tensorflow-benchmarks-imagenet.yaml 
The MPIJob "tensorflow-benchmarks-imagenet" is invalid: []: Invalid value: map[string]interface {}{"metadata":map[string]interface {}{"uid":"7eda61e2-a2c7-11e8-80c4-fa163eb0739b", "selfLink":"", "clusterName":"", "name":"tensorflow-benchmarks-imagenet", "namespace":"default", "creationTimestamp":"2018-08-18T09:17:14Z"}, "spec":map[string]interface {}{"template":map[string]interface {}{"spec":map[string]interface {}{"containers":[]interface {}{map[string]interface {}{"name":"tensorflow-benchmarks", "volumeMounts":[]interface {}{map[string]interface {}{"mountPath":"/efs", "name":"efs"}, map[string]interface {}{"name":"models", "mountPath":"/models"}}, "command":[]interface {}{"mpirun", "python", "scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", "--data_format=NCHW", "--batch_size=256", "--model=resnet50", "--optimizer=momentum", "--variable_update=horovod", "--nodistortions", "--gradient_repacking=8", "--num_epochs=90", "--weight_decay=1e-4", "--data_dir=/efs/imagenet/train", "--use_fp16", "--train_dir=/models/resnet50"}, "image":"mpioperator/tensorflow-benchmarks:latest"}}, "volumes":[]interface {}{map[string]interface {}{"name":"efs", "nfs":map[string]interface {}{"server":"fs-ab134502.efs.us-west-2.amazonaws.com", "path":"/", "readOnly":true}}, map[string]interface {}{"emptyDir":map[string]interface {}{}, "name":"models"}}}}, "gpus":2}, "apiVersion":"kubeflow.org/v1alpha1", "kind":"MPIJob"}: validation failure list:
must validate one and only one schema (oneOf)
must validate one and only one schema (oneOf)
spec.gpus in body should be one of [1 2 4]

how can I run examples/tensorflow-benchmarks-imagenet.yaml ?

additionally, I want to test a horovod examples like tensorflow_mnist.py can I define yaml like this?

apiVersion: kubeflow.org/v1alpha1
kind: MPIJob
metadata:
  name: mpijob-horovod
spec:
  replicas: 2
  template:
    spec:
      containers:
      - image: uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5
        command:
          - mpirun 
          - python
          - tensorflow_mnist.py
        imagePullPolicy: IfNotPresent
        name: mpijob-horovod
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - mountPath: /examples/MNIST-data-0
          name: datavolume
      restartPolicy: Never
      volumes:
      - hostPath:
        path: /sfs/data/mnist
        type: DirectoryOrCreate
        name: datavolume

Failed to create MPI worker statefulset when restartPolicy is Never

When deploying the following mpijob:

apiVersion: kubeflow.org/v1alpha1
kind: MPIJob
metadata:
  name: mpi-dist-mpijob
  labels:
    app: mpijob
spec:
  LauncherOnMaster: true
  replicas: 2
  template:
    metadata:
      name: mpi-dist-mpijob
      labels:
        app: mpijob
        chart: mpijob-0.2.0
        release: mpi-dist
    spec:
      restartPolicy: Never
      containers:
      - image: "uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5"
        name: mpi
        imagePullPolicy:
        workingDir: /root
        command:
        - "sh"
        - "-c"
        - "mpirun python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64     --variable_update horovod --train_dir=/training_logs --summary_verbosity=3"
        resources:
          requests:
            nvidia.com/gpu: "1"
          limits:
            nvidia.com/gpu: "1"

It can't create the MPI workers, reported the error:

error syncing 'default/mpi-dist-mpijob': StatefulSet.apps "mpi-dist-mpijob-worker" is invalid: spec.template.spec.restartPolicy: Unsupported value: "Never": supported values: "Always"

It's caused by the reason that launcher job and statefulset shares the same pod spec, and Statefulset only supports Always

MPI job needs rich status information

Could we add startTime and completionTime field in the status? launcherStatus is helpful on status but can not indicate job duration.

{
"startTime": "2019-02-08T22:19:09Z"
"launcherStatus": "Succeeded"
"completionTime": "2019-02-08T22:19:47Z",
}

I can open a PR for this issue if you think it's reasonable.

time consuming by mpirun invoke

hi~
I do a test with 2node , each node has 8 Gpus.
print the startime in the python script tf_cnn_benchmarks.py

case A:

apiVersion: kubeflow.org/v1alpha1
kind: MPIJob
metadata:
  name: mpijob-horovod-2
spec:
  replicas: 2
  backoffLimit: 0
  template:
    spec:
      containers:
      - image: uber/horovod:0.15.2-tf1.12.0-torch1.0.0-py3.5 
        args:
         - cd /code &&  mpirun -x NCCL_DEBUG=INFO python tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod
        command:
         - /bin/bash
         - -c
        imagePullPolicy: Always
        name: horovod
        resources:
          limits:
            cpu: 64
            nvidia.com/gpu: 8
            memory: 96000Mi
        volumeMounts:
        - mountPath: /code
          name: codevolume
      restartPolicy: Never
      volumes:
      - hostPath:
          path: /opt/tf_cnn_benchmarks
          type: DirectoryOrCreate
        name: codevolume

output:

+ POD_NAME=mpijob-horovod-2-worker-0
+ shift
+ /opt/kube/kubectl exec mpijob-horovod-2-worker-0 -- /bin/sh -c+      PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "379584512" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "mpijob-horovod-[1:2]-launcher-ncdwv,mpijob-horovod-[1:2]-worker-0,mpijob-horovod-[1:2]-worker-1@0(3)" -mca orte_hnp_uri "379584512.0;tcp://192.168.60.207:59751" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
POD_NAME=mpijob-horovod-2-worker-1
+ shift
+ /opt/kube/kubectl exec mpijob-horovod-2-worker-1 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "379584512" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "mpijob-horovod-[1:2]-launcher-ncdwv,mpijob-horovod-[1:2]-worker-0,mpijob-horovod-[1:2]-worker-1@0(3)" -mca orte_hnp_uri "379584512.0;tcp://192.168.60.207:59751" -mca plm "rsh" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated"
begin to training at 2019-03-29 17:19:00

and the pod time

    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 29 Mar 2019 17:17:18 +0800
      Finished:     Fri, 29 Mar 2019 17:20:05 +0800

case B:
run tf_cnn_benchmarks.py with pod host-network
the cmd in the worker pod is

/usr/sbin/sshd -p 12345; sleep infinity

and cmd in launcher pod is , IP1 and IP2 is the host IP where worker pods runnig

 - cd /code  && mpirun -np 16 -H IP1:8,IP2:8 -bind-to none -map-by slot -x NCCL_SOCKET_IFNAME=ib0 -mca btl_tcp_if_exclude docker0,tunl0,lo -x NCCL_DEBUG=ERROR -x LD_LIBRARY_PATH -mca plm_rsh_args "-p 12345 -vvvv" python tf_cnn_benchmarks.py --model resnet101 --batch_size 64 --variable_update horovod

output:

begin to training at 2019-03-29 16:46:29 
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
       Started:      Fri, 29 Mar 2019 16:46:27 +0800
       Finished:     Fri, 29 Mar 2019 16:47:41 +0800  1分14s

so, the costtime of mpirun invoke
caseA(mpijob): 1min42s
caseB(host-network) : 2s

@rongou
do you have any suggestion about this difference?
I guess it is cause by the loss of the container network ?

mpijob restarts a few hours after launcher completed.

My job definition is like this.

apiVersion: "kubeflow.org/v1alpha1"
kind: "MPIJob"
metadata:
  name: {{ job_name }}
  labels:
    exp_name: {{ exp_name }}
    user: {{ user_name }}

spec:
  backoffLimit: 0
  ......
      restartPolicy: Never

After the launcher job finished, either Failed or Succeed, all worker pods terminated normally. However, after around three hours, the whole job automatically restarts. Is this expected? Should I delete the mpijob after each run?

Continuous building docker images

I notice mpioperator/mpi-operator:latest is from rongou/mpi-operator which sometimes has delay.

  1. Could you help release latest mpi operator image? I'd like to use feature from latest version. I can use customized image for now and this is a low priority request.

  2. Does mpi-operator project has CI and CD support? It would be great to use latest image which consistent with upstream repo changes. I would also suggest to build stable tag along with kubeflow release.

Reference:
https://hub.docker.com/r/mpioperator/mpi-operator

allow more flexible rbac of mpi jobs

Right now we create a new Role/ServiceAccount/RoleBinding for every MPIJob. Should give cluster admin the option to reuse existing RBAC resources. Need to change to code and add params to the ksonnet prototype.

From @everpeace:

Sometimes, cluster-admin wanted to manage RBAC related resources on their own or to use existing service account. In the case, admins don't want to let the prototype create them. So,

How about introducing rbacCreate and serviceAccountName parameters?? It's popular way in helm packages. In this case, we need explicit guidance so that admins can refer to required roles when they choose rbacCreate=false.

cannot list poddisruptionbudgets.policy at the cluster scope

Hi , I saw the error logs from mpi-operator-5cdb797cd-btdg8 pod, what this means?

k8s.io/client-go/informers/factory.go:132: Failed to list *v1beta1.PodDisruptionBudget: poddisruptionbudgets.policy is forbidden: User "system:serviceaccount:mpi-operator:mpi-operator" cannot list poddisruptionbudgets.policy at the cluster scope

how does the shift command work in the kubexec configmap

I am doing a comparison test on a 3 node*4GPU setting which involves using both openMPI and mvapich. While the default openMPI is good to run, I have trouble in getting mvapich to work with mpi-operator. And I believe the problem is that mvapich fails to recognize the hostfile as the log shows only 3 shift happen(while it is 12 for openMpi), one for each pod.
Original hostfile:
pod1 slots=4
pod2 slots=4
pod3 slots=4
I then updated the hostfile to make it compatible with mvapich as below but still only 3 shift happen and the process just hangs there.
Updated hostfile
pod1:4
pod2:4
pod3:4
I have even tried this hostfile below but no luck.
Updated again hostfile
pod1
pod1
pod1
pod1
pod2
pod2
pod2
pod2
pod3
pod3
pod3
pod3

Here is the mpiexec command I am using.
mpiexec -n 12 -launcher rsh -launcher-exec /etc/mpi/kubexec.sh -f /etc/mpi/hostfile -env <key> <value> ./script.sh
At this point, I am really wondering how does this shift logic works under the hood. Does it take the hostname from hostfile and how does it that happen? Can you give some suggestion here?Thanks.@rongou

slots in hostfile need to be configurable

The slots of the mpi-operator is not configurable. The current implementation is https://github.com/kubeflow/mpi-operator/blob/master/pkg/controllers/mpi_job_controller.go#L753-#L761. And the sample output is

cat /etc/mpi/hostfile
caffe-wd-mpijob-worker-0 slots=8 max_slots=8
caffe-wd-mpijob-worker-1 slots=8 max_slots=8

Please check slots and max_slots in https://www.ibm.com/support/knowledgecenter/en/SSZTET_10.1.0/smpi02/smpi02_host_list.html. I think we may have two choices:

  1. Make slots configurable. Default setting are what we have.

  2. Don't specify slots and max_slots in hostfile. Just leave it as slots is 1 and max_slots as infinite.

@rongou what's your thought?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.