ibm / ffdl Goto Github PK
View Code? Open in Web Editor NEWFabric for Deep Learning (FfDL, pronounced fiddle) is a Deep Learning Platform offering TensorFlow, Caffe, PyTorch etc. as a Service on Kubernetes
License: Apache License 2.0
Fabric for Deep Learning (FfDL, pronounced fiddle) is a Deep Learning Platform offering TensorFlow, Caffe, PyTorch etc. as a Service on Kubernetes
License: Apache License 2.0
model trained as command:
$CLI_CMD train etc/examples/tf-model/manifest-hostmount.yml etc/examples/tf-model
hostmount learner pod error as flow:
Starting Training training-PYCOsfJmg
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/lib/python2.7/zipfile.py", line 1541, in <module>
main()
File "/usr/lib/python2.7/zipfile.py", line 1512, in main
with ZipFile(args[1], 'r') as zf:
File "/usr/lib/python2.7/zipfile.py", line 756, in init
self.fp = open(file, modeDict[mode])
IOError: [Errno 2] No such file or directory: '/mnt/results/results/training-PYCOsfJmg/_submitted_code/model.zip'
Done load-model
I noticed that setting up FfDL on macOS is more involved than the steps in the documentation - steps like make minikube
, eval $(minikube docker-env)
or make docker-build-base
are omitted and it would also help to have instructions on how to install dependencies. In general, the following instructions should work:
# Install Docker
# Approximately https://docs.docker.com/docker-for-mac/install/
# Install Go
brew install go
brew install glide # Alternative: curl https://glide.sh/get | sh
export GOPATH=$HOME/go
echo "export GOPATH=$HOME/go" >> ~/.profile
export PATH=${GOPATH}/bin:$PATH
echo "export PATH=\$GOPATH/bin:\$PATH" >> ~/.profile
source ~/.profile
# Install Minikube
brew cask install virtualbox # or use installer from https://www.virtualbox.org/wiki/Downloads
brew cask install minikube
brew install kubernetes-helm
# Hyperkit
curl -LO https://storage.googleapis.com/minikube/releases/latest/docker-machine-driver-hyperkit \
&& chmod +x docker-machine-driver-hyperkit \
&& sudo mv docker-machine-driver-hyperkit /usr/local/bin/ \
&& sudo chown root:wheel /usr/local/bin/docker-machine-driver-hyperkit \
&& sudo chmod u+s /usr/local/bin/docker-machine-driver-hyperkit
# Potential Alternative:
# brew install --build-from-source hyperkit
# Clone FfDL
mkdir -p $GOPATH/src/github.com/IBM && cd $_
git clone https://github.com/IBM/FfDL.git && cd FfDL
# Build FfDL
export VM_TYPE=minikube
# Modify Makefile and change MINIKUBE_DRIVER from xhyve to hyperkit
sed -i '' -e "s/MINIKUBE_DRIVER ?= xhyve/MINIKUBE_DRIVER ?= hyperkit/g" Makefile
glide install
make build
make minikube
eval $(minikube docker-env)
make docker-build-base
make docker-build
make deploy
With two minor things to add...
helm
and kubectl
as well...and three questions:
docs/setup-guide.md
?Thanks in advance.
PS regarding troubleshooting:
We should also consider adding a docs/troubleshooting.md
.
For instance, I have seen the following issues on Minikube:
make deploy
dies after "Initializing..." most likely VM_TYPE=minikube
was not set.make deploy
gets stuck at "Installing helm/tiller..." most likely helm is not installed.the pod prometheus' status is CrashLoopBackOfff
Readiness probe failed: Get http://192.168.7.81:3000/api/health: dial tcp 192.168.7.81:3000: getsockopt: connection refused
Hi, I've installed FfDL in a completely offline kubernetes cluster:
Everything worked fine, and I've got the training results, but Grafana showed nothing but mostly a 'no data points' hint on its panel.
Four dashboards:
And I can't find any useful Prometheus or Grafana logs.
BTW, I've commented out the env variable 'GF_INSTALL_PLUGINS' in spec of container 'grafana' in templates/monitoring/prometheus-deployment.yml, for it would try to download from the Internet.
Any hint on what's missing? Thanks!
Moderate severity
The moment module before 2.19.3 for Node.js is prone to a regular expression denial of service via a crafted date str...
package-lock.json update suggested:
moment ~> 2.19.3
Currently FfDL only can be deployed in the default namespace. We need to add some configuration for the helm chart and LCM provision to allow FfDL to be deployed on any namespace.
volumemanage
for VCKvolumemanage
resource and monitor it for completion before executing the training job workload.For more details, please refer to https://github.com/IBM/FfDL/blob/vck-patch/etc/examples/vck-integration.md
Our intern Andrew has done a great job on updating and fixing some of our UI bugs. We should review and pull this in to master/ffdl branch when we have time.
The Learner requires the ffdl-controller image and apparently is not available on the Public DockerHub.
Also, log-collector images (e.g. ffdl/tensorboard_extract_1.3-py3:latest) also not available on DockerHub.
@Tomcli based on your investigation what`s needed on this side?
FfDL Version:0.1
TensorFlow Version:1.5.0-gpu-py3
I try to run a job get error as here:
`132 | 1534166953114 | pciBusID: 0000:00:06.0 |
---|---|---|
133 | 1534166953115 | totalMemory: 22.40GiB freeMemory: 22.29GiB |
134 | 1534166953116 | 2018-08-13 13:29:02.319076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:00:06.0, compute capability: 5.2) |
135 | 1534166953117 | /usr/local/bin/train.sh: line 18: 36 Killed python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001 --trainingIters 2000 2>&1 |
136 | 1534166953118 | Training process finished. Exit code: 137 |
137 | 1534166953119 | Job exited with error code 137 |
138 | 1534166953120 | Failed: learner_exit_code: 137` |
We need to set up a proper pipeline, to build and push the latest version of the Docker images on green master builds.
While originally only intended as a quick help to setup FfDL's dependencies, the DIND scripts have become more widely used than anticipated. Thus, we should probably overhaul and parameterize them. Here are a few notes for improvement:
https://github.com/IBM/FfDL/blob/master/bin/dind_scripts/create_user.sh should have an environment variable for the username, so it can be adapted in one place
https://github.com/IBM/FfDL/blob/master/bin/dind_scripts/experimental_master.sh line 2 and 24 should use $USER, so this is not tied to the user being ffdlr
We should make sure $GOPATH is used consistently, so users who do not use ~/go can adapt it
The scripts should probably check more rigorously whether they were successful and potentially create reports, so tracing issues during setup becomes easier.
The current deploy-plugin target has side effects, i.e. it not only pulls and deploys the S3 driver plugin, but it also sets up a hostmount PV, which is not the standard for development and should probably be done in an independent task. [Also: Isn't that an NFS replacement, whereas the rest is about S3?]
The test-submit Makefile target will not work when used against existing S3 buckets, since it uses placeholder names for the buckets and endpoint. We either need to replace those like we replace the username and password or we need to halt the experimental_master.sh script and tell the user to adapt the etc/examples/tf-model/manifest.yml file before running make test-submit
Overall, we might want to turn the scripts into a full installer that queries for username, target directory etc., we should also add macOS support and potentially allow for more customization (e.g. use existing container registry rather than local registry)
Hello, @FfDL
We deploy FfDL in a private environment in which S3 and Swift are not available, only support NFS external storage. for model definition file, we can use localstack in current dev environment, for training data, we wish use NFS.
The following steps are our adaptions for NFS.
We are confirming the above method, however, new question already occurred.
If there are two models to be submitted, they are all using NFS static external storage at the same mount point, is this not a problem?
Would you please confirm the above method and question, or provide a right solution to us.
Thanks
Currently its following the complete build process, including building images locally
Some CLI-specific instructions in this document are based on older versions of the command line. They should be updated to reflect the latest release, which uses ibmcloud
instead of bx
as the binary.
Examples:
Current:
bx plugin repo-add Bluemix https://plugins.ng.bluemix.net
bx plugin install machine-learning -r bluemix
bx target -o ORG -s SPACE
Simpler version: (no need to add the repo - it's the default; no target required because WML service is now resource managed)
ibmcloud plugin install machine-learning
Current:
bx cf create-service pm-20 lite watson-machine-learning
bx cf create-service-key watson-machine-learning WML-Key
New:
ibmcloud resource service-instance-create watson-machine-learning pm-20 lite us-south
ibmcloud resource service-key-create WML-Key Writer --instance-name watson-machine-learning
ibmcloud resource service-key WML-Key
Here on it should only be necessary to replace bx
with ibmcloud
...
There's a good chance that other documentation is impacted as well. This is the first doc I've tried to follow.
Using the ffdl cli to get a model, returns YAML or JSON, but in both cases an header line "Getting model xyz" is included, which breaks parsing.
# ffdl show training-II-h6nxmR --json
Getting model 'training-II-h6nxmR'...
{
"Payload": {
"model_id": "training-II-h6nxmR",
...
Both JSON|YAML and the message are sent to stdout so the only way to separate them is to grep...
# ffdl show training-II-h6nxmR --json | egrep -v "^Getting model" | jq .Payload.training.training_status
{
"completed": "1539854246722",
"status": "COMPLETED",
"status_description": "COMPLETED",
"submitted": "1539853988330"
}
As mentioned in #45, Kubernetes 1.9.4 changes the secret, configMap, downwardAPI and projected volumes to mount with read-only. Since the current learner implementation needs write access to its mounted volume, the temporary solution is to set the feature gate ReadOnlyAPIDataVolumes=false
. We should change the learner implementation so that it can work with read only access on the mounted volume.
Currently we need an image to test on CPU
When image is built for seldon model using s2i, minimum required memory for docker is 8G.
Add these details in https://github.com/IBM/FfDL/blob/master/community/FfDL-Seldon/pytorch-model/README.md
Work on a PR for H2O.ai support to be pulled in.
FfDL should be able to handle direct data downloads from standard sites and upload to S3 compatible object storage
Are there methods to support super parameter automatic tuning in FfDL? such as learning rate.
The FfDL Elastic Search sometimes has an overhead issue when creating the emetrics/logline mapping.
[2018-02-15T18:11:21,334][INFO ][o.e.c.m.MetaDataMappingService] [VPF0eed] [dlaas_learner_data/Of58W91xS-6OlsuQEGByzw] update_mapping [logline]
[2018-02-15T18:11:36,852][INFO ][o.e.m.j.JvmGcMonitorService] [VPF0eed] [gc][556] overhead, spent [258ms] collecting in the last [1s]
When the Elastic Search works properly, it should have the following logs for mapping update/create.
[2018-02-15T17:52:52,598][INFO ][o.e.c.m.MetaDataMappingService] [R7H6R6o] [dlaas_learner_data/d-NMzvwRT_CXgMwHOBinTg] update_mapping [logline]
[2018-02-15T17:54:51,289][INFO ][o.e.c.m.MetaDataMappingService] [R7H6R6o] [dlaas_learner_data/d-NMzvwRT_CXgMwHOBinTg] create_mapping [emetrics]
It used to be that the FfDL CLI command to follow the logs of an ongoing training job $FFDL_CMD logs --follow ${MODEL_ID}
would tail the training logs until completion of the training job. The logs --follow
process returned control only after the training job was complete. This was a useful feature when chaining up commands to create a semi-automated machine learning pipeline, where subsequent commands require the output data of the training job whose logs are being "followed". We have a small example of such a training pipeline in our ART notebook which is currently broken.
That behavior changed with the merge of PR #79. Now the the $FFDL_CMD logs --follow ${MODEL_ID}
process terminates after 4 minutes -- usually before the training job is completed -- which causes the failure of subsequent processes that depend on training output data.
- var ctx context.Context
- var cancel context.CancelFunc
- logr.Debugf("follow is %t", req.Follow)
- if req.Follow {
- ctx, cancel = context.WithTimeout(context.Background(), 10*(time.Hour*24))
- } else {
- ctx, cancel = context.WithTimeout(context.Background(), 5*time.Second)
- }
+ ctx, cancel := context.WithTimeout(context.Background(), time.Minute*4)
defer cancel()
Showing prematurely aborted logs --follow
process with apparent 4 min timeout.
$FFDL_CMD train manifest.yml model.zip
Deploying model with manifest 'manifest.yml' and model file 'model.zip'...
Model ID: training-uLQ7ZMDmR
OK
$FFDL_CMD logs --follow training-uLQ7ZMDmR && date
Getting model training logs for 'training-uLQ7ZMDmR'...
Training with training/test data at:
DATA_DIR: /mnt/data/training-data-bbe28e19-4fba-4e29-af5f-564f0e0d3f53
MODEL_DIR: /job/model-code
TRAINING_JOB:
TRAINING_COMMAND: pip3 install keras; python3 convolutional_keras.py --data ${DATA_DIR}/mnist.npz
...
Wed Jun 27 19:19:34 UTC 2018: Running training job
...
Train on 54000 samples, validate on 6000 samples
Epoch 1/1
128/54000 [..............................] - ETA: 5:21 - loss: 2.2977 - acc: 0.1562
256/54000 [..............................] - ETA: 4:37 - loss: 2.2591 - acc: 0.1758
...
45184/54000 [========================>.....] - ETA: 39s - loss: 0.3311 - acc: 0.8972
45312/54000 [========================>.....] - ETA: 39s - loss: 0.3305 - acc: 0.8974
45440/54000 [========================>.....] - ETA: 38s - loss: 0.3299 - acc: 0.8976
Wed Jun 27 12:23:35 PDT 2018
Notice the time stamps:
Wed Jun 27 19:19:34 UTC 2018: Running training job
-> date/time training starts
Wed Jun 27 12:23:35 PDT 2018
> date/time just after the logs job returns (after 4 min)
When running the caffe-model job with FfDL. The databroker_s3 always having issue to pull one of the file from s3://mnist_lmdb_data/train/data.mdb
Using Object Storage account test at http://s3.default.svc.cluster.local
Download start: Mon Feb 12 19:38:10 UTC 2018
Downloading from bucket mnist_lmdb_data to /job/mnist_lmdb_data
Completed 256.0 KiB/68.8 MiB (213.3 KiB/s) with 4 file(s) remaining
Completed 264.0 KiB/68.8 MiB (217.1 KiB/s) with 4 file(s) remaining
download: s3://mnist_lmdb_data/test/lock.mdb to job/mnist_lmdb_data/test/lock.mdb
Completed 264.0 KiB/68.8 MiB (217.1 KiB/s) with 3 file(s) remaining
Completed 520.0 KiB/68.8 MiB (259.9 KiB/s) with 3 file(s) remaining
Completed 776.0 KiB/68.8 MiB (369.5 KiB/s) with 3 file(s) remaining
Completed 1.0 MiB/68.8 MiB (469.2 KiB/s) with 3 file(s) remaining
Completed 1.3 MiB/68.8 MiB (585.1 KiB/s) with 3 file(s) remaining
Completed 1.5 MiB/68.8 MiB (701.2 KiB/s) with 3 file(s) remaining
Completed 1.8 MiB/68.8 MiB (817.2 KiB/s) with 3 file(s) remaining
Completed 2.0 MiB/68.8 MiB (933.2 KiB/s) with 3 file(s) remaining
Completed 2.3 MiB/68.8 MiB (825.7 KiB/s) with 3 file(s) remaining
Completed 2.5 MiB/68.8 MiB (916.3 KiB/s) with 3 file(s) remaining
Completed 2.8 MiB/68.8 MiB (973.5 KiB/s) with 3 file(s) remaining
Completed 3.0 MiB/68.8 MiB (1.0 MiB/s) with 3 file(s) remaining
Completed 3.3 MiB/68.8 MiB (1.1 MiB/s) with 3 file(s) remaining
Completed 3.5 MiB/68.8 MiB (1.2 MiB/s) with 3 file(s) remaining
Completed 3.8 MiB/68.8 MiB (1.3 MiB/s) with 3 file(s) remaining
Completed 4.0 MiB/68.8 MiB (1.3 MiB/s) with 3 file(s) remaining
Completed 4.3 MiB/68.8 MiB (1.4 MiB/s) with 3 file(s) remaining
Completed 4.5 MiB/68.8 MiB (1.5 MiB/s) with 3 file(s) remaining
Completed 4.8 MiB/68.8 MiB (1.4 MiB/s) with 3 file(s) remaining
Completed 5.0 MiB/68.8 MiB (1.5 MiB/s) with 3 file(s) remaining
Completed 5.3 MiB/68.8 MiB (1.6 MiB/s) with 3 file(s) remaining
Completed 5.5 MiB/68.8 MiB (1.7 MiB/s) with 3 file(s) remaining
Completed 5.6 MiB/68.8 MiB (1.7 MiB/s) with 3 file(s) remaining
Completed 5.9 MiB/68.8 MiB (1.6 MiB/s) with 3 file(s) remaining
Completed 6.1 MiB/68.8 MiB (1.7 MiB/s) with 3 file(s) remaining
Completed 6.4 MiB/68.8 MiB (1.7 MiB/s) with 3 file(s) remaining
Completed 6.6 MiB/68.8 MiB (1.8 MiB/s) with 3 file(s) remaining
Completed 6.9 MiB/68.8 MiB (1.9 MiB/s) with 3 file(s) remaining
Completed 7.1 MiB/68.8 MiB (1.8 MiB/s) with 3 file(s) remaining
Completed 7.4 MiB/68.8 MiB (1.9 MiB/s) with 3 file(s) remaining
Completed 7.6 MiB/68.8 MiB (2.0 MiB/s) with 3 file(s) remaining
Completed 7.9 MiB/68.8 MiB (2.0 MiB/s) with 3 file(s) remaining
Completed 8.1 MiB/68.8 MiB (2.1 MiB/s) with 3 file(s) remaining
Completed 8.4 MiB/68.8 MiB (2.1 MiB/s) with 3 file(s) remaining
Completed 8.6 MiB/68.8 MiB (2.2 MiB/s) with 3 file(s) remaining
Completed 8.9 MiB/68.8 MiB (1.9 MiB/s) with 3 file(s) remaining
Completed 8.9 MiB/68.8 MiB (1.8 MiB/s) with 3 file(s) remaining
download: s3://mnist_lmdb_data/train/lock.mdb to job/mnist_lmdb_data/train/lock.mdb
Killed
Killed
download failed: s3://mnist_lmdb_data/train/data.mdb to job/mnist_lmdb_data/train/data.mdb [Errno 12] Cannot allocate memory
I also tried to increase the job memory and use IBM Cloud Object storage and still have the same issue. So I believe the issue could be
Hi! thanks for open sourcing this big effort!
Would it be possible to compare this solution to Kubeflow which contains Seldon-Core and an example.
And finally, if you have some time, compare to PipelineAI ?
Investigate what changes need to make for migrating FfDL to latest minikube version and k8s v1.10.0.
We should add a note to the README that 4GB+ RAM is the recommended configuration for Minikube.
Using Seldon as an optional model deployment platform for FfDL.
Work in progress
use a tagged image, like ":master-36" - that tagging should be part of ci/cd
@FfDL
I have read the paper about FfDL. http://learningsys.org/nips17/assets/papers/paper_29.pdf. In the FfDL architecture, Kubernetes Cluster Manager exist as a kubernetes controller. Are there some detail resources about the service, or Would you please tell us the future release plan.
We think it will be better that the development schedule is available.
I am getting this error while running make docker-build
command on a mac environment.
cd build/grpc-health-checker && make install-deps build-x86-64
glide -q install
[WARN] The name listed in the config file (github.ibm.com/deep-learning-platform/grpc-health-checker) does not match the current location (github.com/IBM/FfDL/etc/dlaas-service-base/build/grpc-health-checker)
rm -rf bin/
CGO_ENABLED=0 GOOS=linux go build -ldflags "-s" -a -installsuffix cgo -o bin/grpc-health-checker
docker build -q -f Dockerfile.ubuntu -t dlaas-service-base:ubuntu16.04 .
sha256:6975b033728017afc0f9a2dd9978e76331411d323fef07c9b8600959cccdbf4a
docker tag dlaas-service-base:ubuntu16.04 docker.io/ffdl/dlaas-service-base:ubuntu16.04
docker build -q -f Dockerfile.alpine -t dlaas-service-base:alpine3.3 .
sha256:9e367f913da06d9758a26936bff18c23587e5eebfd424f9793a0cd488742e75b
docker tag dlaas-service-base:alpine3.3 docker.io/ffdl/dlaas-service-base:alpine3.3
(full_img_name=ffdl-metrics; \
cd ./metrics/ && (if [ "minikube" = "minikube" ]; then eval $(minikube docker-env); fi; \
docker build -q -t docker.io/ffdl/$full_img_name .))
Sending build context to Docker daemon 11.26MB
Step 1/6 : FROM dlaas-service-base:ubuntu16.04
pull access denied for dlaas-service-base, repository does not exist or may require 'docker login'
make[1]: *** [.docker-build] Error 1
make: *** [docker-build-metrics] Error 2```
Raising this issue as per my conversation with Tommy earlier today.
minikube version: v0.25.2
kubectl version:
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"", Minor:"", GitVersion:"v1.9.4", GitCommit:"bee2d1505c4fe820744d26d41ecd3fdd4a3d6546", GitTreeState:"clean", BuildDate:"2018-03-21T21:48:36Z", GoVersion:"go1.9.1", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes: 1.9.4
When running a single learner job with the same python script, no issues. Whole process completes.
When running a multi learner job (only thing changed in the manifest is learners: 2
process fails.
Logs are as follows:
Nicholass-MBP:FfDL npng$ $CLI_CMD list
Getting all models ...
ID Name Framework Training status Submitted Completed
training-C6DTcIMmR h2o3_automl h2o3:latest COMPLETED N/A N/A
training-TOnw5SGiR h2o3_automl h2o3:latest FAILED N/A N/A
2 records found.
Nicholass-MBP:FfDL npng$ $CLI_CMD logs training-TOnw5SGiR
Getting model training logs for 'training-TOnw5SGiR'...
Status: FAILED
Cannot read trained model log: rpc error: code = Unknown desc = NoSuchKey: The specified key does not exist.
status code: 404, request id: , host id: Nicholass-MBP:FfDL npng$
Nicholass-MBP:FfDL npng$ kubectl get pods
NAME READY STATUS RESTARTS AGE
alertmanager-78676b6756-2l2zb 1/1 Running 0 32m
etcd0 1/1 Running 0 32m
ffdl-lcm-dd5f59b55-bm52q 1/1 Running 0 32m
ffdl-restapi-7789dbdf5f-2j4mh 1/1 Running 0 32m
ffdl-trainer-59bd46cfdb-9csqr 1/1 Running 2 32m
ffdl-trainingdata-688bf5f44b-48wqb 1/1 Running 5 32m
ffdl-ui-6545f7dd5b-lpqcd 1/1 Running 0 32m
grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9 0/1 ImagePullBackOff 0 6m
jobmonitor-3be3332c-e2f4-4a2b-4775-3069398a12ba-64f9b94465s7gmh 1/1 Running 0 6m
learner-1-3be3332c-e2f4-4a2b-4775-3069398a12ba-f8d8b8c98-6drgz 0/7 Pending 0 6m
learner-2-3be3332c-e2f4-4a2b-4775-3069398a12ba-979949d49-9jv9f 0/7 Pending 0 6m
mongo-0 1/1 Running 0 32m
prometheus-556d97b566-fmgkp 2/2 Running 0 32m
pushgateway-665b6c4b9-hg85s 2/2 Running 0 32m
storage-0 1/1 Running 0 32m
Nicholass-MBP:FfDL npng$
Nicholass-MBP:FfDL npng$ kubectl describe pod grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9
Name: grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9
Namespace: default
Node: minikube/192.168.99.100
Start Time: Fri, 27 Apr 2018 15:45:33 -0700
Labels: app=grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba
pod-template-hash=3168270778
service=dlaas-parameter-server
training_id=training-TOnw5SGiR
Annotations: <none>
Status: Pending
IP: 172.17.0.16
Controlled By: ReplicaSet/grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd
Containers:
grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba:
Container ID:
Image: docker.io/ffdl/parameter-server:master-97
Image ID:
Port: 50051/TCP
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Limits:
cpu: 500m
memory: 1048576k
Requests:
cpu: 500m
memory: 1048576k
Environment:
JOBID: 1111
NUM_LEARNERS: 2
TCP_PORT: 50051
ZK_DIR: training-TOnw5SGiR/parameter-server
ZK_DIR: training-TOnw5SGiR/parameter-server
DLAAS_ETCD_ADDRESS: <set to the key 'DLAAS_ETCD_ADDRESS' in secret 'lcm-secrets'> Optional: false
DLAAS_ETCD_USERNAME: <set to the key 'DLAAS_ETCD_USERNAME' in secret 'lcm-secrets'> Optional: false
DLAAS_ETCD_PASSWORD: <set to the key 'DLAAS_ETCD_PASSWORD' in secret 'lcm-secrets'> Optional: false
DLAAS_ETCD_PREFIX: <set to the key 'DLAAS_ETCD_PREFIX' in secret 'lcm-secrets'> Optional: false
FOR_TEST: 1
DLAAS_JOB_ID: training-TOnw5SGiR
ZNODE_NAME: singleshard
DATA_STORE_AUTHURL: http://s3.default.svc.cluster.local
MODEL_STORE_OBJECTID: dlaas-models/training-TOnw5SGiR.zip
RESULT_STORE_AUTHURL: http://s3.default.svc.cluster.local
RESULT_STORE_TYPE: s3_datastore
RESULT_STORE_USERNAME: test
MODEL_STORE_APIKEY: test
DATA_DIR: h2o3_training_data
DATA_STORE_TYPE: s3_datastore
MODEL_STORE_USERNAME: test
MODEL_DIR: /model-code
GPU_COUNT: 0.000000
RESULT_DIR: h2o3_trained_model
DATA_STORE_OBJECTID: h2o3_training_data
SCHED_POLICY: dense
RESULT_STORE_OBJECTID: h2o3_trained_model/training-TOnw5SGiR
LOG_DIR: /logs
MODEL_STORE_AUTHURL: http://s3.default.svc.cluster.local
MODEL_STORE_TYPE: s3_datastore
DATA_STORE_USERNAME: test
DATA_STORE_APIKEY: test
RESULT_STORE_APIKEY: test
TRAINING_COMMAND: python h2o3_baseline.py --trainDataFile ${DATA_DIR}/higgs_train_10k.csv --target response --memory 1
TRAINING_ID: training-TOnw5SGiR
Mounts:
/etc/certs/ from etcd-ssl-cert-vol (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-nllw4 (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
etcd-ssl-cert-vol:
Type: Secret (a volume populated by a Secret)
SecretName: lcm-secrets
Optional: false
default-token-nllw4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-nllw4
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m default-scheduler Successfully assigned grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9 to minikube
Normal SuccessfulMountVolume 6m kubelet, minikube MountVolume.SetUp succeeded for volume "etcd-ssl-cert-vol"
Normal SuccessfulMountVolume 6m kubelet, minikube MountVolume.SetUp succeeded for volume "default-token-nllw4"
Normal Pulling 5m (x4 over 6m) kubelet, minikube pulling image "docker.io/ffdl/parameter-server:master-97"
Warning Failed 5m (x4 over 6m) kubelet, minikube Failed to pull image "docker.io/ffdl/parameter-server:master-97": rpc error: code = Unknown desc = Error response from daemon: pull access denied for ffdl/parameter-server, repository does not exist or may require 'docker login'
Warning Failed 5m (x4 over 6m) kubelet, minikube Error: ErrImagePull
Warning Failed 5m (x6 over 6m) kubelet, minikube Error: ImagePullBackOff
Normal BackOff 1m (x21 over 6m) kubelet, minikube Back-off pulling image "docker.io/ffdl/parameter-server:master-97"
I think this is what is blocking the rest of the processes:
Normal Pulling 5m (x4 over 6m) kubelet, minikube pulling image "docker.io/ffdl/parameter-server:master-97"
Warning Failed 5m (x4 over 6m) kubelet, minikube Failed to pull image "docker.io/ffdl/parameter-server:master-97": rpc error: code = Unknown desc = Error response from daemon: pull access denied for ffdl/parameter-server, repository does not exist or may require 'docker login'
Warning Failed 5m (x4 over 6m) kubelet, minikube Error: ErrImagePull
Warning Failed 5m (x6 over 6m) kubelet, minikube Error: ImagePullBackOff
Normal BackOff 1m (x21 over 6m) kubelet, minikube Back-off pulling image "docker.io/ffdl/parameter-server:master-97"
For some case, It will take long time to training a model, a "Processing" status is not enough for user experience. We will provide training progress and left time estimation.
Thanks
What happend:
Hi there, thanks a lot for your work. It's impressive, so I was trying to deploy it on local MINIKUBE and local DIND, but in fact none of them worked properly. I was stuck in an issue for few days, so I'd like to ask you guys for help. By chance I've found something similar to my issue from your docs but in the different condition, which means:
minikube
encountered the issue which was recorded in the DIND-TRAING -- all pods worked as expectedalertmanager-7bd87d99cc-jhp2b 1/1 Running 0 6h
etcd0 1/1 Running 0 6h
ffdl-lcm-8d555c7bf-dqqhg 1/1 Running 0 6h
ffdl-restapi-7f5c57c77d-k67pm 1/1 Running 0 6h
ffdl-trainer-6777dd5756-xkk65 1/1 Running 0 6h
ffdl-trainingdata-696b99ff5c-tvbtc 1/1 Running 0 6h
ffdl-ui-95d6464c7-bv2sn 1/1 Running 0 6h
jobmonitor-0d296791-2adc-4336-4f01-b280090460c3-cbdb48cfd-qqsvz 1/1 Running 0 1h
learner-0d296791-2adc-4336-4f01-b280090460c3-0 0/1 ContainerCreating 0 1h
lhelper-0d296791-2adc-4336-4f01-b280090460c3-54858658b-p7vfc 2/2 Running 0 1h
mongo-0 1/1 Running 4 6h
prometheus-67fb854b59-c884p 2/2 Running 0 6h
pushgateway-5665768d5c-jdlnl 2/2 Running 0 6h
storage-0 1/1 Running 0 6h
except the pod learner with eternal pending status because of the following warning.
Unable to mount volumes for pod "learner-d3a04eac-a64a-427e-56e5-8366cc84292f-0_default(33f78708-f963-11e8-aa08-0800275e57f0)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-d3a04eac-a64a-427e-56e5-8366cc84292f-0". list of unmounted volumes=[cosinputmount-d3a04eac-a64a-427e-56e5-8366cc84292f cosoutputmount-d3a04eac-a64a-427e-56e5-8366cc84292f]. list of unattached volumes=[cosinputmount-d3a04eac-a64a-427e-56e5-8366cc84292f cosoutputmount-d3a04eac-a64a-427e-56e5-8366cc84292f learner-entrypoint-files jobdata]
and here's the details of pod learner-x
Name: learner-0d296791-2adc-4336-4f01-b280090460c3-0
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: minikube/10.0.2.15
Start Time: Thu, 06 Dec 2018 17:05:52 +0100
Labels: controller-revision-hash=learner-0d296791-2adc-4336-4f01-b280090460c3-999bf4986
service=dlaas-learner
statefulset.kubernetes.io/pod-name=learner-0d296791-2adc-4336-4f01-b280090460c3-0
training_id=training-bFEXXGPmR
user_id=test-user
Annotations: scheduler.alpha.kubernetes.io/nvidiaGPU={ "AllocationPriority": "Dense" }
scheduler.alpha.kubernetes.io/tolerations=[ { "key": "dedicated", "operator": "Equal", "value": "gpu-task" } ]
Status: Pending
IP:
Controlled By: StatefulSet/learner-0d296791-2adc-4336-4f01-b280090460c3
Containers:
learner:
Container ID:
Image: tensorflow/tensorflow:1.5.0-py3
Image ID:
Ports: 22/TCP, 2222/TCP
Host Ports: 0/TCP, 0/TCP
Command:
bash
-c
export PATH=/usr/local/bin/:$PATH; cp /entrypoint-files/*.sh /usr/local/bin/; chmod +x /usr/local/bin/*.sh;
if [ ! -f /job/load-model.exit ]; then
while [ ! -f /job/load-model.start ]; do sleep 2; done ;
date "+%s%N" | cut -b1-13 > /job/load-model.start_time ;
echo "Starting Training $TRAINING_ID"
mkdir -p "$MODEL_DIR" ;
python -m zipfile -e $RESULT_DIR/_submitted_code/model.zip $MODEL_DIR ;
echo $? > /job/load-model.exit ;
fi
echo "Done load-model" ;
if [ ! -f /job/learner.exit ]; then
while [ ! -f /job/learner.start ]; do sleep 2; done ;
date "+%s%N" | cut -b1-13 > /job/learner.start_time ;
for i in ${!ALERTMANAGER*} ${!DLAAS*} ${!ETCD*} ${!GRAFANA*} ${!HOSTNAME*} ${!KUBERNETES*} ${!MONGO*} ${!PUSHGATEWAY*}; do unset $i; done;
export LEARNER_ID=$((${DOWNWARD_API_POD_NAME##*-} + 1)) ;
mkdir -p $RESULT_DIR/learner-$LEARNER_ID ;
mkdir -p $CHECKPOINT_DIR ;bash -c 'train.sh >> $JOB_STATE_DIR/latest-log 2>&1 ; exit ${PIPESTATUS[0]}' ;
echo $? > /job/learner.exit ;
fi
echo "Done learner" ;
if [ ! -f /job/store-logs.exit ]; then
while [ ! -f /job/store-logs.start ]; do sleep 2; done ;
date "+%s%N" | cut -b1-13 > /job/store-logs.start_time ;
echo Calling copy logs.
mv -nf $LOG_DIR/* $RESULT_DIR/learner-$LEARNER_ID ;
ERROR_CODE=$? ;
echo $ERROR_CODE > $RESULT_DIR/learner-$LEARNER_ID/.log-copy-complete ;
bash -c 'exit $ERROR_CODE' ;
echo $? > /job/store-logs.exit ;
fi
echo "Done store-logs" ;
while true; do sleep 2; done ;
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 500m
memory: 1048576k
nvidia.com/gpu: 0
Requests:
cpu: 500m
memory: 1048576k
nvidia.com/gpu: 0
Environment:
LOG_DIR: /job/logs
GPU_COUNT: 0.000000
TRAINING_COMMAND: python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz --trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz --testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001 --trainingIters 2000
TRAINING_ID: training-bFEXXGPmR
DATA_DIR: /mnt/data/tf_training_data
MODEL_DIR: /job/model-code
RESULT_DIR: /mnt/results/tf_trained_model/training-bFEXXGPmR
DOWNWARD_API_POD_NAME: learner-0d296791-2adc-4336-4f01-b280090460c3-0 (v1:metadata.name)
DOWNWARD_API_POD_NAMESPACE: default (v1:metadata.namespace)
LEARNER_NAME_PREFIX: learner-0d296791-2adc-4336-4f01-b280090460c3
TRAINING_ID: training-bFEXXGPmR
NUM_LEARNERS: 1
JOB_STATE_DIR: /job
CHECKPOINT_DIR: /mnt/results/tf_trained_model/_wml_checkpoints
RESULT_BUCKET_DIR: /mnt/results/tf_trained_model
Mounts:
/entrypoint-files from learner-entrypoint-files (rw)
/job from jobdata (rw)
/mnt/data/tf_training_data from cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 (rw)
/mnt/results/tf_trained_model from cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3 (rw)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
cosinputmount-0d296791-2adc-4336-4f01-b280090460c3:
Type: FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
Driver: ibm/ibmc-s3fs
FSType:
SecretRef: &{cossecretdata-0d296791-2adc-4336-4f01-b280090460c3}
ReadOnly: false
Options: map[debug-level:warn endpoint:http://192.168.99.105:31172 tls-cipher-suite:DEFAULT cache-size-gb:0 chunk-size-mb:52 curl-debug:false kernel-cache:true multireq-max:20 bucket:tf_training_data ensure-disk-free:0 parallel-count:5 region:us-standard s3fs-fuse-retry-count:30 stat-cache-size:100000]
cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3:
Type: FlexVolume (a generic volume resource that is provisioned/attached using an exec based plugin)
Driver: ibm/ibmc-s3fs
FSType:
SecretRef: &{cossecretresults-0d296791-2adc-4336-4f01-b280090460c3}
ReadOnly: false
Options: map[cache-size-gb:0 curl-debug:false endpoint:http://192.168.99.105:31172 parallel-count:2 bucket:tf_trained_model debug-level:warn s3fs-fuse-retry-count:30 stat-cache-size:100000 chunk-size-mb:52 kernel-cache:false ensure-disk-free:2048 region:us-standard tls-cipher-suite:DEFAULT multireq-max:20]
learner-entrypoint-files:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: learner-entrypoint-files
Optional: false
jobdata:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: dedicated=gpu-task:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 1m (x40 over 1h) kubelet, minikube Unable to mount volumes for pod "learner-0d296791-2adc-4336-4f01-b280090460c3-0_default(ce612f9d-f970-11e8-aa08-0800275e57f0)": timeout expired waiting for volumes to attach or mount for pod "default"/"learner-0d296791-2adc-4336-4f01-b280090460c3-0". list of unmounted volumes=[cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3]. list of unattached volumes=[cosinputmount-0d296791-2adc-4336-4f01-b280090460c3 cosoutputmount-0d296791-2adc-4336-4f01-b280090460c3 learner-entrypoint-files jobdata]
dind
encountered the issue with non-hint FAILED ERROR while training. All the pods was running, but there're no pods jobmonitor, learner and lhelper.Deploying model with manifest 'manifest_testrun.yml' and model files in '.'...
Handling connection for 31404
Handling connection for 31404
FAILED
Error 200: OK
What you expected to happen:
Make FfDL work as properly on either local DIND or MINIKUBE.
Environment:
OS: Darwin local 17.4.0 Darwin Kernel Version 17.4.0:
MINIKUBE:
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.3", GitCommit:"2bba0127d85d5a46ab4b778548be28623b32d0b0", GitTreeState:"clean", BuildDate:"2018-05-21T09:17:39Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
How to reproduce it (as minimally and precisely as possible):
I was just following README.rd with several make instructions
make deploy-plugin
make quickstart-deploy
make test-push-data-s3
make test-job-submit
Anything else we need to know?:
In situation 2, I totally followed the above-mentioned steps;
In situation 1, because it popped out hints that nfs error at first, and I just remember one of the doc I've read about MINIKUBE as if to say that, for persistent volumes, it just supports hostpath
type, so I created a PV and PVC, here's the details.
$ kubectl describe pv hostpathtest
Name: hostpathtest
Labels: <none>
Annotations: pv.kubernetes.io/bound-by-controller=yes
Finalizers: [kubernetes.io/pv-protection]
StorageClass:
Status: Bound
Claim: default/static-volume-1
Reclaim Policy: Retain
Access Modes: RWO
Capacity: 20Gi
Node Affinity: <none>
Message:
Source:
Type: HostPath (bare host directory volume)
Path: /data/hostpath_test
HostPathType:
Events: <none>
$ kubectl describe pvc learner-1
Name: learner-1
Namespace: default
StorageClass:
Status: Bound
Volume: hostpathtest-learner
Labels: type=dlaas-static-volume
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{"volume.beta.kubernetes.io/storage-class":""},"labels":{"type":"dlaas-stat...
pv.kubernetes.io/bind-completed=yes
pv.kubernetes.io/bound-by-controller=yes
volume.beta.kubernetes.io/storage-class=
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 20Gi
Access Modes: RWO
Events: <none>
Thanks in advance for all advices and have a good day
Note: Currently GPU workloads on FfDL only works with Feature gate accelerator.
We need to prepare for some sample TensorFlow and Caffe jobs that uses GPU.
lcm/service/lcm/container_helper_extensions.go
I am getting error with one pod that gets into CrashLoopBackoff status. I have tried a few times with clean 'helm install' but get the same error.
ffdl-trainingdata-86c5578b75-v884m 0/1 CrashLoopBackOff
5 7mAny suggestion, how to get past this error and have a clean deployment of this fabric? Thanks
Need to clarify in documentation
The Font awesome library wasn't included in the FfDL UI. We should include it and rebuild our UI image.
if i execute the script, I will get error look similar below:
root@ffdl2018:~/FfDL/bin# kubectl port-forward pod/$ui_pod $ui_port:8080
error: invalid resource name "pod/": [may not contain '/']
So I tried to remove the pod/ thinking maybe newer version of kubeadmin-dind look like the pod/ , but i get different error below. Can someone help me with the error message below?
Forwarding from 127.0.0.1:31300 -> 8080
Handling connection for 30029
E1031 14:22:28.129745 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:28 socat[11424] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused
Handling connection for 30029
E1031 14:22:30.160553 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:30 socat[11441] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused
Handling connection for 30029
E1031 14:22:32.191360 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:32 socat[11492] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused
Handling connection for 30029
E1031 14:22:34.225286 48277 portforward.go:331] an error occurred forwarding 30029 -> 3000: error forwarding port 3000 to pod 47707ef93dfd507f6f14e9f8adb03b26857f292357fd6102877eef2b52e8a554, uid : exit status 1: 2018/10/31 03:22:34 socat[11493] E connect(5, AF=2 127.0.0.1:3000, 16): Connection refused
Handling connection for 30029
creating data source...
Handling connection for 30029
set up dashboards
Handling connection for 30029
Finished
Would you please tell us how to support tensorflow distributions train in FfDL.
As we known, there are worker tasks and parameter tasks in tensorflow, when using FfDL, should we specify the information clearly to FfDL.
Thanks
For best practice, we shouldn't have some of the resource constants hard coded in the code base. This will cause issue such as #13 hard to fix. Also, making some of these constants configurable in helm chart can allow users deploy FfDL based on their Kubernetes Cluster size.
From Kubernetes 1.9 and above, the default role doesn't have permission to access and consume cluster resources, so we need to create a new RBAC for LCM to view and assign cluster resources to the learners. The new RBAC should be done as part of the helm install.
Currently we had a hard-coded solution to use GPU learner image with accelerator at the gpu-dev branch. We will need to implement a trigger to pull the correct learner image without using the hard-coded solution.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.