Comments (5)
Hi @stock99, it looks like the script didn't find the right pod name from your Kubernetes cluster. Can you echo your pod name with the below commands? Thanks.
ui_pod=$(kubectl get pods | grep ffdl-ui | awk '{print $1}')
restapi_pod=$(kubectl get pods | grep ffdl-restapi | awk '{print $1}')
grafana_pod=$(kubectl get pods | grep prometheus | awk '{print $1}')
echo $ui_pod
echo $restapi_pod
echo $grafana_pod
Also, the pod/
format was introduce from kubectl
client v1.10.0 and above, so I would recommend to update your kubectl
client to a version after v1.10.0.
from ffdl.
Hi Tomcli,
It looks like the kubectl come with kubeadm-dind installation script isn't the latest one (1.8.x). If i installed the latest version via snap, the installation script there seem to enforce the use of 1.8.15 still. Should I adjust any environment variable?
echo $ui_pod
ffdl-ui-b6cbb98f-c4zpm
echo $restapi_pod
ffdl-restapi-84bcb74478-t8df6
echo $grafana_pod
prometheus-5f85fd7695-gb568
kubectl version
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.15", GitCommit:"c2bd642c70b3629223ea3b7db566a267a1e2d0df", GitTreeState:"clean", BuildDate:"2018-07-11T17:59:56Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.15", GitCommit:"c2bd642c70b3629223ea3b7db566a267a1e2d0df", GitTreeState:"clean", BuildDate:"2018-07-11T17:52:15Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
snap list
Name Version Rev Tracking Publisher Notes
aws-cli 1.15.71 135 stable aws✓ classic
core 16-2.35.5 5742 stable canonical✓ core
helm 2.11.0 63 stable snapcrafters classic
kubectl 1.12.1 462 stable canonical✓ classic
from ffdl.
Hi @stock99, I updated the script at #150 to make it able to run with K8S 1.8.x. Let me know if you encounter any new issue.
from ffdl.
seem to be ok now after removing 'pod/' in the script. The connection error in the opening post was because I fat-fingered on one of the export statement in dind installation.
But then I got an error message for the test routine make test-push-data-s3
&& make test-job-submit
:
Getting all models ...
Handling connection for 32060
ID Name Framework Training status Submitted Completed
0 records found.
Makefile:213: recipe for target 'test-job-submit' failed
make: *** [test-job-submit] Error 1
======
attached is the console log
error_log.txt
from ffdl.
Anyone can help? I got this error messages when running the make test-job-submit
Downloading Docker images and test training data. This may take a while.
Context "dind" modified.
error: there is no need to specify a resource type as a separate argument when passing arguments in resource/name form (e.g. 'kubectl get resource/<resource_name>' instead of 'kubectl get resource resource/<resource_name>'
Submitting example training job (tf-model)
S3 URL: http://:30381 REST URL: http://localhost:31961
Executing in etc/examples/tf-model: DLAAS_URL=http://localhost:31961 DLAAS_USERNAME=test-user DLAAS_PASSWORD=test /home/chris/FfDL/cli/bin/ffdl-linux train manifest.yml .
sed: can't read : No such file or directory
name: tf_convolutional_network_tutorial
description: Convolutional network model using tensorflow
version: "1.0"
gpus: 0
cpus: 0.5
memory: 1Gb
learners: 1
# Object stores that allow the system to retrieve training data.
data_stores:
- id: sl-internal-os
type: mount_cos
training_data:
container: tf_training_data
training_results:
container: tf_trained_model
connection:
auth_url: http://10.192.0.3:30417
user_name: test
password: test
framework:
name: tensorflow
version: "1.5.0-py3"
command: >
python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz
--trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz
--testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001
--trainingIters 2000
# Change trainingIters to 20000 if you want your model to have over 80% Accuracy rate.
evaluation_metrics:
type: tensorboard
in: "$JOB_STATE_DIR/logs/tb"
# (Eventual) Available event types: 'images', 'distributions', 'histograms', 'images'
# 'audio', 'scalars', 'tensors', 'graph', 'meta_graph', 'run_metadata'
# event_types: [scalars]
/home/chris/FfDL/etc/examples/tf-model
Deploying model with manifest 'manifest_testrun.yml' and model files in '.'...
Handling connection for 31961
Handling connection for 31961
FAILED
Error 200: OK
Test job submitted. Track the status via "DLAAS_URL=http://localhost:31961 DLAAS_USERNAME=test-user DLAAS_PASSWORD=test /home/chris/FfDL/cli/bin/ffdl-linux list".
from ffdl.
Related Issues (20)
- FfDL v0.1.1 model training error HOT 4
- FfDL CLI output is not properly machine parsable
- [Documentation] Update IBM Cloud CLI instructions in /etc/converter/train-deploy-wml.md
- Grafana charts shows no data points HOT 1
- Unable to mount volumes for pod Learner HOT 8
- Learner pod stuck at training step 100 using custom image with TF Object Detection HOT 5
- / FfDL/demos/fashion-mnist-adversarial/README.md references internal repository HOT 1
- how to use pytorch and caffe built by ourselves? HOT 2
- kubectl get pods :lcm ContainerCreating,prometheus trainer and trainingdata STATUS CrashLoopBackOff HOT 26
- tiller-deploy is in status CrashLoopBackOff HOT 2
- Confused about manifest.yml HOT 2
- learner pod failed HOT 19
- caffe training speed is very slow HOT 4
- pytorch training issue: insufficient shared memory HOT 2
- distributed training questions HOT 2
- why pytorch distributed training on two servers is slower than training on one server HOT 21
- .travis.yml: The 'sudo' tag is now deprecated in Travis CI
- ssh permission denied when deploying FfDL on public cloud
- fail to install
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ffdl.