Giter VIP home page Giter VIP logo

Comments (20)

abidmalikwaterloo avatar abidmalikwaterloo commented on May 22, 2024 3

from benchmarks.

ppwwyyxx avatar ppwwyyxx commented on May 22, 2024

I always started the worker first and then started PS with CUDA_VISIBLE_DEVICES=

from benchmarks.

yupeng9 avatar yupeng9 commented on May 22, 2024

Right, if I start the worker first, then the PS will also show OOM error. Will CUDA_VISIBLE_DEVICES disable the GPU devices for PS?

btw, if this is required, then can someone update the official guide?

from benchmarks.

tfboyd avatar tfboyd commented on May 22, 2024

from benchmarks.

tfboyd avatar tfboyd commented on May 22, 2024

@yupeng9
If you are doing distributed TensorFlow on just a few servers I would check out this example that includes tensorboard outputs and other nice features like automatic evaluation. Or you could try the Uber project that is also nice for distributed that I have not personally tested but I have seen their results and they are good. We are working on a nicer high level API in TensorFlow for distributed but the above options are currently the best.

from benchmarks.

yupeng9 avatar yupeng9 commented on May 22, 2024

@tfboyd thanks for the information.

Since pushing to the website can take a while, do you mind posting the instructions here once you have it?

I took a look at cifar10. Is there a plan to migrate tf_cnn_benchmark to include those additional features? A nice thing I see in tf_cnn_benchmark is that it is more like a general benchmark test bed: it supports multiple models as well as different data sets, and therefore it also allows future additions.

More importantly, Tensorflow website publishes useful results from this benchmark, so it has great reference values.

from benchmarks.

DjangoPeng avatar DjangoPeng commented on May 22, 2024

@yupeng9 What's the process of the distributed testing. I'm starting to run the distributed TensorFlow benchmarks.
@tfboyd It seems like the official guide is still not been updated?

from benchmarks.

Zhaojp-Frank avatar Zhaojp-Frank commented on May 22, 2024

+1 any update on the latest doc on distributed training steps? thanks.

from benchmarks.

tfboyd avatar tfboyd commented on May 22, 2024

I doubt I will update the web page anytime soon. I must have been in a hurry when I typed up that page, I also use my own testing harness that builds the commands and I likely failed to copy and past my exact commands from the logs. I did test the what is likely the most recent code on AWS two weeks ago and everything seemed fine with TF 1.4. It was a very small test with 2x p2.8xlarge instances.

I would suggest people not use this code unless they are going to write their own distributed or multi-GPU setup and can understand the variable management aspects. We use this code to test new ideas and a lot of different variations that are not matrix tested, meaning option A may not even work with option D and that will not be documented. I am putting all of my time into helping the team get clean examples published with known accuracy numbers over the next few months.

from benchmarks.

reedwm avatar reedwm commented on May 22, 2024

As @ppwwyyxx stated, when running the parameter servers on the same hosts as the workers, you should prefix the parameter server commands with CUDA_VISIBLE_DEVICES= . This hides the GPUs from TensorFlow so it will not use them or allocate the memory on them. I haven't tried myself, but the updated commands should be:

# Run the following commands on host_0 (10.0.0.1):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

# Run the following commands on host_1 (10.0.0.2):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

I'm currently blocked by this issue, but afterwards, once I have time, I can update the README (and the website once I figure out how) with the updated commands.

from benchmarks.

DjangoPeng avatar DjangoPeng commented on May 22, 2024

@reedwm How about setting CUDA_VISIBLE_DEVICES={0..7} for corresponding worker? Such like GPU 0 for Worker 0. The command should be:

CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

from benchmarks.

reedwm avatar reedwm commented on May 22, 2024

In the above example, each worker on is a separate machine, since they have different IP addresses (10.0.0.1 and 10.0.0.2). So, they will each have their own set of 8 GPUs, and so CUDA_VISIBLE_DEVICES should not be set.

If multiple worker processes are run on the same machine, your strategy of setting CUDA_VISIBLE_DEVICES will work. But it's better to run a single worker per machine and have each worker use all the GPUs on the machine.

from benchmarks.

DjangoPeng avatar DjangoPeng commented on May 22, 2024

Yep! I know the trick of setting CUDA_VISIBLE_DEVICES. But I just have 3 machines, and 2 1080Ti per machine. So, the recommended cluster specification is 3 parameter servers and 3 workers. Besides, a pair of ps and worker per machine. Am I right?

from benchmarks.

reedwm avatar reedwm commented on May 22, 2024

Yep, that is correct. On each machine, the worker will have access to both GPUs, and the parameter server will not since CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks ... will be run.

from benchmarks.

Zhaojp-Frank avatar Zhaojp-Frank commented on May 22, 2024

@reedwm question about the start order. for example, in the same hostA, once run above cmd to start worker, the shall would not return, instead it keeps running, e.g. trying to start the session. so when shall I start PS? any strict order required.
I have suffered numbers of errors such as 'Attempting to use uninitialized value p', 'expects a different device' . it will be great to document the start order info.

from benchmarks.

DjangoPeng avatar DjangoPeng commented on May 22, 2024

@Zhaojp-Frank Generally speaking, you'd better launch PS process before worker 0. If no ps is running well, worker 0 would throw the uninitialized error.

from benchmarks.

reedwm avatar reedwm commented on May 22, 2024

You should be able to launch the processes in any order. @DjangoPeng what's the commands you use that sometimes cause an uninitialized error?

from benchmarks.

abidmalikwaterloo avatar abidmalikwaterloo commented on May 22, 2024

Do we have to kill the parameter servers manually when the job is done?

from benchmarks.

reedwm avatar reedwm commented on May 22, 2024

@abidmalik1967, yes.

from benchmarks.

vilmara avatar vilmara commented on May 22, 2024

Hi @reedwm / @tfboyd, I am running the benchmarks on a multi node system (2 hosts , each has 4 GPUs) following the below instructions on https://www.tensorflow.org/performance/performance_models#executing_the_script , but I am getting errors (notice I replaced python by python3 and used --num_gpus=4 for each host)

Run the following commands on host_0 (10.0.0.1):

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

Run the following commands on host_1 (10.0.0.2):

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

When the system processes the first command, it throws the following error on each host:

host_0 output:
2018-05-15 18:32:29.136718: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2018-05-15 18:32:29.136759: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-05-15 18:32:29.136775: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
2018-05-15 18:32:37.369403: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error

host_1 output:
2018-05-15 18:32:47.220352: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-05-15 18:32:47.220364: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-05-15 18:32:54.466053: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error

When runs the second command, it prints the training info, and after just prints the below lines and doesn't produce more outputs, the processes look like on hold on each host
Running parameter server 0 # in the case of host_0
Running parameter server 1 # in the case of host_1

from benchmarks.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.