Hi, I followed the instructions from the [performance page]{<a href=

If you want to try distributed learning, try Horovod.<a href="https://

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

As <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

How to run the benchmark in the distributed mode?,about tensorflow/benchmarks

Comments (20)

abidmalikwaterloo commented on May 22, 2024 3

If you want to try distributed learning, try Horovod. https://github.com/uber/horovod Its much cleaner and gives better performance.

…

On Tue, May 15, 2018 at 7:45 PM, Vilmara ***@***.***> wrote: Hi @reedwm <https://github.com/reedwm> / @tfboyd <https://github.com/tfboyd>, I am running the benchmarks on a multi node system (2 hosts , each has 4 GPUs) following the below instructions on [ https://www.tensorflow.org/performance/performance_ models#executing_the_script] , but I am getting errors (notice I replaced python by python3 and used --num_gpus=4 for each host) Run the following commands on host_0 (10.0.0.1): python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0 python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0 Run the following commands on host_1 (10.0.0.2): python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1 python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1 Output on host 1( similar error on the second host): C4-1:~/benchmarks/scripts/tf_cnn_benchmarks$ python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 \ --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0 WARNING: Logging before flag parsing goes to stderr. W0515 18:30:07.299445 140167654033152 tf_logging.py:126] From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/ learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Use the retry module or similar alternatives. 2018-05-15 18:30:10.839398: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:05:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:30:11.080912: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1344] Found device 1 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:06:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:30:11.293176: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-05-15 18:30:11.293913: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1344] Found device 2 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:85:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:30:11.512889: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-05-15 18:30:11.513651: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1344] Found device 3 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:86:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:30:11.516938: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2, 3 2018-05-15 18:30:12.718971: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-05-15 18:30:12.719022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 1 2 3 2018-05-15 18:30:12.719030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N Y N N 2018-05-15 18:30:12.719036: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 1: Y N N N 2018-05-15 18:30:12.719042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 2: N N N Y 2018-05-15 18:30:12.719049: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 3: N N Y N 2018-05-15 18:30:12.720414: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 15133 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:05:00.0, compute capability: 6.0) 2018-05-15 18:30:12.968743: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:1 with 15133 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:06:00.0, compute capability: 6.0) 2018-05-15 18:30:13.276930: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:2 with 15133 MB memory) -> physical GPU (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:85:00.0, compute capability: 6.0) 2018-05-15 18:30:13.608508: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:3 with 15133 MB memory) -> physical GPU (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:86:00.0, compute capability: 6.0) 2018-05-15 18:30:13.949687: I tensorflow/core/distributed_ runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 10.0.0.1:50000, 1 -> 10.0.0.2:50000} 2018-05-15 18:30:13.949734: I tensorflow/core/distributed_ runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:50001, 1 -> 10.0.0.2:50001} 2018-05-15 18:30:13.954717: I tensorflow/core/distributed_ runtime/rpc/grpc_server_lib.cc:333] Started server with target: grpc://localhost:50001 TensorFlow: 1.7 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 512 global 64.0 per device Num batches: 100 Num epochs: 0.04 Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1', '/job:worker/task:0/gpu:2', '/job:worker/task:0/gpu:3'] Data format: NCHW Layout optimizer: False Optimizer: sgd Variables: distributed_replicated Sync: True ========== Generating model W0515 18:30:26.539938 140167654033152 tf_logging.py:126] From /home/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py:1525: Supervisor.*init* (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.MonitoredTrainingSession 2018-05-15 18:30:39.134526: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:30:39.134602: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:30:39.134617: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:30:49.134698: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:30:49.134738: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:30:49.134751: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:30:59.134867: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:30:59.134906: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:30:59.134919: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:31:09.135051: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:31:09.135090: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:31:09.135103: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:31:19.135229: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:31:19.135268: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:31:19.135283: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:31:29.135525: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:31:29.135565: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:31:29.135576: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:31:39.135715: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:31:39.135755: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:31:39.135767: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:31:49.135890: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:31:49.135930: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:31:49.135941: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:31:59.136066: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:31:59.136105: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:31:59.136120: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:32:09.136357: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:32:09.136397: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:32:09.136409: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:32:19.136545: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:32:19.136585: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:32:19.136597: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:32:29.136718: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2018-05-15 18:32:29.136759: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:32:29.136775: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1 2018-05-15 18:32:37.369403: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error 2018-05-15 18:32:37.369545: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error 2018-05-15 18:32:39.136914: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:32:49.137081: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1 2018-05-15 18:32:51.321405: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error I0515 18:32:51.462898 140167654033152 tf_logging.py:116] Error reported to Coordinator: <class 'tensorflow.python.framework. errors_impl.UnavailableError'>, OS Error Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1327, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1310, in _run_fn self._extend_graph() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1358, in _extend_graph graph_def.SerializeToString(), status) File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/framework/errors_impl.py", line 516, in *exit* c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.UnavailableError: OS Error During handling of the above exception, another exception occurred: Traceback (most recent call last): File "tf_cnn_benchmarks.py", line 60, in app.run(main) # Raises error on invalid flags, unlike tf.app.run() File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 274, in run _run_main(main, argv) File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 238, in _run_main sys.exit(main(argv)) File "tf_cnn_benchmarks.py", line 56, in main bench.run() File "/home/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1306, in run return self._benchmark_cnn() File "/home/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1535, in _benchmark_cnn start_standard_services=start_standard_services) as sess: File "/usr/lib/python3.5/contextlib.py", line 59, in *enter* return next(self.gen) File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/training/supervisor.py", line 1000, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/training/supervisor.py", line 828, in stop ignore_live_threads=ignore_live_threads) File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/training/coordinator.py", line 389, in join six.reraise(*self._exc_info_to_raise) File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise raise value File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/training/supervisor.py", line 989, in managed_session start_standard_services=start_standard_services) File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/training/supervisor.py", line 726, in prepare_or_wait_for_session init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) File "/usr/local/lib/python3.5/dist-packages/tensorflow/ python/training/session_manager.py", line 281, in prepare_session sess.run(init_op, feed_dict=init_feed_dict) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 905, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1140, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1321, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.UnavailableError: OS Error ***@***.***:/benchmarks/scripts/tf_cnn_benchmarks$ ***@***.***:/benchmarks/scripts/tf_cnn_benchmarks$ python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 \ --batch_size=64 --model=resnet50 --variable_update=distributed_replicated --job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 --worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0 WARNING: Logging before flag parsing goes to stderr. W0515 18:32:57.843850 140232479717120 tf_logging.py:126] From /usr/local/lib/python3.5/dist-packages/tensorflow/contrib/ learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Use the retry module or similar alternatives. 2018-05-15 18:33:01.492719: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:05:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:33:01.696742: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1344] Found device 1 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:06:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:33:01.911120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-05-15 18:33:01.912332: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1344] Found device 2 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:85:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:33:02.155386: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-05-15 18:33:02.156112: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1344] Found device 3 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285 pciBusID: 0000:86:00.0 totalMemory: 15.90GiB freeMemory: 15.61GiB 2018-05-15 18:33:02.159411: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2, 3 2018-05-15 18:33:03.404523: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-05-15 18:33:03.404573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 1 2 3 2018-05-15 18:33:03.404582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N Y N N 2018-05-15 18:33:03.404588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 1: Y N N N 2018-05-15 18:33:03.404594: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 2: N N N Y 2018-05-15 18:33:03.404601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 3: N N Y N 2018-05-15 18:33:03.405971: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:0 with 15133 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:05:00.0, compute capability: 6.0) 2018-05-15 18:33:03.734283: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:1 with 15133 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:06:00.0, compute capability: 6.0) 2018-05-15 18:33:04.081842: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:2 with 15133 MB memory) -> physical GPU (device: 2, name: Tesla P100-PCIE-16GB, pci bus id: 0000:85:00.0, compute capability: 6.0) 2018-05-15 18:33:04.414207: I tensorflow/core/common_ runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:ps/replica:0/task:0/device:GPU:3 with 15133 MB memory) -> physical GPU (device: 3, name: Tesla P100-PCIE-16GB, pci bus id: 0000:86:00.0, compute capability: 6.0) 2018-05-15 18:33:04.744345: I tensorflow/core/distributed_ runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:50000, 1 -> 10.0.0.2:50000} 2018-05-15 18:33:04.744393: I tensorflow/core/distributed_ runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 10.0.0.1:50001, 1 -> 10.0.0.2:50001} 2018-05-15 18:33:04.748917: I tensorflow/core/distributed_ runtime/rpc/grpc_server_lib.cc:333] Started server with target: grpc://localhost:50000 TensorFlow: 1.7 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 512 global 64.0 per device Num batches: 100 Num epochs: 0.04 Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1', '/job:worker/task:0/gpu:2', '/job:worker/task:0/gpu:3'] Data format: NCHW Layout optimizer: False Optimizer: sgd Variables: distributed_replicated Sync: True ========== Running parameter server 0 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#65 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbSULFXyDCfIruVAfJE4t1MjzaRMx6iks5ty2iEgaJpZM4P3nP0> .

-- Abid M. Malik ****************************************************** "I have learned silence from the talkative, toleration from the intolerant, and kindness from the unkind"---Gibran "Success is not for the chosen few, but for the few who choose" --- John Maxwell "Being a good person does not depend on your religion or status in life, your race or skin color, political views or culture. IT DEPENDS ON HOW GOOD YOU TREAT OTHERS"--- Abid "The Universe is talking to us, and the language of the Universe is mathematics."----Abid

from benchmarks.

ppwwyyxx commented on May 22, 2024

I always started the worker first and then started PS with CUDA_VISIBLE_DEVICES=

from benchmarks.

yupeng9 commented on May 22, 2024

Right, if I start the worker first, then the PS will also show OOM error. Will CUDA_VISIBLE_DEVICES disable the GPU devices for PS?

btw, if this is required, then can someone update the official guide?

from benchmarks.

tfboyd commented on May 22, 2024

I forgot to update it with the CUDA_VISIBLE_DEVICES. I made some other mistakes as well in copying the commands I used to run the benchmark.. I use a wrapper to run the tests that manage the args for me and was not careful enough when typing them out by hand. I will try to find time to update the information and get it pushed out. Pushes to the website can take a long time. I will see if I do the change and speed up the publish.

…

On Thu, Oct 12, 2017 at 2:11 PM, Yupeng Fu ***@***.***> wrote: Right, if I start the worker first, then the PS will also show OOM error. Will CUDA_VISIBLE_DEVICES disable the GPU devices for PS? btw, if this is required, then can someone update the official guide? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#65 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWZesoEP4N8DI86naTPEtAZ6RC7Kg6DAks5sroB_gaJpZM4P3nP0> .

from benchmarks.

tfboyd commented on May 22, 2024

@yupeng9
If you are doing distributed TensorFlow on just a few servers I would check out this example that includes tensorboard outputs and other nice features like automatic evaluation. Or you could try the Uber project that is also nice for distributed that I have not personally tested but I have seen their results and they are good. We are working on a nicer high level API in TensorFlow for distributed but the above options are currently the best.

from benchmarks.

yupeng9 commented on May 22, 2024

@tfboyd thanks for the information.

Since pushing to the website can take a while, do you mind posting the instructions here once you have it?

I took a look at cifar10. Is there a plan to migrate tf_cnn_benchmark to include those additional features? A nice thing I see in tf_cnn_benchmark is that it is more like a general benchmark test bed: it supports multiple models as well as different data sets, and therefore it also allows future additions.

More importantly, Tensorflow website publishes useful results from this benchmark, so it has great reference values.

from benchmarks.

DjangoPeng commented on May 22, 2024

@yupeng9 What's the process of the distributed testing. I'm starting to run the distributed TensorFlow benchmarks.
@tfboyd It seems like the official guide is still not been updated?

from benchmarks.

Zhaojp-Frank commented on May 22, 2024

+1 any update on the latest doc on distributed training steps? thanks.

from benchmarks.

tfboyd commented on May 22, 2024

I doubt I will update the web page anytime soon. I must have been in a hurry when I typed up that page, I also use my own testing harness that builds the commands and I likely failed to copy and past my exact commands from the logs. I did test the what is likely the most recent code on AWS two weeks ago and everything seemed fine with TF 1.4. It was a very small test with 2x p2.8xlarge instances.

I would suggest people not use this code unless they are going to write their own distributed or multi-GPU setup and can understand the variable management aspects. We use this code to test new ideas and a lot of different variations that are not matrix tested, meaning option A may not even work with option D and that will not be documented. I am putting all of my time into helping the team get clean examples published with known accuracy numbers over the next few months.

from benchmarks.

reedwm commented on May 22, 2024

As @ppwwyyxx stated, when running the parameter servers on the same hosts as the workers, you should prefix the parameter server commands with CUDA_VISIBLE_DEVICES= . This hides the GPUs from TensorFlow so it will not use them or allocate the memory on them. I haven't tried myself, but the updated commands should be:

# Run the following commands on host_0 (10.0.0.1):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

# Run the following commands on host_1 (10.0.0.2):
python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=8 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=1

I'm currently blocked by this issue, but afterwards, once I have time, I can update the README (and the website once I figure out how) with the updated commands.

from benchmarks.

DjangoPeng commented on May 22, 2024

@reedwm How about setting CUDA_VISIBLE_DEVICES={0..7} for corresponding worker? Such like GPU 0 for Worker 0. The command should be:

CUDA_VISIBLE_DEVICES=0 python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=1 \
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated \
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000 \
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

from benchmarks.

reedwm commented on May 22, 2024

In the above example, each worker on is a separate machine, since they have different IP addresses (10.0.0.1 and 10.0.0.2). So, they will each have their own set of 8 GPUs, and so CUDA_VISIBLE_DEVICES should not be set.

If multiple worker processes are run on the same machine, your strategy of setting CUDA_VISIBLE_DEVICES will work. But it's better to run a single worker per machine and have each worker use all the GPUs on the machine.

from benchmarks.

DjangoPeng commented on May 22, 2024

Yep! I know the trick of setting CUDA_VISIBLE_DEVICES. But I just have 3 machines, and 2 1080Ti per machine. So, the recommended cluster specification is 3 parameter servers and 3 workers. Besides, a pair of ps and worker per machine. Am I right?

from benchmarks.

reedwm commented on May 22, 2024

Yep, that is correct. On each machine, the worker will have access to both GPUs, and the parameter server will not since CUDA_VISIBLE_DEVICES= python tf_cnn_benchmarks ... will be run.

from benchmarks.

Zhaojp-Frank commented on May 22, 2024

@reedwm question about the start order. for example, in the same hostA, once run above cmd to start worker, the shall would not return, instead it keeps running, e.g. trying to start the session. so when shall I start PS? any strict order required.
I have suffered numbers of errors such as 'Attempting to use uninitialized value p', 'expects a different device' . it will be great to document the start order info.

from benchmarks.

DjangoPeng commented on May 22, 2024

@Zhaojp-Frank Generally speaking, you'd better launch PS process before worker 0. If no ps is running well, worker 0 would throw the uninitialized error.

from benchmarks.

reedwm commented on May 22, 2024

You should be able to launch the processes in any order. @DjangoPeng what's the commands you use that sometimes cause an uninitialized error?

from benchmarks.

abidmalikwaterloo commented on May 22, 2024

Do we have to kill the parameter servers manually when the job is done?

from benchmarks.

reedwm commented on May 22, 2024

@abidmalik1967, yes.

from benchmarks.

vilmara commented on May 22, 2024

Hi @reedwm / @tfboyd, I am running the benchmarks on a multi node system (2 hosts , each has 4 GPUs) following the below instructions on https://www.tensorflow.org/performance/performance_models#executing_the_script , but I am getting errors (notice I replaced python by python3 and used --num_gpus=4 for each host)

Run the following commands on host_0 (10.0.0.1):

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=worker --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

python3 tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4
--batch_size=64 --model=resnet50 --variable_update=distributed_replicated
--job_name=ps --ps_hosts=10.0.0.1:50000,10.0.0.2:50000
--worker_hosts=10.0.0.1:50001,10.0.0.2:50001 --task_index=0

Run the following commands on host_1 (10.0.0.2):

When the system processes the first command, it throws the following error on each host:

host_0 output:
2018-05-15 18:32:29.136718: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2018-05-15 18:32:29.136759: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-05-15 18:32:29.136775: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:1
2018-05-15 18:32:37.369403: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error

host_1 output:
2018-05-15 18:32:47.220352: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
2018-05-15 18:32:47.220364: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-05-15 18:32:54.466053: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error

When runs the second command, it prints the training info, and after just prints the below lines and doesn't produce more outputs, the processes look like on hold on each host
Running parameter server 0 # in the case of host_0
Running parameter server 1 # in the case of host_1

from benchmarks.

How to run the benchmark in the distributed mode? about benchmarks HOT 20 OPEN

Comments (20)

Run the following commands on host_0 (10.0.0.1):

Run the following commands on host_1 (10.0.0.2):

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent