tmulc18 / distributed-tensorflow-guide Goto Github PK

Distributed TensorFlow basics and examples of training algorithms

License: MIT License

Python 59.23% Shell 2.15% Jupyter Notebook 38.62%

distributed-tensorflow-guide's Introduction

Distributed TensorFlow Guide

This guide is a collection of distributed training examples (that can act as boilerplate code) and a tutorial of basic distributed TensorFlow. Many of the examples focus on implementing well-known distributed training schemes, such as those available in dist-keras which were discussed in the author's blog post.

Almost all the examples can be run on a single machine with a CPU, and all the examples only use data-parallelism (i.e., between-graph replication).

The motivation for this guide stems from the current state of distributed deep learning. Deep learning papers typical demonstrate successful new architectures on some benchmark, but rarely show how these models can be trained with 1000x the data which is usually the requirement in industy. Furthermore, most successful distributed cases use state-of-the-art hardware to bruteforce massive effective minibatches in a synchronous fashion across high-bandwidth networks; there has been little research showing the potential of asynchronous training (which is why there are a lot of those examples in this guide). Finally, the lack of documenation for distributed TF was the largest motivator for this project. TF is a great tool that prides itself on its scalability, but unfortunately there are few examples that show how to make your model scale with data size.

The aim of this guide is to aid all interested in distributed deep learning, from beginners to researchers.

Basics Tutorial

See the Basics-Tutorial folder for notebooks demonstrating core concepts used in distributed TensorFlow. The rest of the examples assume understanding of the basics tutorial.

Servers.ipynb -- basics of TensorFlow servers
Parameter-Server.ipynb -- everything about parameter servers
Local-then-Global-Variables.ipynb -- creates a graph locally then make global copies of the variables Useful for graphs that do local updates before pushing global updates (e.g., DOWNPOUR, ADAG, etc.)
Multiple-Workers -- contains three notebooks: one parameter server notebook and two worker notebooks The exercise shows how global variables are communicated via the parameter server and how local updates can be made by explicitly placing ops on local devices

Training Algorithm Examples

The complete list of examples is below. The first example, Non-Distributed-Setup, shows the basic learning problem we want to solve distributively; this example should be familiar to all since it doesn't use any distributed code. The second example, Distributed-Setup, shows the same problem being solved with distributed code (i.e., with one parameter server and one worker). The remaining examples are a mix of synchronous and non-synchronous training schemes.

Non-Distributed-Setup
Distributed-Setup
HogWild (Asychronous SGD)
DOWNPOUR
DOWNPOUR-Easy¹
AGN (Accumulated Gradient Normalization)
Synchronous-SGD
Synchronous-SGD-different-learning-rates
SAGN (Synchronous Accumulated Gradients Normalization)
Multiple-GPUs-Single-Machine
Dynamic SGD TODO
Asynchronous Elastic Averaging SGD (AEASGD) TODO
Asynchronous Elastic Averaging Momentum SGD (AEAMSGD) TODO

¹This is the same as the DOWNPOUR example except that is uses SGD on the workers instead of Adagrad.

Running Training Algorithm Examples

All the training examples (except the non-distributed example) live in a folder. To run them, move to the example directory and run the bash script.

cd <example_name>/
bash run.sh

In order to completely stop the example, you'll need to kill the python processes associated with it. If you want to stopped training early, then there will be python processes for each of the workers in addition to the parameter server processes. Unfortunately, the parameter server processes continue to run even after the workers are finished--these will always need to be killed manually. To kill all python processes, run pkill.

sudo pkill python

Requirements

Python 2.7
TensorFlow >= 1.2

Glossary

Server -- encapsulates a Session target and belongs to a cluster
Coordinator -- coordinates threads
Session Manager -- restores session and initialized variables and coordinates threads
Supervisor -- good for threads. Coordinater, Saver, and Session Manager. > Session Manager
Session Creator -- Factory for creating a session?
Monitored Session -- Session. initialization, hooks, recovery.
Monitored Training Session -- only distributed solution for sync optimization
Sync Replicas -- wrapper of optimizer for synchronous optimization
Scaffold -- holds lots of meta training settings and passed to Session creator

Hooks

LoggingTensorHook -- prints tensors every N steps or seconds
StopAtStepHook -- requests to stop training at a certain step
StepCounterHook -- counts steps per second
CheckpointSaverHook -- saves new checkpoint every N steps or seconds
NanTensorHook -- stops training if loss is NaN
SummarySaverHook -- saves summaries every N steps or seconds
GlobalStepWaiterHook -- waits until global step reaches threshold before training
FinalOpsHook -- runs specified ops before closing session
FeedFnHook -- assigns feed_dict

Algorithm References

distributed-tensorflow-guide's People

Contributors

Stargazers

Watchers

Forkers

salemmohammed ml-lab akzaidi bysshe allensmile zhuwenxiao aguicode little1tow linhx13 techscientist zhanghonglishanzai miej s4sarath yamlong yipeng5 guanhuawang 1202zhyl willdamon dholdaway pengshuang yhyu13 shubhampachori12110095 freeshow yongchn soccergame zhangleiqss lightsilver pymia jiongfu feifeibear gmlu gaohuazuo dinghe jywa daiwt ediyago xiang525 saifrahmed lsheiba kurnianggoro xingjinglu simplejian icaffe liuyh17 a382695908 renly kristinburg kash009 xiaoqingwang robertken kinglu thak123 gwqi sidshopin akhauriyash neven7 superlee506 rzhai snowfeet hongyunl pankeshgupta knatarajan1866 dhirajpatnaik16297 zhiymiy lonestar686 luochao1024 vamseeragi thomasdunlap ankur-manikandan lu-yi-hsun shawnshanksgui ziqilau hsaputra vu1seek almondai vanjos afcarl tongluoiupui bkumar26 liwzhi happyxuwork ducta-qc pengsun chet-sheng daisyxten anhngml deeplearningsprint ty01csbaidu jaykimbravekjh sugartom clowread mohanarunachalam mski whjthu tianyireally prasannapkr hunglethanh9 priyan-g nomorecoke mcdavid109

distributed-tensorflow-guide's Issues

the code run unsuccessful in synchronous mode

In the Synchronous-SGD, I run the ssgd.py and the wrong is happened.
tensorflow.python.framework.errors_impl.InvalidArgumentError: NodeDef missing attr 'reduction_type'
from Op<name=ConditionalAccumulator; signature= -> handle:Ref(string); attr=dtype:type,allowed=[DT_FLOAT, DT_DOUBLE, DT_INT32, DT_UINT8, DT_INT16, ..., DT_UINT16, DT_COMPLEX128, DT_HALF, DT_UINT32, DT_UINT64]; attr=shape:shape; attr=container:string,default=""; attr=shared_name:string,default=""; attr=reduction_type:string,default="MEAN",allowed=["MEAN", "SUM"]; is_stateful=true>; NodeDef: {{node sync_replicas/conditional_accumulator}} = ConditionalAccumulator_class=["loc:@sync_replicas/SetGlobalStep"], container="", dtype=DT_FLOAT, shape=[3,3,3,16], shared_name="conv0/conv:0/grad_accum", _device="/job:ps/replica:0/task:0/device:CPU:0"

Comment question marks

The local init ops should be understood and have comments.

How to run run.sh ?

While doing distributed training in multiple computers, do I need to run run.sh in PS or in each computer separately ? BTW both of these cases are not working. When I run my code in single computer with multiple workers it works properly. But I don't know how to run it on multiple computers. Is cluster manager(like Kubernets) required for this purpose?

Beginner Tutorial

Maybe we should add a tutorial showing how to set up a cluster in TF and launch a session with the cluster.

Update: The example with local and global variable needs some updates. There needs to be the functionality of pulling global states and assigning them locally. Additionally, we should get a plot of graph given by TensorBoard to show what the local vs global ops and variables look like.

Unresolved reference to config in dist_multi_gpu_sing_mach.py

I have a single exxact machine with 4 GPUs and was looking into how to run my code in a multi-gpu environment. I am getting the following traceback for the file mentioned above after running the .sh file.


I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
Traceback (most recent call last):
  File "dist_mult_gpu_sing_mach.py", line 76, in <module>
    main()
  File "dist_mult_gpu_sing_mach.py", line 29, in main
    task_index=FLAGS.task_index,config=config)
UnboundLocalError: local variable 'config' referenced before assignment
E tensorflow/stream_executor/cuda/cuda_driver.cc:509] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: ubuntu
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: ubuntu
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 384.98.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:363] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  384.98  Thu Oct 26 15:16:01 PDT 2017
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) 
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 384.98.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 384.98.0
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:197] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2223, 1 -> localhost:2224}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:211] Started server with target: grpc://localhost:2222
Traceback (most recent call last):
  File "dist_mult_gpu_sing_mach.py", line 76, in <module>
    main()
  File "dist_mult_gpu_sing_mach.py", line 29, in main
    task_index=FLAGS.task_index,config=config)
UnboundLocalError: local variable 'config' referenced before assignment

Some clarifications

First of all thanks a lot for the awesome code collection. I had one query (perhaps a naive one) regarding the following line:
with tf.device('/cpu:0'): a = tf.Variable(tf.truncated_normal(shape=[2]),dtype=tf.float32) b = tf.Variable(tf.truncated_normal(shape=[2]),dtype=tf.float32)

in Distributed setup example. Since this code snippet is executed in the worker, I assume that the variables a and b remain in the worker. Here are my queries:

Where exactly do we make use of the parameter server to store the variables as everything seems to be happening in worker?
In the line "with tf.device('/cpu:0'):", which cpu we are using, worker's or ps's?

I am sure these are naive queries but I am still grappling with distributed tensorflow and their documentation does not help my case. Thanks in advance..

local variable

Need to create a local copy of variables based on ones that exist in ps.

Why is the "a_global" different between "work1" and "work2" in Multiple-Workers example?

As described above, "a_global" is a global variable , I think it will update globally, why is it different in "work1" and "work2" ?

Local Optimizer

There is an issue with using adagrad instead of sgd on the worker level... the second worker never starts.

the optimizer's learning rate

Hello, why can't the optimizer's learning rate use placeholders in distributed code?

ADAG Naming

Hi,

This is a very nice write-up! I'm the author of ADAG, I recently changed the name of the optimizer into AGN (Accumulated Gradient Normalization). Which is described in full here https://arxiv.org/abs/1710.02368

Just to keep everything consistent :)

Joeri

No way to feed different mini-batch data to the graph for local optimizer to update local variables in the DOWNPOUR example

The example is brilliant and inspires me a lot.
However it didn't show how to feed data to the graph.
In real practice, I find it impossible to feed different mini-batch data for local optimizer to compute gradients each window-time. Here's what I mean:
It's easy to feed data to the graph by
session.run([opt, ...], feed_dict={})
I think every time session.run(... , feed_dict={...}) is called, the graph uses the same feed_dict to compute gradients by local optimizer window times, although the gradients turn out to be different because last gradients have been applied to local variables the time before this one. However, the feed_dict is the same batch which computed window times then summed up as one set of gradients to apply to the global variables.
I notice that this is different from the pseudocode described in the paper "Large Scale Distributed Deep Networks", and I quote:

Algorithm 7.1: DOWNPOURSGDCLIENT(α, nfetch, npush)

procedure STARTASYNCHRONOUSLYFETCHINGPARAMETERS(parameters)
    parameters ← GETPARAMETERSFROMPARAMSERVER()

procedure STARTASYNCHRONOUSLYPUSHINGGRADIENTS(accruedgradients)
    SENDGRADIENTSTOPARAMSERVER(accruedgradients)
    accruedgradients ← 0

main
global parameters, accruedgradients
step ← 0
accruedgradients ← 0
while true
do
    if (step mod nfetch) == 0
        then STARTASYNCHRONOUSLYFETCHINGPARAMETERS(parameters)
    **data ← GETNEXTMINIBATCH()**
    gradient ← COMPUTEGRADIENT(parameters, data)
    accruedgradients ← accruedgradients + gradient
    parameters ← parameters − α ∗ gradient
    if (step mod npush) == 0
        then STARTASYNCHRONOUSLYPUSHINGGRADIENTS(accruedgradients)
    step ← step + 1

In the pseudocode, every time a mini-batch data is fetched before computing gradients no matter this time is the time to apply them to global variables or not, this is what I bold different mini-batch.

Is there a way to achieve this?