Comments (13)
Can you try:
def apply_gradients(_):
optimizer.apply_gradients(zip([g + 0 for g in temp_grads], agent.trainable_variables))
or
def apply_gradients(_):
optimizer.apply_gradients(zip([g.read_value() for g in temp_grads], agent.trainable_variables))
from seed_rl.
@brieyla1 With applying the fix to gradients from above:
optimizer.apply_gradients(zip([g.read_value() for g in temp_grads], agent.trainable_variables))
Following should suffice to run something mGPU with inference on a separate device:
device_name = any_gpu[0].name if any_gpu else '/device:CPU:0'
num_gpus = len(any_gpu)
if num_gpus < 2:
# a single GPU or CPU both inference and training
strategy = tf.distribute.OneDeviceStrategy(device=device_name)
elif num_gpus == 2:
# one GPU for inference, one for training
# pehaps not the wisest choice and better to mingle, benchmark if in doubt
strategy = tf.distribute.OneDeviceStrategy(device=any_gpu[1].name)
else:
# one GPU for inference, rest DataParallel for training
strategy = tf.distribute.MirroredStrategy(devices=[d.name for d in any_gpu[1:]])
As a replacement for:
Lines 102 to 103 in 5f07ba2
EDIT: having been looking through the code I also believe one should rename num_training_tpus
to num_training_devices
and set it accordingly for GPUs as well as it seems to play a role in other places such as splitting training batch between devices.
@lespeholt Could you please explain why the problem with gradients occurs?
from seed_rl.
One needs to do something similar to this part of the code to run with multiple GPUs.
Lines 42 to 52 in eff7aaa
Can you show me the exact change you did to the code?
from seed_rl.
Here is my change:
Da-Capo@a05ca9b
If I use temp_grads
, it will get an valueerror, but the clip_grads
works well, so I wonder is the clip_grads
synchronized correctly.
def apply_gradients(_):
optimizer.apply_gradients(zip(temp_grads, agent.trainable_variables))
Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute.ReduceOp`
type
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", l
ine 468, in _apply_op_helper
preferred_dtype=default_dtype)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1314, i
n convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 138
1, in _tensor_conversion_sync_on_read
return var._dense_var_to_tensor(dtype=dtype, name=name, as_ref=as_ref) # pylint: disable=protect
ed-access
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 137
1, in _dense_var_to_tensor
self.get(), dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 322
, in get
return self._get_cross_replica()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 134
6, in _get_cross_replica
reduce_util.ReduceOp.from_variable_aggregation(self.aggregation),
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/reduce_util.py", lin
e 50, in from_variable_aggregation
"`tf.distribute.ReduceOp` type" % aggregation)
ValueError: Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute
.ReduceOp` type
from seed_rl.
One needs to do something similar to this part of the code to run with multiple GPUs.
Lines 42 to 52 in eff7aaa
@lespeholt I tried similar thing to train on multi GPU. It turns out that with MirroredStrategy we can't split the devices into inference devices and training devices as with TPU. Do you think it is the correct behaviour ? By the way it works by using a single strategy that uses all the devices.
from seed_rl.
temp_grad2=[g + 0 for g in temp_grads]
because temp_grads are tf.VariableSynchronization.ON_READ,this operation triggers on_read event,temp2_grad should be the agrregation value of all replicas in different devies.But in fact temp2_grad obtain the same value as temp_grad when i use tf.print,Why ?
thank you for your answering @lespeholt
from seed_rl.
within the experimental_run_v2 this shouldn't trigger a synchronization. Only outside experimental_run_v1
"we can't split the devices into inference devices and training devices as with TPU" what do you mean with "we can't"? what problems are you running into?
from seed_rl.
within the experimental_run_v2 this shouldn't trigger a synchronization. Only outside experimental_run_v1
"we can't split the devices into inference devices and training devices as with TPU" what do you mean with "we can't"? what problems are you running into?
This code
Line 540 in eff7aaa
from seed_rl.
Sorry, it's still not clear to me what you mean with not working. That line should work just fine with multiple GPUs.
from seed_rl.
Sorry, it's still not clear to me what you mean with not working. That line should work just fine with multiple GPUs.
Indeed, I was not clear. That line works fine with GPU. The issue I encountered is when I tried to do similar configuration as defined here
Line 41 in 135e561
Where the TPUs are grouped by inference and training.
I began to declare two different MirrorStrategy but it seems not to be the good way to do this.
By the way, I have an example of Multiple GPU training with one single mirror strategy that works well and I hope I can share it soon.
from seed_rl.
@jrabary
is there any update on the multi-gpu inference + training strategies ?
do you have any example of getting it to work correctly?
I'm running into a few issues myself and I'd love to get a view of a working version of it.
from seed_rl.
I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?
My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy()
10.17.8.112 ~/common/utils.py
import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'],
},
'task': {
'type': 'worker',
'index': 0
}
})
multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)
..........
def init_learner_multi_host(num_training_tpus: int):
.......
else:
strategy = multi_startegy
enc = lambda x: x
dec = lambda x, s=None : x if s is None else tf.nest.pack_sequence_as(s, x)
return MultiHostSettings(
strategy, [( '/cpu', [device_name])], strategy , enc, dec)
10.17.8.109 ~/common/utils.py
import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'],
},
'task': {
'type': 'worker',
'index': 1
}
})
multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)
..........
def init_learner_multi_host(num_training_tpus: int):
.......
else:
strategy = multi_startegy
enc = lambda x: x
dec = lambda x, s=None : x if s is None else tf.nest.pack_sequence_as(s, x)
return MultiHostSettings(
strategy, [( '/cpu', [device_name])], strategy , enc, dec)
error:
Unknown : Could not start gRPC server.
from seed_rl.
I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?
My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy()
10.17.8.112 ~/common/utils.pyimport os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'],
},
'task': {
'type': 'worker',
'index': 0
}
})multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)..........
def init_learner_multi_host(num_training_tpus: int):
.......
else: strategy = multi_startegy enc = lambda x: x dec = lambda x, s=None : x if s is None else tf.nest.pack_sequence_as(s, x) return MultiHostSettings( strategy, [( '/cpu', [device_name])], strategy , enc, dec)
10.17.8.109 ~/common/utils.py
import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'],
},
'task': {
'type': 'worker',
'index': 1
}
})multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)..........
def init_learner_multi_host(num_training_tpus: int):
.......
else: strategy = multi_startegy enc = lambda x: x dec = lambda x, s=None : x if s is None else tf.nest.pack_sequence_as(s, x) return MultiHostSettings( strategy, [( '/cpu', [device_name])], strategy , enc, dec)
error:
Unknown : Could not start gRPC server.
#version: Multi-host Multi GPUs:CPU for inference,GPU for training
other details
import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {'worker': ['10.17.8.109:6001', '10.17.8.112:6001']},
'task': {'type': 'worker', 'index': 0}
})
def init_learner_multi_host(num_training_tpus: int):
...
multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
...
tf.device('/cpu').enter()
device_name = '/device:CPU:0'
strategy1 = tf.distribute.OneDeviceStrategy(device=device_name)
strategy2 = multi_strategy
enc = lambda x:x
dec = lambda x, s =None: x if s is None else tf.nest.pack_sequence_as(s, x)
return MultiHostSettings(strategy1, [('/cpu', [device_name])], strategy2, enc, dec)
#This works!
from seed_rl.
Related Issues (20)
- What is the use case for the tensorflow gRPC batched functions? HOT 2
- Is the gcp/train_atari.sh script actually using one GPU device for training? HOT 1
- TF 2.4.1 and gRPC HOT 4
- ERROR: An error occurred during the fetch of repository 'jpeg_archive' HOT 3
- PPO agent event logging could stuck
- Multi-agents support
- "Permission denied" error while running /docker/build.sh HOT 1
- Impossible to run SEED using TPUs with Google Cloud AI Platform HOT 1
- Does SEED ensure that there doesn't end up being a backlog of inferences in the unroll queue? HOT 1
- Update is needed to Dockerfile.dmlab file HOT 1
- Strange loss values from vtrace agent
- Looking for clarification on entropy loss calculations for vtrace agent
- Looking for v-mpo configs/examples
- Need some help about applying IMPALA to Atari game HOT 1
- num_action_repeats=1 flag correct for Atari? HOT 2
- Using tensorflow grpc causes memory leaks when calling to server HOT 1
- Missing learning curve data for `Defender` `Surround` HOT 7
- Trouble reproducing reported training FPS HOT 1
- KL loss implementation is less effective
- Security Policy violation Binary Artifacts HOT 22
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from seed_rl.