I have tried change OneDeviceStrategy to be <code cla

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Here is my change: <a class="commit-link" data-hovercard-type="commit" data-hoverc

How can I train it on multi-GPU about seed_rl HOT 13 CLOSED

google-research commented on May 29, 2024

How can I train it on multi-GPU

from seed_rl.

Comments (13)

lespeholt commented on May 29, 2024 2

Can you try:

def apply_gradients(_):
optimizer.apply_gradients(zip([g + 0 for g in temp_grads], agent.trainable_variables))

def apply_gradients(_):
optimizer.apply_gradients(zip([g.read_value() for g in temp_grads], agent.trainable_variables))

from seed_rl.

Antymon commented on May 29, 2024 1

@brieyla1 With applying the fix to gradients from above:

optimizer.apply_gradients(zip([g.read_value() for g in temp_grads], agent.trainable_variables))

Following should suffice to run something mGPU with inference on a separate device:

    device_name = any_gpu[0].name if any_gpu else '/device:CPU:0'

    num_gpus = len(any_gpu)
    if num_gpus < 2:
      # a single GPU or CPU both inference and training
      strategy = tf.distribute.OneDeviceStrategy(device=device_name)
    elif num_gpus == 2:
      # one GPU for inference, one for training
      # pehaps not the wisest choice and better to mingle, benchmark if in doubt
      strategy = tf.distribute.OneDeviceStrategy(device=any_gpu[1].name)
    else:
      # one GPU for inference, rest DataParallel for training
      strategy = tf.distribute.MirroredStrategy(devices=[d.name for d in any_gpu[1:]])

As a replacement for:

seed_rl/common/utils.py

Lines 102 to 103 in 5f07ba2

 device_name = '/device:GPU:0' if any_gpu else '/device:CPU:0' 

 strategy = tf.distribute.OneDeviceStrategy(device=device_name)

EDIT: having been looking through the code I also believe one should rename num_training_tpus to num_training_devices and set it accordingly for GPUs as well as it seems to play a role in other places such as splitting training batch between devices.

@lespeholt Could you please explain why the problem with gradients occurs?

from seed_rl.

lespeholt commented on May 29, 2024

One needs to do something similar to this part of the code to run with multiple GPUs.

seed_rl/common/utils.py

Lines 42 to 52 in eff7aaa

 if tf.config.experimental.list_logical_devices('TPU'): 

 resolver = tf.distribute.cluster_resolver.TPUClusterResolver('') 

 topology = tf.tpu.experimental.initialize_tpu_system(resolver) 

 strategy = tf.distribute.experimental.TPUStrategy(resolver) 

 training_da = tf.tpu.experimental.DeviceAssignment.build( 

 topology, num_replicas=num_training_tpus) 

 training_strategy = tf.distribute.experimental.TPUStrategy( 

 resolver, device_assignment=training_da) 

 inference_devices = list(set(strategy.extended.worker_devices) - 

 set(training_strategy.extended.worker_devices)) 

 return Settings(strategy, inference_devices, training_strategy, tpu_encode,

Can you show me the exact change you did to the code?

from seed_rl.

Da-Capo commented on May 29, 2024

Here is my change:
Da-Capo@a05ca9b
If I use temp_grads, it will get an valueerror, but the clip_grads works well, so I wonder is the clip_grads synchronized correctly.

def apply_gradients(_):
  optimizer.apply_gradients(zip(temp_grads, agent.trainable_variables))

Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute.ReduceOp`
 type
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", l
ine 468, in _apply_op_helper
    preferred_dtype=default_dtype)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1314, i
n convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 138
1, in _tensor_conversion_sync_on_read
    return var._dense_var_to_tensor(dtype=dtype, name=name, as_ref=as_ref)  # pylint: disable=protect
ed-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 137
1, in _dense_var_to_tensor
    self.get(), dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 322
, in get
    return self._get_cross_replica()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 134
6, in _get_cross_replica
    reduce_util.ReduceOp.from_variable_aggregation(self.aggregation),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/reduce_util.py", lin
e 50, in from_variable_aggregation
    "`tf.distribute.ReduceOp` type" % aggregation)
ValueError: Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute
.ReduceOp` type

from seed_rl.

jrabary commented on May 29, 2024

One needs to do something similar to this part of the code to run with multiple GPUs.

seed_rl/common/utils.py

Lines 42 to 52 in eff7aaa

if tf.config.experimental.list_logical_devices('TPU'):

resolver = tf.distribute.cluster_resolver.TPUClusterResolver('')

topology = tf.tpu.experimental.initialize_tpu_system(resolver)

strategy = tf.distribute.experimental.TPUStrategy(resolver)

training_da = tf.tpu.experimental.DeviceAssignment.build(

topology, num_replicas=num_training_tpus)

training_strategy = tf.distribute.experimental.TPUStrategy(

resolver, device_assignment=training_da)

inference_devices = list(set(strategy.extended.worker_devices) -

set(training_strategy.extended.worker_devices))

return Settings(strategy, inference_devices, training_strategy, tpu_encode,

@lespeholt I tried similar thing to train on multi GPU. It turns out that with MirroredStrategy we can't split the devices into inference devices and training devices as with TPU. Do you think it is the correct behaviour ? By the way it works by using a single strategy that uses all the devices.

from seed_rl.

1576012404 commented on May 29, 2024

temp_grad2=[g + 0 for g in temp_grads]
because temp_grads are tf.VariableSynchronization.ON_READ,this operation triggers on_read event,temp2_grad should be the agrregation value of all replicas in different devies.But in fact temp2_grad obtain the same value as temp_grad when i use tf.print,Why ?
thank you for your answering @lespeholt

from seed_rl.

lespeholt commented on May 29, 2024

within the experimental_run_v2 this shouldn't trigger a synchronization. Only outside experimental_run_v1

"we can't split the devices into inference devices and training devices as with TPU" what do you mean with "we can't"? what problems are you running into?

from seed_rl.

jrabary commented on May 29, 2024

within the experimental_run_v2 this shouldn't trigger a synchronization. Only outside experimental_run_v1

"we can't split the devices into inference devices and training devices as with TPU" what do you mean with "we can't"? what problems are you running into?

This code

seed_rl/agents/sac/learner.py

Line 540 in eff7aaa

branch_index = inference_iteration.assign_add(1) % len(inference_devices)

seems to not working with MirroredStrategy with several GPUs if you split your GPUs into inference devices and training devices as with TPU.

from seed_rl.

lespeholt commented on May 29, 2024

Sorry, it's still not clear to me what you mean with not working. That line should work just fine with multiple GPUs.

from seed_rl.

jrabary commented on May 29, 2024

Sorry, it's still not clear to me what you mean with not working. That line should work just fine with multiple GPUs.

Indeed, I was not clear. That line works fine with GPU. The issue I encountered is when I tried to do similar configuration as defined here

seed_rl/common/utils.py

Line 41 in 135e561

def init_learner(num_training_tpus):

Where the TPUs are grouped by inference and training.
I began to declare two different MirrorStrategy but it seems not to be the good way to do this.
By the way, I have an example of Multiple GPU training with one single mirror strategy that works well and I hope I can share it soon.

from seed_rl.

brieyla1 commented on May 29, 2024

@jrabary
is there any update on the multi-gpu inference + training strategies ?
do you have any example of getting it to work correctly?

I'm running into a few issues myself and I'd love to get a view of a working version of it.

from seed_rl.

giantvision commented on May 29, 2024

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?
My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy()

10.17.8.112 ~/common/utils.py

import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'],
},
'task': {
'type': 'worker',
'index': 0
}
})

multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)

..........

def init_learner_multi_host(num_training_tpus: int):

.......

 else:
       strategy = multi_startegy
       enc = lambda  x: x
       dec = lambda  x, s=None : x if s is None else  tf.nest.pack_sequence_as(s, x)
       return  MultiHostSettings(
               strategy, [( '/cpu',  [device_name])], strategy , enc, dec)

10.17.8.109 ~/common/utils.py

import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'],
},
'task': {
'type': 'worker',
'index': 1
}
})

multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)

..........

def init_learner_multi_host(num_training_tpus: int):

.......

 else:
       strategy = multi_startegy
       enc = lambda  x: x
       dec = lambda  x, s=None : x if s is None else  tf.nest.pack_sequence_as(s, x)
       return  MultiHostSettings(
               strategy, [( '/cpu',  [device_name])], strategy , enc, dec)

error:
Unknown : Could not start gRPC server.

from seed_rl.

giantvision commented on May 29, 2024

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?

My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy()
10.17.8.112 ~/common/utils.py

import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'],
},
'task': {
'type': 'worker',
'index': 0
}
})

multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)

..........

def init_learner_multi_host(num_training_tpus: int):

.......
 else:
       strategy = multi_startegy
       enc = lambda  x: x
       dec = lambda  x, s=None : x if s is None else  tf.nest.pack_sequence_as(s, x)
       return  MultiHostSettings(
               strategy, [( '/cpu',  [device_name])], strategy , enc, dec)
10.17.8.109 ~/common/utils.py

import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'],
},
'task': {
'type': 'worker',
'index': 1
}
})

multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)

..........

def init_learner_multi_host(num_training_tpus: int):

.......
 else:
       strategy = multi_startegy
       enc = lambda  x: x
       dec = lambda  x, s=None : x if s is None else  tf.nest.pack_sequence_as(s, x)
       return  MultiHostSettings(
               strategy, [( '/cpu',  [device_name])], strategy , enc, dec)
error:
Unknown : Could not start gRPC server.

#version: Multi-host Multi GPUs：CPU for inference，GPU for training

other details

import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {'worker': ['10.17.8.109:6001', '10.17.8.112:6001']},
'task': {'type': 'worker', 'index': 0}
})

def init_learner_multi_host(num_training_tpus: int):
...
multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
...

tf.device('/cpu').enter()
device_name = '/device:CPU:0'
strategy1 = tf.distribute.OneDeviceStrategy(device=device_name)
strategy2 = multi_strategy
enc = lambda x:x
dec = lambda x, s =None: x if s is None else tf.nest.pack_sequence_as(s, x)
return MultiHostSettings(strategy1, [('/cpu', [device_name])], strategy2, enc, dec)

#This works！

from seed_rl.

How can I train it on multi-GPU about seed_rl HOT 13 CLOSED

Comments (13)

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?
My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy()

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?

other details

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	device_name = '/device:GPU:0' if any_gpu else '/device:CPU:0'
	strategy = tf.distribute.OneDeviceStrategy(device=device_name)

	if tf.config.experimental.list_logical_devices('TPU'):
	resolver = tf.distribute.cluster_resolver.TPUClusterResolver('')
	topology = tf.tpu.experimental.initialize_tpu_system(resolver)
	strategy = tf.distribute.experimental.TPUStrategy(resolver)
	training_da = tf.tpu.experimental.DeviceAssignment.build(
	topology, num_replicas=num_training_tpus)
	training_strategy = tf.distribute.experimental.TPUStrategy(
	resolver, device_assignment=training_da)
	inference_devices = list(set(strategy.extended.worker_devices) -
	set(training_strategy.extended.worker_devices))
	return Settings(strategy, inference_devices, training_strategy, tpu_encode,

Comments (13)

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code? My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy()

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?

other details

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?
My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy()