Giter VIP home page Giter VIP logo

Comments (13)

lespeholt avatar lespeholt commented on May 29, 2024 2

Can you try:

def apply_gradients(_):
optimizer.apply_gradients(zip([g + 0 for g in temp_grads], agent.trainable_variables))

or

def apply_gradients(_):
optimizer.apply_gradients(zip([g.read_value() for g in temp_grads], agent.trainable_variables))

from seed_rl.

Antymon avatar Antymon commented on May 29, 2024 1

@brieyla1 With applying the fix to gradients from above:

optimizer.apply_gradients(zip([g.read_value() for g in temp_grads], agent.trainable_variables))

Following should suffice to run something mGPU with inference on a separate device:

    device_name = any_gpu[0].name if any_gpu else '/device:CPU:0'

    num_gpus = len(any_gpu)
    if num_gpus < 2:
      # a single GPU or CPU both inference and training
      strategy = tf.distribute.OneDeviceStrategy(device=device_name)
    elif num_gpus == 2:
      # one GPU for inference, one for training
      # pehaps not the wisest choice and better to mingle, benchmark if in doubt
      strategy = tf.distribute.OneDeviceStrategy(device=any_gpu[1].name)
    else:
      # one GPU for inference, rest DataParallel for training
      strategy = tf.distribute.MirroredStrategy(devices=[d.name for d in any_gpu[1:]])

As a replacement for:

seed_rl/common/utils.py

Lines 102 to 103 in 5f07ba2

device_name = '/device:GPU:0' if any_gpu else '/device:CPU:0'
strategy = tf.distribute.OneDeviceStrategy(device=device_name)

EDIT: having been looking through the code I also believe one should rename num_training_tpus to num_training_devices and set it accordingly for GPUs as well as it seems to play a role in other places such as splitting training batch between devices.

@lespeholt Could you please explain why the problem with gradients occurs?

from seed_rl.

lespeholt avatar lespeholt commented on May 29, 2024

One needs to do something similar to this part of the code to run with multiple GPUs.

seed_rl/common/utils.py

Lines 42 to 52 in eff7aaa

if tf.config.experimental.list_logical_devices('TPU'):
resolver = tf.distribute.cluster_resolver.TPUClusterResolver('')
topology = tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
training_da = tf.tpu.experimental.DeviceAssignment.build(
topology, num_replicas=num_training_tpus)
training_strategy = tf.distribute.experimental.TPUStrategy(
resolver, device_assignment=training_da)
inference_devices = list(set(strategy.extended.worker_devices) -
set(training_strategy.extended.worker_devices))
return Settings(strategy, inference_devices, training_strategy, tpu_encode,

Can you show me the exact change you did to the code?

from seed_rl.

Da-Capo avatar Da-Capo commented on May 29, 2024

Here is my change:
Da-Capo@a05ca9b
If I use temp_grads, it will get an valueerror, but the clip_grads works well, so I wonder is the clip_grads synchronized correctly.

def apply_gradients(_):
  optimizer.apply_gradients(zip(temp_grads, agent.trainable_variables))
Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute.ReduceOp`
 type
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", l
ine 468, in _apply_op_helper
    preferred_dtype=default_dtype)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1314, i
n convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 138
1, in _tensor_conversion_sync_on_read
    return var._dense_var_to_tensor(dtype=dtype, name=name, as_ref=as_ref)  # pylint: disable=protect
ed-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 137
1, in _dense_var_to_tensor
    self.get(), dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 322
, in get
    return self._get_cross_replica()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/values.py", line 134
6, in _get_cross_replica
    reduce_util.ReduceOp.from_variable_aggregation(self.aggregation),
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/distribute/reduce_util.py", lin
e 50, in from_variable_aggregation
    "`tf.distribute.ReduceOp` type" % aggregation)
ValueError: Could not convert from `tf.VariableAggregation` VariableAggregation.NONE to`tf.distribute
.ReduceOp` type

from seed_rl.

jrabary avatar jrabary commented on May 29, 2024

One needs to do something similar to this part of the code to run with multiple GPUs.

seed_rl/common/utils.py

Lines 42 to 52 in eff7aaa

if tf.config.experimental.list_logical_devices('TPU'):
resolver = tf.distribute.cluster_resolver.TPUClusterResolver('')
topology = tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
training_da = tf.tpu.experimental.DeviceAssignment.build(
topology, num_replicas=num_training_tpus)
training_strategy = tf.distribute.experimental.TPUStrategy(
resolver, device_assignment=training_da)
inference_devices = list(set(strategy.extended.worker_devices) -
set(training_strategy.extended.worker_devices))
return Settings(strategy, inference_devices, training_strategy, tpu_encode,

@lespeholt I tried similar thing to train on multi GPU. It turns out that with MirroredStrategy we can't split the devices into inference devices and training devices as with TPU. Do you think it is the correct behaviour ? By the way it works by using a single strategy that uses all the devices.

from seed_rl.

1576012404 avatar 1576012404 commented on May 29, 2024

temp_grad2=[g + 0 for g in temp_grads]
because temp_grads are tf.VariableSynchronization.ON_READ,this operation triggers on_read event,temp2_grad should be the agrregation value of all replicas in different devies.But in fact temp2_grad obtain the same value as temp_grad when i use tf.print,Why ?
thank you for your answering @lespeholt

from seed_rl.

lespeholt avatar lespeholt commented on May 29, 2024

within the experimental_run_v2 this shouldn't trigger a synchronization. Only outside experimental_run_v1

"we can't split the devices into inference devices and training devices as with TPU" what do you mean with "we can't"? what problems are you running into?

from seed_rl.

jrabary avatar jrabary commented on May 29, 2024

within the experimental_run_v2 this shouldn't trigger a synchronization. Only outside experimental_run_v1

"we can't split the devices into inference devices and training devices as with TPU" what do you mean with "we can't"? what problems are you running into?

This code

branch_index = inference_iteration.assign_add(1) % len(inference_devices)
seems to not working with MirroredStrategy with several GPUs if you split your GPUs into inference devices and training devices as with TPU.

from seed_rl.

lespeholt avatar lespeholt commented on May 29, 2024

Sorry, it's still not clear to me what you mean with not working. That line should work just fine with multiple GPUs.

from seed_rl.

jrabary avatar jrabary commented on May 29, 2024

Sorry, it's still not clear to me what you mean with not working. That line should work just fine with multiple GPUs.

Indeed, I was not clear. That line works fine with GPU. The issue I encountered is when I tried to do similar configuration as defined here

def init_learner(num_training_tpus):

Where the TPUs are grouped by inference and training.
I began to declare two different MirrorStrategy but it seems not to be the good way to do this.
By the way, I have an example of Multiple GPU training with one single mirror strategy that works well and I hope I can share it soon.

from seed_rl.

brieyla1 avatar brieyla1 commented on May 29, 2024

@jrabary
is there any update on the multi-gpu inference + training strategies ?
do you have any example of getting it to work correctly?

I'm running into a few issues myself and I'd love to get a view of a working version of it.

from seed_rl.

giantvision avatar giantvision commented on May 29, 2024

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?
My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy()

10.17.8.112 ~/common/utils.py

import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'],
},
'task': {
'type': 'worker',
'index': 0
}
})

multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)

..........

def init_learner_multi_host(num_training_tpus: int):

.......

 else:
       strategy = multi_startegy
       enc = lambda  x: x
       dec = lambda  x, s=None : x if s is None else  tf.nest.pack_sequence_as(s, x)
       return  MultiHostSettings(
               strategy, [( '/cpu',  [device_name])], strategy , enc, dec)

10.17.8.109 ~/common/utils.py

import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'],
},
'task': {
'type': 'worker',
'index': 1
}
})

multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)

..........

def init_learner_multi_host(num_training_tpus: int):

.......

 else:
       strategy = multi_startegy
       enc = lambda  x: x
       dec = lambda  x, s=None : x if s is None else  tf.nest.pack_sequence_as(s, x)
       return  MultiHostSettings(
               strategy, [( '/cpu',  [device_name])], strategy , enc, dec)

error:
Unknown : Could not start gRPC server.

from seed_rl.

giantvision avatar giantvision commented on May 29, 2024

I want to run the codes on multi-machines with multi-GPU, and i want to know: how to recompile the code?

My try: use tf.distribute.experimental.MultiWorkerMirroredStrategy()
10.17.8.112 ~/common/utils.py

import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'],
},
'task': {
'type': 'worker',
'index': 0
}
})

multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)

..........

def init_learner_multi_host(num_training_tpus: int):

.......

 else:
       strategy = multi_startegy
       enc = lambda  x: x
       dec = lambda  x, s=None : x if s is None else  tf.nest.pack_sequence_as(s, x)
       return  MultiHostSettings(
               strategy, [( '/cpu',  [device_name])], strategy , enc, dec)

10.17.8.109 ~/common/utils.py

import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker':[' 10.17.8.112:6000', '10.17.8.109:6000'],
},
'task': {
'type': 'worker',
'index': 1
}
})

multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
tf.distribute.experimental.CollectiveCommunication.NCCL)

..........

def init_learner_multi_host(num_training_tpus: int):

.......

 else:
       strategy = multi_startegy
       enc = lambda  x: x
       dec = lambda  x, s=None : x if s is None else  tf.nest.pack_sequence_as(s, x)
       return  MultiHostSettings(
               strategy, [( '/cpu',  [device_name])], strategy , enc, dec)

error:
Unknown : Could not start gRPC server.

#version: Multi-host Multi GPUs:CPU for inference,GPU for training

other details

import os, json
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {'worker': ['10.17.8.109:6001', '10.17.8.112:6001']},
'task': {'type': 'worker', 'index': 0}
})

def init_learner_multi_host(num_training_tpus: int):
...
multi_strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
...

tf.device('/cpu').enter()
device_name = '/device:CPU:0'
strategy1 = tf.distribute.OneDeviceStrategy(device=device_name)
strategy2 = multi_strategy
enc = lambda x:x
dec = lambda x, s =None: x if s is None else tf.nest.pack_sequence_as(s, x)
return MultiHostSettings(strategy1, [('/cpu', [device_name])], strategy2, enc, dec)

#This works!

from seed_rl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.