I am trying to use Returnn framework for speech recognition and I have multiple GPUs a

Hi, Thanks for your answer. Yes, <code class="notr

Using returnn on Multi-gpu about returnn HOT 14 CLOSED

rwth-i6 commented on July 17, 2024

Using returnn on Multi-gpu

from returnn.

Comments (14)

albertz commented on July 17, 2024

Multi-GPU training is supported via the option use_horovod. It uses Horovod. Please refer to the Horovod documentation about how to setup this. See the Returnn code for further related options. This is currently experimental.

from returnn.

modernAlcibiades commented on July 17, 2024

First of all, thank you for such detailed framework implementation.
I tried running 2018-asr-attention code with use_horovod=True in the config and executed using mpirun as suggested in horovod's documentation but I run into horovod_signal_error. I installed horovod with openmpi 3.0.0, cuda 9.0 and flag HOROVOD_GPU_REDUCE=NCCL.

Following is my script to run:
mpirun -np $1 -H localhost:$1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python returnn/rnn.py returnn_horovod.config
Any suggestions will be great!

from returnn.

albertz commented on July 17, 2024

I don't know what you mean by horovod_signal_error. Can you post the full error? Have you searched if other people run into the same problem?
Try to remove the -H localhost:$1, you should not need that.
Also, why are you using Python 2 (via python)? Use Python 3 instead (python3).

from returnn.

albertz commented on July 17, 2024

Any update here?

from returnn.

albertz commented on July 17, 2024

I'm closing this now. Please reopen if you still have problems.
Also maybe see #73.

from returnn.

akshatdewan commented on July 17, 2024

Hi,
I am trying to launch the returnn-experiments/2018-asr-attention example on librispeech data with multiple GPUs. To that end, I tried different configs

use_tensorflow = True
task = "train"
device = "gpu"
use_horovod = True

and run

mpirun -np $1 -H localhost:$1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python returnn/rnn.py returnn_horovod.config

It fails with an error:

InvalidArgumentError (see above for traceback): Cannot assign a device for operation GetDeviceAttr_1: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:XLA_GPU:0' because no supported kernel for XLA_GPU devices is available.
Registered kernels:
  device='GPU'
  device='CPU'

         [[node GetDeviceAttr_1 (defined at <string>:40)  = GetDeviceAttr[_device="/job:localhost/replica:0/task:0/device:XLA_GPU:0"]()]]

The config for which it does not fail is

device = "GPU"

But in this case the training is launched on CPUs.

I am using the commit - b3117f9 (b3117f9).

Is there something wrong with the way I am doing the configuration?

Thanks

from returnn.

albertz commented on July 17, 2024

Can you try with the latest version?

from returnn.

akshatdewan commented on July 17, 2024

Thanks! With the latest version I can make it run.

I have a question though: in the below setup I do not see any acceleration by using multiple GPUs. Do you think it might be because of the error messages like [0d4622c7aab4:12389] Read -1, expected 218592, errno = 1? Or is it expected behavior?

Many thanks for your support.

My setup:
Horovod (0.15.2), returnn (08b5591), cuda 9, nvccl 2.4
Some Observations:

Device	np	Comment
gpu0,gpu1	3	AssertionError: invalid device specified: gpu0,gpu1
gpu0,gpu1	2	AssertionError: invalid device specified: gpu0,gpu1
gpu0,gpu1	1	AssertionError: invalid device specified: gpu0,gpu1
[gpu0,gpu1]	2	AssertionError: multiple devices not supported yet for TF
[gpu0,gpu1]	1	AssertionError: multiple devices not supported yet for TF
gpu	1	Training starts
gpu	2	Training starts but lot of messages like - [0d4622c7aab4:12389] Read -1, expected 218592, errno = 1
gpu	3	Training starts - messages like the above also seen

Stats
np=2

Stats:
mem_usage:GPU:0: Stats(mean=3.0GB, std_dev=44.1MB, min=1.4GB, max=3.1GB, num_seqs=2277, avg_data_len=1)
pretrain epoch 7, finished after 2277 steps, 0:59:08 elapsed (53.4% computing time)
Stats:
mem_usage:GPU:0: Stats(mean=3.0GB, std_dev=41.8MB, min=1.4GB, max=3.1GB, num_seqs=2277, avg_data_len=1)
pretrain epoch 7, finished after 2277 steps, 0:59:08 elapsed (53.4% computing time)

np=1

Stats:
mem_usage:GPU:0: Stats(mean=3.0GB, std_dev=51.8MB, min=1.4GB, max=3.0GB, num_seqs=4560, avg_data_len=1)
pretrain epoch 6, finished after 4560 steps, 0:54:39 elapsed (98.7% computing time)

Command I use to launch

mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRRY_PATH -x PATH -mca pml ob1 -mca btl ^openib python3 returnn/rnn.py cased2_asr_multigpu.config

from returnn.

albertz commented on July 17, 2024

What do you mean by Device in the table? The config option device? That should be just set like device="gpu" (not a list or anything else).

What do you mean by np in the table? The number of GPUs?

The messages you mention (Read -1, expected 218592), I think they are not from Returnn. So they are either from Horovod or TF or MPI or sth else. Have you checked Google?

What is your Horovod reduce type?

What is your dataset? Try HDFDataset maybe with cache_size=0.

In the log you should find 95% computing time or sth like that at the end of an epoch. What number do you see there? It should be >=90%, otherwise sth is wrong.

Also see #73.

from returnn.

akshatdewan commented on July 17, 2024

Hi,

Thanks for your answer.

Yes, Device means config option device.

np in the above table is the np command line argument to mpirun.

I still have to dig deeper into the Read errors...

Config horovod_reduce_type is param.

I saw that you recently made some changes to HDFDataset.py so I pulled them in WP. I am using your latest version 1d56fe8. The code is here 83621a7

I am using a modified clone of LibrispeechCorpus dataset. When I change the baseclass for LibriSpeechCorpus from CachedDataset2 to HDFDataset, I run into errors.

  File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/GeneratingDataset.py", line 2307, in init_seq_order
    line: super(LibriWipoEUCorpus, self).init_seq_order(epoch=epoch, seq_list=seq_list)
    locals:
      super = <builtin> <class 'super'>
      LibriWipoEUCorpus = <global> <class 'GeneratingDataset.LibriWipoEUCorpus'>
      self = <local> <LibriWipoEUCorpus 'dev' epoch=None>
      init_seq_order = <not found>
      epoch = <local> None
      seq_list = <local> None
  File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/CachedDataset.py", line 75, in init_seq_order
    line: seq_index = self.get_seq_order_for_epoch(epoch, self._num_seqs, lambda s: self._seq_lengths[s][0])
    locals:
      seq_index = <not found>
      self = <local> <LibriWipoEUCorpus 'dev' epoch=None>
      self.get_seq_order_for_epoch = <local> <bound method Dataset.get_seq_order_for_epoch of <LibriWipoEUCorpus 'dev' epoch=None>>
      epoch = <local> None
      self._num_seqs = <local> 0
      s = <not found>
      self._seq_lengths = <local> array([], shape=(0, 0), dtype=float64), len = 0
  File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/Dataset.py", line 275, in get_seq_order_for_epoch
    line: assert num_seqs > 0
    locals:
      num_seqs = <local> 0
AssertionError

I am trying to figure out what is going wrong.

Did you test this particular Dataset on your end?

And yes, one question, the new data loading mechanism that you mention in #73, is that what you have implemented with HDFDataset?
Thanks

from returnn.

albertz commented on July 17, 2024

No, HDFDataset is not another base class which you can use instead of CachedDataset2. It is a complete dataset implementation on its own. There is the tool tools/hdf_dump.py to convert any dataset (e.g. your LibrispeechCorpus) to a HDF file which then can be read by HDFDataset.

from returnn.

akshatdewan commented on July 17, 2024

I am sorry if it is a silly question but I seem to have a problem - when I try to the hdf_dump, I get the following error

  File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/HDFDataset.py", line 1029, in dump_from_dataset
    line: shape += dataset.get_data_shape(data_key)
    locals:
      shape = <local> [8507]
      dataset = <local> <LibriWipoEUCorpus 'train' epoch=1>
      dataset.get_data_shape = <local> <bound method Dataset.get_data_shape of <LibriWipoEUCorpus 'train' epoch=1>>
      data_key = <local> 'raw'
  File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/Dataset.py", line 631, in get_data_shape
    line: if self.num_outputs[key][1] <= 1:
    locals:
      self = <local> <LibriWipoEUCorpus 'train' epoch=1>
      self.num_outputs = <local> {'raw': {'shape': (), 'dtype': 'string'}, 'classes': [10323, 1], 'data': [40, 2]}
      key = <local> 'raw'
KeyError: 1

And as you can see in self.num_outputs = <local> {'raw': {'shape': (), 'dtype': 'string'}, 'classes': [10323, 1], 'data': [40, 2]} the value of the raw key of self.num_outputs is not consistent with the values of classes and data keys. Do you think it is a problem in my config file?

from returnn.

albertz commented on July 17, 2024

Hm, yes you are right. No, this is not a problem with your config file. This comes from LibriSpeechCorpus, which sets it like this:

self.num_outputs = {
  "data": [self.num_inputs, 2], "classes": [self.targets.num_labels, 1], "raw": {"dtype": "string", "shape": ()}}

I'm not really sure at this point whether it's wrong to do this (i.e. there is lots of code which does not expect self.num_outputs to be like this) (i.e. LibriSpeechCorpus must be fixed somehow), or whether that Dataset.get_data_shape is kind of an exception (i.e. Dataset.get_data_shape should be fixed). I will maybe try to fix this later. Maybe you can make a separate GitHub issue about this.

As a simple workaround, you maybe can just remove that "raw" entry in self.num_outputs there. Probably you don't need that. And I think you also need to remove the "raw" handling from LibriSpeechCorpus._collect_single_seq.

from returnn.

akshatdewan commented on July 17, 2024

Sure, I created another issue for this #143 .
I will try your suggestion and see how it goes.
Thanks!

from returnn.

Using returnn on Multi-gpu about returnn HOT 14 CLOSED

Comments (14)

Stats
np=2

np=1

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (14)

Stats np=2

np=1

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Stats
np=2