Giter VIP home page Giter VIP logo

Comments (14)

albertz avatar albertz commented on July 17, 2024

Multi-GPU training is supported via the option use_horovod. It uses Horovod. Please refer to the Horovod documentation about how to setup this. See the Returnn code for further related options. This is currently experimental.

from returnn.

modernAlcibiades avatar modernAlcibiades commented on July 17, 2024

First of all, thank you for such detailed framework implementation.
I tried running 2018-asr-attention code with use_horovod=True in the config and executed using mpirun as suggested in horovod's documentation but I run into horovod_signal_error. I installed horovod with openmpi 3.0.0, cuda 9.0 and flag HOROVOD_GPU_REDUCE=NCCL.

Following is my script to run:
mpirun -np $1 -H localhost:$1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python returnn/rnn.py returnn_horovod.config
Any suggestions will be great!

from returnn.

albertz avatar albertz commented on July 17, 2024

I don't know what you mean by horovod_signal_error. Can you post the full error? Have you searched if other people run into the same problem?
Try to remove the -H localhost:$1, you should not need that.
Also, why are you using Python 2 (via python)? Use Python 3 instead (python3).

from returnn.

albertz avatar albertz commented on July 17, 2024

Any update here?

from returnn.

albertz avatar albertz commented on July 17, 2024

I'm closing this now. Please reopen if you still have problems.
Also maybe see #73.

from returnn.

akshatdewan avatar akshatdewan commented on July 17, 2024

Hi,
I am trying to launch the returnn-experiments/2018-asr-attention example on librispeech data with multiple GPUs. To that end, I tried different configs

use_tensorflow = True
task = "train"
device = "gpu"
use_horovod = True

and run

mpirun -np $1 -H localhost:$1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python returnn/rnn.py returnn_horovod.config

It fails with an error:

InvalidArgumentError (see above for traceback): Cannot assign a device for operation GetDeviceAttr_1: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:XLA_GPU:0' because no supported kernel for XLA_GPU devices is available.
Registered kernels:
  device='GPU'
  device='CPU'

         [[node GetDeviceAttr_1 (defined at <string>:40)  = GetDeviceAttr[_device="/job:localhost/replica:0/task:0/device:XLA_GPU:0"]()]]

The config for which it does not fail is

device = "GPU"

But in this case the training is launched on CPUs.

I am using the commit - b3117f9 (b3117f9).

Is there something wrong with the way I am doing the configuration?

Thanks

from returnn.

albertz avatar albertz commented on July 17, 2024

Can you try with the latest version?

from returnn.

akshatdewan avatar akshatdewan commented on July 17, 2024

Thanks! With the latest version I can make it run.

I have a question though: in the below setup I do not see any acceleration by using multiple GPUs. Do you think it might be because of the error messages like [0d4622c7aab4:12389] Read -1, expected 218592, errno = 1? Or is it expected behavior?

Many thanks for your support.

My setup:
Horovod (0.15.2), returnn (08b5591), cuda 9, nvccl 2.4
Some Observations:

Device np Comment
gpu0,gpu1 3 AssertionError: invalid device specified: gpu0,gpu1
gpu0,gpu1 2 AssertionError: invalid device specified: gpu0,gpu1
gpu0,gpu1 1 AssertionError: invalid device specified: gpu0,gpu1
[gpu0,gpu1] 2 AssertionError: multiple devices not supported yet for TF
[gpu0,gpu1] 1 AssertionError: multiple devices not supported yet for TF
gpu 1 Training starts
gpu 2 Training starts but lot of messages like - [0d4622c7aab4:12389] Read -1, expected 218592, errno = 1
gpu 3 Training starts -  messages like the above also seen

Stats
np=2

Stats:
mem_usage:GPU:0: Stats(mean=3.0GB, std_dev=44.1MB, min=1.4GB, max=3.1GB, num_seqs=2277, avg_data_len=1)
pretrain epoch 7, finished after 2277 steps, 0:59:08 elapsed (53.4% computing time)
Stats:
mem_usage:GPU:0: Stats(mean=3.0GB, std_dev=41.8MB, min=1.4GB, max=3.1GB, num_seqs=2277, avg_data_len=1)
pretrain epoch 7, finished after 2277 steps, 0:59:08 elapsed (53.4% computing time)

np=1

Stats:
mem_usage:GPU:0: Stats(mean=3.0GB, std_dev=51.8MB, min=1.4GB, max=3.0GB, num_seqs=4560, avg_data_len=1)
pretrain epoch 6, finished after 4560 steps, 0:54:39 elapsed (98.7% computing time)

Command I use to launch

mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRRY_PATH -x PATH -mca pml ob1 -mca btl ^openib python3 returnn/rnn.py cased2_asr_multigpu.config

from returnn.

albertz avatar albertz commented on July 17, 2024

What do you mean by Device in the table? The config option device? That should be just set like device="gpu" (not a list or anything else).

What do you mean by np in the table? The number of GPUs?

The messages you mention (Read -1, expected 218592), I think they are not from Returnn. So they are either from Horovod or TF or MPI or sth else. Have you checked Google?

What is your Horovod reduce type?

What is your dataset? Try HDFDataset maybe with cache_size=0.

In the log you should find 95% computing time or sth like that at the end of an epoch. What number do you see there? It should be >=90%, otherwise sth is wrong.

Also see #73.

from returnn.

akshatdewan avatar akshatdewan commented on July 17, 2024

Hi,

Thanks for your answer.

Yes, Device means config option device.

np in the above table is the np command line argument to mpirun.

I still have to dig deeper into the Read errors...

Config horovod_reduce_type is param.

I saw that you recently made some changes to HDFDataset.py so I pulled them in WP. I am using your latest version 1d56fe8. The code is here 83621a7

I am using a modified clone of LibrispeechCorpus dataset. When I change the baseclass for LibriSpeechCorpus from CachedDataset2 to HDFDataset, I run into errors.

  File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/GeneratingDataset.py", line 2307, in init_seq_order
    line: super(LibriWipoEUCorpus, self).init_seq_order(epoch=epoch, seq_list=seq_list)
    locals:
      super = <builtin> <class 'super'>
      LibriWipoEUCorpus = <global> <class 'GeneratingDataset.LibriWipoEUCorpus'>
      self = <local> <LibriWipoEUCorpus 'dev' epoch=None>
      init_seq_order = <not found>
      epoch = <local> None
      seq_list = <local> None
  File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/CachedDataset.py", line 75, in init_seq_order
    line: seq_index = self.get_seq_order_for_epoch(epoch, self._num_seqs, lambda s: self._seq_lengths[s][0])
    locals:
      seq_index = <not found>
      self = <local> <LibriWipoEUCorpus 'dev' epoch=None>
      self.get_seq_order_for_epoch = <local> <bound method Dataset.get_seq_order_for_epoch of <LibriWipoEUCorpus 'dev' epoch=None>>
      epoch = <local> None
      self._num_seqs = <local> 0
      s = <not found>
      self._seq_lengths = <local> array([], shape=(0, 0), dtype=float64), len = 0
  File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/Dataset.py", line 275, in get_seq_order_for_epoch
    line: assert num_seqs > 0
    locals:
      num_seqs = <local> 0
AssertionError

I am trying to figure out what is going wrong.

Did you test this particular Dataset on your end?

And yes, one question, the new data loading mechanism that you mention in #73, is that what you have implemented with HDFDataset?
Thanks

from returnn.

albertz avatar albertz commented on July 17, 2024

No, HDFDataset is not another base class which you can use instead of CachedDataset2. It is a complete dataset implementation on its own. There is the tool tools/hdf_dump.py to convert any dataset (e.g. your LibrispeechCorpus) to a HDF file which then can be read by HDFDataset.

from returnn.

akshatdewan avatar akshatdewan commented on July 17, 2024

I am sorry if it is a silly question but I seem to have a problem - when I try to the hdf_dump, I get the following error

  File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/HDFDataset.py", line 1029, in dump_from_dataset
    line: shape += dataset.get_data_shape(data_key)
    locals:
      shape = <local> [8507]
      dataset = <local> <LibriWipoEUCorpus 'train' epoch=1>
      dataset.get_data_shape = <local> <bound method Dataset.get_data_shape of <LibriWipoEUCorpus 'train' epoch=1>>
      data_key = <local> 'raw'
  File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/Dataset.py", line 631, in get_data_shape
    line: if self.num_outputs[key][1] <= 1:
    locals:
      self = <local> <LibriWipoEUCorpus 'train' epoch=1>
      self.num_outputs = <local> {'raw': {'shape': (), 'dtype': 'string'}, 'classes': [10323, 1], 'data': [40, 2]}
      key = <local> 'raw'
KeyError: 1

And as you can see in self.num_outputs = <local> {'raw': {'shape': (), 'dtype': 'string'}, 'classes': [10323, 1], 'data': [40, 2]} the value of the raw key of self.num_outputs is not consistent with the values of classes and data keys. Do you think it is a problem in my config file?

from returnn.

albertz avatar albertz commented on July 17, 2024

Hm, yes you are right. No, this is not a problem with your config file. This comes from LibriSpeechCorpus, which sets it like this:

self.num_outputs = {
  "data": [self.num_inputs, 2], "classes": [self.targets.num_labels, 1], "raw": {"dtype": "string", "shape": ()}}

I'm not really sure at this point whether it's wrong to do this (i.e. there is lots of code which does not expect self.num_outputs to be like this) (i.e. LibriSpeechCorpus must be fixed somehow), or whether that Dataset.get_data_shape is kind of an exception (i.e. Dataset.get_data_shape should be fixed). I will maybe try to fix this later. Maybe you can make a separate GitHub issue about this.

As a simple workaround, you maybe can just remove that "raw" entry in self.num_outputs there. Probably you don't need that. And I think you also need to remove the "raw" handling from LibriSpeechCorpus._collect_single_seq.

from returnn.

akshatdewan avatar akshatdewan commented on July 17, 2024

Sure, I created another issue for this #143 .
I will try your suggestion and see how it goes.
Thanks!

from returnn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.