Comments (14)
Multi-GPU training is supported via the option use_horovod
. It uses Horovod. Please refer to the Horovod documentation about how to setup this. See the Returnn code for further related options. This is currently experimental.
from returnn.
First of all, thank you for such detailed framework implementation.
I tried running 2018-asr-attention code with use_horovod=True in the config and executed using mpirun as suggested in horovod's documentation but I run into horovod_signal_error. I installed horovod with openmpi 3.0.0, cuda 9.0 and flag HOROVOD_GPU_REDUCE=NCCL.
Following is my script to run:
mpirun -np $1 -H localhost:$1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python returnn/rnn.py returnn_horovod.config
Any suggestions will be great!
from returnn.
I don't know what you mean by horovod_signal_error. Can you post the full error? Have you searched if other people run into the same problem?
Try to remove the -H localhost:$1
, you should not need that.
Also, why are you using Python 2 (via python
)? Use Python 3 instead (python3
).
from returnn.
Any update here?
from returnn.
I'm closing this now. Please reopen if you still have problems.
Also maybe see #73.
from returnn.
Hi,
I am trying to launch the returnn-experiments/2018-asr-attention example on librispeech data with multiple GPUs. To that end, I tried different configs
use_tensorflow = True
task = "train"
device = "gpu"
use_horovod = True
and run
mpirun -np $1 -H localhost:$1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python returnn/rnn.py returnn_horovod.config
It fails with an error:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation GetDeviceAttr_1: Could not satisfy explicit device specification '/job:localhost/replica:0/task:0/device:XLA_GPU:0' because no supported kernel for XLA_GPU devices is available.
Registered kernels:
device='GPU'
device='CPU'
[[node GetDeviceAttr_1 (defined at <string>:40) = GetDeviceAttr[_device="/job:localhost/replica:0/task:0/device:XLA_GPU:0"]()]]
The config for which it does not fail is
device = "GPU"
But in this case the training is launched on CPUs.
I am using the commit - b3117f9 (b3117f9).
Is there something wrong with the way I am doing the configuration?
Thanks
from returnn.
Can you try with the latest version?
from returnn.
Thanks! With the latest version I can make it run.
I have a question though: in the below setup I do not see any acceleration by using multiple GPUs. Do you think it might be because of the error messages like [0d4622c7aab4:12389] Read -1, expected 218592, errno = 1
? Or is it expected behavior?
Many thanks for your support.
My setup:
Horovod (0.15.2), returnn (08b5591), cuda 9, nvccl 2.4
Some Observations:
Device | np | Comment |
---|---|---|
gpu0,gpu1 | 3 | AssertionError: invalid device specified: gpu0,gpu1 |
gpu0,gpu1 | 2 | AssertionError: invalid device specified: gpu0,gpu1 |
gpu0,gpu1 | 1 | AssertionError: invalid device specified: gpu0,gpu1 |
[gpu0,gpu1] | 2 | AssertionError: multiple devices not supported yet for TF |
[gpu0,gpu1] | 1 | AssertionError: multiple devices not supported yet for TF |
gpu | 1 | Training starts |
gpu | 2 | Training starts but lot of messages like - [0d4622c7aab4:12389] Read -1, expected 218592, errno = 1 |
gpu | 3 | Training starts - messages like the above also seen |
Stats
np=2
Stats:
mem_usage:GPU:0: Stats(mean=3.0GB, std_dev=44.1MB, min=1.4GB, max=3.1GB, num_seqs=2277, avg_data_len=1)
pretrain epoch 7, finished after 2277 steps, 0:59:08 elapsed (53.4% computing time)
Stats:
mem_usage:GPU:0: Stats(mean=3.0GB, std_dev=41.8MB, min=1.4GB, max=3.1GB, num_seqs=2277, avg_data_len=1)
pretrain epoch 7, finished after 2277 steps, 0:59:08 elapsed (53.4% computing time)
np=1
Stats:
mem_usage:GPU:0: Stats(mean=3.0GB, std_dev=51.8MB, min=1.4GB, max=3.0GB, num_seqs=4560, avg_data_len=1)
pretrain epoch 6, finished after 4560 steps, 0:54:39 elapsed (98.7% computing time)
Command I use to launch
mpirun --allow-run-as-root -np 2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRRY_PATH -x PATH -mca pml ob1 -mca btl ^openib python3 returnn/rnn.py cased2_asr_multigpu.config
from returnn.
What do you mean by Device
in the table? The config option device
? That should be just set like device="gpu"
(not a list or anything else).
What do you mean by np
in the table? The number of GPUs?
The messages you mention (Read -1, expected 218592), I think they are not from Returnn. So they are either from Horovod or TF or MPI or sth else. Have you checked Google?
What is your Horovod reduce type?
What is your dataset? Try HDFDataset
maybe with cache_size=0
.
In the log you should find 95% computing time
or sth like that at the end of an epoch. What number do you see there? It should be >=90%, otherwise sth is wrong.
Also see #73.
from returnn.
Hi,
Thanks for your answer.
Yes, Device
means config option device
.
np
in the above table is the np
command line argument to mpirun
.
I still have to dig deeper into the Read errors...
Config horovod_reduce_type
is param
.
I saw that you recently made some changes to HDFDataset.py so I pulled them in WP. I am using your latest version 1d56fe8. The code is here 83621a7
I am using a modified clone of LibrispeechCorpus
dataset. When I change the baseclass for LibriSpeechCorpus
from CachedDataset2
to HDFDataset
, I run into errors.
File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/GeneratingDataset.py", line 2307, in init_seq_order
line: super(LibriWipoEUCorpus, self).init_seq_order(epoch=epoch, seq_list=seq_list)
locals:
super = <builtin> <class 'super'>
LibriWipoEUCorpus = <global> <class 'GeneratingDataset.LibriWipoEUCorpus'>
self = <local> <LibriWipoEUCorpus 'dev' epoch=None>
init_seq_order = <not found>
epoch = <local> None
seq_list = <local> None
File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/CachedDataset.py", line 75, in init_seq_order
line: seq_index = self.get_seq_order_for_epoch(epoch, self._num_seqs, lambda s: self._seq_lengths[s][0])
locals:
seq_index = <not found>
self = <local> <LibriWipoEUCorpus 'dev' epoch=None>
self.get_seq_order_for_epoch = <local> <bound method Dataset.get_seq_order_for_epoch of <LibriWipoEUCorpus 'dev' epoch=None>>
epoch = <local> None
self._num_seqs = <local> 0
s = <not found>
self._seq_lengths = <local> array([], shape=(0, 0), dtype=float64), len = 0
File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/Dataset.py", line 275, in get_seq_order_for_epoch
line: assert num_seqs > 0
locals:
num_seqs = <local> 0
AssertionError
I am trying to figure out what is going wrong.
Did you test this particular Dataset on your end?
And yes, one question, the new data loading mechanism that you mention in #73, is that what you have implemented with HDFDataset
?
Thanks
from returnn.
No, HDFDataset
is not another base class which you can use instead of CachedDataset2
. It is a complete dataset implementation on its own. There is the tool tools/hdf_dump.py
to convert any dataset (e.g. your LibrispeechCorpus
) to a HDF file which then can be read by HDFDataset
.
from returnn.
I am sorry if it is a silly question but I seem to have a problem - when I try to the hdf_dump, I get the following error
File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/HDFDataset.py", line 1029, in dump_from_dataset
line: shape += dataset.get_data_shape(data_key)
locals:
shape = <local> [8507]
dataset = <local> <LibriWipoEUCorpus 'train' epoch=1>
dataset.get_data_shape = <local> <bound method Dataset.get_data_shape of <LibriWipoEUCorpus 'train' epoch=1>>
data_key = <local> 'raw'
File "/returnn-experiments/2018-asr-attention/librispeech/full-setup-attention/returnn/Dataset.py", line 631, in get_data_shape
line: if self.num_outputs[key][1] <= 1:
locals:
self = <local> <LibriWipoEUCorpus 'train' epoch=1>
self.num_outputs = <local> {'raw': {'shape': (), 'dtype': 'string'}, 'classes': [10323, 1], 'data': [40, 2]}
key = <local> 'raw'
KeyError: 1
And as you can see in self.num_outputs = <local> {'raw': {'shape': (), 'dtype': 'string'}, 'classes': [10323, 1], 'data': [40, 2]}
the value of the raw
key of self.num_outputs
is not consistent with the values of classes
and data
keys. Do you think it is a problem in my config file?
from returnn.
Hm, yes you are right. No, this is not a problem with your config file. This comes from LibriSpeechCorpus
, which sets it like this:
self.num_outputs = {
"data": [self.num_inputs, 2], "classes": [self.targets.num_labels, 1], "raw": {"dtype": "string", "shape": ()}}
I'm not really sure at this point whether it's wrong to do this (i.e. there is lots of code which does not expect self.num_outputs
to be like this) (i.e. LibriSpeechCorpus
must be fixed somehow), or whether that Dataset.get_data_shape
is kind of an exception (i.e. Dataset.get_data_shape
should be fixed). I will maybe try to fix this later. Maybe you can make a separate GitHub issue about this.
As a simple workaround, you maybe can just remove that "raw"
entry in self.num_outputs
there. Probably you don't need that. And I think you also need to remove the "raw"
handling from LibriSpeechCorpus._collect_single_seq
.
from returnn.
Sure, I created another issue for this #143 .
I will try your suggestion and see how it goes.
Thanks!
from returnn.
Related Issues (20)
- PyTorch distributed training CPU OOM with sync_on_cpu HOT 1
- Support `torch.compile` for RF
- RF backend: PyTorch code
- Different effective learning rate reported over gpus HOT 11
- CUDA error: initialization error HOT 3
- MultiProcDataset inside PyTorch DataLoader with num_workers>0, multiple issues HOT 4
- RuntimeError: CUDA error: unspecified launch failure HOT 2
- NonDaemonicSpawnProcess hangs at exit HOT 2
- High memory usage with datasets (specifically when multi procs are used)
- Hang at exit in TDL worker in multiprocessing `_run_finalizers`, deadlock in `_wait_for_tstate_lock`? HOT 6
- Hang HOT 2
- Returnn Native after using different apptainer uses old compilation HOT 6
- MetaDataset with sequence list filter file
- HDFDataset (or generic dataset) post processing HOT 15
- Dataset batching like ESPnet support
- torch.nn.functional.conv2d: RuntimeError: GET was unable to find an engine to execute this computation HOT 1
- TensorFlow 2.14 degradation in WER HOT 2
- Updates for recent TensorFlow version
- Hang in dataset iterator HOT 5
- Log GPU device for torch backend HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn.