I add --local_rank, but raise error. SLURM job: False Traceback

How can I use multi-GPU to train UNMT about xlm HOT 6 CLOSED

facebookresearch commented on July 29, 2024

How can I use multi-GPU to train UNMT

from xlm.

Comments (6)

BinWone commented on July 29, 2024 1

What do you mean by the training time is the same? Is the perplexity the same at the end of a few epochs? Or do you look at the number of words per second? The number of words per second in the log is given per GPU, so this will be the same. But the loss / perplexity should decrease much faster.

yes, i made a mistake. you are right, multi-gpu training get better valid ppl and acc.
pretraining on 1 GPU:

on 4 GPU

from xlm.

hpsun1109 commented on July 29, 2024

another question, in the UNMT model, only one encoder and one decoder? Thanks.

from xlm.

glample commented on July 29, 2024

You should not handle the --local_rank yourself. You can use the following command to train with multi-GPU: https://github.com/facebookresearch/XLM#how-can-i-run-experiments-on-multiple-gpus

export NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py ARGUMENTS

And no, there are 2 separate models for UNMT, one encoder and one decoder, but they are initialized with the same weights (apart from the parameters of the source attention in the decoder that remain randomly initialized).

from xlm.

BinWone commented on July 29, 2024

You should not handle the --local_rank yourself. You can use the following command to train with multi-GPU: https://github.com/facebookresearch/XLM#how-can-i-run-experiments-on-multiple-gpus
export NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py ARGUMENTS
And no, there are 2 separate models for UNMT, one encoder and one decoder, but they are initialized with the same weights (apart from the parameters of the source attention in the decoder that remain randomly initialized).

I using multi-GPU to pre-training the model like export NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py ARGUMENTS, it just run the same job on 8 GPUs, the training time is the same as training on 1 GPU, it doesn't fast the pre-training process.
How to set the params and I can fast the training process on multi-GPU?

from xlm.

glample commented on July 29, 2024

What do you mean by the training time is the same? Is the perplexity the same at the end of a few epochs? Or do you look at the number of words per second? The number of words per second in the log is given per GPU, so this will be the same. But the loss / perplexity should decrease much faster.

from xlm.

glample commented on July 29, 2024

Looks good :)

from xlm.

How can I use multi-GPU to train UNMT about xlm HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent