Comments (19)
Hi,
Good question!
We follow the training script in Diffusion-LM's repo script/run_train.py. The train_run.py you mentioned uses run_clm.py to train the classifier instead of LM itself.
It's true that we "split data per gpu" when we do sampling. That's because we only want to iterate each test case once and in order. However, when training, we set
shuffle=True
, which means each GPU gets a different batch of data. It functions in the same way as usingDistributedSampler
.
I understood that you implemented different data loader with shuffle=True
keyword in load_data_text
function, in diffuseq/text_datset.py
. However, this works improperly, due to reasons below.
in torch.utils.data.DataLoader
, when using shuffle=True
, DataLoader objects makes torch.utils.data.RandomSampler
instead of torch.utils.data.SequentialSampler
. In this case, without generator
keyword, RandomSampler
makes new torch.Generator()
instance, and its seed is set by torch._C._default_generator
. (see torch source code.) which is dependent on transformer.set_seed(args.seed)
function in train.py
. As a consequence, we will get same data per each process, even though using shuffle=True
keyword.
My PR intended to fix this issues. By setting only generator's seed different, other processes will go on with same seed per each processes!
from diffuseq.
I tested the outputs, and checked that current code returns same data outputs.
Since process 0 uses random generator while initializing random embedding (load_model_emb
), data of process 0 was different, however, other processes' datum were same, in output.
python -m torch.distributed.launch --nproc_per_node=4 --master_port=12233 --use_env run_train.py --diff_steps 2000 --lr 0.0001 --learning_steps 140000 --save_interval 20000 --seed 102 --noise_schedule sqrt --hidden_dim 128 --bsz 2048 --microbatch 64 --dataset dialogue --data_dir datasets/CommonsenseConversation --vocab bert --seq_len 128 --schedule_sampler lossaware --notes dialogue
- modified
train.py
"""
Train a diffusion model on images.
"""
import argparse
import json, torch, os
import numpy as np
from diffuseq.utils import dist_util, logger
from diffuseq.text_datasets import load_data_text
from diffuseq.step_sample import create_named_schedule_sampler
from basic_utils import (
load_defaults_config,
create_model_and_diffusion,
args_to_dict,
add_dict_to_argparser,
load_model_emb,
load_tokenizer
)
from train_util import TrainLoop
from transformers import set_seed
import wandb
### custom your wandb setting here ###
# os.environ["WANDB_API_KEY"] = ""
os.environ["WANDB_MODE"] = "offline"
def create_argparser():
defaults = dict()
defaults.update(load_defaults_config())
parser = argparse.ArgumentParser()
add_dict_to_argparser(parser, defaults) # update latest args according to argparse
return parser
def main():
args = create_argparser().parse_args()
set_seed(args.seed)
dist_util.setup_dist()
logger.configure()
logger.log("### Creating data loader...")
tokenizer = load_tokenizer(args)
model_weight, tokenizer = load_model_emb(args, tokenizer)
data = load_data_text(
batch_size=args.batch_size,
seq_len=args.seq_len,
data_args = args,
loaded_vocab=tokenizer,
model_emb=model_weight # use model's weights as init
)
from torch.distributed import barrier, get_rank
barrier()
print(get_rank(), next(data)[1]['input_ids'][0])
import sys
sys.exit(0)
- Outputs
1 tensor([ 101, 1045, 2219, 2019, 5641, 14007, 16419, 1012, 102, 102,
101, 1018, 7787, 1029, 4638, 1012, 3194, 1999, 2392, 1029,
4638, 1012, 8134, 1037, 18851, 1012, 2039, 22994, 2063, 1012,
102, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])
2 tensor([ 101, 1045, 2219, 2019, 5641, 14007, 16419, 1012, 102, 102,
101, 1018, 7787, 1029, 4638, 1012, 3194, 1999, 2392, 1029,
4638, 1012, 8134, 1037, 18851, 1012, 2039, 22994, 2063, 1012,
102, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])
0 tensor([ 101, 25742, 5092, 2497, 2034, 4836, 2385, 2086, 3283, 1010,
2021, 4268, 2024, 3652, 2039, 2007, 1996, 9476, 2157, 2085,
1012, 4268, 3065, 2031, 1037, 3109, 1997, 1037, 11142, 2166,
1012, 102, 102, 101, 2026, 2048, 2095, 2214, 2365, 7425,
2015, 25742, 5092, 2497, 1012, 102, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])
3 tensor([ 101, 1045, 2219, 2019, 5641, 14007, 16419, 1012, 102, 102,
101, 1018, 7787, 1029, 4638, 1012, 3194, 1999, 2392, 1029,
4638, 1012, 8134, 1037, 18851, 1012, 2039, 22994, 2063, 1012,
102, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0])
from diffuseq.
@Dawn-LX
In default, torch random operation such as torch.randn
uses default random generator torch._C._default_genetator
(it's for CPU and genrrator for GPU also exists), unless we don't add generator keyword. My intend (for PR) was to make dataloader use another generator instead of default one.
Meanwhile, in funcction load_model_emb
, when local rank equals 0 function calls torch.nn.init.normal_(model.weight)
, and this is random operation which uses default generator. (Model initialization is also random process too but it's same in all processes and doesn't make gap)
from diffuseq.
Hi,
Good question!
We follow the training script in Diffusion-LM's repo script/run_train.py. The train_run.py you mentioned uses run_clm.py to train the classifier instead of LM itself.
It's true that we "split data per gpu" when we do sampling. That's because we only want to iterate each test case once and in order. However, when training, we set shuffle=True
, which means each GPU gets a different batch of data. It functions in the same way as using DistributedSampler
.
from diffuseq.
Thank you for your reply! But I am still confused.
- the script/run_train.py in Diffusion-LM's repo also doesn't use DistributedSampler, which means they also have the problem of "data is actually duplicated in each gpu"
- set
shuffle=True
has nothing to do with "each GPU gets a different batch of data".
To verfiy my point, we can turn infinite_loader
off and see how many batch iteration it actually runs.
say, if a signle GPU training script has a dataloader of 800 interations.
Then for 4 GPUs training, the dataloader (with DistributedSampler) will run 200 iterations (for the same batchsize) Note that for DistributedSampler & DistributedDataParallel, the the batchsize of dataloader is directly the batchsize on each GPU.
But with the existing multi-gpu training script, the data is duplicated in each gpu, an it will still run 800 iterations for 4 GPU training
from diffuseq.
If you want to use same seed per nodes, you can consider alternative code like below:
Step 1. Change this function to my code below.
DiffuSeq/diffuseq/text_datasets.py
Line 11 in bea43e1
def load_data_text(
batch_size,
seq_len,
deterministic=False,
data_args=None,
model_emb=None,
split='train',
loaded_vocab=None,
loop=True,
seed=None, # ADD THIS
):
training_data = get_corpus(data_args, seq_len, split=split, loaded_vocab=loaded_vocab)
dataset = TextDataset(
training_data,
data_args,
model_emb=model_emb
)
if seed is not None:
batch_generator = torch.Generator()
batch_generator.manual_seed(hash(seed) + int(os.environ.get("LOCAL_RANK", "0")))
else:
batch_generator = None
data_loader = DataLoader(
dataset,
batch_size=batch_size, # 20,
# drop_last=True,
shuffle=not deterministic,
num_workers=0,
generator=batch_generator, # ADDED
)
if loop:
return infinite_loader(data_loader)
else:
# print(data_loader)
return iter(data_loader)
Step 2. Add seed
argument in training script.
Line 44 in bea43e1
line 44~63
data = load_data_text(
batch_size=args.batch_size,
seq_len=args.seq_len,
data_args = args,
loaded_vocab=tokenizer,
model_emb=model_weight, # use model's weights as init
seed=args.seed
)
next(data)
data_valid = load_data_text(
batch_size=args.batch_size,
seq_len=args.seq_len,
data_args=args,
split='valid',
deterministic=True,
loaded_vocab=tokenizer,
model_emb=model_weight, # using the same embedding wight with tranining data
seed=args.seed
)
from diffuseq.
Hi
@kdha0727 You're right. I just tested the case when
--nproc_per_node=2
. When the number of nodes is greater than 2, the duplication problem does exist. In this situation, usingDistrubutedSampler
is a more convenient implementation solution, instead of manually setting random seed? I will fix this soon.
Yes. my method is fine for infinite loops, however, considering more general cases, DistributedSampler would be more compact solution. Thank you for reviewing!
from diffuseq.
@Dawn-LX In default, torch random operation such as
torch.randn
uses default random generatortorch._C._default_genetator
(it's for CPU and genrrator for GPU also exists), unless we don't add generator keyword. My intend (for PR) was to make dataloader use another generator instead of default one. Meanwhile, in funcctionload_model_emb
, when local rank equals 0 function callstorch.nn.init.normal_(model.weight)
, and this is random operation which uses default generator. (Model initialization is also random process too but it's same in all processes and doesn't make gap)
Thank you very much !
I get it !
from diffuseq.
@Dawn-LX It doesn't mean that process 0 use another seed. In original code all processes use same seed, however, process 0 has ONLY 1 MORE RANDOM OPERATION before RandomSampler
being initialized, and this ONLY 1 MORE RANDOM OPERATION makes different random output (for dataloader's seed) even with same seed.
from diffuseq.
Another info is that in the multi-gpu sampling script (which is recently updated) "sample_seq2seq.py"
Lines 120 to 121 in bea43e1
It has an operation to "split data per gpu".
Howerver, the training scripts ("diffuseq/text_datasets.py" or "train_util.py") do not have such operation to spilt data per gpu, and thus I conject that the data is actually duplicated in each gpu in the existing multi-gpu training script.
from diffuseq.
Another information:
The the script/run_train.py in Diffusion-LM's repo is modified based on OpenAI's improved-diffusion.
We can observe that in OpenAI's improved-diffusion, the dataloader is constructed in improved-diffusion/image_datasets.py
. Although they also not use DistributedSampler
in dataloader
, they directly split the dataset
into each gpu in load_data
by set (before construct dataloader):
shard=MPI.COMM_WORLD.Get_rank(),
num_shards=MPI.COMM_WORLD.Get_size()
refer to openai/improved-diffusion/improved_diffusion/image_datasets.py
In contrast, in Diffusion-LM's load_data_text
function, there is no such operation (the MPI
from mpi4py
is even not used at all).
refer to
Diffusion-LM/improved-diffusion/improved_diffusion/text_datasets.py
from diffuseq.
@summmeer
I will be very appreciate that if you can point out the "split data per gpu" operation (exactly which lines of code) in existing multi-gpu training script
from diffuseq.
@Dawn-LX I experienced the same problem, and solved it by implementing custom seeding function.
def seed_all(seed: "Any", deterministic: "bool" = False) -> "None":
import random
import numpy as np
import torch
if deterministic: # False in training, True in sampling
seed = hash(seed)
torch.backends.cudnn.deterministic = True # NOQA
torch.backends.cudnn.benchmark = False # NOQA
else:
seed = hash(seed) + int(os.environ.get("LOCAL_RANK", "0")) # Make seed differ by node rank
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed) # contains torch.cuda.manual_seed_all
Even though each seed per node differs, model parameters would be synchronized in initialization process refer to this. (by this line)
from diffuseq.
Hi,
@Dawn-LX In our code, if each GPU loaded different batches, for current gradiant update, it is the same with larger batch size (benefit from multi-GPU).
But with the existing multi-gpu training script, the data is duplicated in each gpu, an it will still run 800 iterations for 4 GPU training
If you want to train total 200 iters on 4 GPU, you can set iters as 50 so that 4 GPU will load 50*4 iters of the data. It maybe ture that this way is not strictly the same as DistrubutedSampler
, but on each GPU, the data is not duplicated for each batch.
@kdha0727 Thanks for your clarification, but in code, each node loads different batch of data is we wanted and this fuction is already implemented. I'm a little bit confused by your PR.
from diffuseq.
Hi
@kdha0727 You're right. I just tested the case when --nproc_per_node=2
. When the number of nodes is greater than 2, the duplication problem does exist. In this situation, using DistrubutedSampler
is a more convenient implementation solution, instead of manually setting random seed? I will fix this soon.
from diffuseq.
@summmeer @kdha0727
Thank you for both of your replies and contributions !
- By the explanation in #37 (comment), I'm now undertstand why the original code use
shuffle=True
to achieve "each node loads different batch of data", and not need to explicitly split data to each GPU. - I am also very appreciate the explanation by @kdha0727 at #37 (comment) , which shows simply set
shuffle=True
can not implement different data loader for each gpu, due to the random seed issue.
Thank both of your contributions again!, from which I learned a lot about torch's DataLoader and random seed.
from diffuseq.
Another small question (although might not necessary now),
In load_model_emb
, how (and where) does the process 0 uses (another?) random generator, another random seed different from process ,1,2,3 ?
from diffuseq.
wait, I still confused.
a) I understand that we construct Dataloader
with generator=None
, and all processed use RandomSampler
, which build new torch.Generator() instance, of which the seed is set by seed = int(torch.empty((), dtype=torch.int64).random_().item())
, where the .random_()
uses torch._C._default_genetator
, which is dependent on transformer.set_seed(args.seed)
.
b) On the other hand, process 0 calls torch.nn.init.normal_(model.weight)
, which is also uses torch._C._default_genetator
( and is dependent on transformer.set_seed(args.seed)
. )
So why process 0 has different data batch in dataloader's interation (uses different random seed) ?, In my view, since all processes use the same seed according to the description in a). Why process-0 do things in b) makes it's dataloader uses another seed ? I mean, why does the seed in RandomSampler
has something to do with ``torch.nn.init.normal_(model.weight)` ?
from diffuseq.
@Dawn-LX It doesn't mean that process 0 use another seed. In original code all processes use same seed, however, process 0 has ONLY 1 MORE RANDOM OPERATION before
RandomSampler
being initialized, and this ONLY 1 MORE RANDOM OPERATION makes different random output (for dataloader's seed) even with same seed.
@kdha0727
Thank you! Now it all makes sense to me.
That means in contrast to process 1,2,3, process-0 has different results of this line seed = int(torch.empty((), dtype=torch.int64).random_().item())
when building the generator in RandomSampler
. Because process-0 did ONLY 1 MORE RANDOM OPERATION by torch.nn.init.normal_(model.weight)
.
from diffuseq.
Related Issues (20)
- Issues with decoding and evaluation HOT 2
- Padding during training results in a "Killed"
- BERT parameter
- Try to train the model with another dataset, but get so many [UNK] token.
- a few questions about the 'MBR' decoding strategy. HOT 2
- Version of many packages
- Incorrect self-BLEU Computation
- a question about --local_rank
- Could not find a version that satisfies the requirement torch==1.9.0+cu111
- i face some promble Dataset(2) in "text_datasets.py" HOT 1
- If there is any rule to modify the parameters HOT 1
- Machine Translation Task with DiffuSeq HOT 6
- A question about the loss in V2
- Implementation of using soft absorbing state in the forward process in training. HOT 1
- ddim sampling HOT 2
- DDPM HOT 1
- train
- Where is CommonsenseConversation/test.jsonl ? When I run train. sh and then run run_decode_solver. sh or run_decode. sh, I always can't find test.jsonl HOT 2
- 'grad_norm' is NaN
- Understanding tT_loss
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from diffuseq.