shark-nlp / diffuseq Goto Github PK

View Code? Open in Web Editor NEW

705.0 705.0 87.0 2.34 MB

[ICLR'23] DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models

License: MIT License

Python 99.63% Shell 0.37%

diffusion-models sequence-to-sequence text-generation

diffuseq's People

Contributors

Stargazers

Watchers

Forkers

dumpmemory roelvdp lgstd moonintheriver shubhamtalbar96 joanzhou louisngo desis123 ruiyuanlyu lvyiwei1 chenxinan-fdu mingkin wicknight yangbain gfloto hungphongtrn junnyu yunusdemirag techthiyanes para-zhou lipiji skywalkerluke pai-data-augmentation changzhijiang ramyakeerthy zurichrain zhiyuanhubj juyongjiang chiyuzhang94 gaohuan2015 lunaryan joker881 xiaoxue-xx chenxwh abletobetable du-yao mitudesk alexeykrylov smksyj mainpyp kemolo davidheineman lolitasian natnaelt kaishengyao codeaudit explcre zetangforward sean-wiesner8 kaifahmad1 michelamarch tj1116 transformerswsz chouisgiser bytetora chen0z skepsun afeena mirdoch leezg yuuki233333 chiral-carbon zzbuzzard lemaqwq aniketgurav ahmedhailane0 jus1mple treerhope paulovsantanas winfreykong showpiecep imkeithyang neoscheung mingmichelle0414 yangfujun215 hongyeehh ffzhang1231 shariqfz abel1231 joliang17 kuleens sebochs zephyr271828 irish-kw puerrrr bobbyfyb orangecat7777777

diffuseq's Issues

Pretrain DiffuSeq?

Hi,

Thanks for your wonderful work!

To pretrain DiffuSeq on large unsupervised corpus is the first thing that comes in to my mind after reading this paper, since this is what all LM do nowadays. The Commonsense Conversation dataset is already around 1GB in size, so I presume DiffuSeq has the ability the scale to large datasets. Can I ask why you guys haven't included such pretraining experiments?

Thank you!

facing an error while trying to execute pip install -r requirements.txt in terminal

when i am trying to do setup with the given command "pip install -r requirements.txt " facing the below error.

ERROR: Could not find a version that satisfies the requirement torch==1.9.0+cu111 (from versions: 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1)
ERROR: No matching distribution found for torch==1.9.0+cu111

could this be the solution if i remove +cu11 in 1.9.0+cu111, just keep torch==1.9.0
will it resolves the issue or anything needs to be done in order to setup
kindly, please guide

Separate weights for word embedding and lm-head?

Hi, thanks for providing the code.

I have a question regarding the word embedding and lm-head. In your code, both functions shard the same weight. I wonder if they can have separate weights?

Thanks for your help!

Size of the hidden dimension

Hey,
I was wondering if you have tested the effect of the hidden dimension on the training, and if yes, what were your findings?

Randomly Initialized embeddings?

Hi, thanks for sharing your implementations, it is really helpful and very clean to follow.

I have one question about the embedding used in your model. There are 3 options discussed in the Diffusion-LM paper: 1. fixed randomly initialized embedding. 2. fiexd embeddings initialized from a PLM (like BERT). or 3. E2E embedding like Diffusion-LM.
I did not find which one you used in your paper.

But from your codes, It seems that you use randomly initialized embeddings, which is different from Diffusion-LM?
Specifically, I find the input ids are embedded into 128-d random embeddings in the data loading process:

DiffuSeq/basic_utils.py

Line 71 in 8bfafcb

def load_model_emb(args, tokenizer):

Please correct me if I am wrong.

About Transformer Model

I see you use the self.input_transformers = BertEncoder(config) to create the Transformer Model, Is the creation of Transformer model realized through BertEncoder?
Thanks

Typo? train.py

Hello, I might be horribly mistaken but I also might have seen a "typo".
In train.py first docstring, it says "Train a diffusion model on images."
Is it incorrect?

Problems of running time and microbatch ?

Thanks for your great work. When we reproduce your results, we come up two questions:

The training time of QQP task is much longer than 2 days (48h), which is inconsistent with the statement in this repo:

It will take around 2 days to train a DiffuSeq model on 2 NVIDIA A100 GPUs for QG and QQP

We follow the suggestion implementation: 2 Nvidia A100 GPUs and run the following command:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=12233 --use_env run_train.py --diff_steps 2000 --lr 0.0001 --learning_steps 50000 --microbatch 64 --save_interval 10000 --seed 102 --noise_schedule sqrt --hidden_dim 128 --bsz 2048 --dataset qqp --data_dir {datasets/QQP} --vocab bert --seq_len 128 --schedule_sampler lossaware --notes qqp

Here is our wandb output:

as well as the log out:

Note that we have already run this code with around 15 hours, but just reach 4960 step, which is far from 50000 steps as you suggest.

We take a deeper look into your code and find that during the training loops you split one batch into several micro bateches.
We wonder why you conduct such operation? Is this trick necessary?

Also, we wonder how can the total training process can be done into 2 days with your suggestion command and implementation ?

When will the code be open source？

Dataset(2) in "text_datasets.py"

Hello, thank you for your contribution to this code.

I’m confused by the code in ‘text_datasets.py’:”

import datasets
form datasets import Dataset as Dataset2

Here, the "datasets" is a folder without any class of "Dataset"?
So, in line 70:

raw_datasets = Dataset2.from_dict(sentence_lst)

can not run because the "Dataset2" does not exist.

the Multi-GPU training acutally duplicates data in each GPU ?

Hello.

I find that the Dataloader constructed in diffuseq/text_datasets.py not used pytorch's DistributedSampler

DiffuSeq/diffuseq/text_datasets.py

Line 47 in bea43e1

data_loader = DataLoader(

, which makes the data is actually duplicated in each GPU, e.g., in func:forward_backward in train_util.py

DiffuSeq/train_util.py

Line 235 in bea43e1

def forward_backward(self, batch, cond):

i.e., each GPU is processing the same data, which makes distributed training pointless.

Is my conjecture correct?

just FYI, the training script in Diffusion-LM's repo train_run.py uses transformers's training script run_clm.py, in which DistributedSampler is used in the Trainer

Decoding only [UNK] token ?

Hi,
I try to use DIFFUSEQ to adapt a different language. However, after training with this new language, model continuously produce [UNK] token.

What could be the reason? Where did I do wrong? Do you have any idea about the problem? Did you something similar experience during development phase?

Kind regards.

On the calculation of Xt partial loss in Zt

https://github.com/Shark-NLP/DiffuSeq/blob/main/diffuseq/gaussian_diffusion.py#:~:text=terms%5B%22mse%22%5D%20%3D%20mean_flat((target%20%2D%20model_output)%20**%202)
https://github.com/Shark-NLP/DiffuSeq/blob/main/diffuseq/utils/nn.py#:~:text=def%20mean_flat(,.shape))))
Dear author, since you keep the Xt part of Zt unchanged (it is anchored as X0 after each calculation), when calculating the MSE loss of Lt-1, the Xt part is 0, will the loss result be wrong? I tried the mean_flat function and it counted 0 into the number. thank you!

Questions about "decoder_nll"

Hi,thanks for your implementation.Your code is very clean and clear.But here I've got one problem.
That is in diffuseq/gaussian_diffusion.py,from line 632 to line 636.Here we got 2 nlls:
decoder_nll = self._token_discrete_loss(x_start, get_logits, input_ids_x) # embedding regularization
terms["nll"] = self._token_discrete_loss(model_out_x_start, get_logits, input_ids_x, mask=input_ids_mask, truncate=True, t=t)
and I see you use the first one to compute the losses:
terms["loss"] = terms["mse"] + decoder_nll + tT_loss
So I wonder why you use the decoder_nll rather than terms["nll"].I think decoder_nll doesn't use the transformer part of the model,can it benefit the training process?
I would appreciate it if you could answer this question.

Great work! looking forward to the code release.

Gradient accumulation

Hi,
I have yet another question, this time regarding the training process.
Why is the backward call in the for loop and not called after the microbatches are processed?
This looks like gradient accumulation or what else is the purpose of the microbatches?

https://github.com/Shark-NLP/DiffuSeq/blame/9cddf4eaee82ec5930a68de377953a7b9981acc1/train_util.py#L237-273

Nothing generated from decode

Hi DiffuSeq authors!

I followed the example from training to decode written in README but nothing was generated from decode.

May I know if there's some problem with my training?

In particular, I noticed this error during training:
socket.timeout: _ssl.c:1114: The handshake operation timed out

Problem of Decoding

I tried to run decode.sh on my own dataset, but it turns the bug. What should I do to solve it ?
(Ps: I used -nproc_per_node=3 when training, and I tried to change it to 1 or 3 when decoding, but not work)

RAM used: 2240.02 MB
RAM used: 2240.02 MB
### End of reading iteration...
  0%|                                                                                                   | 0/101 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "sample_seq2seq.py", line 210, in <module>
    main()
  File "sample_seq2seq.py", line 148, in main
    samples = sample_fn(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 448, in p_sample_loop
    for sample in self.p_sample_loop_progressive(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 517, in p_sample_loop_progressive
    out = self.p_sample(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 373, in p_sample
    out = self.p_mean_variance(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 931, in p_mean_variance
    return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 311, in p_mean_variance
    assert t.shape == (B,)
AssertionError
### End of reading iteration...
  0%|                                                                                                   | 0/101 [00:00<?, ?it/s]### End of reading iteration...
  0%|                                                                                                   | 0/101 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "sample_seq2seq.py", line 210, in <module>
    main()
  File "sample_seq2seq.py", line 148, in main
    samples = sample_fn(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 448, in p_sample_loop
    for sample in self.p_sample_loop_progressive(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 517, in p_sample_loop_progressive
    out = self.p_sample(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 373, in p_sample
    out = self.p_mean_variance(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 931, in p_mean_variance
    return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 311, in p_mean_variance
    assert t.shape == (B,)
AssertionError
  0%|                                                                                                   | 0/101 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "sample_seq2seq.py", line 210, in <module>
    main()
  File "sample_seq2seq.py", line 148, in main
    samples = sample_fn(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 448, in p_sample_loop
    for sample in self.p_sample_loop_progressive(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 517, in p_sample_loop_progressive
    out = self.p_sample(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 373, in p_sample
    out = self.p_mean_variance(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 931, in p_mean_variance
    return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 311, in p_mean_variance
    assert t.shape == (B,)
AssertionError
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 87758 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 87759 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 87757) of binary: /home/anaconda3/envs/kaggle_env/bin/python
Traceback (most recent call last):
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sample_seq2seq.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-11-26_07:57:08
  host      : gpu-server2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 87757)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
############################## decoding finished...

About the usage of `batch` in dataset (form of `batch`, `cond`)

It seems that the batch argument, which is the first argument of batch (form of batch, cond), is not used in training and sampling. It is output of TextDataset.model_emb(input_ids) and in TrainLoop class, it is subscripted to micro, and used in SpacedDistribution.training_losses_seq2seq method. However, in that method, it is saved as x_start_fix and never used.
Originally, what is the usage of batch argument in dataset?

Why is trained embedding orthogonal?

I load the model ema_0.9999_050000.pt shared by you (thanks for sharing), and find the word_embedding.weight is orthogonal, which is wierd! This means trained embedding failing to learn semantic relevance between words, and it just seperates words far away to tolerate the generation error.
Here is my code:

pt_path = '/home/workarea/Diffusion/DiffuSeq-Fork/diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test_ori20221113-20_27_29/ema_0.9999_050000.pt'
s = torch.load(pt_path, map_location=torch.device('cpu'))
weight = s['word_embedding.weight']
mm = torch.softmax(torch.mm(weight, weight.transpose(0,1)), dim=-1)
print(mm.trace()/mm.size(0))
# the result is 1!

Could you please explain this phenomenon? Thanks a lot!

Question for extending paraphrasing to other Dataset

This code really helps, I have a question that can we use the pretrained checkpoint to paraphrase over other datasets? How should we adjust the dataset as QQP format?

Resume checkpoint does not include loading embedding?

Hi, may I ask the implementation of resuming training from a checkpoint?
To the best of my understanding, when specify "args.resume_checkpoint", there's no operation on loading the saved embedding. Is this a bug or something?
Also I found there's no training on the embedding. Did I miss points or it is intended?
Thank you very much.

License

Unfortunately I am only allowed to run code on our universities cluster if there is a license attached to the code. Would it be possible to attach the standard MIT license to the repo?

About loss in training_losses_seq2seq() when time step t=0

Thanks for your great work.
I have a question about loss calculation in training_losses_seq2seq() when the sampled time step t=0

DiffuSeq/diffuseq/gaussian_diffusion.py

Lines 612 to 619 in bdc8f0a

 x_t = self.q_sample(x_start, t, noise=noise, mask=input_ids_mask) # reparametrization trick. 

 get_logits = model.model.module.get_logits 

 terms = {} 

 target = x_start 

 model_output = model(x_t, self._scale_timesteps(t), **model_kwargs)

If t=0. The x_t = self.q_sample() line is incorrect, since it tries to sample $x_0$ from $q(x_t|x_0)$. Therefore the model_output is invalid since x_t is invalid.
Then it seems like you try to replace the invalid term in the following code.

DiffuSeq/diffuseq/gaussian_diffusion.py

Lines 623 to 626 in bdc8f0a

 model_out_x_start = self._x0_helper(model_output, x_t, t)['pred_xstart'] # predicted_xstart = model_output 

 t0_mask = (t == 0) 

 t0_loss = mean_flat((x_start_mean - model_out_x_start) ** 2) 

 terms["mse"] = th.where(t0_mask, t0_loss, terms["mse"])

But you still use the invalid variable model_output to calculate MSE loss.

Is there anything I misunderstand? Could you please help me and clarify the code? Thanks.

Problems about running code.

Thanks for your great work! When I run this command "python -m torch.distributed.launch --nproc_per_node
=7 --master_port=12233 --use_env run_train.py --diff_steps 2000 --lr 0.0001 --learning_steps 140000 --save_interval 20000 --seed 102 --noise_schedule sqrt --h
idden_dim 128 --bsz 2048 --microbatch 64 --dataset dialogue --data_dir datasets/CommonsenseConversation --vocab b
ert --seq_len 128 --schedule_sampler lossaware --notes dialogue "

The output is stuck with "### Training...", and there is no further log and there are also no logs on wandb, please tell me what to do with that.

Why no attention mask used?

In vanilla BERT, we should input attention mask to avoid performing attention on padding positions. However, in this code repository, no attention mask is used. Is there any reason for this design?

Thanks!

Problem about Running Time on Dialogue dataset

Hi, thanks for your great work.

I am conducting the dialogue related experiments. Actually, I don't find the estimation about time for dialogue tasks(There are the discriptions and issues about QG and QQP). Additionally, it seems that the model should be trained at 140k steps, which means it may excute lasting 5+ days even using around 4 A100.

Would you like to share more detailed experience about the GPU resource settings and running time in different tasks? I supposed it may be the issue we should optimze.

Thanks

why MSE of "x_start" and output not "noise" and output?

Thanks for releasing the code

in https://github.com/Shark-NLP/DiffuSeq/blob/main/diffuseq/gaussian_diffusion.py#L621

   x_start = self._get_x_start(x_start_mean, std)
        # print(x_start_mean.shape, x_start.shape)
        if noise is None:
            noise = th.randn_like(x_start)

        x_t = self.q_sample(x_start, t, noise=noise, mask=input_ids_mask) # reparametrization trick.

        get_logits = model.model.module.get_logits

        terms = {}

        target = x_start
        model_output = model(x_t, self._scale_timesteps(t), **model_kwargs)
        assert model_output.shape == target.shape == x_start.shape
        terms["mse"] = mean_flat((target - model_output) ** 2)

why are you taking why MSE of "x_start" and output not "noise" and output?

but in the standard Diffusion Models it is MSE of noise and output (https://huggingface.co/blog/annotated-diffusion)


def p_losses(denoise_model, x_start, t, noise=None, loss_type="l1"):
    if noise is None:
        noise = torch.randn_like(x_start)

    x_noisy = q_sample(x_start=x_start, t=t, noise=noise)
    predicted_noise = denoise_model(x_noisy, t)

    if loss_type == 'l1':
        loss = F.l1_loss(noise, predicted_noise)
    elif loss_type == 'l2':
        loss = F.mse_loss(noise, predicted_noise)
    elif loss_type == "huber":
        loss = F.smooth_l1_loss(noise, predicted_noise)
    else:
        raise NotImplementedError()

    return loss

Some questions about different losses

Hi,
thank you for releasing such a clean code. My questions are based on the modified ICLR accepted paper .

Here I have some questions about different losses. If I misunderstand your code, I would really appreciate your correction.

It seems you calculate the cross-entropy loss over both x and y rather than just y, which is different from your paper "Note that although in the first term we only compute the loss w.r.t y0, due to the attention mechanism in the transformer, the reconstruction of y0 also takes x0 into account, thus the gradients from the first term will also affect the learning of x0."

At Here, your final loss uses decoder_nll rather than term["nll_loss"]. And the calculation of decoder_nll doesn't use the input_mask.

A related question to 1. It seems you don't use the rounding loss in the final loss.
You calculate the rounding loss as:
terms["nll"] = self._token_discrete_loss(model_out_x_start, get_logits, input_ids_x, mask=input_ids_mask, truncate=True, t=t)
But the final loss is:

decoder_nll = self._token_discrete_loss(x_start, get_logits, input_ids_x)
terms["loss"] = terms["mse"] + decoder_nll + tT_loss

The rounding loss isn't in the final loss.

Why do you calculate decoder_nll? The input to self._token_discrete_loss for decoder_nll is x_start. It is a noisy word embedding (add gaussian noise to the word embedding), should already be very close to the word embedding.
Why don't you learn sigma? The DDIM paper says a learnable sigma is beneficial.

Question About top-p sampling

Hello , thanks for sharing your code, it is really helpful.

I notice there is a hyperparameter top-p, the code is here. When we run decode, this hyperparameter is set -1, so we don't actually use "top-p sampling".

But I still wonder what it is for , did you use it in your experiment？if we use it，what is the appropriate value? Could you please provide me with further details or refer me to any relevant literature that would allow me to better understand it

Thank you in advance for your assistance

RoFormer model instead of BERT?

Hi there,
have you tried to use the Roformer model for text generation? I want to use that, since it allows the capturing of relative positions of each word.
If you have tried it, do i have to change anything other than just loading the different model because right now, my generation is way worse than BERT what is counterintuitive! :)

Where decode "src" and "trg" ?

Hi,
Thank you for great work.

I just try to understand where"src" and "trg" from the dataset is decoded? Can you explain to me if you have time?

I found three usage of tokenizer.decode_token() :

https://github.com/Shark-NLP/DiffuSeq/blob/main/sample_seq2seq.py#L202
https://github.com/Shark-NLP/DiffuSeq/blob/main/sample_seq2seq.py#L208-L209

I wonder, which of these three usages are decoding "src" and "trg"?

Thank you.

Questions about TransformerNetModel initialization

Hi:
Thank you for your excellent work and for providing a nice codebase.
First of all, I want to point out a bug. If I use the bert-base-uncased model to initialize, and word_embedding uses bert embedding, which dimension is 768, but input_dims is still 128, an error will be reported in the forward().
Secondly, I would like to ask if all your experiments were not initialized with the pre-trained model. I see that use_plm_init defaults to "no" in the script you provided. I tried to use the bert model for initialization, but the loss seems abnormal. Have you tried this?

Any help is appreciated.

Pretrained checkpoint for QG

Hi,

Thank you for sharing your amazing work. The repository is very clean and the instructions are easy to follow.
I see that you have recently uploaded the pretrained checkpoint for QQP. Would it be possible to obtain the pretrained checkpoint for the Quasar-T dataset too?

Thank you!

use additional model to guide text generation

Have you tried to use an additional model to guide the text generation based on your model? just like the Diffusion-LM did? Well it improve the performance?

NCCL error

Hi DiffuSeq! I am trying to troubleshoot my bash train.sh but got stuck with an NCCL error

############################## size of vocab 30522
### Creating model and diffusion...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
wandb: Tracking run with wandb version 0.14.0
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
### Training...
Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 115, in <module>
  File "train.py", line 115, in <module>
Traceback (most recent call last):
  File "train.py", line 115, in <module>
    main()
    main()
  File "train.py", line 92, in main
  File "train.py", line 92, in main
    main()
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
  File "train.py", line 92, in main
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
    self._load_and_sync_parameters()
    self._load_and_sync_parameters()
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
    dist_util.sync_params(self.model.parameters())
    dist_util.sync_params(self.model.parameters())
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
    self._load_and_sync_parameters()
    dist.broadcast(p, 0)
    dist.broadcast(p, 0)
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
    dist_util.sync_params(self.model.parameters())
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
    dist.broadcast(p, 0)
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
    work = default_pg.broadcast([tensor], opts)
    work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8**
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**

    work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8**
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
  File "train.py", line 115, in <module>
    main()
  File "train.py", line 92, in main
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
    self._load_and_sync_parameters()
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
    dist_util.sync_params(self.model.parameters())
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
    dist.broadcast(p, 0)
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
    work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/cltam/test/test_DiffuSeq/DiffuSeq/wandb/offline-run-20230328_222828-w9r0x0wf
wandb: Find logs at: ./wandb/offline-run-20230328_222828-w9r0x0wf/logs
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00020599365234375 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "92557", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "92558", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "92559", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 3, "group_rank": 0, "worker_id": "92564", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [3], \"role_rank\": [3], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}}

Any idea how to troubleshoot this? Thank you!

about the data preprocess

Thank you for your excellent work. I'm interested in your code and prepare to use my dataset to train. But I have no idea how should I preprocess the data. Is there anything I should pay attention to?
Could you please give me some hints and share the preprocess script? Thank you so much.

What is the `tT_loss`?

There is a tT_loss term in the final loss:

DiffuSeq/diffuseq/gaussian_diffusion.py

Lines 629 to 630 in 901f860

 out_mean, _, _ = self.q_mean_variance(x_start, th.LongTensor([self.num_timesteps - 1]).to(x_start.device)) 

 tT_loss = mean_flat(out_mean ** 2)

What is this? I cannot find it in the paper. And accroding to the code, the out_mean looks like the mean value of $x_T\sim\mathcal{N}(0, I)$ as $T\rightarrow +\infty$ from the diffusion forward procedure, and out_mean ** 2 should then be $\bar{\alpha}_T x_0\rightarrow 0$. Also, there seems no learnable params in the compute graph of tT_loss?

I wonder what is this term for, what is the meaning, and where it comes from?

Problem about Commonsense Conversation Dataset

Thanks for your great job!

In your repo i found the dataset sample CommonsenseConversation, but when i try downloading it i found i can't get it down in the link: http://coai.cs.tsinghua.edu.cn/hml/dataset/#commonsense

I'm curious how you can access this dataset, I would be very honored if you could answer my question!

Taken <Pad> as a regular token could make model only learn the <Pad> information?

Hi
In my project, I discovered that taking as the regular token, the diffusion model usally
learn the information. In other words, the model tends to predict the token instead of other words in the generation.
How to avoid this issue?

Only one gpu

If I want to modify it to single card training, which part of the code will need to be modified? How to modify?

NLL for q0-q2 is 0 but for q3 is >2

Hey,
its me again! :D
My loss, mse and in the example down below the nll is dropping very fast and even reaches 0.

Coming from my previous issue (#45), this means that the recovered token embeddings for timessteps [0, n * 0.25) [n * 0.25, n * 0.5) [n * 0.5, n * 0.75) with n being the number of diffusion steps are the exact same for each timestamp.
However the nll for q3 ([n * 0.75, n]) is very high (>2) which is also reflected when generating sequences for the respective checkpoints. Did you encounter something similar during your training process? 😊

sampling issue

Hi all! First I would like to thank you for sharing the code!
I'd like to apply diffuseq for some seq2seq tasks involving protein sequences (aminoacid tokens). So first I tested the model in a task to reverse the order of a sequence like I E T M L (source seq) to L M T E I (target seq). I used a training dataset composed of 8K samples and a validation dataset composed of 2K samples.
During training, the validation and training loss decreased to ~ 0 as expected (last learning step:)

| grad_norm | 0.0696 |
| loss | 0.0783 |
| loss_q0 | 0.0783 |
| loss_q1 | 0.0783 |
| loss_q2 | 0.0784 |
| loss_q3 | 0.0783 |
| mse | 0.0715 |
| mse_q0 | 0.0714 |
| mse_q1 | 0.0715 |
| mse_q2 | 0.0715 |
| mse_q3 | 0.0714 |
| nll | 2.76 |
| nll_q0 | 2.76 |
| nll_q1 | 2.76 |
| nll_q2 | 2.76 |
| nll_q3 | 2.76 |
| samples | 3.84e+06 |
| step | 3e+04 |

However, when I generate sequences using the decoder bash script, I obtain this for the last trained step:
{"recover": "", "reference": "[CLS] L M T E I [SEP]", "source": "[CLS] I E T M L [SEP] [SEP]"}
This seems that the model only predicted PAD from the source sequence.
If I decode from intermediary training steps, I obtain this:
{"recover": "[SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]", "reference": "[CLS] L M T E I [SEP]", "source": "[CLS] I E T M L [SEP] [SEP]"}

This is my first time working on diffusion models applied to sequences, so I don't know if that could be a problem related to hyperparameters. Would you have any thoughts on that?

This is some information potentially important to this case:

I used the protBert model from Rostlab/prot_bert hugging face as tokenizer but without initializing the pretrained model (use_plm_init no);
Example of my training samples:
{"src":"N N E T P","trg":"P T E N N"}
{"src":"Q E W Q R","trg":"R Q W E Q"}
{"src":"G Q M P M","trg":"M P M Q G"}
Training parameters:
--diff_steps 2000
--lr 0.0001
--learning_steps 30000
--save_interval 10000
--seed 102
--noise_schedule sqrt
--bsz 128
--dataset reverse2
--vocab bert
--schedule_sampler lossaware
--notes test-reverse2
--data_dir /home/ribeiroh/Projetos/DiffuSeq/datasets/reverse
--seq_len 18
--config_name Rostlab/prot_bert
--hidden_dim 128
--use_plm_init no

Thank you so much!

rounding issue

感谢作者的工作，请问代码中rounding部分的操作的解释在论文中有体现吗，怎么理解这个rounding操作后的结果就是word的词序号？

Modifications for unconditional text generation

Thank you for your meaningful work and nice codes. Now I would like to seek some help from you. I plan to conduct some experiment for unconditional text generation using your code, in which there is no source sequence for input. Could you please give me some tips about how and where I should make some modifications. Thanks!

Question about tT_loss

I am still confused about issue #17. The content of this issue has been duplicated as follow:

There is a tT_loss term in the final loss:
DiffuSeq/diffuseq/gaussian_diffusion.py
Lines 629 to 630 in 901f860

 out_mean, _, _ = self.q_mean_variance(x_start, th.LongTensor([self.num_timesteps - 1]).to(x_start.device)) 
 tT_loss =  mean_flat(out_mean ** 2)

What is this? I cannot find it in the paper. And accroding to the code, the out_mean looks like the mean value of as from the diffusion forward procedure, and out_mean ** 2 should then be . Also, there seems no learnable params in the compute graph of tT_loss? I wonder what is this term for, what is the meaning, and where it comes from?

As for the comment from @yjc4 in #17, I think that term doesn't explain these issues, because obviously it has been dropped from step 1 to step 2 in equation (17) from the paper. Please provide me some hints. Thanks.

Working with larger datasets

Hi ,
thank you for this awesome project.
I want to apply DiffuSeq on a larger datasets (~17M sentences) but the tokenizing keeps blowing up my RAM, even though I have 200GB available! Is there a functionality that I am missing that uses cached tokens or is this work in progress?

Thanks again & best!

Unable to reproduce the results

Unable to reproduce the results of the question generation task in the paper.

The results of the question generation task in the paper are as follows：

We use the provided command to reproduce the following results：

The command provided by repo is also the command we use, as follows：

python -m torch.distributed.launch --nproc_per_node=4 --master_port=12233 --use_env run_train.py --diff_steps 2000 --lr 0.0001 --learning_steps 40000 --save_interval 2000 --seed 102 --noise_schedule sqrt --hidden_dim 128 --bsz 2048 --microbatch 64 --dataset qg --data_dir {datasets/QG} --vocab bert --seq_len 128 --schedule_sampler lossaware --notes qg

Error when decoding

I train the dialogue task, when i want to decoding, error occurs:

Traceback (most recent call last):
  File "sample_seq2seq.py", line 210, in <module>
    main()
  File "sample_seq2seq.py", line 184, in main
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking arugment for argument mat2 in method wrapper_mm)

q<n> metrics

When it comes to training metrics, I cannot find the difference between the different q splits. (loss_q1, nll_q0, etc.)
Could you shortly explain or reference the corresponding paper / paragraph?

	x_t = self.q_sample(x_start, t, noise=noise, mask=input_ids_mask) # reparametrization trick.

	get_logits = model.model.module.get_logits

	terms = {}

	target = x_start
	model_output = model(x_t, self._scale_timesteps(t), **model_kwargs)

	model_out_x_start = self._x0_helper(model_output, x_t, t)['pred_xstart'] # predicted_xstart = model_output
	t0_mask = (t == 0)
	t0_loss = mean_flat((x_start_mean - model_out_x_start) ** 2)
	terms["mse"] = th.where(t0_mask, t0_loss, terms["mse"])

	out_mean, _, _ = self.q_mean_variance(x_start, th.LongTensor([self.num_timesteps - 1]).to(x_start.device))
	tT_loss = mean_flat(out_mean ** 2)

shark-nlp / diffuseq Goto Github PK

diffuseq's People

Contributors

Stargazers

Watchers

Forkers

diffuseq's Issues

Recommend Projects

Recommend Topics

Recommend Org