Giter VIP home page Giter VIP logo

diffuseq's People

Contributors

kdha0727 avatar summmeer avatar yjc4 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diffuseq's Issues

Separate weights for word embedding and lm-head?

Hi, thanks for providing the code.

I have a question regarding the word embedding and lm-head. In your code, both functions shard the same weight. I wonder if they can have separate weights?

Thanks for your help!

NCCL error

Hi DiffuSeq! I am trying to troubleshoot my bash train.sh but got stuck with an NCCL error

############################## size of vocab 30522
### Creating model and diffusion...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
### The parameter count is 91225274
### Saving the hyperparameters to diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test-qqp20230328-22:28:17/training_args.json
### Training...
wandb: Tracking run with wandb version 0.14.0
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
### Training...
Traceback (most recent call last):
Traceback (most recent call last):
  File "train.py", line 115, in <module>
  File "train.py", line 115, in <module>
Traceback (most recent call last):
  File "train.py", line 115, in <module>
    main()
    main()
  File "train.py", line 92, in main
  File "train.py", line 92, in main
    main()
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
  File "train.py", line 92, in main
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
    self._load_and_sync_parameters()
    self._load_and_sync_parameters()
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
    dist_util.sync_params(self.model.parameters())
    dist_util.sync_params(self.model.parameters())
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
    self._load_and_sync_parameters()
    dist.broadcast(p, 0)
    dist.broadcast(p, 0)
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
    dist_util.sync_params(self.model.parameters())
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
    dist.broadcast(p, 0)
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
    work = default_pg.broadcast([tensor], opts)
    work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8**
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**

    work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8**
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Traceback (most recent call last):
  File "train.py", line 115, in <module>
    main()
  File "train.py", line 92, in main
    TrainLoop(
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 88, in __init__
    self._load_and_sync_parameters()
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/train_util.py", line 141, in _load_and_sync_parameters
    dist_util.sync_params(self.model.parameters())
  File "/home/cltam/test/test_DiffuSeq/DiffuSeq/diffuseq/utils/dist_util.py", line 70, in sync_params
    dist.broadcast(p, 0)
  File "/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1076, in broadcast
    work = default_pg.broadcast([tensor], opts)
**RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).**
wandb: Waiting for W&B process to finish... (failed 1).
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/cltam/test/test_DiffuSeq/DiffuSeq/wandb/offline-run-20230328_222828-w9r0x0wf
wandb: Find logs at: ./wandb/offline-run-20230328_222828-w9r0x0wf/logs
INFO:torch.distributed.elastic.agent.server.api:[default] worker group successfully finished. Waiting 300 seconds for other agents to finish.
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
/home/cltam/anaconda3/envs/diffuseq/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:70: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00020599365234375 seconds
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "92557", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "92558", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "92559", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 3, "group_rank": 0, "worker_id": "92564", "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [3], \"role_rank\": [3], \"role_world_size\": [4]}", "agent_restarts": 0}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "cltam-System-Product-Name", "state": "SUCCEEDED", "total_run_time": 20, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 0}}

Any idea how to troubleshoot this? Thank you!

Why no attention mask used?

In vanilla BERT, we should input attention mask to avoid performing attention on padding positions. However, in this code repository, no attention mask is used. Is there any reason for this design?

Thanks!

Dataset(2) in "text_datasets.py"

Hello, thank you for your contribution to this code.

I’m confused by the code in ‘text_datasets.py’:”

import datasets
form datasets import Dataset as Dataset2

Here, the "datasets" is a folder without any class of "Dataset"?
So, in line 70:

raw_datasets = Dataset2.from_dict(sentence_lst)

can not run because the "Dataset2" does not exist.

NLL for q0-q2 is 0 but for q3 is >2

Hey,
its me again! :D
My loss, mse and in the example down below the nll is dropping very fast and even reaches 0.

Coming from my previous issue (#45), this means that the recovered token embeddings for timessteps [0, n * 0.25) [n * 0.25, n * 0.5) [n * 0.5, n * 0.75) with n being the number of diffusion steps are the exact same for each timestamp.
However the nll for q3 ([n * 0.75, n]) is very high (>2) which is also reflected when generating sequences for the respective checkpoints. Did you encounter something similar during your training process? 😊

Bildschirm­foto 2023-04-21 um 12 29 08

Question about tT_loss

I am still confused about issue #17. The content of this issue has been duplicated as follow:

There is a tT_loss term in the final loss:
DiffuSeq/diffuseq/gaussian_diffusion.py
Lines 629 to 630 in 901f860

 out_mean, _, _ = self.q_mean_variance(x_start, th.LongTensor([self.num_timesteps - 1]).to(x_start.device)) 
 tT_loss =  mean_flat(out_mean ** 2) 

What is this? I cannot find it in the paper. And accroding to the code, the out_mean looks like the mean value of as from the diffusion forward procedure, and out_mean ** 2 should then be . Also, there seems no learnable params in the compute graph of tT_loss? I wonder what is this term for, what is the meaning, and where it comes from?

As for the comment from @yjc4 in #17, I think that term doesn't explain these issues, because obviously it has been dropped from step 1 to step 2 in equation (17) from the paper. Please provide me some hints. Thanks.

why MSE of "x_start" and output not "noise" and output?

Hi

Thanks for releasing the code

in https://github.com/Shark-NLP/DiffuSeq/blob/main/diffuseq/gaussian_diffusion.py#L621

   x_start = self._get_x_start(x_start_mean, std)
        # print(x_start_mean.shape, x_start.shape)
        if noise is None:
            noise = th.randn_like(x_start)

        x_t = self.q_sample(x_start, t, noise=noise, mask=input_ids_mask) # reparametrization trick.

        get_logits = model.model.module.get_logits

        terms = {}

        target = x_start
        model_output = model(x_t, self._scale_timesteps(t), **model_kwargs)
        assert model_output.shape == target.shape == x_start.shape
        terms["mse"] = mean_flat((target - model_output) ** 2)

why are you taking why MSE of "x_start" and output not "noise" and output?

but in the standard Diffusion Models it is MSE of noise and output (https://huggingface.co/blog/annotated-diffusion)


def p_losses(denoise_model, x_start, t, noise=None, loss_type="l1"):
    if noise is None:
        noise = torch.randn_like(x_start)

    x_noisy = q_sample(x_start=x_start, t=t, noise=noise)
    predicted_noise = denoise_model(x_noisy, t)

    if loss_type == 'l1':
        loss = F.l1_loss(noise, predicted_noise)
    elif loss_type == 'l2':
        loss = F.mse_loss(noise, predicted_noise)
    elif loss_type == "huber":
        loss = F.smooth_l1_loss(noise, predicted_noise)
    else:
        raise NotImplementedError()

    return loss

Pretrained checkpoint for QG

Hi,

Thank you for sharing your amazing work. The repository is very clean and the instructions are easy to follow.
I see that you have recently uploaded the pretrained checkpoint for QQP. Would it be possible to obtain the pretrained checkpoint for the Quasar-T dataset too?

Thank you!

Size of the hidden dimension

Hey,
I was wondering if you have tested the effect of the hidden dimension on the training, and if yes, what were your findings?

the Multi-GPU training acutally duplicates data in each GPU ?

Hello.

I find that the Dataloader constructed in diffuseq/text_datasets.py not used pytorch's DistributedSampler

data_loader = DataLoader(

, which makes the data is actually duplicated in each GPU, e.g., in func:forward_backward in train_util.py

def forward_backward(self, batch, cond):

i.e., each GPU is processing the same data, which makes distributed training pointless.

Is my conjecture correct?

just FYI, the training script in Diffusion-LM's repo train_run.py uses transformers's training script run_clm.py, in which DistributedSampler is used in the Trainer

Error when decoding

I train the dialogue task, when i want to decoding, error occurs:

Traceback (most recent call last):
  File "sample_seq2seq.py", line 210, in <module>
    main()
  File "sample_seq2seq.py", line 184, in main
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking arugment for argument mat2 in method wrapper_mm)

Question About top-p sampling

Hello , thanks for sharing your code, it is really helpful.

I notice there is a hyperparameter top-p, the code is here. When we run decode, this hyperparameter is set -1, so we don't actually use "top-p sampling".

But I still wonder what it is for , did you use it in your experiment?if we use it,what is the appropriate value? Could you please provide me with further details or refer me to any relevant literature that would allow me to better understand it

Thank you in advance for your assistance

Pretrain DiffuSeq?

Hi,

Thanks for your wonderful work!

To pretrain DiffuSeq on large unsupervised corpus is the first thing that comes in to my mind after reading this paper, since this is what all LM do nowadays. The Commonsense Conversation dataset is already around 1GB in size, so I presume DiffuSeq has the ability the scale to large datasets. Can I ask why you guys haven't included such pretraining experiments?

Thank you!

About loss in training_losses_seq2seq() when time step t=0

Thanks for your great work.
I have a question about loss calculation in training_losses_seq2seq() when the sampled time step t=0

x_t = self.q_sample(x_start, t, noise=noise, mask=input_ids_mask) # reparametrization trick.
get_logits = model.model.module.get_logits
terms = {}
target = x_start
model_output = model(x_t, self._scale_timesteps(t), **model_kwargs)

If t=0. The x_t = self.q_sample() line is incorrect, since it tries to sample $x_0$ from $q(x_t|x_0)$. Therefore the model_output is invalid since x_t is invalid.
Then it seems like you try to replace the invalid term in the following code.

model_out_x_start = self._x0_helper(model_output, x_t, t)['pred_xstart'] # predicted_xstart = model_output
t0_mask = (t == 0)
t0_loss = mean_flat((x_start_mean - model_out_x_start) ** 2)
terms["mse"] = th.where(t0_mask, t0_loss, terms["mse"])

But you still use the invalid variable model_output to calculate MSE loss.

Is there anything I misunderstand? Could you please help me and clarify the code? Thanks.

Resume checkpoint does not include loading embedding?

Hi, may I ask the implementation of resuming training from a checkpoint?
To the best of my understanding, when specify "args.resume_checkpoint", there's no operation on loading the saved embedding. Is this a bug or something?
Also I found there's no training on the embedding. Did I miss points or it is intended?
Thank you very much.

sampling issue

Hi all! First I would like to thank you for sharing the code!
I'd like to apply diffuseq for some seq2seq tasks involving protein sequences (aminoacid tokens). So first I tested the model in a task to reverse the order of a sequence like I E T M L (source seq) to L M T E I (target seq). I used a training dataset composed of 8K samples and a validation dataset composed of 2K samples.
During training, the validation and training loss decreased to ~ 0 as expected (last learning step:)

| grad_norm | 0.0696 |
| loss | 0.0783 |
| loss_q0 | 0.0783 |
| loss_q1 | 0.0783 |
| loss_q2 | 0.0784 |
| loss_q3 | 0.0783 |
| mse | 0.0715 |
| mse_q0 | 0.0714 |
| mse_q1 | 0.0715 |
| mse_q2 | 0.0715 |
| mse_q3 | 0.0714 |
| nll | 2.76 |
| nll_q0 | 2.76 |
| nll_q1 | 2.76 |
| nll_q2 | 2.76 |
| nll_q3 | 2.76 |
| samples | 3.84e+06 |
| step | 3e+04 |

However, when I generate sequences using the decoder bash script, I obtain this for the last trained step:
{"recover": "", "reference": "[CLS] L M T E I [SEP]", "source": "[CLS] I E T M L [SEP] [SEP]"}
This seems that the model only predicted PAD from the source sequence.
If I decode from intermediary training steps, I obtain this:
{"recover": "[SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]", "reference": "[CLS] L M T E I [SEP]", "source": "[CLS] I E T M L [SEP] [SEP]"}

This is my first time working on diffusion models applied to sequences, so I don't know if that could be a problem related to hyperparameters. Would you have any thoughts on that?

This is some information potentially important to this case:

  • I used the protBert model from Rostlab/prot_bert hugging face as tokenizer but without initializing the pretrained model (use_plm_init no);
  • Example of my training samples:
    {"src":"N N E T P","trg":"P T E N N"}
    {"src":"Q E W Q R","trg":"R Q W E Q"}
    {"src":"G Q M P M","trg":"M P M Q G"}
  • Training parameters:
    --diff_steps 2000
    --lr 0.0001
    --learning_steps 30000
    --save_interval 10000
    --seed 102
    --noise_schedule sqrt
    --bsz 128
    --dataset reverse2
    --vocab bert
    --schedule_sampler lossaware
    --notes test-reverse2
    --data_dir /home/ribeiroh/Projetos/DiffuSeq/datasets/reverse
    --seq_len 18
    --config_name Rostlab/prot_bert
    --hidden_dim 128
    --use_plm_init no

Thank you so much!

Questions about TransformerNetModel initialization

Hi:
Thank you for your excellent work and for providing a nice codebase.
First of all, I want to point out a bug. If I use the bert-base-uncased model to initialize, and word_embedding uses bert embedding, which dimension is 768, but input_dims is still 128, an error will be reported in the forward().
Secondly, I would like to ask if all your experiments were not initialized with the pre-trained model. I see that use_plm_init defaults to "no" in the script you provided. I tried to use the bert model for initialization, but the loss seems abnormal. Have you tried this?

Any help is appreciated.

Problem of Decoding

I tried to run decode.sh on my own dataset, but it turns the bug. What should I do to solve it ?
(Ps: I used -nproc_per_node=3 when training, and I tried to change it to 1 or 3 when decoding, but not work)

RAM used: 2240.02 MB
RAM used: 2240.02 MB
### End of reading iteration...
  0%|                                                                                                   | 0/101 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "sample_seq2seq.py", line 210, in <module>
    main()
  File "sample_seq2seq.py", line 148, in main
    samples = sample_fn(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 448, in p_sample_loop
    for sample in self.p_sample_loop_progressive(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 517, in p_sample_loop_progressive
    out = self.p_sample(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 373, in p_sample
    out = self.p_mean_variance(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 931, in p_mean_variance
    return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 311, in p_mean_variance
    assert t.shape == (B,)
AssertionError
### End of reading iteration...
  0%|                                                                                                   | 0/101 [00:00<?, ?it/s]### End of reading iteration...
  0%|                                                                                                   | 0/101 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "sample_seq2seq.py", line 210, in <module>
    main()
  File "sample_seq2seq.py", line 148, in main
    samples = sample_fn(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 448, in p_sample_loop
    for sample in self.p_sample_loop_progressive(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 517, in p_sample_loop_progressive
    out = self.p_sample(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 373, in p_sample
    out = self.p_mean_variance(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 931, in p_mean_variance
    return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 311, in p_mean_variance
    assert t.shape == (B,)
AssertionError
  0%|                                                                                                   | 0/101 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "sample_seq2seq.py", line 210, in <module>
    main()
  File "sample_seq2seq.py", line 148, in main
    samples = sample_fn(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 448, in p_sample_loop
    for sample in self.p_sample_loop_progressive(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 517, in p_sample_loop_progressive
    out = self.p_sample(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 373, in p_sample
    out = self.p_mean_variance(
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 931, in p_mean_variance
    return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
  File "/home/DiffuSeq/diffuseq/gaussian_diffusion.py", line 311, in p_mean_variance
    assert t.shape == (B,)
AssertionError
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 87758 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 87759 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 87757) of binary: /home/anaconda3/envs/kaggle_env/bin/python
Traceback (most recent call last):
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/anaconda3/envs/kaggle_env/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
sample_seq2seq.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-11-26_07:57:08
  host      : gpu-server2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 87757)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
############################## decoding finished...

On the calculation of Xt partial loss in Zt

https://github.com/Shark-NLP/DiffuSeq/blob/main/diffuseq/gaussian_diffusion.py#:~:text=terms%5B%22mse%22%5D%20%3D%20mean_flat((target%20%2D%20model_output)%20**%202)
https://github.com/Shark-NLP/DiffuSeq/blob/main/diffuseq/utils/nn.py#:~:text=def%20mean_flat(,.shape))))
Dear author, since you keep the Xt part of Zt unchanged (it is anchored as X0 after each calculation), when calculating the MSE loss of Lt-1, the Xt part is 0, will the loss result be wrong? I tried the mean_flat function and it counted 0 into the number. thank you!

Randomly Initialized embeddings?

Hi, thanks for sharing your implementations, it is really helpful and very clean to follow.

I have one question about the embedding used in your model. There are 3 options discussed in the Diffusion-LM paper: 1. fixed randomly initialized embedding. 2. fiexd embeddings initialized from a PLM (like BERT). or 3. E2E embedding like Diffusion-LM.
I did not find which one you used in your paper.

But from your codes, It seems that you use randomly initialized embeddings, which is different from Diffusion-LM?
Specifically, I find the input ids are embedded into 128-d random embeddings in the data loading process:

def load_model_emb(args, tokenizer):

Please correct me if I am wrong.

Nothing generated from decode

Hi DiffuSeq authors!

I followed the example from training to decode written in README but nothing was generated from decode.

image

May I know if there's some problem with my training?

image

In particular, I noticed this error during training:
socket.timeout: _ssl.c:1114: The handshake operation timed out

facing an error while trying to execute pip install -r requirements.txt in terminal

when i am trying to do setup with the given command "pip install -r requirements.txt " facing the below error.

ERROR: Could not find a version that satisfies the requirement torch==1.9.0+cu111 (from versions: 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1)
ERROR: No matching distribution found for torch==1.9.0+cu111

could this be the solution if i remove +cu11 in 1.9.0+cu111, just keep torch==1.9.0
will it resolves the issue or anything needs to be done in order to setup
kindly, please guide

rounding issue

感谢作者的工作,请问代码中rounding部分的操作的解释在论文中有体现吗,怎么理解这个rounding操作后的结果就是word的词序号?

Problem about Running Time on Dialogue dataset

Hi, thanks for your great work.

I am conducting the dialogue related experiments. Actually, I don't find the estimation about time for dialogue tasks(There are the discriptions and issues about QG and QQP). Additionally, it seems that the model should be trained at 140k steps, which means it may excute lasting 5+ days even using around 4 A100.

Would you like to share more detailed experience about the GPU resource settings and running time in different tasks? I supposed it may be the issue we should optimze.

Thanks

Problems about running code.

Thanks for your great work! When I run this command "python -m torch.distributed.launch --nproc_per_node
=7 --master_port=12233 --use_env run_train.py --diff_steps 2000 --lr 0.0001 --learning_steps 140000 --save_interval 20000 --seed 102 --noise_schedule sqrt --h
idden_dim 128 --bsz 2048 --microbatch 64 --dataset dialogue --data_dir datasets/CommonsenseConversation --vocab b
ert --seq_len 128 --schedule_sampler lossaware --notes dialogue "

The output is stuck with "### Training...", and there is no further log and there are also no logs on wandb, please tell me what to do with that.

RoFormer model instead of BERT?

Hi there,
have you tried to use the Roformer model for text generation? I want to use that, since it allows the capturing of relative positions of each word.
If you have tried it, do i have to change anything other than just loading the different model because right now, my generation is way worse than BERT what is counterintuitive! :)

Problems of running time and microbatch ?

Thanks for your great work. When we reproduce your results, we come up two questions:

  1. The training time of QQP task is much longer than 2 days (48h), which is inconsistent with the statement in this repo:

It will take around 2 days to train a DiffuSeq model on 2 NVIDIA A100 GPUs for QG and QQP

We follow the suggestion implementation: 2 Nvidia A100 GPUs and run the following command:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=12233 --use_env run_train.py --diff_steps 2000 --lr 0.0001 --learning_steps 50000 --microbatch 64 --save_interval 10000 --seed 102 --noise_schedule sqrt --hidden_dim 128 --bsz 2048 --dataset qqp --data_dir {datasets/QQP} --vocab bert --seq_len 128 --schedule_sampler lossaware --notes qqp

Here is our wandb output:
f2df1ad229d2d1fe083be2b76b01005
as well as the log out:
77ad9e776687786c2bacaec5c0acc0d
Note that we have already run this code with around 15 hours, but just reach 4960 step, which is far from 50000 steps as you suggest.

  1. We take a deeper look into your code and find that during the training loops you split one batch into several micro bateches.
    We wonder why you conduct such operation? Is this trick necessary?

Also, we wonder how can the total training process can be done into 2 days with your suggestion command and implementation ?

Typo? train.py

Hello, I might be horribly mistaken but I also might have seen a "typo".
In train.py first docstring, it says "Train a diffusion model on images."
Is it incorrect?

image

About Transformer Model

I see you use the self.input_transformers = BertEncoder(config) to create the Transformer Model, Is the creation of Transformer model realized through BertEncoder?
Thanks

Decoding only [UNK] token ?

Hi,
I try to use DIFFUSEQ to adapt a different language. However, after training with this new language, model continuously produce [UNK] token.

What could be the reason? Where did I do wrong? Do you have any idea about the problem? Did you something similar experience during development phase?

image

Kind regards.

q<n> metrics

When it comes to training metrics, I cannot find the difference between the different q splits. (loss_q1, nll_q0, etc.)
Could you shortly explain or reference the corresponding paper / paragraph?

About the usage of `batch` in dataset (form of `batch`, `cond`)

It seems that the batch argument, which is the first argument of batch (form of batch, cond), is not used in training and sampling. It is output of TextDataset.model_emb(input_ids) and in TrainLoop class, it is subscripted to micro, and used in SpacedDistribution.training_losses_seq2seq method. However, in that method, it is saved as x_start_fix and never used.
Originally, what is the usage of batch argument in dataset?

What is the `tT_loss`?

There is a tT_loss term in the final loss:

out_mean, _, _ = self.q_mean_variance(x_start, th.LongTensor([self.num_timesteps - 1]).to(x_start.device))
tT_loss = mean_flat(out_mean ** 2)

What is this? I cannot find it in the paper. And accroding to the code, the out_mean looks like the mean value of $x_T\sim\mathcal{N}(0, I)$ as $T\rightarrow +\infty$ from the diffusion forward procedure, and out_mean ** 2 should then be $\bar{\alpha}_T x_0\rightarrow 0$. Also, there seems no learnable params in the compute graph of tT_loss?

I wonder what is this term for, what is the meaning, and where it comes from?

about the data preprocess

Thank you for your excellent work. I'm interested in your code and prepare to use my dataset to train. But I have no idea how should I preprocess the data. Is there anything I should pay attention to?
Could you please give me some hints and share the preprocess script? Thank you so much.

Questions about "decoder_nll"

Hi,thanks for your implementation.Your code is very clean and clear.But here I've got one problem.
That is in diffuseq/gaussian_diffusion.py,from line 632 to line 636.Here we got 2 nlls:
decoder_nll = self._token_discrete_loss(x_start, get_logits, input_ids_x) # embedding regularization
terms["nll"] = self._token_discrete_loss(model_out_x_start, get_logits, input_ids_x, mask=input_ids_mask, truncate=True, t=t)
and I see you use the first one to compute the losses:
terms["loss"] = terms["mse"] + decoder_nll + tT_loss
So I wonder why you use the decoder_nll rather than terms["nll"].I think decoder_nll doesn't use the transformer part of the model,can it benefit the training process?
I would appreciate it if you could answer this question.

Only one gpu

If I want to modify it to single card training, which part of the code will need to be modified? How to modify?

Some questions about different losses

Hi,
thank you for releasing such a clean code. My questions are based on the modified ICLR accepted paper .

Here I have some questions about different losses. If I misunderstand your code, I would really appreciate your correction.

  1. It seems you calculate the cross-entropy loss over both x and y rather than just y, which is different from your paper "Note that although in the first term we only compute the loss w.r.t y0, due to the attention mechanism in the transformer, the reconstruction of y0 also takes x0 into account, thus the gradients from the first term will also affect the learning of x0."

At Here, your final loss uses decoder_nll rather than term["nll_loss"]. And the calculation of decoder_nll doesn't use the input_mask.

  1. A related question to 1. It seems you don't use the rounding loss in the final loss.
    You calculate the rounding loss as:
    terms["nll"] = self._token_discrete_loss(model_out_x_start, get_logits, input_ids_x, mask=input_ids_mask, truncate=True, t=t)
    But the final loss is:
decoder_nll = self._token_discrete_loss(x_start, get_logits, input_ids_x)
terms["loss"] = terms["mse"] + decoder_nll + tT_loss

The rounding loss isn't in the final loss.

  1. Why do you calculate decoder_nll? The input to self._token_discrete_loss for decoder_nll is x_start. It is a noisy word embedding (add gaussian noise to the word embedding), should already be very close to the word embedding.

  2. Why don't you learn sigma? The DDIM paper says a learnable sigma is beneficial.

Unable to reproduce the results

Unable to reproduce the results of the question generation task in the paper.

  • The results of the question generation task in the paper are as follows:

image

  • We use the provided command to reproduce the following results:
    image

The command provided by repo is also the command we use, as follows:

python -m torch.distributed.launch --nproc_per_node=4 --master_port=12233 --use_env run_train.py --diff_steps 2000 --lr 0.0001 --learning_steps 40000 --save_interval 2000 --seed 102 --noise_schedule sqrt --hidden_dim 128 --bsz 2048 --microbatch 64 --dataset qg --data_dir {datasets/QG} --vocab bert --seq_len 128 --schedule_sampler lossaware --notes qg

Modifications for unconditional text generation

Thank you for your meaningful work and nice codes. Now I would like to seek some help from you. I plan to conduct some experiment for unconditional text generation using your code, in which there is no source sequence for input. Could you please give me some tips about how and where I should make some modifications. Thanks!

License

Hi

Unfortunately I am only allowed to run code on our universities cluster if there is a license attached to the code. Would it be possible to attach the standard MIT license to the repo?

Why is trained embedding orthogonal?

I load the model ema_0.9999_050000.pt shared by you (thanks for sharing), and find the word_embedding.weight is orthogonal, which is wierd! This means trained embedding failing to learn semantic relevance between words, and it just seperates words far away to tolerate the generation error.
Here is my code:

pt_path = '/home/workarea/Diffusion/DiffuSeq-Fork/diffusion_models/diffuseq_qqp_h128_lr0.0001_t2000_sqrt_lossaware_seed102_test_ori20221113-20_27_29/ema_0.9999_050000.pt'
s = torch.load(pt_path, map_location=torch.device('cpu'))
weight = s['word_embedding.weight']
mm = torch.softmax(torch.mm(weight, weight.transpose(0,1)), dim=-1)
print(mm.trace()/mm.size(0))
# the result is 1!

Could you please explain this phenomenon? Thanks a lot!

Working with larger datasets

Hi ,
thank you for this awesome project.
I want to apply DiffuSeq on a larger datasets (~17M sentences) but the tokenizing keeps blowing up my RAM, even though I have 200GB available! Is there a functionality that I am missing that uses cached tokens or is this work in progress?

Thanks again & best!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.