hzfinfdu / diffusion-bert Goto Github PK

View Code? Open in Web Editor NEW

274.0 274.0 21.0 1.73 MB

ACL'2023: DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models

License: Apache License 2.0

Python 99.89% Shell 0.11%

bert conditional-generation diffusion-models pretrained-language-model text-generation unconditional-generation

diffusion-bert's People

Contributors

Stargazers

Watchers

Forkers

shaun95 ckqqqq sbwww dumpmemory maxmax2016 lpyhdzx sjoshi17jhu littletrain-jyp rongzhi-dong ll-c8 akicwy poonehmousavi kpmyon s-sahoo mirdoch justinchiu mingmichelle0414 mattfeng yangfujun215

diffusion-bert's Issues

self is not defined for discrete_diffusion_predict_fn()

In the function def discrete_diffusion_predict_fn(), self.device() is called, however the self is not defined in this function. Code snippet here, self.device() is giving the error:

if predict_x0:
        init_state = SamplingState(x, x, torch.tensor([num_steps], device=self.device))
    else:
        init_state = SamplingState(x, None, torch.tensor([num_steps], device=self.device))

I tried to pass in device as function arg and manipulate the devices of variables here and didn't make it.

Please provide an updated discrete_diffusion_predict_fn() that addresses this device inconsistency if possible.

FileNotFoundError: [Errno 2] No such file or directory: '/remote-home/zfhe/projects/diffusion_torch/D3PM_new_timestep_ckpts/best(1799999).th'

When running predict.py, encountering the error message "FileNotFoundError: [Errno 2] No such file or directory: '/remote-home/zfhe/projects/diffusion_torch/D3PM_new_timestep_ckpts/best(1799999).th'". What should I do to make it run properly?

RuntimeError: Error(s) in loading state_dict for RobertaForMaskedLM:

I encountered this error after the model was trained and ready for testing. I found it to be the same as #17, you mentioned under that issue that certain parameters need to be changed, what exactly should I do?

Inquiry on some details of the method.

As said in the second paragraph of Section 4.3, "We attribute the superior performance of DiffusionBERT to its onetime sampling of all tokens". I wonder the meaning of "onetime sampling of all tokens", does it mean generating all the tokens in a sentence at a time? If it does, it seems to conflict with the demonstration in Table 1. Thank you!

why TypeError？

Why am I running Word. py,the following error occurred：
Traceback (most recent call last):
File "C:\GithubProjects\Diffusion-BERT-main\word_freq.py", line 18, in
for iid in data['input_ids']:
TypeError: string indices must be integers

checkpoint

Thanks for your great work!

Can you also release trained checkpoints to make it more convenient to reproduce experiment result?

How did you calculate the perplexity of DiffusionLM

Greetings,

I am currently working on diffusion for text generation as well. In your paper you have included the PPL of DiffusionLM in your results for comparison. I would I assume you have derived this from the ELBO from the loss of the model, right? Would you please share more details of the computation? For example, what loss you are using and whether you have estimate the token level PPL or the sequence level PPL. It would be great if you can share the code for this part as well.

Thank you very much. Your help is appreciated as we would like to cite this method.

unfinished codebase?

Hello, first, thank you for your work. I find it fascinating!
I was wondering if the codebase isn't yet complete, since the predict.py and predict_downstream_conditionals.py still have missing imports, etc.

I was hoping to see how the model actually functions after I've trained it after one epoch.
Any plans on updating soon?

代码公布时间？

作者，你好！
请问最近有计划发布代码吗？大概是什么时候呢？

Missing key(s) in state_dict for unconditional

Hi,

When I was trying to load the checkpoint, it gives the following error:

Missing key(s) in state_dict: Missing key(s) in state_dict: "bert.embeddings.position_ids", "bert.embeddings.word_embeddings.weight", "bert.embeddings.position_embeddings.weight", "bert.embeddings.token_type_embeddings.weight", "bert.embeddings.LayerNorm.weight", "bert.embeddings.LayerNorm.bias", "bert.encoder.layer.0.attention.self.query.weight", "bert.encoder.layer.0.attention.self.query.bias", "bert.encoder.layer.0.attention.self.key.weight",......

and a lot of other layer infos.

It looks like the state_dict has keys "module.bert...." rather than "bert..."as expected. Seems it's similar to issue #17 so please kindly help. How would I fix this issue? Thanks in advance.

P.S. I got the model checkpoints by running DDP_main.py. I saved earlier-stage checkpoints and stopped training as it took too long in eval mode with warnings "NAN encountered ... times". Does your training look the same?

No module named 'perplexity

When I run predict.py, an error message 'No module named 'perplexity'' appears. After checking, I found that in addition to perplexity, the compute_metric package is also missing from the environment. How can I download these libraries?

code?

I don't see any code here, is there somewhere else to look?

Resuming training via `--load_step`

Thanks for the code release!

Heads up for other users who want to resume training from a checkpoint: you will want to

de-indent DDP_main.py:80 so that all devices can load the checkpoint
load the optimizer and scheduler states on line DDP_main:146
set the index of the dataloader to the correct example before actually training

I'm not totally sure this solves everything like logging, but might work ok.

Note: There's also a separate issue that your checkpoints might get overwritten between epochs, so be sure you're loading the right thing and saving where you want.

When the code will be made public, and whether the parameter of Bert will change during training

Hi,

Thanks for the interesting work.

May I ask when the code will be made public, and whether the parameter of Bert will change during training?

Thanks.

Lower-case in LM1B

Hello!

In the paper you write

All text data are lower-cased to align with the settings of Austin et al. (2021)

But in D3PM paper it is never stated that LM1B data was lower-cased (and you can see samples from their model in the appendix where the sentences contain upper-case characters). So the perplexity comparison seems incorrect, because it is easier to model all-lowercased text. Am I missing something?

Conditional Generation

Hi @Hzfinfdu,

May I ask where can I find more details about conditional generation? Thank you!

How to evaluate BLEU score on LM1B?

Dear authors,

I understand that you plan to release your code on January. But could you share more details regarding how you evaluate the BLEU score and PPL on the LM1B dataset? I am also working on Diffusion Model for text and may potentially cite your paper. Thanks!

Function missed

This work is fascinating and attractive, however, I have an issue when I read the code.
Line 511 in diffusion_word_freq.py "schedule_fn = utils.create_learning_rate_scheduler(" calls the function create_learning_rate_scheduler in file utils.py. But I don't find the definition in utils.py, maybe the code is incomplete?
Thanks.

Will the code be released recently?

Very interesting work! Curious is whether there is a plan to release the code recently?

No module named 'perplexity

When I run predict.py, an error message 'No module named 'perplexity'' appears. how can I download this library?

关于时间表

你好，我在您的论文中看到您新创建的一种新的Spindle方式，我了解到它首先需要用程序中的word_freq.py统计词频，然后会生成一个pt文件。但我发先在我没有运行统计词频代码情况下，还是可以继续后面的步骤进行训练。我在代码里也没有看到训练引用pt词频文件。我想问的是我该如何使用您的Spindle方式进行加噪

Here are some of my problems, please advise

Hi, Dear author :
1. the paper is in q (Xt | X0) this part if you use the denominator instead not to calculate?
2. Why predict that x0 is a floating point number and not map it to one-hot?
3. step = t - 1 and step = t + 1 appear frequently in your code. Do they have a specific meaning? thank you

GPT mentioend in Figure3

Dear authors,

Thanks for open-sourcing your wonderful work.

You mention GPT in Figure 3 when comparing the Pareto front across different models("AR models of the same size"). May I ask if this is a pre-trained GPT (e.g. GPT2-small) finetuned on the LM1B dataset, or a model with GPT architecture trained from scrach on the LM1B training set?

How to fine-tune it

It seems that a pre-trained language model. Could i run train on a lot of unconditional text to get a checkpoint then fine-tuning the model on Seq2seq tasks?

Can you please publish some details of the Diffusion-LM training phase? Learning steps, batch_size, etc.

Thanks you!

What's the meaning of the parameter 'load_step'?

Thanks

Reimplemented Diffusion-LM

Hi,

Thank you for the amazing work.

As I read your paper and noticed that you had reimplemented Diffusion-LM, do you have any plan to open source your implementation as well?

Thanks.

Missing key(s) in state_dict when testing using predict_downstream_condition.py

python predict_downstream_condition.py --ckpt_path model_name_roberta-base_taskname_qqp_lr_3e-05_seed_42_numsteps_2000_sample_Categorical_schedule_mutual_hybridlambda_0.0003_wordfreqlambda_0.0_fromscratch_False_timestep_none_ckpts/best(38899).th
using standard schedule with num_steps: 2000.
Traceback (most recent call last):
File "predict_downstream_condition.py", line 101, in
model.load_state_dict(ckpt['model'])
File "/opt/conda/envs/diff/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1672, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for RobertaForMaskedLM:
Missing key(s) in state_dict: "roberta.embeddings.position_ids", "roberta.embeddings.word_embeddings.weight", "roberta.embeddings.position_embeddings.weight", "roberta.embeddings.token_type_embeddings.weight", "roberta.embeddings.LayerNorm.weight", "roberta.embeddings.LayerNorm.bias", "roberta.encoder.layer.0.attention.self.query.weight", "roberta.encoder.layer.0.attention.self.query.bias".........................

Do you plan to release the checkpoints?

Hi,

I try to train a model on a different dataset but the loss doesn't change that much. I wonder if you could release the checkpoints so could first load the model and then finetune it on my own dataset?

Thanks

How to calculate entropy?

Dear authors,
Thank you for your paper. It was quite illuminating.

Your proposed noise schedule requires the entropy value of each word/token before noising, but I couldn't find how you calculated it. Is it per sentence/ngram/corpus/etc. ? Any libraries you used to calculate it, or was it manual?

Thank you for your time.