yixinl7 / brio Goto Github PK
View Code? Open in Web Editor NEWACL 2022: BRIO: Bringing Order to Abstractive Summarization
ACL 2022: BRIO: Bringing Order to Abstractive Summarization
Hello, I'm trying to use your results and codes in our team's research (for comparison only), but I'm not sure about what's possible and what I can't.
Could you please specify which license is used for this project? e.g. MIT, Apache, ...
Thank you.
Hi,
thank you for the good work!
I am re-implementing BRIO and I am a little bit confused about the 'score mode' of the model. I mean, how does the scoring mechanism work, given several candidate summaries and corresponding source text? According to my understanding, the output of the model should be a sequence of token ids. Does 'the score of one summary' or 'the probability to generate one summary' mean the similarity between the output and candidate summary?
Thank you for your patience!
Hello, I've tried initializing the model using the provided example in the README:
model = BRIO('Yale-LILY/brio-cnndm-uncased', tok.pad_token_id, is_pegasus=False)
However I've been facing some issues when trying to use it for inference:
.generate()
method have value of None
. I've tried just putting some default values as shown here. Here's how it looks :inputs = tokenizer([article], max_length=max_length, return_tensors="pt", truncation=True)
summary_ids = model.generate(inputs["input_ids"],
early_stopping=False,
max_length=1024,
num_beams=1,
num_beam_groups=1)
.generate()
method does not create any ids. Whenever I try to decode the generated summary, I get the error TypeError: 'NoneType' object is not iterable
. When checking the type or content of summary_ids
, I get None
or <class 'NoneType'>
.Why is this happening? When loading the pre-trained models straight from HF, I don't have any issues but this one does not seem to be working.
Hi @yixinL7
Could you please provide the code to convert the BRIO saved checkpoints into the BartForConditionalGeneration
format?
Thanks!
The size of embedding in your checkpoint is 50264 while there are 50265 tokens in the original BART model and the last token is mask.
Hi Yixin - thanks for sharing the repo and putting the pre-trained models on HuggingFace.
Unfortunately, though, I'm having trouble loading the tokenizer for CNN/DM. Any thoughts / suggestions? Thanks
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/griffin/sum/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 546, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/griffin/sum/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
return cls._from_pretrained(
File "/home/griffin/sum/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1923, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/griffin/sum/lib/python3.8/site-packages/transformers/models/bart/tokenization_bart_fast.py", line 171, in __init__
super().__init__(
File "/home/griffin/sum/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: No such file or directory (os error 2)```
Hi @yixinL7, I was training a BRIO model on a different dataset (RedditTIFU) and observed conflicting trends between the mle and ranking loss. I start from a converged mle checkpoint. As the training proceeds, on the validation, the mle loss keeps increasing while the ranking loss decreases normally. In terms of raw value, the mle loss increased from 2.34 -> 2.5 and the ranking loss decreased from 1.63 -> 0.88 after 130k steps. I set the coefficient of ranking loss as 1.0 (setting to higher values e.g. 100 leads to instability).
After training, I picked the one with lowest validation loss (mle+rank combined), and saw that the generation performance significantly degraded (around half of the original model), although the coordination property seemed to have been learned (score improves with higher beam size, but still far from the raw mle checkpoint). Do you still have the log files from back then ? I was trying to figure out what was the cause. It would be helpful to receive your opinions on this !
Hi thank you for sharing your awesome work.
However i notice that you do lower case before tokenize the CNNDM data while using BRIO-CNN (based on BART)
I check the vocab in origin BART and your BRIO vocab(download from transformer model lib)
They all contain upper cased tokens.
Thank you for your reply
Hi, thanks for this fantastic work.
Here is my question: I try to use BRIO in another generation task and re-implement it in Fairseq. However, I find that the performance is relatively poor after incorporating BRIO.
I look further into the generation results and find that many results are just a single period. Moreover, the distribution of scores of candidates seems to be isotropic after training with the contrastive loss (I set the hyper-parameters following the CNN setting in your paper), such as the example shown below:
before training with the contrastive loss (16 candidates, sorted):
[-0.2314, -0.2862, -0.2660, -0.2471, -0.2442, -0.2796, -0.2611, -0.2617, -0.2608, -0.2984, -0.2622, -0.5395, -0.5655, -0.4688, -0.5250, -0.5317],
after:
[-1.1421, -1.1402, -1.1290, -1.1524, -1.1554, -1.1483, -1.1415, -1.1476, -1.1527, -1.1472, -1.1538, -1.1437, -1.1555, -1.1722, -1.1440, -1.1427]
Can you give me any advice?
Line 287 in a32b78e
It should be like.
dct = tok.batch_encode_plus(slines, max_length=args.total_len, return_tensors="pt", padding="max_length", truncation=True)
When I try to use "brio-xsum-cased" model in Huggingface, a "tokenizer files missing" error is produced.
Can you check out the configuration files? That's important to me, thanks!
Hello!
Thank you for releasing the code as an open source. It's very helpful to study "abstractive summarization"
I have two questions as follows:
▶ ROUGE score on XSum (I made it)
▶ code
# evaluate the model as a generator on XSum
python main.py --cuda --gpuid 0 --config xsum -e --model_pt xsum_paper/model_generation.bin -g
# calculate ROUGE score
python cal_rouge.py --ref ./xsum/diverse/test.target.tokenized --hyp ./result/xsum_paper/test.out.tokenized -l
Hi - Thanks for the great code. I've been trying to re-implement BRIO in my HuggingFace fork, but unable to get it to work.
I'm curious what this line in RankingLoss is doing:
TotalLoss = loss_func(score, score, ones)
One possibility is that I haven't yet included the gold reference as part of the ranking loss, which might explain why the contrast loss is causing the gold standard MLE loss to rise too highly. I will add that but was also curious about the above function. Thank you!!
Hi,
Thank you for the good work.
del
commands to remove tensors that are no longer needed. Is there anything else that also needs to be deleted?Thank you!
I use the summary model trained based on pegasus-base. On the RTX 8000 GPU card, it takes several seconds for each row of training data to generate candidates, which is too slow.
When I need to generate candidate summaries for 5,000,000 data, it takes a very long time.
How to speed up the generation of candidates?
Hi,
Thank you for the good work.
As you stated in "Implementation Details": "We use 4 NVIDIA RTX 3090 GPUs for the model training, and the average running time for one epoch is around 20 hours." What is the actual time for the overall training of your entire model? Because you set epochs to 100 in the config file (eg xsum dataset), it will take a very long time(20h*100?) to train these epochs.
Thank you very much!
Hi Yixin,
Excellent work! I was wondering if you will also release the checkpoints for the NYT model?
Thanks,
Tanya
Hi, in this part,
Lines 1863 to 1869 in 135f0e5
encoder_hidden_states
and attention_mask
won't be changed in the decoder, a new view for them is more memory efficient than repeat_interleave
. Because repeat
operation in pytorch would copy the data storage as illustratedindex_select
with proper index will be better:if self.is_scoring_mode:
batch_size,cand_num,_ = decoder_input_ids.shape
encoder_hidden_states = encoder_outputs[0]
expanded_return_idx = torch.arange(batch_size).view(-1,1).repeat(1,cand_num).view(-1).to(encoder_hidden_states.device)
encoder_hidden_states = encoder_hidden_states.index_select(0,expanded_return_idx)
attention_mask = attention_mask.index_select(0,expanded_return_idx)
decoder_input_ids = decoder_input_ids.view(-1, decoder_input_ids.size(-1))
decoder_attention_mask = decoder_attention_mask.view(-1, decoder_attention_mask.size(-1))
from evaluate import load
bertscore = load("bertscore",cache_dir="../cache")
bertscore.compute(predictions=predictions, references=references, lang="en")
I want to ask why the BERTScore in the paper is only about 30, while I use this code to calculate about 90
Hello, I want to use your SimCLS repo for a summarization task, thanks for its performance. And I know that I should create a structure like your example, include test.source, test.source.tokenized, test.out, ....
Follow the instruction to gen candidate, I can gen test.out file from input, but I wonder how can I gen test.target from my input
Thank you (I'm newbie)
Thank you for sharing the great work. Are there any hyper-parameters that should be changed when using the cased dataset? I only changed the dataset and the pre-trained model, but got even poorer results compared to the uncased model. By the way, I'm wondering why the article part in the cased data is still uncased like the photo below.
Can the train process long Chinese data ?I want to process a long Chinese data,frirst extract the data,can directly use it??
Hi,
Thanks for your work.
I am wondering if I can train a BART model first by removing ranking loss.
Will simply setting the rank_weight to 0 work for that?
Hello,
Thank you for the great & well-organized repository!
It seems that test.source
and test.target
files in the zipped provided XSum directory are identical.
when I download your Preprocessed cnndm data , I found that there is a problem that the text and the summary do not correspond .
Here is an example (cnndm_cased val 10002.json):
article:Trying to find a way to explain the birds and the bees to children can be a difficult task for any parent . So luckily for this father , a pair of raccoons took it upon themselves to make his job a little easier by giving a little demonstration in the garden . The hilarious footage captured in Seattle begins innocently enough with some excited children looking out of the window at two raccoons scaling their fence . The children watch on excitedly as the male raccoon chases the female from the fence and into the garden . As the youngsters speculate about whether they will jump from the fence -- before a little muffin man ' song interlude -- one of the raccoons descends into the garden closely pursued by the other . One child , sensing the tension , asks :
Can raccoons fight ?
abstract: Carol Woodle is one of the most sought after celebrity look-a-likes in the business . Today Carol , 59 , appears as Oprah at 90th birthday parties , corporate events and women 's shelters . At one appearance she gave away iPads like Oprah 's famous car giveaways saying : ` You get an iPad and you get an iPad and YOU get an iPad . But after her first husband walked out on her and their three young boys she thought her life was over . I was so low I was even hospitalized for 30 days due to malnutrition and depression ' Being Oprah allowed her to put the boys ' through college .
Hi, your codes and paper are really interested!
In README, generation with hugggingface is mentioned, but training was unmentioned.
Is it possible to train huggingface model using BRIO class?
Thank you.
Hi, I ask one question about gen_candidate.py
code. When I perform that code on 1 NVIDIA RTX 3090 GPU, the runtime error (CUDA out of memory) happened. Therefore, I modified the code as below, is this okay?
▶ Before (gen_candidate.py
83 ~ 85 lines)
with torch.no_grad():
batch = tok.prepare_seq2seq_batch(src_texts=slines, return_tensors="pt").to(device)
gen = model.generate(**batch, num_return_sequences=128, num_beam_groups=16, diversity_penalty=0.1, num_beams=128, length_penalty=0.6)
dec: List[str] = tok.batch_decode(gen, skip_special_tokens=True)
▶ After (gen_candidate.py
83 ~ 85 lines)
max_length = 64
with torch.no_grad():
batch = tok.prepare_seq2seq_batch(src_texts=slines, return_tensors="pt", max_length=max_length, pad_to_max_length=True, truncation=True).to(device)
gen = model.generate(**batch, num_return_sequences=128, num_beam_groups=16, diversity_penalty=0.1, num_beams=128, length_penalty=0.6)
dec: List[str] = tok.batch_decode(gen, skip_special_tokens=True)
Hi, @yixinL7
I was wondering what was max_len (# max length of summary), total_len (# total length of source article) was calculated is it by words or by each characters?
Thank you and will be glad to hear your answer!
Good day Mr. Yixin, thank you so much for your work.
I'm currently trying to use pretrained MBart model as baseline and custom dataset in order to implement BRIO on Vietnamese. I replaced crucial things like using mbart_modeling.py, changing BartScorer to MBartScorer, use suitable tokenizer,... but the ROUGE scores during training was quite poor:
I tried to config three most sensitive parameters like you suggested in this post #9 but to no avail. The MLE avg loss is always around 6.0 at the beginning of training:
Can you give me some advice?
Hi, you have done amazing effort on this and i want to reproduce not using linux-64 environment so can you please include environement.yml using conda env export --from-history for me to try it on other platform? thank you so much!!
Hi, thanks for this fantastic work.
Here is my question:
when I do "Example: evaluating the model as a generator on CNNDM"
python cal_rouge.py --ref ./cnndm/test.target.tokenized --hyp ./result/cnndm/test.out.tokenized -l
it tell me :
Rouge155.py line 497, in __get_model_filenames_for_id
raise Exception(
Exception: Could not find any model summaries for the system summary with ID 10. Specified model filename pattern was: #ID#.ref
How to solve this problem?Thank you!
Hi,
I installed your source code successfully and I run evaluation command to produce evaluate score but I want to training brio model on cnndm dataset by following command
python main.py --cuda --gpuid 0 --config cnndm -l
And it log out to terminal like bellow:
`
2022-07-20 07:28:34.137223: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Namespace(accumulate_step=1, adding=0, batch_size=1, config='cnndm', cuda=True, dataset='cnndm', datatype='diverse', do_generation=False, do_reranking=False, do_sample=True, epoch=100, eval_interval=1000, evaluate=False, gen_max_len=140, gen_min_len=55, gold_margin=0, gold_weight=0, gpuid=[0], grad_norm=0, is_pegasus=False, length_penalty=2.0, log=True, margin=0.001, max_len=120, max_lr=0.002, max_num=16, mle_weight=0.1, model_pt='', model_type='facebook/bart-large-cnn', no_gold=False, normalize=True, num_beams=4, port=12355, pretrained=None, rank_weight=10, report_freq=100, scale=1, score_mode='log', seed=970903, smooth=0.1, total_len=128, warmup_steps=10000)
BRIO(
(model): BartScorer(
(model): CustomBartModel(
(shared): Embedding(50264, 1024, padding_idx=1)
(encoder): BartEncoder(
(embed_tokens): Embedding(50264, 1024, padding_idx=1)
(embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
(layers): ModuleList(
(0): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(1): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(2): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(3): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(4): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(5): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(6): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(7): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(8): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(9): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(10): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(11): BartEncoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(activation_fn): GELUActivation()
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
(layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(decoder): BartDecoder(
(embed_tokens): Embedding(50264, 1024, padding_idx=1)
(embed_positions): BartLearnedPositionalEmbedding(1026, 1024)
(layers): ModuleList(
(0): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(1): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(2): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(3): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(4): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(5): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(6): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(7): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(8): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(9): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(10): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
(11): BartDecoderLayer(
(self_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(activation_fn): GELUActivation()
(self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(encoder_attn): BartAttention(
(k_proj): Linear(in_features=1024, out_features=1024, bias=True)
(v_proj): Linear(in_features=1024, out_features=1024, bias=True)
(q_proj): Linear(in_features=1024, out_features=1024, bias=True)
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
(layernorm_embedding): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
)
)
(lm_head): Linear(in_features=1024, out_features=50264, bias=False)
)
)
Traceback (most recent call last):
File "main.py", line 548, in
main(args)
File "main.py", line 523, in main
run(0, args)
File "main.py", line 444, in run
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 314.00 MiB (GPU 0; 10.92 GiB total capacity; 9.50 GiB already allocated; 69.31 MiB free; 10.02 GiB reserved in total by PyTorch)
Exception ignored in: <bound method Recorder.del of <utils.Recorder object at 0x7f917d3a0e48>>
Traceback (most recent call last):
File "/tf/tiennv80/NewsSum/brio/utils.py", line 54, in del
File "/usr/local/lib/python3.6/dist-packages/torch/utils/tensorboard/writer.py", line 1034, in close
File "/usr/local/lib/python3.6/dist-packages/torch/utils/tensorboard/writer.py", line 139, in close
File "/usr/local/lib/python3.6/dist-packages/tensorboard/summary/writer/event_file_writer.py", line 130, in close
File "/usr/local/lib/python3.6/dist-packages/tensorboard/summary/writer/event_file_writer.py", line 185, in close
File "/usr/local/lib/python3.6/dist-packages/tensorboard/summary/writer/event_file_writer.py", line 213, in stop
File "/usr/lib/python3.6/queue.py", line 145, in put
File "/usr/lib/python3.6/threading.py", line 347, in notify
TypeError: 'NoneType' object is not callable
`
And I run it on linux computer with 1 GPU card GTX 1080 Ti 11Gb.
Could you give me some tips for this error?
Hi Yixin, thank you for this fantastic work.
I am reproducing the BRIO model and would like to realize the difference between the data and data.tokenized files, since there seems to be no code to discriminate them.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.