microsoft / unilm Goto Github PK

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

License: MIT License

Python 83.33% Shell 2.05% Makefile 0.01% Batchfile 0.01% C++ 0.33% Cython 0.19% Cuda 0.62% Lua 0.06% Jupyter Notebook 13.35% C 0.01% HTML 0.02% Perl 0.01% Mako 0.01% Dockerfile 0.01%

nlp pre-trained-model unilm minilm layoutlm layoutxlm beit document-ai trocr beit-3

unilm's Issues

Could you please release the dev results?

It's very nice to get the test result, but in fact it is useless for researching because we should't touch the test.

So if it is convenient, please release the dev result. I know it is critical because providing a useful guild is already very very very nice, but I make this request because of the high-quality of the repo and maintainer.

Thanks a lot.

unilm model in other languages ?

Thanks for answering my previous issue.
I have a new one (easy): do you plan to release UNILM model in other languages than english ?
Thanks in advance for your response
Philippe

Fine tuned models for Generative QA

Excellent work!

Would it be possible to provide fine-tuned models for Generative QA along with training and inference instructions similar to those provided for Abstractive Summarization and Question Generation?

is it related to the "Cross-Lingual Natural Language Generation via Pre-Training" project ?

Hi
I've notice some name in common between the 2 papers (https://arxiv.org/pdf/1905.03197.pdf and https://arxiv.org/pdf/1909.10481v3.pdf). How these 2 projects are related ?
Thanks and congrats for such an impressive work !
Philippe

Empty predicted sequences being generated.

I have a very limited data set- 225 samples. The task is similar to Gigaword headline generation. The statistics for my source and target sequences look like this:

Source:
(After BERT tokenizer)

Max token length: 380
Min token length: 5
Average token length: 97.53

Target:
(After BERT tokenizer)

Max token length : 85
Min token length: 5
Average token length: 29.8

I used an 80-20 split and trained on 180 samples and tested on 45. I tried running decoding with different values for max_tgt_length, and got the following results:

40 tokens : output sequences generated for 37 samples out of 45.
60 tokens : output sequences generated for 35 samples.
100 tokens: output sequences generated for 34 samples.

What's happening here and what is a good work around given the variations in my data?

question about seq2seq pre-training

For pre-training seq2seq LM , how to construct the training example? Especially, given the unannotated corpus, what are the source segment and target segment?

Class advice impossible in Python3. Exception

Using the docker to run the code ," pip install --user --editable . " success
but cant finetue the gigaword by "run_seq2seq.py"

For the fine-tuning in NLG task, why not use standard language model objective with teacher forcing?

Hi, I would like to know more about the comparison between the standard language model fine-tuning (teacher forcing) and the masked language model fine-tuning (same objective as pre-training) in your paper. In my opinion, for the text generation task, the most popular fine-tuning approach would be the teacher forcing in language modeling as it mimics the generation process during testing. Thanks!

Error while training question generation

ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

About mini version of UniLM

Is there a way to optimize while loop within BertForSeq2SeqDecoder?

Hello,

I noticed the BertForSeq2SeqDecoder to be slow on CPU and this is mainly due to the while looping inside the forward method, where basically it iterates N times where N is the difference between next_pos and output_length, and next_pos increments by 1 at each iteration.

Can you explain me what are:

curr_length
next_pos and input_length
output_length
token_type_ids

Do you have any idea on how to optimize the code in order to get rid of the while loop?

Thanks a lot, any help with the explanation of those variables will be appreciated

Reproductibility issue

I'm having trouble reproducing the results on CNN/DM dataset.

I downloaded the data and the fine-tuned model provided in the README, and I followed the commands to predict the test set.

Everything is running fine, but at the end I have the following results :

1 ROUGE-1 Average_R: 0.62689 (95%-conf.int. 0.62269 - 0.63111)
1 ROUGE-1 Average_P: 0.13695 (95%-conf.int. 0.13561 - 0.13828)
1 ROUGE-1 Average_F: 0.22101 (95%-conf.int. 0.21918 - 0.22288)

1 ROUGE-2 Average_R: 0.33142 (95%-conf.int. 0.32673 - 0.33603)
1 ROUGE-2 Average_P: 0.06949 (95%-conf.int. 0.06832 - 0.07078)
1 ROUGE-2 Average_F: 0.11266 (95%-conf.int. 0.11089 - 0.11456)

1 ROUGE-L Average_R: 0.52624 (95%-conf.int. 0.52179 - 0.53061)
1 ROUGE-L Average_P: 0.11465 (95%-conf.int. 0.11345 - 0.11598)
1 ROUGE-L Average_F: 0.18509 (95%-conf.int. 0.18333 - 0.18698)

/root/code/unilm/src/cnndm_model/cnndm_model.bin.test.alp1.0
ROUGE-F(1/2/l): 22.10/11.27/18.51
ROUGE-R(1/2/3/l): 62.69/33.14/52.62

It's weird because I checked the prediction file (cnndm_model.bin.test.alp1.0.post) and compared it with the one provided in the README, and most of the time there is only a few differences.

Here is a comparison of the last few lines of the file (left is the 'official' one, right is mine)

Segmentation fault (core dumped)

When run the run_seq2seq.py to finetune the model on summarization datasets, the program will always crash and output "Segmentation fault (core dumped)". The command is as follow:

export CUDA_VISIBLE_DEVICES=0,1,2,3
python biunilm/run_seq2seq.py
--do_train --fp16 --amp --num_workers 0
--bert_model ../bert-large-cased/ --new_segment_ids --tokenized_input
--output_dir ../summ_model/bert_save
--log_dir ../summ_model/bert_log
--model_recover_path ../storage/unilmv1-large-cased.bin
--max_seq_length 768 --max_position_embeddings 768
--trunc_seg a --always_truncate_tail
--max_len_a 568 --max_len_b 200
--mask_prob 0.7 --max_pred 140
--train_batch_size 48 --gradient_accumulation_steps 2
--learning_rate 0.00003 --warmup_proportion 0.1 --label_smoothing 0.1
--num_train_epochs 30

What causes the Segmentation fault error? Thanks for your help!

Dockerized image on DockerHub?

A docker image for UniLM would be great.

How to use decode_seq2seq.py on custom document?

A QG task problem

Hi, I have a problem with the QG task when I evaluate the performance. I use your released evaluation scripts can achieve your performance. But, I use https://github.com/xinyadu/nqg.

Why I get this result?
Thank you for your help.

Multi GPU Training doesnot work

It uses a single GPU with the command line for multiple gpus even when visible devices are 0,1.

Any suggestions?

Results of Question Generation on SQuAD

Thank you for making the code open-sourced.

It seems that there is an issue with the question generation evaluation script. The evaluation script is producing the near same results as reported in the paper. The evaluation script contains some post-processing methods which is the main source of improvement.

The majority of the authors have used the nlg-eval to report the performance. With nlg-eval the scores on the test dataset with the released generation output and gold question (test.q.tok.txt) are as follows:
Bleu_1: 0.407580
Bleu_2: 0.275720
Bleu_3: 0.201373
Bleu_4: 0.151140
METEOR: 0.161781
ROUGE_L: 0.436765
I am requesting you to please clarify the difference.

Whole Word Masking

It seems you don't use Whole Word Masking in pre-training.
Whole Word Masking has been showed useful in BERT., so will you try this on uniLM?(And release the pre-trained model)

Thanks !

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1, Please help

I was installing the Nvidia/apex particular tree branch by doing this on colab.

%%writefile setup.sh
git clone -q https://github.com/NVIDIA/apex.git
cd apex
git reset --hard 1603407bf49c7fc3da74fceb6a6c7b47fece2ef8
cd ..
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

I'm getting this whole error:
though this problem with particular branch, if I install the master branch it is getting installed.
.....
csrc/scale_check_overflow.cpp:14:3: note: in expansion of macro ‘AT_CHECK’
AT_CHECK(grads.type().is_cuda(), "grads must be a CUDA tensor");
^
.....
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Running setup.py install for apex ... error
Cleaning up...
Removing source in /tmp/pip-req-build-wmagyiis
Removed build tracker '/tmp/pip-req-tracker-9l2sbkhe'
ERROR: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-wmagyiis/setup.py'"'"'; file='"'"'/tmp/pip-req-build-wmagyiis/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-dmtu3t6t/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.
Exception information:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/cli/base_command.py", line 153, in _main
status = self.run(options, args)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/commands/install.py", line 455, in run
use_user_site=options.use_user_site,
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/req/init.py", line 62, in install_given_reqs
**kwargs
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/req/req_install.py", line 888, in install
cwd=self.unpacked_source_directory,
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/utils/subprocess.py", line 275, in runner
spinner=spinner,
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/utils/subprocess.py", line 242, in call_subprocess
raise InstallationError(exc_msg)
pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /usr/bin/python3 -u -c 'import sys, setuptools,

CNN/DM abstractive summarization inference

While running inference on custom data getting

RuntimeError: "add_cpu/sub_cpu" not implemented for 'Half'

Error

Regarding Abstractive Summaries

As mentioned in the paper, during fine-tuning, you masked some tokens in the summaries and then you predicted those tokens. But during inference (only test-data is given), you don't have the summaries. So, how did the prediction occur during inference time? I mean, how you gave the inputs for inference and how did the decoding work?

Can not use custom sentence for QG

Problem

Hi. I want to try the QG using decode_seq2seq.py. It works when I try use the sample data. But when I use another data, it encounter Key Error: 'H.E.

Note

I use BERT-LARGE-CASED
It will success if I remove that word, and error with another 'weird' word.

Question

Is the decode seq2seq will match each the input word with a BERT LARGE CASED vocab?
How to preprocess the text before decode seq2seq? any guidance for preprocessing?
I also read similar issue from pytorch-bert-transformer huggingface/transformers#63 ?

Terminal Output

File "/root/code/unilm/src/pytorch_pretrained_bert/tokenization.py", line 117, in convert_tokens_to_ids 
    ids.append(self.vocab[token])
KeyError: 'H.E.' # or another weird word

CPU based pre-trained model

I am guessing that the model provided is for machines with CUDA-capable device.
Do you guys happen to have a pre-trained CPU version for cnndm_model.bin ?

@@ -165,7 +165,7 @@ def main():
     print(args.model_recover_path)
     for model_recover_path in glob.glob(args.model_recover_path.strip()):
         logger.info("***** Recover model: %s *****", model_recover_path)
-        model_recover = torch.load(model_recover_path)
+        model_recover = torch.load(model_recover_path, map_location="cpu")

DATA_DIR=../cnndm_data
MODEL_RECOVER_PATH=../cnndm_model.bin
EVAL_SPLIT=test
export PYTORCH_PRETRAINED_BERT_CACHE=/tmp/bert-cased-pretrained-cache
# run decoding
python biunilm/decode_seq2seq.py --fp16 --amp --bert_model bert-large-cased --new_segment_ids --mode s2s --need_score_t
races \
  --input_file ${DATA_DIR}/${EVAL_SPLIT}.src --split ${EVAL_SPLIT} --tokenized_input \
  --model_recover_path ${MODEL_RECOVER_PATH} \
  --max_seq_length 768 --max_tgt_length 128 \
  --batch_size 64 --beam_size 5 --length_penalty 0 \
  --forbid_duplicate_ngrams --forbid_ignore_word ".|[X_SEP]"
11/04/2019 15:55:06 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file https://s3.amazonaws.com/
models.huggingface.co/bert/bert-large-cased-vocab.txt from cache at /tmp/bert-cased-pretrained-cache/cee054f6aafe5e2cf8
16d2228704e326446785f940f5451a5b26033516a4ac3d.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=51 error=38 : no CUDA-capable device is detected
Traceback (most recent call last):
  File "biunilm/decode_seq2seq.py", line 254, in <module>
    main()
  File "biunilm/decode_seq2seq.py", line 147, in main
    amp_handle = amp.init(enable_caching=True)
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/apex/amp/amp.py", line 65, in init
    handle = AmpHandle(enable_caching, verbose)
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/apex/amp/handle.py", line 14, in __init__
    self._default_scaler = LossScaler()
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/apex/amp/scaler.py", line 35, in __init__
    self._overflow_buf = torch.cuda.IntTensor([0])
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /pytorch/aten/src/THC/THCGeneral.cpp:51
[1]    72305 exit 1     python biunilm/decode_seq2seq.py --fp16 --amp --bert_model bert-large-cased

without --amp:

Traceback (most recent call last):
  File "biunilm/decode_seq2seq.py", line 254, in <module>
    main()
  File "biunilm/decode_seq2seq.py", line 216, in main
    position_ids, input_mask, task_idx=task_idx, mask_qkv=mask_qkv)
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/john/code/unilm/src/pytorch_pretrained_bert/modeling.py", line 1409, in forward
    return self.beam_search(input_ids, token_type_ids, position_ids, attention_mask, task_idx=task_idx, mask_qkv=mask_qkv)
  File "/home/john/code/unilm/src/pytorch_pretrained_bert/modeling.py", line 1528, in beam_search
    output_all_encoded_layers=True, prev_embedding=prev_embedding, prev_encoded_layers=prev_encoded_layers, mask_qkv=mask_qkv)
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/john/code/unilm/src/pytorch_pretrained_bert/modeling.py", line 1062, in forward
    input_ids, token_type_ids, attention_mask)
  File "/home/john/code/unilm/src/pytorch_pretrained_bert/modeling.py", line 1037, in get_extended_attention_mask
    extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
  File "/home/john/.virtualenvs/unilm/lib/python3.6/site-packages/torch/tensor.py", line 371, in __rsub__
    return _C._VariableFunctions.rsub(self, other)
RuntimeError: "add_cpu" not implemented for 'Half'

Packages:

pytorch-pretrained-bert 0.4.0    
torch                   1.1.0 
tensorboardX            1.9
apex                     0.1

How is the result in the paper evaluated?

I wanna to know the evaluation method of the generated task in the paper. Are you testing according to the highest validation set or following the results of the last epoch? If it is the second test method, is the random seed of each training fixed? Thank you so much~

Details of Applying Question Generation to Question Answering

In your paper, you generate five million answerable and four million unanswerable examples to improve the question answering. Can you provide the generated examples for us to reproduce the results? Thank you very much

Any plan to release the Chinese version?

Where is bert pretrained cache?

export PYTORCH_PRETRAINED_BERT_CACHE=/{tmp_folder}/bert-cased-pretrained-cache
From this command I can't understand where can I find bert-cased-pretrained-cache. I tried to pip install separately, but there's no bert-cased-pretrained-cache

❓ Question : Training - Evaluating discrepancy in Abstractive Summarization

Thanks for open-sourcing the code !

After reading your paper, I have a question about the finetuning procedure for Abstractive summarization (and more generally any Seq2Seq task).

I understand this idea : Similarly to Bert and to UniLM pretraining, finetuning on Abstractive Summarization is masking some token and predicting it in order to learn a bidirectional representation of tokens.

But at inference time, since we don't have access to the whole summary (it is yet to be generated), we can only apply a left-to-right LM.

It seems a pretty big discrepancy between training and testing.

What I don't understand is that people already tried to use BERT (trained as a bidirectional encoder) as a left-to-right LM. But results were really low.

And in your case, results are very high !

So my questions are :

Did I miss something ? Did I misunderstood and there is in fact no discrepancy ?
If I understood right, why do you finetune Seq2Seq model using bidirectional LM, and not left-to-right LM ?

Will you release a base version model?

Thanks to your contribution and will you consider releasing the model based on BERT-base?

error while training

ModuleNotFoundError: No module named 'pytorch_pretrained_bert'

CNN/DM Abstractive summarization data preprocessing

Not able to convert raw custom data into preprocessed data required to model. Could you please help?

Checked #11 but not able to implement it.

How to evaluate？怎么样得到论文里的结果

作者你好，请问论文给出的结果是最后一个epoch的评估结果，还是每个epoch dev集最高的测试结果呢？另外如果是第一种random seed有放开么？感谢~

Segmentation Fault Core dump

After installing the environment，I run the run_seq2seq.py. When I load the model,Segmentation Fault Core dump occurs. My environment is pytorch 1.1.0, cuda 10.1, torch-vision 0.3.0. so why the core dump occurs?

do not manage to run with --fp16 & --amp

Hi
Thanks for sharing this.
I've tried to run the question generation part.
I can manage to make it work, but with a batch_size of 8, because of my 16GB limit on my GPU.
So I wanted to switch to FP16 to increase the batch_size and speed up the training.

I'm getting this error:

10/16/2019 09:32:57 - INFO - main - device: cuda n_gpu: 1, distributed training: False, 16-bits training: True
10/16/2019 09:32:57 - INFO - pytorch_pretrained_bert.tokenization - loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt from cache at /tmp/bert-cased-pretrained-cache/cee054f6aafe5e2cf816d2228704e326446785f940f5451a5b26033516a4ac3d.e13dbb970cb325137104fb2e5f36fe865f27746c6b526f6352861b1980eb80b1
Loading Train Dataset /root/unilm/data/train
Load 75722 documents
10/16/2019 09:33:02 - INFO - main - enable fp16 with amp
10/16/2019 09:33:02 - INFO - main - ***** Recover model: /root/unilm/models/unilmv1-large-cased.bin *****
10/16/2019 09:33:03 - INFO - pytorch_pretrained_bert.modeling - loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz from cache at /tmp/bert-cased-pretrained-cache/7fb0534b83c42daee7d3ddb0ebaa81387925b71665d6ea195c5447f1077454cd.eea60d9ebb03c75bb36302aa9d241d3b7a04bba39c360cf035e8bf8140816233
10/16/2019 09:33:03 - INFO - pytorch_pretrained_bert.modeling - extracting archive file /tmp/bert-cased-pretrained-cache/7fb0534b83c42daee7d3ddb0ebaa81387925b71665d6ea195c5447f1077454cd.eea60d9ebb03c75bb36302aa9d241d3b7a04bba39c360cf035e8bf8140816233 to temp dir /tmp/tmppli0_vk5
10/16/2019 09:33:14 - INFO - pytorch_pretrained_bert.modeling - Model config {
"attention_probs_dropout_prob": 0.1,
"directionality": "bidi",
"ffn_type": 0,
"fp32_embedding": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"label_smoothing": 0.1,
"max_position_embeddings": 512,
"new_pos_ids": false,
"num_attention_heads": 16,
"num_hidden_layers": 24,
"num_qkv": 0,
"pooler_fc_size": 768,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 128,
"pooler_type": "first_token_transform",
"relax_projection": 0,
"seg_emb": false,
"task_idx": 3,
"type_vocab_size": 6,
"vocab_size": 28996
}

^[[A10/16/2019 09:33:49 - INFO - pytorch_pretrained_bert.modeling - Weights of BertForPreTrainingLossMask not initialized from pretrained model: ['crit_mask_lm_smoothed.one_hot']
10/16/2019 09:33:50 - INFO - main - ***** CUDA.empty_cache() *****
10/16/2019 09:33:50 - INFO - main - ***** Running training *****
10/16/2019 09:33:50 - INFO - main - Batch size = 4
10/16/2019 09:33:50 - INFO - main - Num steps = 9465
Epoch: 0%| | 0/1 [00:01<?, ?it/s]
Traceback (most recent call last):
File "biunilm/run_seq2seq.py", line 483, in
main()
File "biunilm/run_seq2seq.py", line 461, in main
optimizer.step()
File "/root/.local/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/apex/optimizers/fp16_optimizer.py", line 157, in step
grads_groups_flat.append(_flatten_dense_tensors([p.grad for p in group]))
File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 192, in _flatten_dense_tensors
flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 192, in
flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
AttributeError: 'NoneType' object has no attribute 'contiguous'
Iter (loss=5.700): 0%|

when running this command line:

python3 biunilm/run_seq2seq.py --do_train --num_workers 0
--bert_model bert-large-cased --new_segment_ids --tokenized_input
--data_dir ${DATA_DIR} --src_file train.pa.tok.txt --tgt_file train.q.tok.txt
--output_dir ${OUTPUT_DIR}/bert_save
--log_dir ${OUTPUT_DIR}/bert_log
--model_recover_path ${MODEL_RECOVER_PATH}
--max_seq_length 512 --max_position_embeddings 512
--mask_prob 0.7 --max_pred 48
--train_batch_size 8 --gradient_accumulation_steps 2
--learning_rate 0.00002 --warmup_proportion 0.1 --label_smoothing 0.1
--num_train_epochs 1
--amp
--fp16

Any clue of what it could come from ?
Thanks in advance
Philippe

Additional details on task_idx

Hello,
First of all, thank you so much for open sourcing your code. This is great work! I have a quick question on the tasks in
https://github.com/microsoft/unilm/blob/master/src/pytorch_pretrained_bert/modeling.py#L1214-L1217
Can you define task_idx? I see from seq2seq_loader.py that task_idx=3 is for seq2seq LM and left2right LM. I think that task_idx=0 is for bidirectional LM. However, I am not sure about 1 and 2.
I really appreciate your help!

Any suggestions on how to create [SEP] hints for Neural Question Generation ...

UniLM team, Awesome work!! I am able to generate very good quality questions & am thoroughly impressed with some particularly generated questions, they are simply amazing, generative models are the future & carry insane potential!!

I have one question though:

I followed your approach of using [SEP] tag post passages providing hints to draw meaningful questions. However, am not sure how can I scale this for the dataset I have? I am thinking to apply NER on the passages I have and further piggybacking on selective NERs, generating [SEP] hints for every passage.

Is there a better & faster approach?? Obviously NERs are not always picking the desired hints I would wish to capture & hence I may lose lot of intel with NERs approach. I am just not able to think of any better alternatives. Humanly curating [SEP] hints is not feasible for my research work.

Thank you & keep doing the amazing work!
Anshoo

Source sequence padding inconsistency between the training and decoding in seq2seq

Hi, I find that during decoding in seq2seq, you will pad the source sequence into the length of max_src_length with '[PAD]'. But during the training, there is no such padding in the source sequence. Would it introduce any inconsistency problems?

Regarding Mask Matrix M

I am a bit confused regarding the implementation of equation 3 of the paper where the Mask Matrix M was used. Can you please describe, how is the Mask Matrix M implemented in the code and in which portion of the code equation 3 was used?

how to pre-train the unilm？

Error Windows 10 Git Bash

Environment

Windows

Git Bash

git version 2.23.0.windows.1

Docker Version:

Error Reproduction

After run:

alias=`whoami | cut -d'.' -f2`; docker run -it --rm --runtime=nvidia --ipc=host --privileged -v /home/${alias}:/home/${alias} pytorch/pytorch:1.1.0-cuda10.0-cudnn7.5-devel bash

##Returns:

C:\Program Files\Docker\Docker\Resources\bin\docker.exe: Error response from daemon: Mount denied:
The source path "C:/Program Files/Git/home/zzj04;C"
doesn't exist and is not known to Docker.
See 'C:\Program Files\Docker\Docker\Resources\bin\docker.exe run --help'.

How to reproduce Long Text Generation examples from paper?

I would appreciate if another user could share the sample commands that can reproduce the interesting results included with the arvix paper. No success so far with original MS repo and only limited, repetitious output using the current huggingface unilm fork and same prompt.

Cheers

Seed line from 1984 used in original paper:

Winston sat back. A sense of complete helplessness had descended upon him.

Published output:

He was now helpless and unmoving. What would that
mean? To become a monk under her mother’s rule. What would it
mean for him to be put at the center of the most dangerous of all of
demons, the demon known as Azrael? He would never know exactly
what that was, or even how it would end, unless he could sense the
strength of Azrael’s mind and soul. It wasn’t only the existence of
his mother, it was all he knew, and every moment. He was trapped
within her embrace, a trapped, torn, battered, crippling presence he
only knew it would mean was eternity. And in the instant that he
was lost his life changed forever, when Azraelis’ spirit touched his
mind, he was engulfed in this dark and mystical, so enchanting and
beautiful force.

fine-tune time and decode part

Hello, I try to use your model on QG task. However it takes a lot of time to fine-tune.
It takes 8h for each epoch on SQuAD dataset. (no fp16 since my GPU is not suppoted)
The GPU is 1080ti (11g) and I need to set the batch size to 1.

I wonder why this model is so slow in fine-tune part ? (compare to other pre-train model like GPT and BERT)

Another question is about the decode part. Each time predict a word and feed this updated sequence back to model to predict next word. Is this right? (Sorry the codes are too complex to understand for me.)

CNN/DM : data preprocessing

The link to the data of CNN/DM dataset is an already preprocessed dataset.

How can we reproduce similar dataset from the official .story files ?

Hardware configuration

What is the hardware setup used in the training here?
We have V100, 2080 RTX, 1080 GTX ti, 1060 GTX, and 960m people here. Hopefully it works on diverse setups.

Truncating predictions

Thanks for open-sourcing the repo, code is great and really easy to reproduce thanks to docker, your detailed explanations in the README and your finetuned checkpoints !

I have a question about predictions post-processing.

I could reproduce paper's results on CNN/DM datasets with the command provided in the README. My results are :

R-1	R-2	R-L
43.06	20.42	40.32

But if I run the same command removing the truncation (--trunc_len 0 instead of
--trunc_len 70), results are much lower :

R-1	R-2	R-L
42.05	19.90	39.44

Is this normal ?

In other codebase, I've never seen predictions being truncated. I'm wondering why it is necessary with UniLM.

I'm also curious to hear your opinion about the reason why the score is lower without truncation.

How to pretrain UniLM on abstract summarization task?

If I want to train UniLM from scratch on another abstract summarization task (not in English), how do I do it?

I guess the fine tuning and inference code from Readme can be reused, but I'm not sure how to do the pretraining. Can you guys share the pre-train code on CNN summarization? Thanks guys!

Replacing 1 by #

In this line (code for evaluation of CNNDM)

unilm/src/gigaword/eval.py

Line 239 in d22a233

sentence = fix_tokenization(l.strip()).replace('1', '#')

1 is replaced by #.

I don't understand it. Can someone explain me the reason of such post-processing ?

No Module Named Bleu

Hi, where is the Bleu come from?

I encounter this when run eval

File "src/qg/eval_on_unilm_tokenized_ref.py", line 4, in <module>
    from bleu.bleu import Bleu
ImportError: No module named bleu.bleu

Timeline and information for V2?

Hello,

Thank you very much for sharing this codebase together with a good enough documentation, much appreciated! :-)

Is there any timeline or some information for the upcoming V2 release?

Regards,
Fabian