bert-nmt / bert-nmt Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
decode_one_ () miss 1 argument
I got the error ”Exception: process 1 terminated with signal SIGKILL“ when i training a model(25M sentences). It seem that caused by out of memory, the reason as follow:
Hi, thank you for your work.
I encounter a problem when resuming my training, somehow the training will restart from the first epoch again. The log is as below
| model transformer_s2_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 354184960 (num. trained: 245874688)
| training on 1 GPUs
| max tokens per GPU = 3050 and max sentences per GPU = None
Model will load checkpoint from ../model/bert-base-cased/2222/checkpoint_last.pt
| loaded checkpoint ../model/bert-base-cased/2222/checkpoint_last.pt (epoch 8 @ 0 updates)
| loading train data for epoch 0
| ../process/bin train src-trg 1159547 examples
| saved checkpoint ../model/bert-base-cased/2222/checkpoint0.pt (epoch 0 @ 0 updates) (writing took 530.7144210338593 seconds)
Even though I have trained it for 8 epochs it is restarting from the first epoch. Do you have any idea how this happens? Thank you very much.
Warm regards,
Reza Qorib
I have encountered the following error while trying to train bert fused nmt using BERT-Base, Multilingual Cased. Kindly help!
File "/mnt/beegfs/home/abdulrauf/alector/nrpu-July/installations/bert-nmt/train.py", line 315, in <module> cli_main() File "/mnt/beegfs/home/abdulrauf/alector/nrpu-July/installations/bert-nmt/train.py", line 311, in cli_main main(args) File "/mnt/beegfs/home/abdulrauf/alector/nrpu-July/installations/bert-nmt/train.py", line 89, in main train(args, trainer, task, epoch_itr) File "/mnt/beegfs/home/abdulrauf/alector/nrpu-July/installations/bert-nmt/train.py", line 130, in train log_output = trainer.train_step(samples) File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/fairseq/trainer.py", line 289, in train_step raise e File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/fairseq/trainer.py", line 266, in train_step ignore_grad File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/fairseq/tasks/fairseq_task.py", line 232, in train_step loss, sample_size, logging_output = criterion(model, sample) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/fairseq/criterions/label_smoothed_cross_entropy.py", line 38, in forward net_output = model(**sample['net_input']) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/fairseq/models/fairseq_model.py", line 241, in forward bert_encoder_out, _ = self.bert_encoder(bert_input, output_all_encoded_layers=True, attention_mask= 1. - bert_encoder_padding_mask) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/bert/modeling.py", line 736, in forward embedding_output = self.embeddings(input_ids, token_type_ids) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/mnt/beegfs/projects/alector/nrpu-July/installations/bert-nmt/bert/modeling.py", line 272, in forward position_embeddings = self.position_embeddings(position_ids) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/mnt/beegfs/home/abdulrauf/miniconda/envs/bertNMT/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: CUDA error: device-side assert triggered
I use your framwork to train a transformer model by fairseq-train, then print the model arch, but i see there is a bert.
Any influence ?
Hello,
Sorry to ask such a basic question.
I ran preprocess.py and I got bin and idx files.
I don't know how to use .bin and .idx files after this.
Please tell me how to use bin and idx files.
total 1687755
-rw------- 1 root root 534956 Apr 16 05:52 dict.de.txt
-rw------- 1 root root 534956 Apr 16 05:52 dict.en.txt
-rw------- 1 root root 324412 Apr 16 07:25 test.bert.en-de.en.bin
-rw------- 1 root root 72136 Apr 16 07:25 test.bert.en-de.en.idx
-rw------- 1 root root 338516 Apr 16 06:36 test.en-de.de.bin
-rw------- 1 root root 72136 Apr 16 06:36 test.en-de.de.idx
-rw------- 1 root root 324740 Apr 16 06:13 test.en-de.en.bin
-rw------- 1 root root 72136 Apr 16 06:13 test.en-de.en.idx
-rw------- 1 root root 479599392 Apr 16 07:24 train.bert.en-de.en.bin
-rw------- 1 root root 95068360 Apr 16 07:24 train.bert.en-de.en.idx
-rw------- 1 root root 477476928 Apr 16 06:36 train.en-de.de.bin
-rw------- 1 root root 95068360 Apr 16 06:36 train.en-de.de.idx
-rw------- 1 root root 466401152 Apr 16 06:13 train.en-de.en.bin
-rw------- 1 root root 95068360 Apr 16 06:13 train.en-de.en.idx
-rw------- 1 root root 4855892 Apr 16 07:25 valid.bert.en-de.en.bin
-rw------- 1 root root 961456 Apr 16 07:25 valid.bert.en-de.en.idx
-rw------- 1 root root 4838976 Apr 16 06:36 valid.en-de.de.bin
-rw------- 1 root root 961456 Apr 16 06:36 valid.en-de.de.idx
-rw------- 1 root root 4721140 Apr 16 06:13 valid.en-de.en.bin
-rw------- 1 root root 961456 Apr 16 06:13 valid.en-de.en.idx
Hey there -- wanted to start by saying thanks for the excellent work!
I'm trying to run some of the preliminary experiments, but I'm not sure how I'd go about doing the following:
(1): Initializing the Encoder with BERT: The transformer_iwslt_de_en
architecture has only 6 encoder layers, but BERT-base has 12 decoder layers. I'm not sure what layers were selected to initialize the NMT encoder. I think I see code for doing this with XLM (https://github.com/bert-nmt/bert-nmt/blob/master/fairseq/models/transformer_from_pretrained_xlm.py), but not with BERT.
(2): I'm also not sure if there's code for running this case: "Leveraging the output of BERT as embedding;" if you could point me in the right direction, I'd really appreciate it.
Thanks a ton!
Omar
edit: My guess is this has something to do with bert_gates and bert_ratio/encoder_ratio, but I'm not entirely sure.
Hi Team,
I really appreciate your research work. I started trying out your code base. I'm using code base from update-20-10 branch. When my training script reach to the preprocessing stage. I'm getting following error trace. Can you please help me resolve this? Have I missed on something?
File "bert-nmt/preprocess.py", line 274, in <module>
cli_main()
File "bert-nmt/preprocess.py", line 270, in cli_main
main(args)
File "bert-nmt/preprocess.py", line 191, in main
make_all(args.source_lang, berttokenizer)
File "bert-nmt/preprocess.py", line 176, in make_all
make_dataset(vocab, args.trainpref, "train", lang, num_workers=args.workers)
File "bert-nmt/preprocess.py", line 172, in make_dataset
make_binary_dataset(vocab, input_prefix, output_prefix, lang, num_workers)
File "bert-nmt/preprocess.py", line 138, in make_binary_dataset
offset=0, end=offsets[1]
File "bert-nmt/fairseq/binarizer.py", line 60, in binarize
ids = dict.encode_line(
AttributeError: 'BertTokenizerFast' object has no attribute 'encode_line'
I have install the dependencies using this docker files.
Thanks,
are the parameters of the pre trained bert model trainable in the fused nmt training?
Neat work. Congratulation.
In the paper, you mentioned that you are using multi-bleu.perl to evaluate IWSLT’14 En↔De, but in the script provided ("iwslt-interactive.sh"), you are evaluating with sacrebleu.
I tried myself, but somehow generate.py outperform interactive.py for about 1 bleu point. I guess the gap comes from the tokenization problem?
Would you mind updating the evaluation script for iwslt14?
when we want to finetune BERT during translation, we turn on --finetune_bert.
However, I noticed that in train.py L54-56, you disable the update of BERT's pooler, can you explain a bit why you turn off the update of the pooling layer?
Hello! Strictly speaking this is not an issue but a question. I have read through your paper, but I do not understand the purpose and effect of using a pretrained NMT model. In the bert-fused model, how much of a difference in model performance is there between using a pretrained NMT model and a randomly initialized one, specifcally in low-resource scenario with unlabeled data?
The log is as follow:
0%| | 33792/407873900 [00:29<94:07:20, 1203.63BModel name 'bert-base-uncased' was not found in model name list (bert-base-uncased, bert-large -uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-mult ilingual-cased, bert-base-chinese, bert-base-german-cased). We assumed 'https://s3.amazonaw s.com/models.huggingface.co/bert/bert-base-uncased.tar.gz' was a path or url but couldn't f ind any file associated to this path or url.
Traceback (most recent call last):
Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9,0.98)', adam_eps=1 e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch= 'transformer_s2_iwslt_de_en', attention_dropout=0.0, bert_first=True, bert_gates=[1, 1, 1, 1, 1, 1], bert_model_name='bert-base-uncased', bert_output_layer=-1, bert_ratio=1.0, bucket _cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0 , data='data-bin/iwslt14.tokenized.de-en', dataset_impl='cached', ddp_backend='c10d', decod er_attention_heads=4, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim =1024, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_no_bert= False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validat ion=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=F alse, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, encod er_attention_heads=4, encoder_bert_dropout=True, encoder_bert_dropout_ratio=0.5, encoder_be rt_mixup=False, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=1024, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_ratio =1.0, find_unused_parameters=False, finetune_bert=False, fix_batches_to_gpus=False, fp16=Fa lse, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_u pdates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source='True ', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler=' inverse_sqrt', mask_cls_sep=False, max_epoch=0, max_sentences=None, max_sentences_valid=Non e, max_source_positions=1024, max_target_positions=1024, max_tokens=50000, max_update=15000 0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_epoch_checkpoints=F alse, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_worke rs=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multi ple=8, reset_dataloader=False, reset_lr_scheduler=True, reset_meters=False, reset_optimizer =False, restore_file='checkpoint_last.pt', save_dir='checkpoints/iwed_en_de_0.5', save_inte rval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, sha re_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang ='en', target_lang='de', task='translation', tbmf_wrapper=False, tensorboard_logdir='', thr eshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir =None, valid_subset='valid', validate_interval=1, warmup_from_nmt=True, warmup_init_lr=1e-0 7, warmup_nmt_file='checkpoint_nmt.pt', warmup_updates=4000, weight_decay=0.0001)
| [en] dictionary: 10152 types
| [de] dictionary: 10152 types
| data-bin/iwslt14.tokenized.de-en valid en-de 7283 examples
File "train.py", line 315, in
cli_main()
File "train.py", line 311, in cli_main
main(args)
File "train.py", line 49, in main
model = task.build_model(args)
File "/home/alex/bert-nmt-master/fairseq/tasks/fairseq_task.py", line 169, in build_mod el
return models.build_model(args, self)
File "/home/alex/bert-nmt-master/fairseq/models/init.py", line 50, in build_model
return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
File "/home/alex/bert-nmt-master/fairseq/models/transformer.py", line 301, in build_mod el
args.bert_out_dim = bertencoder.hidden_size
AttributeError: 'NoneType' object has no attribute 'hidden_size'
0%| | 33792/407873900 [05:15<1058:16:27, 107.05B/s]
pls help me,thanks
I want to train bert fused NMT using my own BERT model, kindly guide me the proper process to do that.
are they in the same way?
I am following the instructions in the readme regarding using interactive.py:
sed -r 's/(@@ )|(@@ ?$)//g' $bpefile > $bpefile.debpe
$MOSE/scripts/tokenizer/detokenizer.perl -l $src < $bpefile.debpe > $bpefile.debpe.detok
paste -d "\n" $bpefile $bpefile.debpe.detok > $bpefile.in
cat $bpefile.in | python interactive.py -s $src -t $tgt \
--buffer-size 1024 --batch-size 128 --beam 5 --remove-bpe > output.log
Is $bpefile a file containing the source sentences after bpe has been applied to it? (e.g test.en?)
Also, the following execution gives an error as the 'data' argument is missing:
cat $bpefile.in | python interactive.py -s $src -t $tgt --buffer-size 1024 --batch-size 128 --beam 5 --remove-bpe > output.log
i am a student, and i am very glad about the paper "INCORPORATION OF BERT INTO NMT". I wish you would explain me more in detail about the code or help me guide to execute the code you have given me
Hi @bert-nmt
How did this error cause it, and if modified, what parameters should I add?
File "train.py", line 167
progress.print(stats, tag='train', step=stats['num_updates'])
thanks
I am receiving the error RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows.
for translation of long sentences (full stack trace attached below).
It looks like a sample with 329 src_tokens requires a bert_input tensor of length 568 that is too big.
What can I do to increase this source sentence length constraint?
{'net_input':
{
'src_tokens': tensor([[ 38, 6, 5, 34, 5, ...]]),
'src_lengths': tensor([329]),
'bert_input': tensor([[ 101, 10105, 167, 115, 12100, ...]])
}
}
src_tokens' tensor length is 329
'bert_input' tensor length is 568
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2463, in __call__
return self.wsgi_app(environ, start_response)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2449, in wsgi_app
response = self.handle_exception(e)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1866, in handle_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 2446, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1951, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1820, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.7/site-packages/flask/app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/Users/erikchan/Workspace/nda-ai/postedit_models/tf-serving/flask_app/app/main.py", line 52, in tagatag_v2
outputs = translate_v2(inputs, src_id, targ_id, TAGATAG_MODELS, args, utils, src_dict, tgt_dict)
File "/Users/erikchan/Workspace/nda-ai/postedit_models/tf-serving/flask_app/app/query_tagatag_v2.py", line 58, in translate_v2
translations = task.inference_step(generator, TAGATAG_MODELS, sample)
File "/Users/erikchan/Workspace/nda-ai/postedit_models/tf-serving/flask_app/app/fairseq/tasks/fairseq_task.py", line 246, in inference_step
return generator.generate(models, sample, prefix_tokens=prefix_tokens)
File "/usr/local/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
return func(*args, **kwargs)
File "/Users/erikchan/Workspace/nda-ai/postedit_models/tf-serving/flask_app/app/fairseq/sequence_generator.py", line 152, in generate
bert_outs, _ = model.models[0].bert_encoder(bertinput, output_all_encoded_layers=True, attention_mask=~bert_encoder_padding_mask)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/Users/erikchan/Workspace/nda-ai/postedit_models/tf-serving/flask_app/app/bert/modeling.py", line 736, in forward
embedding_output = self.embeddings(input_ids, token_type_ids)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/Users/erikchan/Workspace/nda-ai/postedit_models/tf-serving/flask_app/app/bert/modeling.py", line 272, in forward
position_embeddings = self.position_embeddings(position_ids)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/usr/local/lib/python3.7/site-packages/torch/nn/functional.py", line 1484, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at ../aten/src/TH/generic/THTensorEvenMoreMath.cpp:418
Hello, is the development set you use in the direction of IWSLT en-zh IWSLT2016? Is the test set IWSLT 2017?
If it is convenient, can you provide the pre-processing script before bpe? If you are using the prepare-iwslt14.sh script on fairseq, there is an echo "creating train, valid, test ... "step in the preprocessing. I don't know why. Do you need this when doing en-zh?
Thank you very much and look forward to your reply.
@teslacool
Hello,
Sorry to ask such a basic question.
I ran train.ipynb and got "Exception: Cannot load model parameters from checkpoint, please ensure that the architectures match.".
Where can I get pretrained_nmt_model that matches transformer_s2_iwslt_de_en.
I gat model4.pt from
https://github.com/pytorch/fairseq/tree/master/examples/translation
transformer.wmt19.en-de
src = 'en'
tgt = 'de'
bedropout = '0.5'
ARCH = 'transformer_s2_iwslt_de_en'
DATAPATH = 'examples/translation/data_preprocess'
SAVEDIR = 'checkpoints/iwed_' + src + '' + tgt + '' + bedropout
your_pretrained_nmt_model = 'wmt19.en-de.joined-dict.ensemble/model4.pt'
!time python train.py $DATAPATH
-a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1
--dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001
--criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07'
--adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup
--encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log
I'm trying to follow the 'Data Preprocessing' example and am receiving a UTF-8 decoding error as shown below:
sudo python3 preprocess.py --source-lang en --target-lang de \
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
--destdir data-bin/wmt17_en_de --joined-dictionary --bert-model-name bert-base-multilingual-cased
Namespace(alignfile=None, bert_model_name='bert-base-multilingual-cased', cpu=False, criterion='cross_entropy', dataset_impl='cached', destdir='data-bin/wmt17_en_de', fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, seed=1, source_lang='en', srcdict=None, target_lang='de', task='translation', tbmf_wrapper=False, tensorboard_logdir='', testpref='examples/translation/wmt17_en_de/test', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, trainpref='examples/translation/wmt17_en_de/train', user_dir=None, validpref='examples/translation/wmt17_en_de/valid', workers=1)
Traceback (most recent call last):
File "preprocess.py", line 274, in <module>
cli_main()
File "preprocess.py", line 270, in cli_main
main(args)
File "preprocess.py", line 75, in main
{train_path(lang) for lang in [args.source_lang, args.target_lang]}, src=True
File "preprocess.py", line 56, in build_dictionary
padding_factor=args.padding_factor,
File "/bert-nmt/fairseq/tasks/fairseq_task.py", line 54, in build_dictionary
Dictionary.add_file_to_dictionary(filename, d, tokenizer.tokenize_line, workers)
File "/bert-nmt/fairseq/data/dictionary.py", line 284, in add_file_to_dictionary
merge_result(Dictionary._add_file_to_dictionary_single_worker(filename, tokenize, dict.eos_word))
File "/bert-nmt/fairseq/data/dictionary.py", line 262, in _add_file_to_dictionary_single_worker
line = f.readline()
File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 7346: invalid continuation byte
My guess is because I am using the bert-base-multilingual-cased model?
Since I only care about EN and DE in this case:
I am changing line 247 in https://github.com/bert-nmt/bert-nmt/blob/master/fairseq/data/dictionary.py FROM:
with open(filename, 'r', encoding='utf-8') as f:
TO:
with open(filename, 'r', encoding='utf-8', errors='ignore') as f:
Does this sound correct?
I saw the transformer_from pretrained_xlm.py, but it seems it will initialize the NMT model. I want to train a fuse style xlm-nmt model like the paper said from a pretrained xlm and pretrained nmt.
Do the codes released support this yet?
I tried to use en-zh data in bertNMT , but I got the problem in architectures match.
My env :
cuda 9.0
pytorch 1.0.0
python 3.6
My Preprocess :
<Step 1> token / clean and generate bpe format
English token : NLTK
Chinese token : Jieba
follow this guideline : https://github.com/twairball/fairseq-zh-en
<Step 2> generate bert input
makedataforbert.sh
<Step 3> generate binary file
TEXT=examples/translation/fairseq-zh-en/data/wmt17_en_zh
DATADIR=data-bin/wmt17_en_zh
NUM_OPS=32000
fairseq-preprocess
--source-lang en \
--target-lang zh \
--trainpref $TEXT/train.${NUM_OPS}.bpe \
--validpref $TEXT/valid.${NUM_OPS}.bpe \
--testpref $TEXT/test.${NUM_OPS}.bpe \
--thresholdsrc 3 \
--thresholdtgt 3 \
--destdir $DATADIR
My Pretrain model
I did a pretrain model via fairseq .
CUDA_VISIBLE_DEVICES=0
fairseq-train data-bin/wmt17_en_zh
--arch transformer_vaswani_wmt_en_de_big
--share-decoder-input-output-embed
--optimizer adam
--adam-betas '(0.9, 0.98)'
--clip-norm 0.0
--lr 5e-4 --lr-scheduler inverse_sqrt
--warmup-updates 4000
--dropout 0.3
--weight-decay 0.0001
--criterion label_smoothed_cross_entropy
--label-smoothing 0.1
--max-tokens 1024
--eval-bleu
--eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}'
--eval-bleu-detok moses
--eval-bleu-remove-bpe
--eval-bleu-print-samples
--best-checkpoint-metric bleu
--maximize-best-checkpoint-metric
--save-dir checkpoints/fconv_wmt17_en_zh
After that, I choose a checkpoint.pt file as my pretrain model to use.
Training
src=en
tgt=zh
bedropout=0.5
ARCH=transformer_vaswani_wmt_en_de_big
DATAPATH=destdir/
SAVEDIR=checkpoints/wmt17_${src}_${tgt}_${bedropout}
mkdir -p $SAVEDIR
if [ ! -f $SAVEDIR/checkpoint_nmt.pt ]; then cp /home/blue90211/Storage01/fairseq/checkpoints/fconv_wmt17_en_zh/test_best.pt $SAVEDIR/checkpoint_nmt.pt; fi
if [ ! -f "$SAVEDIR/checkpoint_last.pt" ]; then warmup="--warmup-from-nmt --reset-lr-scheduler"; else warmup=""; fi
CUDA_VISIBLE_DEVICES=1 python train.py $DATAPATH \
-a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 \
--dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup \
--encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log
My log file :
Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, bert_first=True, bert_gates=[1, 1, 1, 1, 1, 1], bert_model_name='bert-base-uncased', bert_output_layer=-1, bert_ratio=1.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data='destdir/', dataset_impl='cached', ddp_backend='c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_no_bert=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, encoder_attention_heads=16, encoder_bert_dropout=True, encoder_bert_dropout_ratio=0.5, encoder_bert_mixup=False, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_ratio=1.0, find_unused_parameters=False, finetune_bert=False, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_cls_sep=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4000, max_update=150000, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=True, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/wmt17_en_zh_0.5', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='en', target_lang='zh', task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_from_nmt=True, warmup_init_lr=1e-07, warmup_nmt_file='checkpoint_nmt.pt', warmup_updates=4000, weight_decay=0.0001)
| [en] dictionary: 65912 types
| [zh] dictionary: 65912 types
| destdir/ valid en-zh 2001 examples
bert_gates [True, True, True, True, True, True]
TransformerModel(
(encoder): TransformerEncoder(
(embed_tokens): Embedding(65912, 1024, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(1): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(2): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(3): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(4): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(5): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): TransformerDecoder(
(embed_tokens): Embedding(65912, 1024, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(2): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(3): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(4): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(5): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
)
)
(bert_encoder): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(1): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(2): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(3): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(4): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(5): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(6): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(7): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(8): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(9): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(10): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
)Traceback (most recent call last):
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/trainer.py", line 150, in load_checkpoint
| model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 375378176 (num. trained: 265895936)
| training on 1 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
Model will load checkpoint from checkpoints/wmt17_en_zh_0.5/checkpoint_nmt.pt
self.get_model().load_state_dict(state['model'], strict=False if warmup_from_nmt else True)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/fairseq_model.py", line 72, in load_state_dict
return super().load_state_dict(state_dict, strict)
File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for TransformerModel:
size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([29248, 1024]) from checkpoint, the shape in current model is torch.Size([65912, 1024]).
size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([33864, 1024]) from checkpoint, the shape in current model is torch.Size([65912, 1024]).
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 315, in <module>
cli_main()
File "train.py", line 311, in cli_main
main(args)
File "train.py", line 75, in main
extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/checkpoint_utils.py", line 115, in load_checkpoint
warmup_from_nmt=args.warmup_from_nmt,
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/trainer.py", line 153, in load_checkpoint
'Cannot load model parameters from checkpoint, '
Exception: Cannot load model parameters from checkpoint, please ensure that the architectures match.
Should I choose another architecture ?
Why I used the same parameter and architecture in fairseq, but can't work in bertNMT?
Can you give me some suggestion?
Thank you
Hello, I would like to know if there is a way to change the dataset for nmt task. I am trying to use bert-nmt on my own dataset (English to English, description generation) and would like to know if there would be a better way for me to use my own dataset, rather than use code like below (others' dataset):
!bash fairseq/examples/translation/prepare-iwslt14.sh
I am currently trying to change the files from the folder created from the above dataset, but is it the correct way for me to do so, would there be other alternative way to change the dataset (mine is a csv file) and still use bert-nmt to conduct the task?
Ran into the following error while training a BERT-fused NMT model:
/usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
len(cache))
Traceback (most recent call last):
File "train.py", line 315, in <module>
cli_main()
File "train.py", line 307, in cli_main
nprocs=args.distributed_world_size,
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/eee/bert-nmt/train.py", line 274, in distributed_main
main(args, init_distributed=True)
File "/home/eee/bert-nmt/train.py", line 89, in main
train(args, trainer, task, epoch_itr)
File "/home/eee/bert-nmt/train.py", line 130, in train
log_output = trainer.train_step(samples)
File "/home/eee/bert-nmt/fairseq/trainer.py", line 289, in train_step
raise e
File "/home/eee/bert-nmt/fairseq/trainer.py", line 266, in train_step
ignore_grad
File "/home/eee/bert-nmt/fairseq/tasks/fairseq_task.py", line 232, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/eee/bert-nmt/fairseq/criterions/label_smoothed_cross_entropy.py", line 38, in forward
net_output = model(**sample['net_input'])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 442, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/eee/bert-nmt/fairseq/models/transformer.py", line 339, in forward
bert_encoder_out, _ = self.bert_encoder(bert_input, output_all_encoded_layers=True, attention_mask= 1. - bert_encoder_padding_mask)
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 325, in __rsub__
return _C._VariableFunctions.rsub(self, other)
RuntimeError: Subtraction, the `-` operator, with a bool tensor is not supported. If you are trying to invert a mask, use the `~` or `bitwise_not()` operator instead.
Should transformer.py line 339 be changed from:
bert_encoder_out, _ = self.bert_encoder(bert_input, output_all_encoded_layers=True, attention_mask= 1. - bert_encoder_padding_mask)
to:
bert_encoder_out, _ = self.bert_encoder(bert_input, output_all_encoded_layers=True, attention_mask= ~bert_encoder_padding_mask)
Hi, I am curious about your GPU configuration and the training duration. Could you please share some information about that?
Hello, how do you calculate the bleu score about the results of Chinese sentences? Do you divide a sentence into words?
Looking forward to your reply, thank you very much.
Can you provide some examples?
what's the bpefile?
sed -r 's/(@@ )|(@@ ?$)//g' $bpefile > $bpefile.debpe
$MOSE/scripts/tokenizer/detokenizer.perl -l $src < $bpefile.debpe > $bpefile.debpe.detok
paste -d "\n" $bpefile $bpefile.debpe.detok > $bpefile.in
cat $bpefile.in | python interactive.py -s $src -t $tgt
--buffer-size 1024 --batch-size 128 --beam 5 --remove-bpe > output.log
My env :
python=3.5
cuda=9.0
pytorch=1.1.0
torchvision=0.3.0
## Step(1). Download WMT16 English-German
## Step(2). makedataforbert.sh
## Step(3).
TEXT=examples/translation/wmt16_en_de_test
python preprocess.py --source-lang en --target-lang de
--trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test
--destdir destdir --joined-dictionary --bert-model-name bert-base-uncased
## Step(4). Download WMT16 English-German Model from fairseq
## Step(5).
#!/usr/bin/env bash
nvidia-smi
python3 -c "import torch; print(torch.__version__)"
src=en
tgt=de
bedropout=0.5
ARCH=transformer_vaswani_wmt_en_de_big
DATAPATH=destdir/
SAVEDIR=checkpoints/wmt16_${src}_${tgt}_${bedropout}
mkdir -p $SAVEDIR
if [ ! -f $SAVEDIR/checkpoint_nmt.pt ]; then cp wmt16.en-de.joined-dict.transformer/model.pt $SAVEDIR/checkpoint_nmt.pt; fi
if [ ! -f "$SAVEDIR/checkpoint_last.pt" ]; then warmup="--warmup-from-nmt --reset-lr-scheduler"; else warmup=""; fi
python train.py $DATAPATH \
-a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 \
--dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup \
--encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log
After that I got the log and error message ...
(bertNMT) blue90211@AI:~/Storage01/bert-nmt$ python train.py $DATAPATH -a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' --adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup --encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log
| distributed init (rank 1): tcp://localhost:14689
| distributed init (rank 0): tcp://localhost:14689
| initialized host AI as rank 1
| initialized host AI as rank 0
Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_vaswani_wmt_en_de_big', attention_dropout=0.0, bert_first=True, bert_gates=[1, 1, 1, 1, 1, 1], bert_model_name='bert-base-uncased', bert_output_layer=-1, bert_ratio=1.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data='destdir/', dataset_impl='cached', ddp_backend='c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_no_bert=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:14689', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=2, dropout=0.3, encoder_attention_heads=16, encoder_bert_dropout=True, encoder_bert_dropout_ratio=0.5, encoder_bert_mixup=False, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_ratio=1.0, find_unused_parameters=False, finetune_bert=False, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_cls_sep=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4000, max_update=150000, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=True, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/wmt16_en_de_0.5', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='en', target_lang='de', task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_from_nmt=True, warmup_init_lr=1e-07, warmup_nmt_file='checkpoint_nmt.pt', warmup_updates=4000, weight_decay=0.0001)
| [en] dictionary: 32768 types
| [de] dictionary: 32768 types
| destdir/ valid en-de 3000 examples
bert_gates [True, True, True, True, True, True]
TransformerModel(
(encoder): TransformerEncoder(
(embed_tokens): Embedding(32768, 1024, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(1): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(2): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(3): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(4): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(5): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): TransformerDecoder(
(embed_tokens): Embedding(32768, 1024, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(2): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(3): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(4): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(5): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): LayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
)
)
(bert_encoder): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(1): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(2): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(3): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(4): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(5): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(6): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(7): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(8): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(9): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(10): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): BertLayerNorm()
(dropout): Dropout(p=0.1)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
| model transformer_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 341438720 (num. trained: 231956480)
| training on 2 GPUs
| max tokens per GPU = 4000 and max sentences per GPU = None
Model will load checkpoint from checkpoints/wmt16_en_de_0.5/checkpoint_nmt.pt
| NOTICE: your device may support faster training with --fp16
| loaded checkpoint checkpoints/wmt16_en_de_0.5/checkpoint_nmt.pt (epoch 31 @ 0 updates)
| loading train data for epoch 31
| destdir/ train en-de 4500966 examples
| saved checkpoint checkpoints/wmt16_en_de_0.5/checkpoint31.pt (epoch 31 @ 0 updates) (writing took 1.965510606765747 seconds)
Traceback (most recent call last):
File "train.py", line 315, in <module>
cli_main()
File "train.py", line 307, in cli_main
nprocs=args.distributed_world_size,
File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 167, in spawn
while not spawn_context.join():
File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 114, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/mnt/Storage01/blue90211/bert-nmt/train.py", line 274, in distributed_main
main(args, init_distributed=True)
File "/mnt/Storage01/blue90211/bert-nmt/train.py", line 89, in main
train(args, trainer, task, epoch_itr)
File "/mnt/Storage01/blue90211/bert-nmt/train.py", line 130, in train
log_output = trainer.train_step(samples)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/trainer.py", line 289, in train_step
raise e
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/trainer.py", line 266, in train_step
ignore_grad
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/tasks/fairseq_task.py", line 232, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/criterions/label_smoothed_cross_entropy.py", line 38, in forward
net_output = model(**sample['net_input'])
File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 376, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/fairseq_model.py", line 239, in forward
encoder_out = self.encoder(src_tokens, src_lengths=src_lengths, **kwargs)
File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/transformer.py", line 564, in forward
x = layer(x, encoder_padding_mask)
File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/transformer.py", line 1245, in forward
x, _ = self.self_attn(query=x, key=x, value=x, key_padding_mask=encoder_padding_mask)
File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/modules/multihead_attention.py", line 117, in forward
q, k, v = self.in_proj_qkv(query)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/modules/multihead_attention.py", line 240, in in_proj_qkv
return self._in_proj(query).chunk(3, dim=-1)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/modules/multihead_attention.py", line 277, in _in_proj
return F.linear(input, weight, bias)
File "/tools/anaconda3/envs/bertNMT/lib/python3.6/site-packages/torch/nn/functional.py", line 1408, in linear
output = input.matmul(weight.t())
RuntimeError: cublas runtime error : the GPU program failed to execute at /opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/THCBlas.cu:259
The BERT fused NMT-model uses transformer_s2_iwslt_de_en
which seems to be the same as transformer_iwslt_de_en
Is it fair to assume using a larger model such as transformer_s2_vaswani_wmt_en_de_big
would improve accuracy at the cost of requiring more resources and increasing training & inference time?
The paper mentions the range of [0, 1.0] for drop net probability, but the code uses [0, 0.5]. Any difference between them? And what happens when ratio is lower like 0.1 or 0.0?
Hi,
I trained a NMT model for low resource language in fairseq, it takes 30 seconds for an epoch on 8*2080ti GPUs, on a dataset containing 0.3M sentences and the architecture is transformer-base.
Using the same data and transformers2
as architecture, am getting around 20 minutes per epoch.
Used the train command present here.
Note : Enabled args update-freq 16 , --fp16 and ddp_backend no_c10d
.
Can you please infer from these why the training time is too high. Thanks.
thank you very much
I preprocess my data for bert-base-uncased
and I used the transformer_iwslt_de_en
architecture to train the vanilla NMT using the following command:
CUDA_VISIBLE_DEVICES=0 fairseq-train \
data-bin/sample \
--arch transformer_iwslt_de_en --share-decoder-input-output-embed \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
--dropout 0.3 --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--max-tokens 4096 \
--best-checkpoint-metric bleu --maximize-best-checkpoint-metric
Then I tried to train the BERT-fused NMT using:
python train.py $DATAPATH \
-a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 \
--dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9,0.98)' --save-dir $SAVEDIR $warmup \
--cpu --encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log
Training is successful and I see the trained model checkpoint. But when I tried to use generate.py using this:
python generate.py data-bin/sample \
--path checkpoints/sample_iwslt_input_output_0.5/checkpoint293.pt --bert-model-name bert-base-uncased \
--beam 5 --remove-bpe --cpu
Generation hangs indefinitely and when I stop it forcefully I get the following error stack:
^CTraceback (most recent call last):
File "generate.py", line 195, in <module>
cli_main()
File "generate.py", line 191, in cli_main
main(args)
File "generate.py", line 109, in main
hypos = task.inference_step(generator, models, sample, prefix_tokens)
File "/home/skhurana/bert-nmt/fairseq/tasks/fairseq_task.py", line 246, in inference_step
return generator.generate(models, sample, prefix_tokens=prefix_tokens)
File "/home/skhurana/bert-nmt/venv/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 43, in decorate_no_grad
return func(*args, **kwargs)
File "/home/skhurana/bert-nmt/fairseq/sequence_generator.py", line 329, in generate
tokens[:, :step + 1], encoder_outs, bert_outs, temperature=self.temperature,
File "/home/skhurana/bert-nmt/venv/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 43, in decorate_no_grad
return func(*args, **kwargs)
File "/home/skhurana/bert-nmt/fairseq/sequence_generator.py", line 596, in forward_decoder
temperature=temperature,
File "/home/skhurana/bert-nmt/fairseq/sequence_generator.py", line 626, in _decode_one
decoder_out = list(model.decoder(tokens, encoder_out, bert_out, incremental_state=self.incremental_states[model]))
File "/home/skhurana/bert-nmt/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/skhurana/bert-nmt/fairseq/models/transformer.py", line 855, in forward
x, extra = self.extract_features(prev_output_tokens, encoder_out, bert_encoder_out, incremental_state)
File "/home/skhurana/bert-nmt/fairseq/models/transformer.py", line 904, in extract_features
self_attn_mask=self.buffered_future_mask(x) if incremental_state is None else None,
File "/home/skhurana/bert-nmt/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/skhurana/bert-nmt/fairseq/models/transformer.py", line 1511, in forward
attn_mask=self_attn_mask,
File "/home/skhurana/bert-nmt/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/skhurana/bert-nmt/fairseq/modules/multihead_attention.py", line 163, in forward
v = torch.cat((prev_value, v), dim=1)
KeyboardInterrupt
Not sure what is happening.
Does the version of fairseq matter if it's different in training the vanilla NMT and the one used here?
I trained lightconv model via fairseq .
When I used it in bert nmt, I got this error .
(bertNMT) blue90211@AI02:~/Storage01/bert-nmt$ CUDA_VISIBLE_DEVICES=0 python train.py $DATAPATH -a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' --adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup --encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout --bert-model-name bert-base-uncased | tee -a $SAVEDIR/training.log
Traceback (most recent call last):
Namespace(adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='lightconv', attention_dropout=0.0, bert_first=True, bert_gates=[1, 1, 1, 1, 1, 1], bert_model_name='bert-base-uncased', bert_output_layer=-1, bert_ratio=1.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data='databin/wmt17_enzh_join', dataset_impl='cached', ddp_backend='c10d', decoder_attention_heads=8, decoder_conv_dim=512, decoder_conv_type='dynamic', decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=2048, decoder_glu=True, decoder_input_dim=512, decoder_kernel_size_list=[3, 7, 15, 31, 31, 31], decoder_layers=6, decoder_learned_pos=False, decoder_no_bert=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, encoder_attention_heads=8, encoder_bert_dropout=True, encoder_bert_dropout_ratio=0.5, encoder_bert_mixup=False, encoder_conv_dim=512, encoder_conv_type='dynamic', encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=2048, encoder_glu=True, encoder_kernel_size_list=[3, 7, 15, 31, 31, 31, 31], encoder_layers=7, encoder_learned_pos=False, encoder_normalize_before=False, encoder_ratio=1.0, find_unused_parameters=False, finetune_bert=False, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, input_dropout=0.1, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_cls_sep=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4000, max_update=150000, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.0, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=True, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='en', target_lang='zh', task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_from_nmt=True, warmup_init_lr=1e-07, warmup_nmt_file='checkpoint_nmt.pt', warmup_updates=4000, weight_decay=0.0001, weight_dropout=0.0, weight_softmax=True)
| [en] dictionary: 73104 types
| [zh] dictionary: 73104 types
| databin/wmt17_enzh_join valid en-zh 2001 examples
File "train.py", line 315, in
cli_main()
File "train.py", line 311, in cli_main
main(args)
File "train.py", line 49, in main
model = task.build_model(args)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/tasks/fairseq_task.py", line 169, in build_model
return models.build_model(args, self)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/init.py", line 50, in build_model
return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/lightconv.py", line 176, in build_model
return LightConvModel(encoder, decoder)
File "/mnt/Storage01/blue90211/bert-nmt/fairseq/models/lightconv.py", line 53, in init
super().init(encoder, decoder)
TypeError: init() missing 3 required positional arguments: 'bertencoder', 'berttokenizer', and 'mask_cls_sep'`
使用了您提供的preprocess.py脚本之后发现两个语种的词典是一样的,请问这样是对的吗?
Hello, I am currently trying to use your code for nmt task. While working on the demo one using fairseq/examples/translation/prepare-iwslt14.sh and following your instruction on datapreprocessing.
After that, I tried to train the bert-fused NMT model, however, I encountered below error. I am not sure where does it come about, does it have something to do with my data path?
for data path, I have put all of the bert-preprocessed files using your code "preprocess.py" to this path: /content/bert-nmt/bert-nmt-files/Bert-NMT-files
I would sincerely appreciate if you could take a look at these following error message and give me some ideas on where to modify them. Thank you!
Traceback (most recent call last):
Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_s2_iwslt_de_en', attention_dropout=0.0, bert_first=True, bert_gates=[1, 1, 1, 1, 1, 1], bert_model_name='bert-base-uncased', bert_output_layer=-1, bert_ratio=1.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data='/content/bert-nmt/bert-nmt-files/Bert-NMT-files', dataset_impl='cached', ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=512, decoder_embed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_no_bert=False, decoder_normalize_before=False, decoder_output_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, encoder_attention_heads=4, encoder_bert_dropout=True, encoder_bert_dropout_ratio=0.5, encoder_bert_mixup=False, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_embed_dim=1024, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, encoder_ratio=1.0, find_unused_parameters=False, finetune_bert=False, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_cls_sep=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4000, max_update=150000, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints/iwed_${src}${tgt}${bedropout}', save_interval=1, save_interval_updates=0, seed=1, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='en', target_lang='de', task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_from_nmt=False, warmup_init_lr=1e-07, warmup_nmt_file='checkpoint_nmt.pt', warmup_updates=4000, weight_decay=0.0001)
| [en] dictionary: 10152 types
| [de] dictionary: 10152 types
| /content/bert-nmt/bert-nmt-files/Bert-NMT-files valid en-de 7283 examples
File "/content/bert-nmt/train.py", line 315, in
cli_main()
File "/content/bert-nmt/train.py", line 311, in cli_main
main(args)
File "/content/bert-nmt/train.py", line 46, in main
task.load_dataset(valid_sub_split, combine=True, epoch=0)
File "/content/bert-nmt/fairseq/tasks/translation.py", line 213, in load_dataset
bert_model_name = self.bert_model_name
File "/content/bert-nmt/fairseq/tasks/translation.py", line 80, in load_langpair_dataset
srcbert_datasets, srcbert_datasets.sizes, berttokenizer,
AttributeError: 'NoneType' object has no attribute 'sizes'
Hello,
I've been trying to locate the file with the exact bert fused nmt architecture, could you please mention the name?
Thanks a lot for the clarification.
Is it not possible to use a non-fairseq NMT model for the saved checkpoint? I get a keyError - state['best_loss']
in checkpoint_utils.py when I try to warmup with an open-nmt transformer model.
Hi, I've been trying to customize the code lately but it can't work because in my language there is only a pre-trained Roberta model. Can I have insight on how to do that? Thank you
I am trying to using this model to generate questions from sentences. The model, trains, but when I am trying to test with generate.py, I get this error
FileNotFoundError: [Errno 2] No such file or directory: 'squaden/dict.en.txt'
My command is
python generate.py squaden --path checkpoints/iwed_en_tgt_0.5/checkpoint_best.pt --batch-size 128 --beam=5 --bert-model-name "bert-base-uncased" --cpu --source-lang en --target-lang tgt
I have also tried using interactive.py, but it gives me an error saying "data is a required argument". When I pass in data as an arg instead of through stdin, it gives me the same error as generate.py. Any advise would be appreciated.
I encountered the following trouble: when I used a short corpus of Chinese and English to preprocess the data according to the command line as shown in the figure, there was an error. The wrong location is / root / Bert NMT / fairseq/ binarizer.py Line 60 of IDS= dict.encode_ Error in line():
AttributeError: 'BertTokenizerFast' object has no attribute 'encode_ line'.
I don't know why there is such a mistake, because I use Chinese and English sentences, right? Looking forward to your reply. Thank you very much
Can you tell me how you obtained the pretrained model exactly? I followed the steps on the readme page and ran prepare-iwslt14.sh (provided in examples) and then makedataforbert.sh. Then, I ran preprocess.py and I made sure to give it --joined-vocabulary and made sure that the encoder /decoder have the same dimensions as the transformer model expected on the transformer-s2. my probem is that the transformer model that i should pretrain is not training and the validation loss keeps increasing
my prepare-iwslt14.sh file:
echo 'Cloning Moses github repository (for tokenization scripts)...'
git clone https://github.com/moses-smt/mosesdecoder.git
echo 'Cloning Subword NMT repository (for BPE pre-processing)...'
git clone https://github.com/rsennrich/subword-nmt.git
SCRIPTS=mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
LC=$SCRIPTS/tokenizer/lowercase.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
BPEROOT=subword-nmt/subword_nmt
BPE_TOKENS=10000
URL="https://wit3.fbk.eu/archive/2014-01/texts/de/en/de-en.tgz"
GZ=de-en.tgz
if [ ! -d "$SCRIPTS" ]; then
echo "Please set SCRIPTS variable correctly to point to Moses scripts."
exit
fi
src=de
tgt=en
lang=de-en
prep=iwslt14.tokenized.de-en
tmp=$prep/tmp
orig=orig
mkdir -p $orig $tmp $prep
echo "Downloading data from ${URL}..."
cd $orig
wget "$URL"
if [ -f $GZ ]; then
echo "Data successfully downloaded."
else
echo "Data not successfully downloaded."
exit
fi
tar zxvf $GZ
cd ..
echo "pre-processing train data..."
for l in $src $tgt; do
f=train.tags.$lang.$l
tok=train.tags.$lang.tok.$l
cat $orig/$lang/$f | \
grep -v '<url>' | \
grep -v '<talkid>' | \
grep -v '<keywords>' | \
sed -e 's/<title>//g' | \
sed -e 's/<\/title>//g' | \
sed -e 's/<description>//g' | \
sed -e 's/<\/description>//g' | \
perl $TOKENIZER -threads 8 -l $l > $tmp/$tok
echo ""
done
perl $CLEAN -ratio 1.5 $tmp/train.tags.$lang.tok $src $tgt $tmp/train.tags.$lang.clean 1 175
for l in $src $tgt; do
perl $LC < $tmp/train.tags.$lang.clean.$l > $tmp/train.tags.$lang.$l
done
echo "pre-processing valid/test data..."
for l in $src $tgt; do
for o in ls $orig/$lang/IWSLT14.TED*.$l.xml
; do
fname=${o##/}
f=$tmp/${fname%.}
echo $o $f
grep '<seg id' $o |
sed -e 's/\s*//g' |
sed -e 's/\s*</seg>\s*//g' |
sed -e "s/\’/'/g" |
perl $TOKENIZER -threads 8 -l $l |
perl $LC > $f
echo ""
done
done
echo "creating train, valid, test..."
for l in $src $tgt; do
awk '{if (NR%23 == 0) print $0; }' $tmp/train.tags.de-en.$l > $tmp/valid.$l
awk '{if (NR%23 != 0) print $0; }' $tmp/train.tags.de-en.$l > $tmp/train.$l
cat $tmp/IWSLT14.TED.dev2010.de-en.$l \
$tmp/IWSLT14.TEDX.dev2012.de-en.$l \
$tmp/IWSLT14.TED.tst2010.de-en.$l \
$tmp/IWSLT14.TED.tst2011.de-en.$l \
$tmp/IWSLT14.TED.tst2012.de-en.$l \
> $tmp/test.$l
done
TRAIN=$tmp/train.en-de
BPE_CODE=$prep/code
rm -f $TRAIN
for l in $src $tgt; do
cat $tmp/train.$l >> $TRAIN
done
echo "learn_bpe.py on ${TRAIN}..."
python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE
for L in $src $tgt; do
for f in train.$L valid.$L test.$L; do
echo "apply_bpe.py to ${f}..."
python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $prep/$f
done
done`
my makedataforbert.sh file (which is properly placed)
#!/usr/bin/env bash lng=$1 echo "src lng $lng" for sub in train valid test do sed -r 's/(@@ )|(@@ ?$)//g' ${sub}.${lng} > ${sub}.bert.${lng}.tok ../mosesdecoder/scripts/tokenizer/detokenizer.perl -l $lng < ${sub}.bert.${lng}.tok > ${sub}.bert.${lng} rm ${sub}.bert.${lng}.tok done
my preprocessing comand (which i ran from the bert-nmt directory)
TEXT=iwslt14.tokenized.de-en python preprocess.py --source-lang de --target-lang en \ --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \ --destdir destdir --joined-dictionary --bert-model-name bert-base-uncased
my training command for the obtaining a pretrained model (from the fairseq directory)
CUDA_VISIBLE_DEVICES=0 python train.p y ../iwslt_de_en --arch transformer_iwslt_de_en --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0. 0 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy -- label-smoothing 0.1 --max-tokens 4096 --share-all-embeddings
my logs
`Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, ada
ptive_softmax_dropout=0, arch='transformer_iwslt_de_en', attention_dropout=0.0, bucket_cap_mb=25, clip_norm=0.0, cpu=False, criterion='label_smoothed_cro
ss_entropy', curriculum=0, data='../iwslt_de_en', dataset_impl='cached', ddp_backend='c10d', decoder_attention_heads=4, decoder_embed_dim=512, decoder_em
bed_path=None, decoder_ffn_embed_dim=1024, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_ou
tput_dim=512, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_po
rt=-1, distributed_rank=0, distributed_world_size=1, dropout=0.3, encoder_attention_heads=4, encoder_embed_dim=512, encoder_embed_path=None, encoder_ffn_
embed_dim=1024, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, find_unused_parameters=False, fix_batches_to_gpus=False, fp1
6=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_
load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1000, lr=[0.0005], lr_scheduler='inverse_sqrt', max_epoch=0, m
ax_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=4096, max_update=0, memory_efficient_fp16=F
alse, min_loss_scale=0.0001, min_lr=-1, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_worke
rs=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=False, reset_lr_scheduler=False, reset
_meters=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='checkpoints', save_interval=1, save_interval_updates=0, seed=1, senten
ce_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=True, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=
None, task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1
, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=-1, warmup_updates=4000, weight_decay=0.0001)
| [de] dictionary: 10152 types
| [en] dictionary: 10152 types
| ../iwslt_de_en valid de-en 7283 examples
TransformerModel(
(encoder): TransformerEncoder(
(embed_tokens): Embedding(10152, 512, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(2): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(3): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(4): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(5): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): TransformerDecoder(
(embed_tokens): Embedding(10152, 512, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(2): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(3): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(4): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
(5): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=512, out_features=512, bias=True)
)
(encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=512, out_features=1024, bias=True)
(fc2): Linear(in_features=1024, out_features=512, bias=True)
(final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
)
)
)
)
| model transformer_iwslt_de_en, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 36741120 (num. trained: 36741120)
| training on 1 GPUs
| max tokens per GPU = 4096 and max sentences per GPU = None
| no existing checkpoint found checkpoints/checkpoint_last.pt
| loading train data for epoch 0
| ../iwslt_de_en train de-en 160239 examples
| NOTICE: your device may support faster training with --fp16
| epoch 001 | loss 9.987 | nll_loss 9.399 | ppl 675.32 | wps 23729 | ups 6 | wpb 3586.843 | bsz 145.540 | num_updates 1101 | lr 0.0005 | gnorm 0.893 | c$
ip 0.000 | oom 0.000 | wall 172 | train_wall 155
| epoch 001 | valid on 'valid' subset | loss 11.076 | nll_loss 10.470 | ppl 1417.97 | num_updates 1101
| saved checkpoint checkpoints/checkpoint1.pt (epoch 1 @ 1101 updates) (writing took 1.1984376907348633 seconds)
| epoch 002 | loss 9.853 | nll_loss 9.253 | ppl 610.28 | wps 23901 | ups 7 | wpb 3586.843 | bsz 145.540 | num_updates 2202 | lr 0.0005 | gnorm 0.710 | c$
ip 0.000 | oom 0.000 | wall 342 | train_wall 305
| epoch 002 | valid on 'valid' subset | loss 11.158 | nll_loss 10.556 | ppl 1504.99 | num_updates 2202 | best_loss 11.0761
| saved checkpoint checkpoints/checkpoint2.pt (epoch 2 @ 2202 updates) (writing took 3.039834976196289 seconds)
| epoch 003 | loss 9.814 | nll_loss 9.211 | ppl 592.59 | wps 24035 | ups 6 | wpb 3586.843 | bsz 145.540 | num_updates 3303 | lr 0.0005 | gnorm 0.677 | c$
ip 0.000 | oom 0.000 | wall 512 | train_wall 455
| epoch 003 | valid on 'valid' subset | loss 11.777 | nll_loss 11.146 | ppl 2265.77 | num_updates 3303 | best_loss 11.0761
| saved checkpoint checkpoints/checkpoint3.pt (epoch 3 @ 3303 updates) (writing took 2.8751749992370605 seconds)
| epoch 004 | loss 9.795 | nll_loss 9.190 | ppl 584.09 | wps 24097 | ups 6 | wpb 3586.843 | bsz 145.540 | num_updates 4404 | lr 0.000476515 | gnorm 0.64$
| clip 0.000 | oom 0.000 | wall 681 | train_wall 605
| epoch 004 | valid on 'valid' subset | loss 11.232 | nll_loss 10.616 | ppl 1569.04 | num_updates 4404 | best_loss 11.0761
| saved checkpoint checkpoints/checkpoint4.pt (epoch 4 @ 4404 updates) (writing took 2.5432913303375244 seconds)
| epoch 005 | loss 9.778 | nll_loss 9.170 | ppl 576.20 | wps 24204 | ups 7 | wpb 3586.843 | bsz 145.540 | num_updates 5505 | lr 0.000426208 | gnorm 0.61$
| clip 0.000 | oom 0.000 | wall 850 | train_wall 754
| epoch 005 | valid on 'valid' subset | loss 11.640 | nll_loss 11.013 | ppl 2066.59 | num_updates 5505 | best_loss 11.0761
| saved checkpoint checkpoints/checkpoint5.pt (epoch 5 @ 5505 updates) (writing took 3.2354724407196045 seconds)
| epoch 006 | loss 9.768 | nll_loss 9.160 | ppl 572.06 | wps 24162 | ups 6 | wpb 3586.843 | bsz 145.540 | num_updates 6606 | lr 0.000389073 | gnorm 0.60$
| clip 0.000 | oom 0.000 | wall 1019 | train_wall 903
| epoch 006 | valid on 'valid' subset | loss 12.176 | nll_loss 11.635 | ppl 3179.38 | num_updates 6606 | best_loss 11.0761
| saved checkpoint checkpoints/checkpoint6.pt (epoch 6 @ 6606 updates) (writing took 2.5028364658355713 seconds)
| epoch 007 | loss 9.763 | nll_loss 9.155 | ppl 570.00 | wps 24147 | ups 7 | wpb 3586.843 | bsz 145.540 | num_updates 7707 | lr 0.000360211 | gnorm 0.59$
| clip 0.000 | oom 0.000 | wall 1188 | train_wall 1053
| epoch 007 | valid on 'valid' subset | loss 11.722 | nll_loss 11.073 | ppl 2154.92 | num_updates 7707 | best_loss 11.0761
| saved checkpoint checkpoints/checkpoint7.pt (epoch 7 @ 7707 updates) (writing took 2.8729090690612793 seconds)
| epoch 008 | loss 9.759 | nll_loss 9.150 | ppl 568.17 | wps 24257 | ups 7 | wpb 3586.843 | bsz 145.540 | num_updates 8808 | lr 0.000336947 | gnorm 0.590
| clip 0.000 | oom 0.000 | wall 1356 | train_wall 1201
| epoch 008 | valid on 'valid' subset | loss 11.776 | nll_loss 11.143 | ppl 2261.35 | num_updates 8808 | best_loss 11.0761
| saved checkpoint checkpoints/checkpoint8.pt (epoch 8 @ 8808 updates) (writing took 2.8396410942077637 seconds)
| epoch 009 | loss 9.753 | nll_loss 9.143 | ppl 565.41 | wps 24271 | ups 7 | wpb 3586.843 | bsz 145.540 | num_updates 9909 | lr 0.000317676 | gnorm 0.618
| clip 0.000 | oom 0.000 | wall 1524 | train_wall 1350
| epoch 009 | valid on 'valid' subset | loss 11.559 | nll_loss 10.923 | ppl 1941.17 | num_updates 9909 | best_loss 11.0761
| saved checkpoint checkpoints/checkpoint9.pt (epoch 9 @ 9909 updates) (writing took 2.829507827758789 seconds)
| epoch 010 | loss 9.748 | nll_loss 9.138 | ppl 563.51 | wps 24342 | ups 7 | wpb 3586.843 | bsz 145.540 | num_updates 11010 | lr 0.000301374 | gnorm 0.58
0 | clip 0.000 | oom 0.000 | wall 1692 | train_wall 1499
| epoch 010 | valid on 'valid' subset | loss 11.753 | nll_loss 11.096 | ppl 2188.33 | num_updates 11010 | best_loss 11.0761
| saved checkpoint checkpoints/checkpoint10.pt (epoch 10 @ 11010 updates) (writing took 2.639819622039795 seconds)
| epoch 011 | loss 9.744 | nll_loss 9.133 | ppl 561.48 | wps 24320 | ups 7 | wpb 3586.843 | bsz 145.540 | num_updates 12111 | lr 0.000287349 | gnorm 0.57
3 | clip 0.000 | oom 0.000 | wall 1860 | train_wall 1647
| epoch 011 | valid on 'valid' subset | loss 11.856 | nll_loss 11.232 | ppl 2406.01 | num_updates 12111 | best_loss 11.0761
| saved checkpoint checkpoints/checkpoint11.pt (epoch 11 @ 12111 updates) (writing took 2.7617385387420654 seconds)
| epoch 012: 68%|▋| 749/1101 [02:04<00:51, 6.82it/s, loss=9.744, nll_loss=9.134, ppl=561.65, wps=21628, ups=6, wpb=3583.718, bsz=144.725, num_updates=1
`
my problem is that the validation perplexity does not go below 1000 which is too bad.
did I do anything differently?
The bash makedataforbert.sh doesnt run out of the box
To reproduce,
Run bash makedataforbert.sh 'fr-en.fr'
src lng fr-en.fr
sed: can't read train.fr-en.fr: No such file or directory
makedataforbert.sh: line 7: ../mosesdecoder/scripts/tokenizer/detokenizer.perl: No such file or directory
sed: can't read valid.fr-en.fr: No such file or directory
makedataforbert.sh: line 7: ../mosesdecoder/scripts/tokenizer/detokenizer.perl: No such file or directory
sed: can't read test.fr-en.fr: No such file or directory
makedataforbert.sh: line 7: ../mosesdecoder/scripts/tokenizer/detokenizer.perl: No such file or directory
The reason is that the previous steps downloads the files into the translation/iwslt17.de_fr.en.bpe16k folder instead of the same folder as the makedataforbert script.
I try to use a pretrain XLM with my own code, and it fail to train. So I try to use the source code and pretrained BERT from your url in the code, and it also get this error.
Is there anything should be noticed here?
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping or increasing the batch size.
I only find a little info about case in Appendix.1.
for En-De, we lowercase all words, split 7k sentence pairs from the training dataset for validation and concatenate dev2010, dev2012, tst2010, tst2011, tst2012 as the test set.
Is this mean the bleu score is case-insensitive?
python train.py $DATAPATH
-a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1
--dropout 0.3 --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001
--criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07'
--adam-betas '(0.9,0.98)' --save-dir $SAVEDIR --share-all-embeddings $warmup
--encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout | tee -a $SAVEDIR/training.log
上面是所有的可以设置的参数吗?如果我想训练的时候取消bert参数的固定该怎么做?
Hello hello,
I am currently experimenting with the BERT-Fuse model. I could succesfully run a training with a small dataset. Now, as I am attempting a training with a 100+ Million-token-dataset, the training gets stuck with 100% GPU utilisation after a few epochs. Tried reducing the batch size, but same issue. Has anyone faced a similar issue? Do you have any suggestions for solving this please?
Thank you in advance.
Is there any way I can check the extent by which BERT Attention is being used both in the encoder and decoder level? Is there anything in the code that can help me do this? Basically, I would want to see if BERT Attention is being used and if it is different at inference time vs training time
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.