Comments (7)
The learning rate is set to 5e-5
from bert-nmt.
Did you use fp16?
Can you paste you scripts to run bert-nmt model and the running logs?
from bert-nmt.
Sorry for lost some logs here, there is a few logs and the scripts.
WFS_HOME=/data/offline/tf_wfs_plugin/wfs-tf
LD_LIBRARY_PATH=/data/offline/tf_wfs_plugin/wfs-tf:/data/offline/tf_wfs_plugin/wfs-tf/wfs-pb:/home/qspace/viplib/:/home/qspace/viplib64/::/usr/local/python3/lib:
| distributed init (rank 3): tcp://localhost:13982
| distributed init (rank 1): tcp://localhost:13982
| distributed init (rank 2): tcp://localhost:13982
| distributed init (rank 0): tcp://localhost:13982
| initialized host mmnewyardnodesz29269 as rank 3
| initialized host mmnewyardnodesz29269 as rank 1
| initialized host mmnewyardnodesz29269 as rank 2
| initialized host mmnewyardnodesz29269 as rank 0
Namespace(activation_dropout=0.0, activation_fn='relu', adam_betas='(0.9,0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_s2_vaswani_wmt_en_de_big', attention_dropout=0.0, bert_first=True, bert_gates=[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], bert_model_name='/data/xianfzeng/FUSE-NMT/bert-nmt-master/bert/pretrain_model', bert_output_layer=-1, bert_ratio=1.0, bucket_cap_mb=25, clip_norm=25, cpu=False, criterion='label_smoothed_cross_entropy', curriculum=0, data='/data/xianfzeng/FUSE-NMT/fairseq_bertnmt', dataset_impl='cached', ddp_backend='no_c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_no_bert=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method='tcp://localhost:13982', distributed_no_spawn=False, distributed_port=-1, distributed_rank=0, distributed_world_size=4, dropout=0.2, encoder_attention_heads=16, encoder_bert_dropout=True, encoder_bert_dropout_ratio=0.3, encoder_bert_mixup=False, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=4096, encoder_layers=20, encoder_learned_pos=False, encoder_normalize_before=False, encoder_ratio=1.0, find_unused_parameters=False, finetune_bert=False, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source='True', left_pad_target='False', log_format=None, log_interval=1, lr=[0.0005], lr_scheduler='inverse_sqrt', mask_cls_sep=False, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=128, max_target_positions=128, max_tokens=8192, max_update=150000, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, required_batch_size_multiple=8, reset_dataloader=True, reset_lr_scheduler=True, reset_meters=False, reset_optimizer=True, restore_file='checkpoint_last.pt', save_dir='/data/xianfzeng/FUSE-NMT/ckpts_bert', save_interval=100000000000, save_interval_updates=5000, seed=1, sentence_avg=False, share_all_embeddings=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, source_lang='en', target_lang='cn', task='translation', tbmf_wrapper=False, tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[10], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_from_nmt=True, warmup_init_lr=1e-07, warmup_nmt_file='/data/xianfzeng/FUSE-NMT/ckpts_bert/ckpt_nmt.pt', warmup_updates=4000, weight_decay=0.0001)
| [en] dictionary: 38038 types
| [cn] dictionary: 47038 types
| /data/xianfzeng/FUSE-NMTfairseq_bertnmt valid en-cn 1997 examples
berttokenizer: <bert.tokenization.BertTokenizer object at 0x7fc460c26a20>
decoder bert gates: [True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True]
decoder bert_gates [True, True, True, True, True, True]
TransformerS2Model(
(encoder): TransformerS2Encoder(
(embed_tokens): Embedding(38038, 1024, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(1): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(2): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(3): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(4): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(5): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(6): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(7): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(8): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(9): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(10): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(11): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(12): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(13): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(14): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(15): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(16): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(17): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(18): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(19): TransformerS2EncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
)
)
(decoder): TransformerDecoder(
(embed_tokens): Embedding(47038, 1024, padding_idx=1)
(embed_positions): SinusoidalPositionalEmbedding()
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(1): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(2): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(3): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(4): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
(5): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(self_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(encoder_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(bert_attn): MultiheadAttention(
(out_proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(encoder_attn_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
(final_layer_norm): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
)
)
)
(bert_encoder): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(28996, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(1): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(2): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(3): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(4): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(5): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(6): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(7): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(8): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(9): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(10): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(11): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): FusedLayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
)
| model transformer_s2_vaswani_wmt_en_de_big, criterion LabelSmoothedCrossEntropyCriterion
| num. model params: 691826432 (num. trained: 583516160)
| training on 4 GPUs
| max tokens per GPU = 8192 and max sentences per GPU = None
nmt path: /data/xianfzeng/FUSE-NMT/ckpts_bert/ckpt_nmt.pt
Model will load checkpoint from /data/xianfzeng/FUSE-NMT/ckpts_bert/ckpt_nmt.pt
| loaded checkpoint /data/xianfzeng/FUSE-NMT/ckpts_bert/ckpt_nmt.pt (epoch 6 @ 0 updates)
| NOTICE: your device may support faster training with --fp16
| loading train data for epoch 0
| /data/xianfzeng/FUSE-NMT/fairseq_bertnmt train en-cn 22521703 examples
berttokenizer: <bert.tokenization.BertTokenizer object at 0x7fc44fe704a8>
| WARNING: 812 samples have invalid sizes and will be skipped, max_positions=(128, 128), first few sample ids=[18641833, 18790658, 17492500, 3812474, 1208199, 16677192, 11828924, 13219881, 342115, 9696306]
| saved checkpoint /data/xianfzeng/FUSE-NMT/ckpts_bert/checkpoint0.pt (epoch 0 @ 0 updates) (writing took 9.053580284118652 seconds)
| epoch 001: 1 / 6929 loss=16.386, nll_loss=16.411, ppl=87145.25, wps=7321, ups=0, wpb=78118.500, bsz=3168.000, num_updates=2, lr=3.4995e-07, gnorm=13.280, clip=0.000, oom=0.500, wall=434, train_wall=21
| epoch 001: 2 / 6929 loss=16.385, nll_loss=16.409, ppl=87026.18, wps=6938, ups=0, wpb=74748.333, bsz=3109.333, num_updates=3, lr=4.74925e-07, gnorm=17.387, clip=0.333, oom=1.000, wall=444, train_wall=32
| epoch 001: 3 / 6929 loss=16.381, nll_loss=16.405, ppl=86792.37, wps=6771, ups=0, wpb=72535.250, bsz=3072.000, num_updates=4, lr=5.999e-07, gnorm=16.877, clip=0.250, oom=1.500, wall=454, train_wall=42
DATAPATH=/data/xianfzeng/FUSE-NMT/fairseq_bertnmt
#ARCH=transformer_s2_xlm_iwslt_de_en
ARCH=transformer_s2_vaswani_wmt_en_de_big
src=en
tgt=cn
SAVEDIR=/data/xianfzeng/FUSE-NMT/ckpts_bert
#warmup="--warmup-from-nmt --reset-lr-scheduler --reset-optimizer --reset-dataloader"
warmup=""
bedropout=0.3
work_dir=/data/xianfzeng/FUSE-NMT/bert-nmt-master
export PYTHONPATH=$work_dir:$PYTHONPATH
export CUDA_LAUNCH_BLOCKING=1
python $work_dir/fairseq_cli/train.py $DATAPATH \
-a $ARCH --optimizer adam --lr 0.0005 -s $src -t $tgt --label-smoothing 0.1 \
--dropout 0.2 --max-tokens 4096 --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 150000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9,0.98)' --save-dir $SAVEDIR $warmup \
--warmup-nmt-file $SAVEDIR/checkpoint_nmt.pt \
--max-source-positions 128 \
--max-target-positions 128 \
--encoder-layers 20 \
--encoder-bert-dropout --encoder-bert-dropout-ratio $bedropout \
--ddp-backend=no_c10d \
--log-interval 1 \
--save-interval-updates 2500 \
--save-interval 100000000000 \
--update-freq 4 \
--fp16 \
--bert-model-name $work_dir/bert/pretrain_model \
#--warmup-nmt-file $SAVEDIR/ckpt_nmt.pt \
#--bert-gates 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
#--use-bmuf \
And the loss was very high at the begining and did not reduce for long time. And I try to use XLM and roberta which I changed some code from bert-nmt, and it always get a gradient error.
from bert-nmt.
The nmt model can be trained properly.
from bert-nmt.
About the 'FloatingPointError' , I have meet it before when I use --fp16
in Fairseq. You can remove this flag and try again or decrease your learning rate.
If you load the warmup model properly, the loss of epoch 1 should not be so high. It would be like the end of warmup model.
from bert-nmt.
About the 'FloatingPointError' , I have meet it before when I use
--fp16
in Fairseq. You can remove this flag and try again or decrease your learning rate.If you load the warmup model properly, the loss of epoch 1 should not be so high. It would be like the end of warmup model.
So, if the loss of epoch 1 is very low like 3.x, it means I load the model correctly?
from bert-nmt.
About the 'FloatingPointError' , I have meet it before when I use
--fp16
in Fairseq. You can remove this flag and try again or decrease your learning rate.
If you load the warmup model properly, the loss of epoch 1 should not be so high. It would be like the end of warmup model.So, if the loss of epoch 1 is very low like 3.x, it means I load the model correctly?
yes.
from bert-nmt.
Related Issues (20)
- Give an example of how to use "python interactive.py"
- How to translate a sentence with a trained model in your code?
- How to train a vanilla NMT model using Fairseq HOT 1
- Why is the bleu so strange HOT 2
- error when caliing interactive.py
- Can this repo supports multi GPU training ? HOT 1
- using bert-nmt for Arabic GEC HOT 1
- 'NoneType' object has no attribute 'sizes' HOT 1
- ReourceWarning when run prepare-iwslt14.sh
- how to use multi GPU
- RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)` HOT 1
- Fairseq repository download position HOT 2
- What parameters should I modify in options.py to apply for multi GPU?
- 有关fuse-bert的请教
- Train error!
- Exception: Cannot load model parameters from checkpoint, please ensure that the architectures match.
- fairseq0.6.2训练不收敛
- AssertionError: Index file doesn't match expected format. Make sure that --dataset-impl is configured properly. HOT 1
- Exception: Cannot load model parameters from checkpoint, please ensure that the architectures match. HOT 1
- please help: to run the code HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bert-nmt.