Giter VIP home page Giter VIP logo

xlm's People

Contributors

cclauss avatar ethanjperez avatar glample avatar jowagner avatar jrapin avatar kubapok avatar louismartin avatar sedflix avatar tagucci avatar talschuster avatar victorsanh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xlm's Issues

Cannot get good results if I train from original data and script

Hi, When I ran the translation task, I met a problem. I can get the similar result if l load your mlm_enfr_1024.pth. But I cannot get good result if I start from your get-data-nmt.sh for both de-en, en-fr cases.

details:
Running command: python train.py --exp_name 'my_enfr_mlm' --dump_path './dumped/' --exp_id 'bs.20' --data_path './data/processed/en-fr/' --lgs 'en-fr' --clm_steps '' --mlm_steps 'en,fr' --emb_dim '1024' --n_layers '6' --n_heads '8' --dropout '0.1' --attention_dropout '0.1' --gelu_activation 'true' --batch_size '32' --bptt '256' --optimizer 'adam,lr=0.0001' --epoch_size '200000' --validation_metrics '_valid_mlm_ppl' --stopping_criterion '_valid_mlm_ppl,10'

INFO - 02/20/19 17:14:59 - 0:50:54 - valid_en_mlm_ppl -> 1413.372916
INFO - 02/20/19 17:14:59 - 0:50:54 - log:{"epoch": 0, "valid_en_mlm_ppl": 1413.3729161899485, "valid_en_mlm_acc": 4.681079149544399, "valid_fr_mlm_ppl": 1137.9702763241598, "valid_fr_mlm_acc": 4.591462520170163, "valid_mlm_ppl": 1275.6715962570543, "valid_mlm_acc": 4.636270834857281, "test_en_mlm_ppl": 1377.6397512089368, "test_en_mlm_acc": 4.500805152979066, "test_fr_mlm_ppl": 1547.092026693417, "test_fr_mlm_acc": 4.81150066011442, "test_mlm_ppl": 1462.3658889511769, "test_mlm_acc": 4.656152906546742}
INFO - 02/20/19 18:05:31 - 1:41:26 - valid_en_mlm_ppl -> 2161.567965
INFO - 02/20/19 18:05:31 - 1:41:26 - log:{"epoch": 1, "valid_en_mlm_ppl": 2161.56796481175, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1688.979616470098, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1925.2737906409238, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2062.9860141920476, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2497.6693821048448, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2280.327698148446, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 18:56:00 - 2:31:55 - valid_en_mlm_ppl -> 2245.817440
INFO - 02/20/19 18:56:00 - 2:31:55 - log:{"epoch": 2, "valid_en_mlm_ppl": 2245.8174404810325, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1625.404408585545, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1935.6109245332887, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2138.2897057505943, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2388.5677765876662, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2263.4287411691303, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 19:46:31 - 3:22:26 - valid_en_mlm_ppl -> 2165.622311
INFO - 02/20/19 19:46:31 - 3:22:26 - log:{"epoch": 3, "valid_en_mlm_ppl": 2165.6223114703407, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1680.1268854516293, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1922.874598460985, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2075.5851921823105, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2465.9347158442074, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2270.7599540132587, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 20:37:00 - 4:12:55 - valid_en_mlm_ppl -> 2062.631943
INFO - 02/20/19 20:37:00 - 4:12:55 - log:{"epoch": 4, "valid_en_mlm_ppl": 2062.6319433943568, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1765.4204690043236, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1914.0262061993403, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 1966.636764557332, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2606.315150449565, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2286.4759575034486, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 21:27:28 - 5:03:23 - valid_en_mlm_ppl -> 2151.624741
INFO - 02/20/19 21:27:28 - 5:03:23 - log:{"epoch": 5, "valid_en_mlm_ppl": 2151.624740528933, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1690.7461604349478, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1921.1854504819405, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2054.5326346790675, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2479.448594677353, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2266.9906146782105, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 22:17:56 - 5:53:51 - valid_en_mlm_ppl -> 2155.638091
INFO - 02/20/19 22:17:56 - 5:53:51 - log:{"epoch": 6, "valid_en_mlm_ppl": 2155.6380909977584, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1699.0517872173994, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1927.3449391075787, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2053.9586330892766, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2483.16693279636, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2268.5627829428186, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 23:08:23 - 6:44:18 - valid_en_mlm_ppl -> 2133.608678
INFO - 02/20/19 23:08:23 - 6:44:18 - log:{"epoch": 7, "valid_en_mlm_ppl": 2133.608678409897, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1695.3582695161938, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1914.4834739630455, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2038.1278812563512, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2492.9029435971656, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2265.5154124267583, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 23:58:51 - 7:34:46 - valid_en_mlm_ppl -> 2065.049633
INFO - 02/20/19 23:58:51 - 7:34:46 - log:{"epoch": 8, "valid_en_mlm_ppl": 2065.049632547123, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1770.2985750724292, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1917.6741038097762, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 1973.5921541087191, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2588.5655595835324, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2281.0788568461257, "test_mlm_acc": 4.0700737027375675}
INFO - 02/21/19 00:49:20 - 8:25:15 - valid_en_mlm_ppl -> 2177.331599
INFO - 02/21/19 00:49:20 - 8:25:15 - log:{"epoch": 9, "valid_en_mlm_ppl": 2177.331599451264, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1664.960476646684, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1921.1460380489739, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2081.1290653201354, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2436.2827245826775, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2258.7058949514067, "test_mlm_acc": 4.0700737027375675}
INFO - 02/21/19 01:39:46 - 9:15:41 - valid_en_mlm_ppl -> 2110.860061
INFO - 02/21/19 01:39:46 - 9:15:41 - log:{"epoch": 10, "valid_en_mlm_ppl": 2110.8600607294125, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1716.5880506037283, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1913.7240556665704, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2007.549178045412, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2522.7412353839986, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2265.145206714705, "test_mlm_acc": 4.0700737027375675}
INFO - 02/21/19 02:30:13 - 10:06:08 - valid_en_mlm_ppl -> 2208.660441
INFO - 02/21/19 02:30:13 - 10:06:08 - log:{"epoch": 11, "valid_en_mlm_ppl": 2208.6604406115257, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1656.203270846642, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1932.431855729084, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2111.8613551170783, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2405.011263807759, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2258.4363094624186, "test_mlm_acc": 4.0700737027375675}


INFO - 02/20/19 16:24:05 - 0:00:00 - ============ Monolingual data (en)
INFO - 02/20/19 16:24:05 - 0:00:00 - Loading data from ./data/processed/en-fr/train.en.pth ...
INFO - 02/20/19 16:24:06 - 0:00:01 - 129033877 words (64139 unique) in 5000000 sentences. 0 unknown words (0 unique) covering 0.00% of the data.

INFO - 02/20/19 16:24:08 - 0:00:03 - Loading data from ./data/processed/en-fr/valid.en.pth ...
INFO - 02/20/19 16:24:08 - 0:00:03 - 69727 words (64139 unique) in 3000 sentences. 1 unknown words (1 unique) covering 0.00% of the data.

INFO - 02/20/19 16:24:08 - 0:00:03 - Loading data from ./data/processed/en-fr/test.en.pth ...
INFO - 02/20/19 16:24:09 - 0:00:03 - 76017 words (64139 unique) in 3003 sentences. 0 unknown words (0 unique) covering 0.00% of the data.

INFO - 02/20/19 16:24:09 - 0:00:04 - ============ Monolingual data (fr)
INFO - 02/20/19 16:24:09 - 0:00:04 - Loading data from ./data/processed/en-fr/train.fr.pth ...
INFO - 02/20/19 16:24:09 - 0:00:04 - 130884578 words (64139 unique) in 5000000 sentences. 0 unknown words (0 unique) covering 0.00% of the data.

INFO - 02/20/19 16:24:12 - 0:00:06 - Loading data from ./data/processed/en-fr/valid.fr.pth ...
INFO - 02/20/19 16:24:12 - 0:00:07 - 79585 words (64139 unique) in 3000 sentences. 1 unknown words (1 unique) covering 0.00% of the data.

INFO - 02/20/19 16:24:12 - 0:00:07 - Loading data from ./data/processed/en-fr/test.fr.pth ...
INFO - 02/20/19 16:24:12 - 0:00:07 - 86351 words (64139 unique) in 3003 sentences. 0 unknown words (0 unique) covering 0.00% of the data.

INFO - 02/20/19 16:24:13 - 0:00:08 - ============ Data summary
INFO - 02/20/19 16:24:13 - 0:00:08 - Monolingual data - train - en: 5000000
INFO - 02/20/19 16:24:13 - 0:00:08 - Monolingual data - valid - en: 3000
INFO - 02/20/19 16:24:13 - 0:00:08 - Monolingual data - test - en: 3003
INFO - 02/20/19 16:24:13 - 0:00:08 - Monolingual data - train - fr: 5000000
INFO - 02/20/19 16:24:13 - 0:00:08 - Monolingual data - valid - fr: 3000
INFO - 02/20/19 16:24:13 - 0:00:08 - Monolingual data - test - fr: 3003

Address already in use

I tried to run several multi-gpu programs on a single server.But I countered this problem
RuntimeError: Address already in use at /pytorch/torch/lib/THD/process_group/General.cpp:17
So, if I have 4 GPUS on a single server and want to run two programs on GPU 0,1 and 2,3, how can I set the parameter local_rank and master_port? @glample

Pretrained word embeddings

First, thanks for sharing your code!

I really appreciate it.

I have a question about pre-trained word embeddings for unsupervised NMT task.

While reviewing code, I could find out that you guys never used pre-trained word embeddings.
(since --reload_emb is empty)

If this is true that pre-trained word embeddings has not beed used, is there a specific reason for not using pre-trained word embeddings?

Thank You!

Memory is not released

Hi, when the program ends, the memory of the GPU0 is released, but the other GPUs are not released. Why that ?

Why SRC < TGT ?

if [ "$SRC" \> "$TGT" ]; then echo "please ensure SRC < TGT"; exit; fi

Hi @glample,
Can you explain why do you make this assumption "SRC < TGT"?
I noticed it also in:

if src < tgt and ((src, tgt) in required_para or (tgt, src) in required_para)
_lang1, _lang2 = (lang1, lang2) if lang1 < lang2 else (lang2, lang1)
assert lang1 < lang2

Reloading model and params from Checkpoint

Hi,
How can I reload the checkpoint and model file in order to continue from the last epoch I have reached in previous (aborted) running ? I want to do this in the pretrain stage and also in the train stage

Thanks,
Odel

Performance of Unsupervised NMT with 5M monolingual data

Hi, @glample . Thank you for your nice contribution.

I have noticed the demo you released only uses 5M monolingual data. I have tried and it seems it can not achieve the accuracy paper reported, but i want to know what accuracy it will achieve under 5M monolingual data (just for reference). Can you provide some helps?

Translation script

Hello! Do you happen to have a translate.py script so that the model can be used to translate new data? I saw the --eval_only parameter, but it seems that the file to be translated has to be named according to the naming conventions specified in the trainer (and the data folder has to contain all the training/validation files too). The evaluator also appears to be using the target language file to get the maximum sentence length, which we shouldn't have access to when translating a new document.

Thanks for your help!

How to save fine-tune models for XNLI task?

Hi,
I ran XNLI fine-tuning task (with MLM+TLM) and got an average accuracy of 73.5 (compared to 75.1 in your paper). The code generated params.pkl, however, I could not find the fine-tuned model. How do I save the model after fine-tuning (or after every epoch of fine-tuning)?

FP 16 Training for mt and bt steps.

Hi, I noticed in the code that fp16 training is disabled manually for machine translation and back translation updates by putting assert false statements.

Specifically I am trying to use the MT step. I commented the assert statement and added retain_graph=True in the first backward call. But I noticed that after doing this my throughput was actually lower than without fp16 enabled.

Can you help me with correctly setting up the fp16 training for mt step?

The result becomes 0 at the end of second epoch when I pretrain a model with the MLM objective for Mongolian and Chinese

Hi,@glample

The result becomes 0 at the end of second epoch when I pretrain a model with the MLM objective for Mongolian and Chinese. Is the preprocessing method inappropriate?

details:
python train.py --exp_name 'my_mnzh_mlm' --dump_path './dumped/' --exp_id '190225' --data_path './data/processed/mn-zh/' --lgs 'mn-zh' --clm_steps '' --mlm_steps 'mn,zh' --emb_dim '1024' --n_layers '6' --n_heads '8' --dropout '0.2' --attention_dropout '0.2' --gelu_activation 'true' --batch_size '16' --bptt '256' --optimizer 'adam,lr=0.0001' --epoch_size '300000' --validation_metrics '_valid_mlm_ppl' --stopping_criterion '_valid_mlm_ppl,10'

python train.py --exp_name 'my_mnzh_mlm' --dump_path './dumped/' --exp_id '190225' --data_path './data/processed/mn-zh/' --lgs 'mn-zh' --clm_steps '' --mlm_steps 'mn,zh' --emb_dim '1024' --n_layers '6' --n_heads '8' --dropout '0.2' --attention_dropout '0.2' --gelu_activation 'true' --batch_size '16' --bptt '256' --optimizer 'adam,lr=0.0001' --epoch_size '300000' --validation_metrics '_valid_mlm_ppl' --stopping_criterion '_valid_mlm_ppl,10'

INFO - 02/25/19 13:21:37 - 3:07:50 - ============ End of epoch 0 ============
INFO - 02/25/19 13:21:48 - 3:08:01 - epoch -> 0.000000
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_mn_mlm_ppl -> 574.678424
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_mn_mlm_acc -> 17.192429
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_zh_mlm_ppl -> 5591.294827
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_zh_mlm_acc -> 14.550473
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_mlm_ppl -> 3082.986625
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_mlm_acc -> 15.871451
INFO - 02/25/19 13:21:48 - 3:08:01 - test_mn_mlm_ppl -> 436.168551
INFO - 02/25/19 13:21:48 - 3:08:01 - test_mn_mlm_acc -> 13.728215
INFO - 02/25/19 13:21:48 - 3:08:01 - test_zh_mlm_ppl -> 32195.137737
INFO - 02/25/19 13:21:48 - 3:08:01 - test_zh_mlm_acc -> 7.138838
INFO - 02/25/19 13:21:48 - 3:08:01 - test_mlm_ppl -> 16315.653144
INFO - 02/25/19 13:21:48 - 3:08:01 - test_mlm_acc -> 10.433527

INFO - 02/25/19 16:29:17 - 6:15:30 - ============ End of epoch 1 ============
INFO - 02/25/19 16:29:28 - 6:15:41 - epoch -> 1.000000
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_mn_mlm_ppl -> 966.486405
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_mn_mlm_acc -> 7.886435
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_zh_mlm_ppl -> 8967.092445
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_zh_mlm_acc -> 0.000000
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_mlm_ppl -> 4966.789425
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_mlm_acc -> 3.943218
INFO - 02/25/19 16:29:28 - 6:15:41 - test_mn_mlm_ppl -> 808.229061
INFO - 02/25/19 16:29:28 - 6:15:41 - test_mn_mlm_acc -> 12.853917
INFO - 02/25/19 16:29:28 - 6:15:41 - test_zh_mlm_ppl -> 43495.881859
INFO - 02/25/19 16:29:28 - 6:15:41 - test_zh_mlm_acc -> 0.000000
INFO - 02/25/19 16:29:28 - 6:15:41 - test_mlm_ppl -> 22152.055460
INFO - 02/25/19 16:29:28 - 6:15:41 - test_mlm_acc -> 6.426958

Hyperparameters for replicating supervised MT ro -> en result.

Hi,

I am trying to replicate the supervised MT ro -> en baseline of 28.4 mentioned in the paper. I was hoping that you could give me some idea about the hyperparameters for that.
Specifically can you tell me the values of #of BPE operations, learning rate and learning rate schedule used, dropout and attention dropout values, embedding size of the network, batch size and # of gpus used during training.

Thanks!

Question About Decoder

How does the decoder know which direction go towards(lang1 or lang2) when input language is lang1?In other words, how does the decoder know which state it is at , DAE or MT ?
In the previous version(UNMT), it uses different project layers. In XLM, self.pred_layer is always same. @glample

Embeddings for each subword in a sentence

Hi,

thanks for releasing the code for Cross-lingual Language Model Pretraining ❤️

I would like to know, if it's possible to: encode a whole sentence and get the embeddings for each token (or better subword). The notebook contains only example of how to encode a sentence, but could you also provide a way to get the embeddings for each subword?

Thanks :)

loss.backward is blocked

Hi,
Thanks a lot for the awesome project~
I append a MLP after XLM sentence embedding to build a QA model. But after run several step(single GPU, 21000 step, 8 batch size), it is blocked on loss.backward step, without any error message. If run on 4 GPU, it will blocked sooner (like 4200 step, 4*8 batch size). Could you please give some hint how can I fix this bug?
Thanks a lot~

weird codes in Evaluator.get_iterator

Hi,

I just found out a weird piece at:

if len(self.params.langs) > 30:
eval_lgs = set(["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh", "ab", "ay", "bug", "ha", "ko", "ln", "min", "nds", "pap", "pt", "tg", "to", "udm", "uk", "zh_classical"])
eval_lgs = set(["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"])
subsample = 10 if (data_set == 'test' or lang1 not in eval_lgs) else 5
n_sentences = 600 if (data_set == 'test' or lang1 not in eval_lgs) else 1500

If possible, may I ask the intuition behind this "hack".

Thanks.

trainging model for languages with great differences

Thanks for you work. I have a question. When I training for languages with great differences, such as Chinese-English, English-Kazakh. Is it a good choice to share all parameters? I notice that XLM usually share all parameters.

Subsampling frequent outputs

Hi,

thanks for sharing your code!
I'm just wondering if you have implemented the subsampling of frequent outputs (can't find it in your code) and if it was crucial for the performance.

Cheers,
Stephan

loss of paddding

Hi, do we need to ignore padding's loss when we do back-translation? It seems that the code doesn't ignore the padding when we calculate loss. Thank you very much.

pred_mask = alen[:, None] < len1[None] - 1 # do not predict anything given the last target word

How can I use multi-GPU to train UNMT

I add --local_rank, but raise error.

SLURM job: False
Traceback (most recent call last):
File "train.py", line 322, in
main(params)
File "train.py", line 198, in main
init_distributed_mode(params)
File "XLM/src/slurm.py", line 110, in init_distributed_mode
params.global_rank = int(os.environ['RANK'])
File "/usr/lib/python3.5/os.py", line 725, in getitem
raise KeyError(key) from None
KeyError: 'RANK'

Bug with file path in get-data-xnli.sh

Hi, There were couple of bugs in the "get-data-xnli.sh" script related to file path. The following is the fix:

  1. comment "mkdir -p $XNLI_PATH" (line 29) -- creating this directory prevents downloading XNLI-1.0.zip

  2. replace
    mkdir -p $PROCESSED_PATH/eval/XNLI
    rm $PROCESSED_PATH/eval/XNLI/*. -- getting error "cannot remove...no such file..."
    with
    if [ -d $PROCESSED_PATH/eval/XNLI ]; then
    rm -rf $PROCESSED_PATH/eval/XNLI
    fi
    mkdir -p $PROCESSED_PATH/eval/XNLI

How can I get the words embeddings?

Hello!
Thank you for sharing this code!

Is there an easy way to get the embedding of a particular word?
Those found in table 5. of the paper.
Thank you!

couldn't match SOTA performance on wmt14 EnDe

Dear authors,

I understand this repo isn't very much for supervised MT. But your codebase contains Transformer Enc-Dec model and more importantly it is much simpler than standard supervised MT codebase (e.g. T2T, Fairseq, OpenNMT).

With the intention to reproduce wmt14 EnDe SOTA performance, I use the data & BPE from Fairseq, train the Transformer base (emb_dim=512) w/ only mt_step="en-de" on 4x 2080 Ti (one gpu even lower). And finally got a tokenized BLEU score of 25.63 w/ beam_size 4, length_penalty 0.6. It's more than 1 BLEU lower than reported in Transformer paper.

Training script:
export NGPU=4; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py --exp_name wmt14_ende --dump_path ./dumped/ --data_path ./data/processed/wmt14_de-en/fairseq --lgs 'en-de' --encoder_only false --emb_dim 512 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 6000 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --eval_bleu true --stopping_criterion 'valid_en-de_mt_bleu,10' --validation_metrics 'valid_en-de_mt_bleu' --mt_steps "en-de" --gpus '0,1,2,3'

Translate results:

valid_en-de_mt_ppl-> 5.401580
valid_en-de_mt_acc -> 65.806969
valid_en-de_mt_bleu -> 28.990000
test_en-de_mt_ppl -> 5.942769
test_en-de_mt_acc -> 66.605212
test_en-de_mt_bleu -> 25.630000

My intuition is the model structure is slightly different (gelu, layer_norm etc.). May I ask you have you tried it with supervised MT wmt14 benchmark, and what's your thoughts on this?

Best.

Question About Performance

The paper shows the best en-fr bleu is 33.4. The readme.md shows
'epoch -> 7
valid_fr-en_mt_bleu -> 28.36
valid_en-fr_mt_bleu -> 30.50
test_fr-en_mt_bleu -> 34.02
test_en-fr_mt_bleu -> 36.62'.
Does this result from the max_len parameter which removes the long sentences from parallel test corpus?

Experience OOM error during evaluate_mt()

Dear authors,
Thank you so much for your codes. I'm trying to reproduce supervised MT results on wmt14 en-de. The training works fine with single(multi)-gpu. However, I frequently experience OOM error after one epoch and during evaluate_mt() step. Here's the script I used and the error message:

python train.py --exp_name wmt14_ende --dump_path ./dumped/ --data_path ./data/processed/wmt14_de-en --lgs 'en-de' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 2000 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --eval_bleu true --stopping_criterion 'valid_en-de_mt_bleu,10' --validation_metrics 'valid_en-de_mt_bleu' --mt_steps "en-de" --gpus '0'
(--gpus just indicates the gpuid to use)

Traceback (most recent call last):
File "train.py", line 325, in
main(params)
File "train.py", line 300, in main
scores = evaluator.run_all_evals(trainer)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/evaluation/evaluator.py", line 181, in run_all_evals
self.evaluate_mt(scores, data_set, lang1, lang2, eval_bleu)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/evaluation/evaluator.py", line 377, in evaluate_mt
word_scores, loss = decoder('predict', tensor=dec2, pred_mask=pred_mask, y=y, get_scores=True)
File "/nfs/stak/users/yangyil/shared/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/model/transformer.py", line 313, in forward
return self.predict(**kwargs)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/model/transformer.py", line 416, in predict
scores, loss = self.pred_layer(masked_tensor, y, get_scores)
File "/nfs/stak/users/yangyil/shared/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/model/transformer.py", line 132, in forward
loss = F.cross_entropy(scores, y, reduction='elementwise_mean')
File "/nfs/stak/users/yangyil/shared/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1550, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/nfs/stak/users/yangyil/shared/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 975, in log_softmax
return input.log_softmax(dim)
RuntimeError: CUDA error: out of memory

The OOM always happens within F.cross_entropy(), although cross_entropy doesn't always trigger OOM. Do you have some idea to make it more stable?

Another thing: I uses pytorch 0.4.1 but didn't experience this #15, and if I update it to 1.0.1, I'll experience another error pytorch/pytorch#13273 (
_queue_reduction() doesn't take
torch.distributed.ProcessGroupNCCL object).

Best.
Yilin

Adjust learning rate

Hi, I noticed that whether it is unsupervised NMT training or MLM training, the learning rate is 0.0001. Is this the learning rate when training with 8 GPUs? If I use 4 GPUs, how to adjust the learning rate and warm-up? Thank you very much.

TypeError: cross_entropy() got an unexpected keyword argument 'reduction'

Hi, @glample

I trained with a single GPU and got an err just like the them shown.
Running command:
First: ./get-data-nmt.sh --src en --tgt fr
got:
===== Data summary
Monolingual training data:
en: ./data/processed/en-fr/train.en.pth
fr: ./data/processed/en-fr/train.fr.pth
Monolingual validation data:
en: ./data/processed/en-fr/valid.en.pth
fr: ./data/processed/en-fr/valid.fr.pth
Monolingual test data:
en: ./data/processed/en-fr/test.en.pth
fr: ./data/processed/en-fr/test.fr.pth
Parallel validation data:
en: ./data/processed/en-fr/valid.en-fr.en.pth
fr: ./data/processed/en-fr/valid.en-fr.fr.pth
Parallel test data:
en: ./data/processed/en-fr/test.en-fr.en.pth
fr: ./data/processed/en-fr/test.en-fr.fr.pth
And then run: python train.py --exp_name 'my_enfr_mlm' --dump_path './dumped/' --exp_id 'bs.20' --data_path './data/processed/en-fr/' --lgs 'en-fr' --clm_steps '' --mlm_steps 'en,fr' --emb_dim '1024' --n_layers '6' --n_heads '8' --dropout '0.1' --attention_dropout '0.1' --gelu_activation 'true' --batch_size '8' --bptt '256' --optimizer 'adam,lr=0.0001' --epoch_size '300000' --validation_metrics '_valid_mlm_ppl' --stopping_criterion '_valid_mlm_ppl,10'

got the err.

The BLEU decreased when train on Unsupervised NMT

Hi,@glample

I pre-trained a language model and use it to train on Unsupervised NMT,but the BLEU becomes lower and lower. Is there something wrong?

Details:
The language model:
INFO - 03/14/19 16:57:00 - 23:43:04 - ============ End of epoch 11 ============
INFO - 03/14/19 16:57:06 - 23:43:10 - epoch -> 11.000000
INFO - 03/14/19 16:57:06 - 23:43:10 - valid_mn_mlm_ppl -> 12.698742
INFO - 03/14/19 16:57:06 - 23:43:10 - valid_mn_mlm_acc -> 61.901453
INFO - 03/14/19 16:57:06 - 23:43:10 - valid_zh_mlm_ppl -> 482.045657
INFO - 03/14/19 16:57:06 - 23:43:10 - valid_zh_mlm_acc -> 24.392448
INFO - 03/14/19 16:57:06 - 23:43:10 - valid_mlm_ppl -> 247.372200
INFO - 03/14/19 16:57:06 - 23:43:10 - valid_mlm_acc -> 43.146951
INFO - 03/14/19 16:57:06 - 23:43:10 - test_mn_mlm_ppl -> 34.794975
INFO - 03/14/19 16:57:06 - 23:43:10 - test_mn_mlm_acc -> 52.602524
INFO - 03/14/19 16:57:06 - 23:43:10 - test_zh_mlm_ppl -> 124.785448
INFO - 03/14/19 16:57:06 - 23:43:10 - test_zh_mlm_acc -> 34.501062
INFO - 03/14/19 16:57:06 - 23:43:10 - test_mlm_ppl -> 79.790211
INFO - 03/14/19 16:57:06 - 23:43:10 - test_mlm_acc -> 43.551793

Unsupervised NMT:

python3.6.2 train.py --exp_name unsupMT_mnzh --dump_path ./dumped/ --exp_id '190315' --reload_model './dumped/my_mnzh_mlm/190313/best-valid_mlm_ppl.pth,./dumped/my_mnzh_mlm/190313/best-valid_mlm_ppl.pth' --data_path ./data/processed/mn-zh/ --lgs 'mn-zh' --ae_steps 'mn,zh' --bt_steps 'mn-zh-mn,zh-mn-zh' --word_shuffle 3 --word_dropout 0.1 --word_blank 0.1 --lambda_ae '0:1,100000:0.1,300000:0' --encoder_only false --emb_dim 768 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 1000 --batch_size 16 --max_batch_size 64 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001,weight_decay=0 --epoch_size 300000 --eval_bleu true --stopping_criterion 'valid_mn-zh_mt_bleu,10' --validation_metrics 'valid_mn-zh_mt_bleu'

INFO - 03/15/19 12:54:23 - 3:17:34 - ============ End of epoch 0 ============
INFO - 03/15/19 12:56:06 - 3:19:16 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp0.mn-zh.valid.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.mn-zh.valid.txt : 0.180000
INFO - 03/15/19 12:58:15 - 3:21:25 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp0.zh-mn.valid.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.zh-mn.valid.txt : 2.740000
INFO - 03/15/19 12:58:36 - 3:21:47 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp0.mn-zh.test.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.mn-zh.test.txt : 0.000000
INFO - 03/15/19 12:59:01 - 3:22:12 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp0.zh-mn.test.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.zh-mn.test.txt : 2.160000
INFO - 03/15/19 12:59:01 - 3:22:12 - epoch -> 0.000000
INFO - 03/15/19 12:59:01 - 3:22:12 - valid_mn-zh_mt_ppl -> 6020.106288
INFO - 03/15/19 12:59:01 - 3:22:12 - valid_mn-zh_mt_acc -> 9.684522
INFO - 03/15/19 12:59:01 - 3:22:12 - valid_mn-zh_mt_bleu -> 0.180000
INFO - 03/15/19 12:59:01 - 3:22:12 - valid_zh-mn_mt_ppl -> 146.305114
INFO - 03/15/19 12:59:01 - 3:22:12 - valid_zh-mn_mt_acc -> 40.263721
INFO - 03/15/19 12:59:01 - 3:22:12 - valid_zh-mn_mt_bleu -> 2.740000
INFO - 03/15/19 12:59:01 - 3:22:12 - test_mn-zh_mt_ppl -> 6059.479785
INFO - 03/15/19 12:59:01 - 3:22:12 - test_mn-zh_mt_acc -> 12.168889
INFO - 03/15/19 12:59:01 - 3:22:12 - test_mn-zh_mt_bleu -> 0.000000
INFO - 03/15/19 12:59:01 - 3:22:12 - test_zh-mn_mt_ppl -> 488.040713
INFO - 03/15/19 12:59:01 - 3:22:12 - test_zh-mn_mt_acc -> 34.044409
INFO - 03/15/19 12:59:01 - 3:22:12 - test_zh-mn_mt_bleu -> 2.160000

INFO - 03/16/19 06:23:05 - 20:46:16 - ============ End of epoch 5 ============
INFO - 03/16/19 06:25:41 - 20:48:51 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp5.mn-zh.valid.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.mn-zh.valid.txt : 0.000000
INFO - 03/16/19 06:27:31 - 20:50:41 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp5.zh-mn.valid.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.zh-mn.valid.txt : 0.280000
INFO - 03/16/19 06:27:58 - 20:51:09 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp5.mn-zh.test.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.mn-zh.test.txt : 0.000000
INFO - 03/16/19 06:28:22 - 20:51:33 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp5.zh-mn.test.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.zh-mn.test.txt : 0.920000
INFO - 03/16/19 06:28:22 - 20:51:33 - epoch -> 5.000000
INFO - 03/16/19 06:28:22 - 20:51:33 - valid_mn-zh_mt_ppl -> 9263.390210
INFO - 03/16/19 06:28:22 - 20:51:33 - valid_mn-zh_mt_acc -> 7.963293
INFO - 03/16/19 06:28:22 - 20:51:33 - valid_mn-zh_mt_bleu -> 0.000000
INFO - 03/16/19 06:28:22 - 20:51:33 - valid_zh-mn_mt_ppl -> 195.211674
INFO - 03/16/19 06:28:22 - 20:51:33 - valid_zh-mn_mt_acc -> 36.910448
INFO - 03/16/19 06:28:22 - 20:51:33 - valid_zh-mn_mt_bleu -> 0.280000
INFO - 03/16/19 06:28:22 - 20:51:33 - test_mn-zh_mt_ppl -> 9938.071239
INFO - 03/16/19 06:28:22 - 20:51:33 - test_mn-zh_mt_acc -> 6.666667
INFO - 03/16/19 06:28:22 - 20:51:33 - test_mn-zh_mt_bleu -> 0.000000
INFO - 03/16/19 06:28:22 - 20:51:33 - test_zh-mn_mt_ppl -> 619.158340
INFO - 03/16/19 06:28:22 - 20:51:33 - test_zh-mn_mt_acc -> 32.541759
INFO - 03/16/19 06:28:22 - 20:51:33 - test_zh-mn_mt_bleu -> 0.920000

reloading decoder from mlm_1024.pth

Hi, thanks for your work. When I reloading decoder from mlm_1024.pth that has been pretrained, the warnings are rised as follow:

WARNING - 03/03/19 10:07:57 - 0:00:37 - Parameter layer_norm15.0.weight not found.
WARNING - 03/03/19 10:07:57 - 0:00:37 - Parameter layer_norm15.0.bias not found.
WARNING - 03/03/19 10:07:57 - 0:00:37 - Parameter encoder_attn.0.q_lin.weight not found.
WARNING - 03/03/19 10:07:57 - 0:00:37 - Parameter encoder_attn.0.q_lin.bias not found.
WARNING - 03/03/19 10:07:57 - 0:00:37 - Parameter encoder_attn.0.k_lin.weight not found.
...

if dec_path != '':

My training is for unsupervised NMT. Is is the normal? How to fix it? Thank you very much.

truecasing

Hi,

did you do truecasing/lowercasing in your MT experiments? From the code I can't find any signs of this.

Is there any specific reason to do / not do it?

Thanks

Not able to learn with sinusoidal embeddings.

Hi,
I ran the MLM pretraining for en-fr using the default arguments.
I noticed that while I was able to learn using the learnt embeddings, using sinusoidal embeddings completely fails to learn and the validation accuracy stays around 5%.

Did you face similar issues while using sin embeddings too?

Thanks!

RuntimeError: CUDA error: device-side assert triggered

Hello! I have been running your translate.py script and have been running into this error on a particular line of my input file, containing a BPE-ised URL but other than that nothing particular (only 13 subwords long) The error occurs with the following line of code:

decoded, dec_lengths = decoder.generate(encoded, lengths.cuda(), params.tgt_id, max_len=int(1.5 * lengths.max().item() + 10))`

Do you have any suggestions about what might be causing this error and how it could be fixed? Thank you very much in advance!

Error when using multi-GPU for training MT only

I tried to train a machine translation model using parallel data only. The script I used for training is as follows:

export NGPU=4; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py \
        --exp_name supMT_deen \
        --dump_path ./checkpoints/ \
        --data_path /unsullied/sharefs/zhaoyuekai/data/WMT/corpus/de-en/processed/ \
        --lgs 'de-en' \
        --mt_steps 'de-en' \
        --lambda_mt '0:1,100000:0.1,300000:0' \
         --encoder_only false \
        --emb_dim 1024 \
        --n_layers 6 \
         --n_heads 8 \
         --dropout 0.1 \
         --attention_dropout 0.1 \
         --gelu_activation true  \
         --tokens_per_batch 2000 \
         --batch_size 32 \
         --bptt 256 \
         --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
         --epoch_size 200000 \
         --eval_bleu true \
         --stopping_criterion 'valid_en-fr_mt_bleu,10' \
         --validation_metrics 'valid_en-fr_mt_bleu'

When training on only one GPU, no error was reported, however when I tried to train it on 4 GPUs, following error was encountered.

Traceback (most recent call last):
  File "train.py", line 341, in <module>
Traceback (most recent call last):
  File "train.py", line 341, in <module>
Traceback (most recent call last):
  File "train.py", line 341, in <module>
Traceback (most recent call last):
  File "train.py", line 341, in <module>
    main(params)
  File "train.py", line 300, in main
    main(params)
  File "train.py", line 300, in main
    main(params)
  File "train.py", line 300, in main
    trainer.mt_step(lang1, lang2, params.lambda_mt)
  File "/unsullied/sharefs/zhaoyuekai/data/XLM/config/XLM.active/src/trainer.py", line 770, in mt_step
    trainer.mt_step(lang1, lang2, params.lambda_mt)
  File "/unsullied/sharefs/zhaoyuekai/data/XLM/config/XLM.active/src/trainer.py", line 770, in mt_step
    trainer.mt_step(lang1, lang2, params.lambda_mt)
  File "/unsullied/sharefs/zhaoyuekai/data/XLM/config/XLM.active/src/trainer.py", line 770, in mt_step
        self.optimize(loss, ['encoder', 'decoder'])self.optimize(loss, ['encoder', 'decoder'])

  File "/unsullied/sharefs/zhaoyuekai/data/XLM/config/XLM.active/src/trainer.py", line 131, in optimize
  File "/unsullied/sharefs/zhaoyuekai/data/XLM/config/XLM.active/src/trainer.py", line 131, in optimize
    self.optimize(loss, ['encoder', 'decoder'])
  File "/unsullied/sharefs/zhaoyuekai/data/XLM/config/XLM.active/src/trainer.py", line 131, in optimize
    main(params)
  File "train.py", line 300, in main
        loss.backward()loss.backward()

  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    loss.backward()
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph)torch.autograd.backward(self, gradient, retain_graph, create_graph)

  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
        allow_unreachable=True)  # allow_unreachable flagallow_unreachable=True)  # allow_unreachable flag

  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
        self._queue_reduction(bucket_idx)self._queue_reduction(bucket_idx)

  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
    self._queue_reduction(bucket_idx)
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
    self.device_ids)
TypeError    : self.device_ids)_queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

RuntimeError: CUDA out of memory. Tried to allocate 498.50 MiB (GPU 0; 7.92 GiB total capacity; 6.74 GiB already allocated; 307.56 MiB free; 3.53 MiB cached)

Hi,@glample

I pretrained a model with the MLM objective for Mongolian and Chinese, but when I used the pretrained model for mn-zh Machine Translation, the error came. I tried reducing --batch_size from default 32 to 16, 8, 4, 2, and 1, but that didn't help. Could you have any good solutions for this to share?

The pretrained result is:
INFO - 02/28/19 09:47:19 - 1 day, 1:01:20 - ============ End of epoch 7 ============
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - epoch -> 7.000000
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_mn_mlm_ppl -> 20.055305
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_mn_mlm_acc -> 56.151420
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_zh_mlm_ppl -> 1813.456839
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_zh_mlm_acc -> 28.312303
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_mlm_ppl -> 916.756072
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_mlm_acc -> 42.231861
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_mn_mlm_ppl -> 8.259349
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_mn_mlm_acc -> 65.375485
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_zh_mlm_ppl -> 11569.002599
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_zh_mlm_acc -> 15.452244
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_mlm_ppl -> 5788.630974
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_mlm_acc -> 40.413864

Train on unsupervised MT from the pretrained model
python train.py --exp_name unsupMT_mnzh --dump_path ./dumped/ --reload_model 'best-valid_mlm_ppl.pth,best-valid_mlm_ppl.pth' --data_path ./data/processed/mn-zh/ --lgs 'mn-zh' --ae_steps 'mn,zh' --bt_steps 'mn-zh-mn,zh-mn-zh' --word_shuffle 3 --word_dropout 0.1 --word_blank 0.1 --lambda_ae '0:1,100000:0.1,300000:0' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 2000 --batch_size 16 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.999,lr=0.0001 --epoch_size 300000 --eval_bleu true --stopping_criterion 'valid_mn-zh_mt_bleu,10' --validation_metrics 'valid_mn-zh_mt_bleu'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.