Hey, thank you for the released code. I tried to first reproduce the baseline Toke

I use the nmt-multi/s/ted/data_process/extract_parallel_data_from_centric_corpus

Yes, we also used this to produce test sets for zero-shot translation on the TE

Yes, i use nmt-multi/s/ted/data_process/remove_start_token.sh before running mul

There are the training and logs for Token_src on the TED-59 dataset. <p dir

Problem on ted59 zero-shot translation,about cordercorder/nmt-multi

Comments (27)

cordercorder commented on August 23, 2024

Hi, thanks for your attention.

As there are no test sets for zero-shot translation on the TED-59 dataset, we construct 3306 (58 * 57) test sets for zero-shot language pairs by pairing sentences of any two languages via aligned English sentences in the original test sets.

Are the test sets you use for zero-shot translation on the TED-59 dataset the same as ours?

from nmt-multi.

zanchangtong commented on August 23, 2024

I use the nmt-multi/scripts/ted/data_process/extract_parallel_data_from_centric_corpus.sh to extract zero-shot test sets from the tokenized parallel corpora, and process them with fairseq-preprocess.

from nmt-multi.

zanchangtong commented on August 23, 2024

Will it be the effect of max-tokens=16384? The total batch size is 16384*4.

from nmt-multi.

cordercorder commented on August 23, 2024

Yes, we also used this script to produce test sets for zero-shot translation on the TED-59 dataset.

I use the nmt-multi/scripts/ted/data_process/extract_parallel_data_from_centric_corpus.sh to extract zero-shot test sets from the tokenized parallel corpora, and process them with fairseq-preprocess.

Maybe. I only use max-tokens=4096 throughout the experiments and the total batch size is 65536 (4096 * 4 * 4, max-tokens * num_process * update-freq).

Will it be the effect of max-tokens=16384? The total batch size is 16384*4.

Do you remove the artificial English tag (__en__) prepended to every non-English sentence in the raw dataset? This may also affect model training and bias BLEU.

from nmt-multi.

zanchangtong commented on August 23, 2024

Yes, i use nmt-multi/scripts/ted/data_process/remove_start_token.sh before running multilingual_preprocess.sh. And, the supervised translation experiments on bitext corpora are good.

from nmt-multi.

cordercorder commented on August 23, 2024

It seems like that all the procedures run well. Maybe the BLEU score discrepancy is caused by the difference of batch-size.

from nmt-multi.

zanchangtong commented on August 23, 2024

I have tried the setting of max-tokens=4096 with the total batch size not changing. But, it is disappointing that zero-shot performance(8.9078 Bleu) is also significantly lower than the paper reported.

from nmt-multi.

zanchangtong commented on August 23, 2024

Is it convenient to provide the zero-shot test set or bleu results of each zero-shot translation task? Maybe after comparing the intermediate results, I can find out where the bug is.
Here are the zero-shot bleu results of my model: zero-shot_8.9bleu_ted59.json.txt.

from nmt-multi.

cordercorder commented on August 23, 2024

Sure. The file below includes our detailed experiment results for zero-shot translation on the TED-59 dataset.

multilingual_bleu_statistics.zero-shot.json.txt

from nmt-multi.

cordercorder commented on August 23, 2024

There are the training script and logs for Token_src on the TED-59 dataset.

fairseq_train.many-many_4.logs.txt

fairseq_train.many-many_4.sh.txt

from nmt-multi.

cordercorder commented on August 23, 2024

I also attempt to upload the translation generated by Token_src trained on the TED-59 dataset for you to debug easily. Disappointedly, the file size is beyond the limit of GitHub. (File size too big: 25MB are allowed, 320 MB were attempted to upload)

from nmt-multi.

cordercorder commented on August 23, 2024

I notice that you used the checkpoint_best.pt for testing. Did you select the checkpoint according to the loss on the validation sets?

from nmt-multi.

zanchangtong commented on August 23, 2024

After comparing the training log, I find that my model's validation is higher(4.617 vs 4.482) and my raw dataset size is slightly larger(10109386 vs 10092756).
Are there some missing data filtering operations? I directly extract parallel data from the ted_talk dataset.
This is the detailed information on dataset size:raw_data_size.txt

from nmt-multi.

zanchangtong commented on August 23, 2024

I selected the model based on the validation bleu score but got the final zero-shot bleu still below(9.04 BLEU). And, the performance is not influenced by max_tokens=4096 or 16384, when the total batch size does not change.

I notice that you used the checkpoint_best.pt for testing. Did you select the checkpoint according to the loss on the validation sets?

from nmt-multi.

cordercorder commented on August 23, 2024

We only removed the sentences which contain more than 100 sub-words for training sets and validation sets of every language pair. There were no extra filtering operations during data preprocessing. It seems like there are also some filtering operations in your data preprocessing pipeline (maybe removing long sentences). Can you provide the detailed script for data preprocessing?

After comparing the training log, I find that my model's validation is higher(4.617 vs 4.482) and my raw dataset size is slightly larger(10109386 vs 10092756). Are there some missing data filtering operations? I directly extract parallel data from the ted_talk dataset. This is the detailed information on dataset size:raw_data_size.txt

Below are the raw script we used for data preprocessing and the dictionary for all languages involved in TED-59 dataset:
multilingual_preprocess.sh.txt
dict.txt

from nmt-multi.

zanchangtong commented on August 23, 2024

Hey, this zip file includes my all preprocessing scripts. I'm still debugging and trying to reproduce the final result.
ted_preprocess.zip

from nmt-multi.

cordercorder commented on August 23, 2024

I find that there is only one small difference between our data preprocessing pipeline after comparing our scripts for corpus generation and preprocessing. I used the python script ted_reader.py in this repository to read the parallel corpus of XX -> En from the raw TED-59 dataset. After that, I copied the parallel corpus of XX -> En into the reverse direction to construct the parallel corpus of EN -> XX. In comparison, it seems like you used the python script ted_reader.py to construct the parallel corpus of both XX -> En and En -> XX directions. Can you check whether the parallel corpus of EN -> XX is identical to its inverse direction (XX -> En)?

In addition, as the raw TED-59 dataset has been tokenized by Moses and detokenized input is required by sacreBLEU, the translation and reference need to be detokenized before computing the BLEU score. Did you detokenize the translation and reference? If convenient, could you provide the detailed scripts for computing the BLEU score?

from nmt-multi.

zanchangtong commented on August 23, 2024

I checked my training data according to the sample number and content. The EN -> XX data is identical to its inverse direction (XX -> En).
For evaluation, I perform detokenize with sacremoses for references and system outputs. Here is my evaluation script.
report_bleu.zero-shot.txt

from nmt-multi.

zanchangtong commented on August 23, 2024

I find that the batch size(bsz) in my experiments is a little larger than your training log reported, and my training loss(loss) is also larger. For example,

2022-10-18 15:27:22 | INFO | train_inner | epoch 001:    103 / 7544 loss=15.456, nll_loss=15.317, ppl=40833.8, wps=180950, ups=3.43, wpb=52723.9, bsz=1972.6, num_updates=100, lr=1.25975e-05, gnorm=2.382, loss_scale=16, train_wall=32, gb_free=33.8, wall=58
2022-10-18 15:27:47 | INFO | train_inner | epoch 001:    203 / 7544 loss=13.933, nll_loss=13.627, ppl=12652.3, wps=208537, ups=3.91, wpb=53367.6, bsz=2019.5, num_updates=200, lr=2.5095e-05, gnorm=1.159, loss_scale=16, train_wall=23, gb_free=33.9, wall=84
2022-10-18 15:28:12 | INFO | train_inner | epoch 001:    303 / 7544 loss=12.626, nll_loss=12.15, ppl=4544.26, wps=216561, ups=4.02, wpb=53823.8, bsz=2004.4, num_updates=300, lr=3.75925e-05, gnorm=1.044, loss_scale=16, train_wall=22, gb_free=33.8, wall=108

Compared with yours:

2021-07-10 01:02:23 | INFO | train_inner | epoch 001:    103 / 8110 loss=15.282, nll_loss=15.123, ppl=35686.6, wps=52429.6, ups=1.01, wpb=51871.2, bsz=1828.1, num_updates=100, lr=1.25975e-05, gnorm=2.562, loss_scale=16, train_wall=100, gb_free=4.7, wall=176
2021-07-10 01:04:00 | INFO | train_inner | epoch 001:    203 / 8110 loss=13.381, nll_loss=13.001, ppl=8199.27, wps=53128.7, ups=1.02, wpb=52000.2, bsz=1848.1, num_updates=200, lr=2.5095e-05, gnorm=1.145, loss_scale=16, train_wall=91, gb_free=4.8, wall=274
2021-07-10 01:05:38 | INFO | train_inner | epoch 001:    303 / 8110 loss=11.968, nll_loss=11.401, ppl=2703.35, wps=53723.3, ups=1.03, wpb=52318.6, bsz=1913.9, num_updates=300, lr=3.75925e-05, gnorm=0.945, loss_scale=16, train_wall=89, gb_free=4.7, wall=371

It is a little weird about the difference in batch size with the same max-token setting. When I reduce the max-token to keep the same batch size, the loss is reduced while also larger than yours:

2022-10-26 04:56:48 | INFO | train_inner | epoch 001:    103 / 8208 loss=15.335, nll_loss=15.181, ppl=37155.5, wps=198170, ups=4.06, wpb=48780.4, bsz=1842.2, num_updates=100, lr=1.25975e-05, gnorm=2.359, loss_scale=16, train_wall=27, gb_free=34.1, wall=47
2022-10-26 04:57:10 | INFO | train_inner | epoch 001:    203 / 8208 loss=13.841, nll_loss=13.522, ppl=11763, wps=222517, ups=4.51, wpb=49321, bsz=1887, num_updates=200, lr=2.5095e-05, gnorm=1.125, loss_scale=16, train_wall=20, gb_free=34.1, wall=69
2022-10-26 04:57:32 | INFO | train_inner | epoch 001:    303 / 8208 loss=12.52, nll_loss=12.028, ppl=4174.92, wps=221221, ups=4.55, wpb=48591.5, bsz=1832.9, num_updates=300, lr=3.75925e-05, gnorm=0.939, loss_scale=16, train_wall=20, gb_free=34.1, wall=91

from nmt-multi.

cordercorder commented on August 23, 2024

Thanks for your checking. The evaluation script seems well.

I checked my training data according to the sample number and content. The EN -> XX data is identical to its inverse direction (XX -> En). For evaluation, I perform detokenize with sacremoses for references and system outputs. Here is my evaluation script. report_bleu.zero-shot.txt

This is strange. Could you check whether the vocabularies we used are the same? I have shared my vocabulary here. Besides, which version of fairseq did you use?

I find that the batch size(bsz) in my experiments is a little larger than your training log reported, and my training loss(loss) is also larger. For example,

2022-10-18 15:27:22 | INFO | train_inner | epoch 001:    103 / 7544 loss=15.456, nll_loss=15.317, ppl=40833.8, wps=180950, ups=3.43, wpb=52723.9, bsz=1972.6, num_updates=100, lr=1.25975e-05, gnorm=2.382, loss_scale=16, train_wall=32, gb_free=33.8, wall=58
2022-10-18 15:27:47 | INFO | train_inner | epoch 001:    203 / 7544 loss=13.933, nll_loss=13.627, ppl=12652.3, wps=208537, ups=3.91, wpb=53367.6, bsz=2019.5, num_updates=200, lr=2.5095e-05, gnorm=1.159, loss_scale=16, train_wall=23, gb_free=33.9, wall=84
2022-10-18 15:28:12 | INFO | train_inner | epoch 001:    303 / 7544 loss=12.626, nll_loss=12.15, ppl=4544.26, wps=216561, ups=4.02, wpb=53823.8, bsz=2004.4, num_updates=300, lr=3.75925e-05, gnorm=1.044, loss_scale=16, train_wall=22, gb_free=33.8, wall=108

Compared with yours:

2021-07-10 01:02:23 | INFO | train_inner | epoch 001:    103 / 8110 loss=15.282, nll_loss=15.123, ppl=35686.6, wps=52429.6, ups=1.01, wpb=51871.2, bsz=1828.1, num_updates=100, lr=1.25975e-05, gnorm=2.562, loss_scale=16, train_wall=100, gb_free=4.7, wall=176
2021-07-10 01:04:00 | INFO | train_inner | epoch 001:    203 / 8110 loss=13.381, nll_loss=13.001, ppl=8199.27, wps=53128.7, ups=1.02, wpb=52000.2, bsz=1848.1, num_updates=200, lr=2.5095e-05, gnorm=1.145, loss_scale=16, train_wall=91, gb_free=4.8, wall=274
2021-07-10 01:05:38 | INFO | train_inner | epoch 001:    303 / 8110 loss=11.968, nll_loss=11.401, ppl=2703.35, wps=53723.3, ups=1.03, wpb=52318.6, bsz=1913.9, num_updates=300, lr=3.75925e-05, gnorm=0.945, loss_scale=16, train_wall=89, gb_free=4.7, wall=371

It is a little weird about the difference in batch size with the same max-token setting. When I reduce the max-token to keep the same batch size, the loss is reduced while also larger than yours:

2022-10-26 04:56:48 | INFO | train_inner | epoch 001:    103 / 8208 loss=15.335, nll_loss=15.181, ppl=37155.5, wps=198170, ups=4.06, wpb=48780.4, bsz=1842.2, num_updates=100, lr=1.25975e-05, gnorm=2.359, loss_scale=16, train_wall=27, gb_free=34.1, wall=47
2022-10-26 04:57:10 | INFO | train_inner | epoch 001:    203 / 8208 loss=13.841, nll_loss=13.522, ppl=11763, wps=222517, ups=4.51, wpb=49321, bsz=1887, num_updates=200, lr=2.5095e-05, gnorm=1.125, loss_scale=16, train_wall=20, gb_free=34.1, wall=69
2022-10-26 04:57:32 | INFO | train_inner | epoch 001:    303 / 8208 loss=12.52, nll_loss=12.028, ppl=4174.92, wps=221221, ups=4.55, wpb=48591.5, bsz=1832.9, num_updates=300, lr=3.75925e-05, gnorm=0.939, loss_scale=16, train_wall=20, gb_free=34.1, wall=91

from nmt-multi.

zanchangtong commented on August 23, 2024

This is my vocabulary, which is the same as you shared. dict (3).txt
The commit ID of fairseq is d3890e5. I clone the latest version fairseq, and use "git checkout d3890e5" to change the version.

Thanks for your checking. The evaluation script seems well.

I checked my training data according to the sample number and content. The EN -> XX data is identical to its inverse direction (XX -> En). For evaluation, I perform detokenize with sacremoses for references and system outputs. Here is my evaluation script. report_bleu.zero-shot.txt

This is strange. Could you check whether the vocabularies we used are the same? I have shared my vocabulary here. Besides, which version of fairseq did you use?

I find that the batch size(bsz) in my experiments is a little larger than your training log reported, and my training loss(loss) is also larger. For example,

2022-10-18 15:27:22 | INFO | train_inner | epoch 001:    103 / 7544 loss=15.456, nll_loss=15.317, ppl=40833.8, wps=180950, ups=3.43, wpb=52723.9, bsz=1972.6, num_updates=100, lr=1.25975e-05, gnorm=2.382, loss_scale=16, train_wall=32, gb_free=33.8, wall=58
2022-10-18 15:27:47 | INFO | train_inner | epoch 001:    203 / 7544 loss=13.933, nll_loss=13.627, ppl=12652.3, wps=208537, ups=3.91, wpb=53367.6, bsz=2019.5, num_updates=200, lr=2.5095e-05, gnorm=1.159, loss_scale=16, train_wall=23, gb_free=33.9, wall=84
2022-10-18 15:28:12 | INFO | train_inner | epoch 001:    303 / 7544 loss=12.626, nll_loss=12.15, ppl=4544.26, wps=216561, ups=4.02, wpb=53823.8, bsz=2004.4, num_updates=300, lr=3.75925e-05, gnorm=1.044, loss_scale=16, train_wall=22, gb_free=33.8, wall=108

Compared with yours:

2021-07-10 01:02:23 | INFO | train_inner | epoch 001:    103 / 8110 loss=15.282, nll_loss=15.123, ppl=35686.6, wps=52429.6, ups=1.01, wpb=51871.2, bsz=1828.1, num_updates=100, lr=1.25975e-05, gnorm=2.562, loss_scale=16, train_wall=100, gb_free=4.7, wall=176
2021-07-10 01:04:00 | INFO | train_inner | epoch 001:    203 / 8110 loss=13.381, nll_loss=13.001, ppl=8199.27, wps=53128.7, ups=1.02, wpb=52000.2, bsz=1848.1, num_updates=200, lr=2.5095e-05, gnorm=1.145, loss_scale=16, train_wall=91, gb_free=4.8, wall=274
2021-07-10 01:05:38 | INFO | train_inner | epoch 001:    303 / 8110 loss=11.968, nll_loss=11.401, ppl=2703.35, wps=53723.3, ups=1.03, wpb=52318.6, bsz=1913.9, num_updates=300, lr=3.75925e-05, gnorm=0.945, loss_scale=16, train_wall=89, gb_free=4.7, wall=371

It is a little weird about the difference in batch size with the same max-token setting. When I reduce the max-token to keep the same batch size, the loss is reduced while also larger than yours:

2022-10-26 04:56:48 | INFO | train_inner | epoch 001:    103 / 8208 loss=15.335, nll_loss=15.181, ppl=37155.5, wps=198170, ups=4.06, wpb=48780.4, bsz=1842.2, num_updates=100, lr=1.25975e-05, gnorm=2.359, loss_scale=16, train_wall=27, gb_free=34.1, wall=47
2022-10-26 04:57:10 | INFO | train_inner | epoch 001:    203 / 8208 loss=13.841, nll_loss=13.522, ppl=11763, wps=222517, ups=4.51, wpb=49321, bsz=1887, num_updates=200, lr=2.5095e-05, gnorm=1.125, loss_scale=16, train_wall=20, gb_free=34.1, wall=69
2022-10-26 04:57:32 | INFO | train_inner | epoch 001:    303 / 8208 loss=12.52, nll_loss=12.028, ppl=4174.92, wps=221221, ups=4.55, wpb=48591.5, bsz=1832.9, num_updates=300, lr=3.75925e-05, gnorm=0.939, loss_scale=16, train_wall=20, gb_free=34.1, wall=91

from nmt-multi.

cordercorder commented on August 23, 2024

As the vocabulary is constructed based on the corpus after applying long sentence filtering operations, the same vocabulary (same 1-gram and its frequency) may indicate that our training sets are also identical. However, it seems like there are more sentences in your training sets according to your previous reply:

After comparing the training log, I find that my model's validation is higher(4.617 vs 4.482) and my raw dataset size is slightly larger(10109386 vs 10092756). Are there some missing data filtering operations? I directly extract parallel data from the ted_talk dataset. This is the detailed information on dataset size:raw_data_size.txt

I can not figure out the underline reasons behind these issues based on existing information, such as the performance gap of zero-shot translation on the TED-59 dataset, the difference in training sets size and batch size, and the slightly higher loss during training and validation. Can we communicate through Wechat or QQ for more details?

from nmt-multi.

altctrl00 commented on August 23, 2024

I wonder what is the cause of this difference, my raw dataset size is also 10109386, my validation loss is also above 4.6 and BLEU on zero-shot is 9.63

from nmt-multi.

cordercorder commented on August 23, 2024

Maybe there is some difference between our datasets (including training sets, validation sets, and test sets). I will share them through Baidu Cloud Disk tomorrow (I have to report the research progress to my advisor today).

from nmt-multi.

zanchangtong commented on August 23, 2024

Hi, I reduce the training batch size ( be the same as bsz value in the training log) with max-token=3900. Then, I got the zero-shot translation Bleu 10.38 which is much close to the paper report value. So, I think the possible reason for the performance drop may be the difference between actual batch sizes.

from nmt-multi.

zanchangtong commented on August 23, 2024

Maybe there is some difference between our datasets (including training sets, validation sets, and test sets). I will share them through Baidu Cloud Disk tomorrow (I have to report the research progress to my advisor today).

Thanks, Looking forward to the data you share.

from nmt-multi.

cordercorder commented on August 23, 2024

After carefully checking the number of parallel sentence pairs in the training sets for all language pairs of the TED-59 dataset, I find that the size of our training sets is also identical (10109386 sentence pairs in the raw training sets). I apologize for the mistake I provide the wrong training log of Token_src on the TED-59 dataset. There are some bugs in the preliminary script for data preprocessing, which results in incorrect training samples. Therefore, the raw training sets size reported in the training log is also incorrect. Fortunately, I find these bugs in later experiments and fixed them. Due to the huge number of experiments, I conduct experiments on different servers. Although I delete the old experimental data and rerun all the experiments affected by these bugs, the training script and log for Token_src still remain on the server which runs the experiments with bugs. I find the correct training script and log of Token_src on another server during the checking process and provide them below (The training process crashed for unknown reasons at epoch 12 and I recover the training process from epoch 12, which results in two training logs):

fairseq_train.many-many_4.logs.txt
fairseq_train.many-many_4.recover_1.logs.txt
fairseq_train.many-many_4.sh.txt

But surprisingly, although the training script remains the same as before, the training loss becomes smaller after recovering from epoch 12:

Below is the training status before the training process crashes:

2021-07-17 16:20:21 | INFO | fairseq.trainer | begin training epoch 12
2021-07-17 16:20:21 | INFO | fairseq_cli.train | Start iterating over samples
2021-07-17 16:20:23 | INFO | train_inner | epoch 012:      3 / 7543 loss=4.207, nll_loss=2.394, ppl=5.26, wps=25896.7, ups=0.49, wpb=52708.3, bsz=1972.6, num_updates=82900, lr=0.00010983, gnorm=0.482, loss_scale=16, train_wall=67, gb_free=17.7, wall=60130
2021-07-17 16:21:33 | INFO | train_inner | epoch 012:    103 / 7543 loss=4.196, nll_loss=2.381, ppl=5.21, wps=76157.3, ups=1.43, wpb=53288.6, bsz=2025.3, num_updates=83000, lr=0.000109764, gnorm=0.46, loss_scale=16, train_wall=64, gb_free=17.6, wall=60200
2021-07-17 16:22:43 | INFO | train_inner | epoch 012:    203 / 7543 loss=4.188, nll_loss=2.372, ppl=5.18, wps=76341.2, ups=1.43, wpb=53468.6, bsz=2046.5, num_updates=83100, lr=0.000109698, gnorm=0.464, loss_scale=16, train_wall=63, gb_free=18.6, wall=60270
2021-07-17 16:23:48 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
2021-07-17 16:23:54 | INFO | train_inner | epoch 012:    304 / 7543 loss=4.192, nll_loss=2.378, ppl=5.2, wps=75320.4, ups=1.41, wpb=53293.3, bsz=2024.4, num_updates=83200, lr=0.000109632, gnorm=0.463, loss_scale=8, train_wall=64, gb_free=17.5, wall=60341
2021-07-17 16:25:03 | INFO | train_inner | epoch 012:    404 / 7543 loss=4.187, nll_loss=2.372, ppl=5.18, wps=76525.6, ups=1.45, wpb=52942.4, bsz=1993.7, num_updates=83300, lr=0.000109566, gnorm=0.464, loss_scale=8, train_wall=63, gb_free=17.9, wall=60410
2021-07-17 16:26:12 | INFO | train_inner | epoch 012:    504 / 7543 loss=4.188, nll_loss=2.374, ppl=5.18, wps=76947.1, ups=1.44, wpb=53341.8, bsz=1994.5, num_updates=83400, lr=0.000109501, gnorm=0.458, loss_scale=8, train_wall=63, gb_free=17.6, wall=60480
2021-07-17 16:27:22 | INFO | train_inner | epoch 012:    604 / 7543 loss=4.199, nll_loss=2.385, ppl=5.22, wps=76490.7, ups=1.43, wpb=53617.1, bsz=1987.1, num_updates=83500, lr=0.000109435, gnorm=0.46, loss_scale=8, train_wall=64, gb_free=17.9, wall=60550

Below is the training status after recovering:

2021-07-17 18:42:02 | INFO | fairseq.trainer | begin training epoch 12
2021-07-17 18:42:02 | INFO | fairseq_cli.train | Start iterating over samples
2021-07-17 18:42:07 | INFO | train_inner | epoch 012:      3 / 7543 loss=4.164, nll_loss=2.343, ppl=5.07, wps=70479.9, ups=1.34, wpb=51857.3, bsz=2053.3, num_updates=82900, lr=0.00010983, gnorm=0.486, loss_scale=16, train_wall=4, gb_free=17.7, wall=22
2021-07-17 18:43:22 | INFO | train_inner | epoch 012:    103 / 7543 loss=4.196, nll_loss=2.381, ppl=5.21, wps=71004.5, ups=1.33, wpb=53288.6, bsz=2025.3, num_updates=83000, lr=0.000109764, gnorm=0.46, loss_scale=16, train_wall=67, gb_free=17.6, wall=97
2021-07-17 18:44:37 | INFO | train_inner | epoch 012:    203 / 7543 loss=4.188, nll_loss=2.372, ppl=5.18, wps=71300.6, ups=1.33, wpb=53468.6, bsz=2046.5, num_updates=83100, lr=0.000109698, gnorm=0.464, loss_scale=16, train_wall=66, gb_free=18.6, wall=172
2021-07-17 18:45:47 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
2021-07-17 18:45:53 | INFO | train_inner | epoch 012:    304 / 7543 loss=4.192, nll_loss=2.378, ppl=5.2, wps=69990.1, ups=1.31, wpb=53293.3, bsz=2024.4, num_updates=83200, lr=0.000109632, gnorm=0.463, loss_scale=8, train_wall=69, gb_free=17.5, wall=248
2021-07-17 18:47:08 | INFO | train_inner | epoch 012:    404 / 7543 loss=4.187, nll_loss=2.372, ppl=5.18, wps=70992.5, ups=1.34, wpb=52942.4, bsz=1993.7, num_updates=83300, lr=0.000109566, gnorm=0.464, loss_scale=8, train_wall=67, gb_free=17.9, wall=323
2021-07-17 18:48:22 | INFO | train_inner | epoch 012:    504 / 7543 loss=4.188, nll_loss=2.374, ppl=5.18, wps=71597.7, ups=1.34, wpb=53341.8, bsz=1994.5, num_updates=83400, lr=0.000109501, gnorm=0.458, loss_scale=8, train_wall=67, gb_free=17.6, wall=397
2021-07-17 18:49:37 | INFO | train_inner | epoch 012:    604 / 7543 loss=4.199, nll_loss=2.385, ppl=5.22, wps=71326.2, ups=1.33, wpb=53617.1, bsz=1987.1, num_updates=83500, lr=0.000109435, gnorm=0.46, loss_scale=8, train_wall=67, gb_free=17.9, wall=472

This is strange, as the training status after recovering from the breaking point should be identical to before. I am not sure whether this is the reason behind our performance gap on zero-shot translation on the TED-59 datasets.

I have uploaded the datasets to Baidu Cloud Disk and the sharing link is https://pan.baidu.com/s/1EUOH8FWumRvQUZogzoL2MA(extraction code: g9qk).

Note that the file named main_data_bin.zip contains the binary files preprocessed by fairseq-preprocess, spm_corpus.zip contains the corpus after applying BPE, spm_corpus_extract_parallel_data.zip contains the test sets for zero-shot translation, which are constructed by pairing sentences of any two languages via aligned English sentences in the original test sets, spm_corpus_extract_parallel_data_bin.zip contains the binary files of the test sets for zero-shot translation.

from nmt-multi.

Problem on ted59 zero-shot translation about nmt-multi HOT 27 CLOSED

Comments (27)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent