Giter VIP home page Giter VIP logo

flores's Introduction

FLORES-200 and NLLB Professionally Translated Datasets: NLLB-Seed, NLLB-MD, and Toxicity-200

⚠️ This repository is no longer being updated ⚠️

Newer versions of the FLORES and NLLB-Seed datasets managed by the Open Language Data Initiative are available here:

Quick-access to the original READMEs:

Citation

If you use any of this data in your work, please cite:

@article{nllb2022,
  author    = {NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi,  Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang},
  title     = {No Language Left Behind: Scaling Human-Centered Machine Translation},
  year      = {2022}
}

Changelog

  • 2022-06-30: Released FLORES-200, NLLB-Seed, NLLB-MD, and Toxicity-200

  • 2021-06-04: Released FLORES-101

Licenses

  • FLORES-200: CC-BY-SA 4.0
  • NLLB-SEED: CC-BY-SA 4.0
  • NLLB-MD: CC-BY-NC 4.0
  • Toxicity-200: CC-BY-SA 4.0

flores's People

Contributors

christofbr avatar dadelani avatar dmitryvinn avatar guzmanhe avatar gwenzek avatar huihuifan avatar jeanm avatar jiajunshen avatar jmp84 avatar myleott avatar oanaignat avatar pipibjc avatar sharad461 avatar xettrisomeman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

flores's Issues

Reproducing Fully Supervised Baseline

[Following up on email.]

We attempted to reproduce the fully supervised baseline for Ne->En translation with the commands given on the README. However, we were unable to obtain the BLEU score reported in the paper and are instead getting that the model converges to ~5.6 BLEU (at ~24 ppl). This is in contrast to the reported BLEU of 7.6 from the paper.

Some more details about the training: We trained with the provided code and evaluated using sacrebleu. We first observed this while training on a single K80 GPU. Since our first email, we also reran the model on 4 K80 GPUs (with batch size between 50k and 100k) and still got the same results. (In particular, --fp16 is not available on this hardware.) We are wondering if you are able to reproduce these observations or have any suggestions for resolving this discrepancy.

Thanks!

Submission to DynaBench does not show results

Hi, I have a question regarding submitting models to DynaBench.

We followed the instructions in /dynalab/ and submitted a model around 24 hours ago. As the first test, we just submitted the pretrained small model.

The status has been green with label "created". However we are not able to see the results. There is a note in the bottom of the page showing "No data available, the model is still evaluating."

Therefore I'm wondering how long it typically takes to see the evaluation results. If ~24 hours is not a normal timespan, could you suggest some directions for troubleshooting on our side?

Thanks in advance for your help!
Danni

how to get the monolingual corpus

The code seems written for downloading parallel corpus. If I want to do some work with UNMT, how I can get the monolingual Common Crawl corpus referred in the paper? Nepali and Sinhala corpus are not found on paracrawl.eu homepage.

Request for the correction of Santali script name

Hi, I have downloaded FLORES-200 from here, When I reviewed the data for Santali language where the file name was written as sat_Beng.devtest (where the script code represents Santali in Bengali script) in both dev and devtest folder, but when viewed the file all contents were in Ol Chiki script (sat_Olck). Hence, I request for the change in file name from sat_Beng.devtest to sat_Olck.devtest.

Mismatch of the size between pretrained model and finetuned model

Hi,

when I check the model size of my finetuned model and your provided pretrained model, I notice they are different.

There are two major dicts in the checkpoint, i.e. "model" and "last_optimizer_state". If both are in fp32, the size of "last_optimizer_state" should be roughly as twice big as "model", since there are first and second momentums in the adam optimizer.
For the pretrained model you offered, the sizes are:

  • For pretrained MM100_175M, the size of "model": 336M, the size of "last_optimizer_state": 1.4G
  • For pretrained MM100_615M, the size of "model": 1.2G, the size of "last_optimizer_state": 4.7G

It makes sense, because the pretrained "model" is in fp16 and the "last_optimizer_state" in fp32. The size of "last_optimizer_state" should be roughly as four times as the "model".

However, when I fintune the pretrained model, I meet some problems.

  1. The "model" is saved in fp32 instead of fp16, even though I train with --fp16. My training config is as:
DATA=/path/to/data
TOOL=/path/to/fairseq/train.py
PRETRAINED_MODEL=/path/to/flores101_mm100_615M/model.pt
lang_pairs=/path/to/language_pairs.txt

python $TOOL \
    $DATA \
    --dataset-impl mmap \
    --arch transformer_wmt_en_de_big \
    --dropout 0.1 --attention-dropout 0.1 \
    --encoder-embed-dim 1024 --decoder-embed-dim 1024 \
    --encoder-attention-heads 16 --decoder-attention-heads 16 \
    --encoder-ffn-embed-dim 4096 --decoder-ffn-embed-dim 4096 \
    --encoder-normalize-before --decoder-normalize-before \
    --encoder-layers 12 --decoder-layers 12 \
    --share-all-embeddings \
    --restore-file $PRETRAINED_MODEL \
    --task translation_multi_simple_epoch \
    --encoder-langtok "src" --decoder-langtok \
    --lang-pairs $lang_pairs \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --optimizer adam --adam-eps 1e-08 --adam-betas '(0.9, 0.98)' \
    --fp16 --fp16-init-scale 128  --fp16-scale-tolerance 0.0  --memory-efficient-fp16 \
    --lr-scheduler inverse_sqrt --lr 8e-04 --warmup-init-lr 1e-07 --warmup-updates 2500 \
    --max-tokens 2048  \
    --save-interval 1  
  1. The size of "model" and "last_optimizer_state" are weird.
  • For finetuned MM100_175M, the size of "model" is 1.7G, the size of "last_optimizer_state" is 1.4G.
  • For finetuned MM100_615M, the size of "model" is 4.3G, the size of "last_optimizer_state" is 4.7G.

The sizes of "model" and "last_optimizer_state" are comparable, which is strange to me. Besides, even though I manually change the float of "model" to half, I can only obtain half size of the "model" that is still different with your pretrained "model". For your convenience, you can check my 615M model at https://dynabench.org/models/250

Do you have any ideas for this?

Scripts to Scrape Data?

Thanks again for releasing this dataset, it's super-helpful!

We are very interested in using this on the Sinhalese and Nepalese data itself, but also want to test out methods for cross-lingual transfer learning, where we learn from higher-resourced languages to improve accuracy on the Sinhalese and Nepalese datasets. Would it be possible to get scripts or instructions on how to scrape the monolingual and comparable data reported in the paper, so we can scrape data for other similar language pairs?

Evaluation of languages of the same family

Note: I have very little time then I'm optimizing for sharing raw feedback based on visual observation. I had no time this week to evaluate this properly using data and a more quantitative approach, happy to help in some days from now.

Hypothesis

For languages that are from the same family (we will use Spanish to Catalan as example moving forward) the Flores dataset has potentially two problems:

a) Does not reflect well how a human translator translate in these language pairs
b) Introduces bias and favors non rule-based machine translation (e.g. neural systems)

Let me elaborate both

Does not reflect well how a human translator translate in these language pairs.

Take as example from the dataset a sentence for Spanish and Catalan languages. Assume that you are evaluating Spanish to Catalan translation:

Spanish > El 7 de octubre, un motor se separó al despegar, sin dejar heridos. Rusia hizo permanecer en tierra los Il-76 por poco tiempo después de ese accidente.

Catalan > Un motor es va separar durant l'enlairament sense provocar ferits el set d'octubre. Rússia va fer aterrar ràpidament els Il-76 després d'aquell accident.

A human translator, will take the Spanish translation (source) and do the minimum changes to translate into Catalan (target), this includes preserving by default the same structure and vocabulary when there is no need to change.

For this sample sentence, this is how a human translator will translate from Spanish > Catalan:

Catalan > El 7 d'octubre, un motor es va separar al despegar, sense deixar ferits. Rússia va fer romandre en terra els Il-76 durant poc temps després d'aquest accident.

The core problem is that when a human translates from English to Spanish or from English to Catalan (languages from different families) needs to make some hard decisions because languages are very different. Different translators will take different decisions regarding vocabulary, grammar structure, etc. When you compare then Spanish to Catalan this not how a human will translate directly from Spanish to Catalan since you are pivoting over English and you will not make unnecessary grammar or vocabulary changes.

Introduces bias and favors non rule-based machine translation (e.g. neural systems)

Your evaluation sentences for Spanish and Catalan do not mimic how a human will translate Spanish to Catalan. They have the same meaning but the structure and vocabulary has change for no reason.

Rule-based machine translation systems like Apertium are very effective in languages that are from the same family. They apply transformation rules from source to target that mimic what a human will do.

If you use Flores dataset to evaluate rule-based systems you will in general score them lower, even if the translation is more accurate and closer to what a human will do. This is because the evaluation data set is not a translation from Spanish > Catalan, instead you have done English > Spanish and then English > Catalan and then you get a Spanish > Catalan which noise introduced by English.

This problems impacts language pairs like Spanish > Galician, Spanish > Catalan, French > Occitan, Spanish > Occitan, etc

Probing and quantifying this hypothesis

My suggestions

  1. Quantify how many languages pairs may be effected by this problem

  2. One way to probe this hypothesis and quantify the problem is to ask a human translator to translate from Spanish to Catalan directly and then compare (using spBLUE for example) how different this Spanish to Catalan translation done directly from the current ones in Flores done following English -> Catalan and English -> Spanish

Non-matching quotation marks in some dev/devtest sets

There are a lot of double (or more than double) quotation marks in the Flores dev and devtest sets

E.g.:

grep '""' flores101_dataset/dev/*dev

The affected sentences seem to vary - eng.dev has none, tel.dev has 72.

While users can clean the files and the effect on evaluation is probably not too strong, it seemed worth flagging if there is ever a dataset update.

contamination flores200-dev and flores200-devtest

Hi,
I saw some of the sentences appreaed in both of these splits (flores200-dev and flores200-devtest):

For example for dyu these 3 are in common between dev and devtest:

  • Tournoi kun yèrè nan Sud africain djumlan nan gnannanman, ka Zambien myé lorou man bugô mugan ni wôrô-fohi.
  • San nanni tèmini, brevet (sèbè myé nissodja) tun dila, mi tunbè brevet dugnugnan djonan dini MRI oya bara là.
  • Moldava capitali ayé Chisinau yé. Oya sigui kan yé Romanian, man oyé okan ni fôla dôni. Nga Russi kan béfôla yôrô chiaman la.

I found this type of contamination for these languages:

  • ary_Arab
  • dyu_Latn
  • hat_Latn
  • kam_Latn
  • kaz_Cyrl
  • lin_Latn
  • lit_Latn
  • npi_Deva
  • spa_Latn
  • taq_Tfng
  • urd_Arab
  • ydd_Hebr

Evaluation Script for all language pairs

Hi,

Thanks for your great benchmark. Is there any script that we can concatenate multiple language pairs and evaluate them together with loading the model once?

questions about unsupervised results in the paper

unsupervised ne-en and en-ne bleu scores are 0.5 and 0.1, did you use back-translation in unsupervised setting or just DAE ?
btw, the +multi in unsupervised setting uses XLM, have you tried XLM without +multi ?

How to replicate supervised NE-EN baseline?

Hi there,

I'm currently trying to reproduce the baseline supervised results from the README but have so far not been able to do so.

So far using the following command

fairseq-generate \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang ne --target-lang en \
    --path $CHECKPOINT_DIR \
    --beam 5 --lenpen 1.2 \
    --gen-subset valid \
    --remove-bpe=sentencepiece \
    --sacrebleu

I get the following results:

| Translated 2559 sentences (79390 tokens) in 46.7s (54.85 sentences/s, 1701.52 tokens/s)                                                                                     
| Generate valid with beam=5: BLEU = 6.09 38.4/10.1/3.6/1.4 (BP = 0.917 ratio = 0.920 hyp_len = 42313 ref_len = 45975)

Changing --gen-subset valid to --gen-subset test yields:

| Translated 2835 sentences (94317 tokens) in 57.7s (49.17 sentences/s, 1635.67 tokens/s)
| Generate test with beam=5: BLEU = 7.66 40.2/12.0/4.5/1.9 (BP = 0.958 ratio = 0.959 hyp_len = 48970 ref_len = 51076)

These resulting BLEU scores of 6.09 and 7.66 seems to be different than that reported in Table 3 of the paper. Table 3 reports a BLEU score of 7.6 for devtest.

Training is performed with

CUDA_VISIBLE_DEVICES=0 fairseq-train \
    data-bin/wiki_ne_en_bpe5000/ \
    --source-lang ne --target-lang en \
    --arch transformer --share-all-embeddings \
    --encoder-layers 5 --decoder-layers 5 \
    --encoder-embed-dim 512 --decoder-embed-dim 512 \
    --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \
    --encoder-attention-heads 2 --decoder-attention-heads 2 \
    --encoder-normalize-before --decoder-normalize-before \
    --dropout 0.4 --attention-dropout 0.2 --relu-dropout 0.2 \
    --weight-decay 0.0001 \
    --label-smoothing 0.2 --criterion label_smoothed_cross_entropy \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0 \
    --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \
    --lr 1e-3 --min-lr 1e-9 \
    --max-tokens 4000 \
    --update-freq 4 \
    --max-epoch 100 --save-interval 1 --save-dir $CHECKPOINT_DIR

This command should be the same as that of the README except for a difference in the checkpointing directory.

My current specs:

# Azure VM:
> Standard NC6_Promo (6 vcpus, 56 GiB memory)

# GPU information:
> sudo lspci -k | grep "NVIDIA"
< d26a:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
<	Subsystem: NVIDIA Corporation GK210GL [Tesla K80]

# Fairseq version: 
> pip freeze | grep "fairseq"
< fairseq==0.9.0

I realize some of this may be due to hardware differences, but beyond that I am wondering

  1. Comparing the sentence counts in my evaluation outputs to Table 1 of the paper, it seems like using --gen-subset valid corresponds to using dev (2559 sentences) and --gen-subset test to using devtest (2835 sentences). Is this correct?

    • According to README.md --gen-subset test should be using the test data which, according to Table 1, contains 2924 sentences -- a number which matches neither of my outputs.
  2. What is the version of fairseq the baseline results in the paper were created with?

  3. Could some of this have to do with random initialization of weights/embeddings? Is there a seed I can set to better control this ? If so, was a certain seed used to create the results of the paper?

  4. The closest I get to Table 3 performance is with --gen-subset test, which gives me BLEU 7.66. Since this is only 0.06 away from the reported devtest value, could it effectively be a rounding error?

Apologies for a long post and thank you very much in advance!

Scripts for Back-translation?

Hi,
Thanks for releasing the code and datasets.

Are you going to release the scripts you used for back-translation, and in general a single script to run (train) the system in one go?

Thanks again,
Ashim

Contribution: Kabyle Language

Hello,
We are a team of kabyle volunteers who are working on open corpora (Voice, translations, localization...). We are interested to collect more translated data for kabyle language (Kabyle-English & Kabyle French). Is there any platform where we can send contributions?

The Cantonese (Yue Chinese, `yue_Hant`) data in FLORES-200 is not Cantonese at all

The Cantonese (Yue Chinese, yue_Hant) data in FLORES-200 is completely wrong. The data is not Cantonese at all, but rather Mandarin Chinese in Traditional Chinese Script (zho_Hant), which only has stylistic differences compared to the zho_Hant data in the dataset.

Furthermore, the paper mentioned that the yue_Hant and zho_Hant data tend to be predicted as each other. It turns out that both datasets actually consist of zho_Hant data exclusively. yue_Hant and zho_Hant should actually be very easy to distinguish from each other.

Here is how correct yue_Hant data would look like:

Language Code Sentence
eng_Latn They found the Sun operated on the same basic principles as other stars: The activity of all stars in the system was found to be driven by their luminosity, their rotation, and nothing else.
zho_Hant 他們發現太陽的運作與其他恆星的基本原理相同:系統中所有恆星的活動均受其光度、自轉所推動,就是這麼簡單。
yue_Hant (wrong) 他們發現,太陽和其他恆星的運行原理是一樣的:系統中所有恆星的活動都是由它們的亮度、自轉驅動的,而並非其他因素。
yue_Hant (corrected) 佢哋發現,太陽其他恆星運行原理分別:系統入面所有恆星活動都淨係佢哋嘅亮度自轉推動,而包括其他因素。

(Bold denotes words that are used exclusively in yue_Hant)

Data Download is outdated

It seems the data download procedure for floresv1 is a bit outdated as the global voices link no longer works. Similarly, the link to download the data (floresv1) from GitHub also doesn't seem to work correctly.

Not able to reproduce the semi-supervised results

I tried to reproduce the semi-supervised training the flores en-ne datasets using the reproduce.sh script. I used 4 V-100 gpus and it took around 2 days to finish the experiment.  But unfortunately, our score (3.69) is worse than the reported one in the paper (6.8). I basically just ran the script following the instructions on the github repo. The only minor change I did is During back-translation, I downsized the max-token from 20000 to 6000 for translation (not for training so it does not matter) because otherwise I ran out of gpu memory easily. 

I am just wondering if there's anything wrong with my experimental settings?

Extra zero-width characters in the dataset

There are some instances of zero-width characters in the dataset, for example:

  • Zero-width space (ZWSP / U+200B) at line 236 of dev/tha.dev; line 191 and 680 of devtest/tha.dev
  • Repeating Zero-width non-joiners (ZWNJ / U+200C) at line 121 of dev/pus.dev (a ZWNJ could be a valid one, but in this case it appears 7 times in a row.)

While ZWSP is less likely to do anything with semantic (more likely to be about typesetting, but sometimes used by a word processor as a word delimiter for language that does not has a space between words), ZWNJ could affect meaning of words.

This can be cleaned by the user themselves and, with the very small amount, is negligible in the evaluation. So probably no need to take any change in the dataset (or a low priority one) but a good to know for anyone who like to process it. Just a note.

Topics

According to the FLORES-101 paper, "we manually labeled all sentences by a more detailed sub-topic, one of 10 possibilities: crime, disasters, entertainment, geography, health, nature, politics, science, sports, and travel". Table 1 in the paper includes the statistics of these different sub-topics. However, in the metadata files there is a much larger number of sub-topics (actually, 306) such as:

Accident
accidents
accordion/right hand
advanced interactive media
Alchohol
American education/forgotten half/Foster care
American education/Special Needs ADD
...
ancient china/government
Ancient Civilizations/Romans
Ancient_Civilizations/Assyrians
...
big cats
big cats, lion
big cats, ocelot
big cats, tiger
Blended Learning/Blogging
Blended Learning/Field trips
Bugs/Insects_Intro
business
castles of england/tudor castles
castles of english/development of castles
climate
...

Is the 10-class metadata available for download or some recomendations on how to group the existing ones into a smaller number of topics?

The list of the 306 topics may be asily obtained with:

cat metedata_dev*|cut -f 3|sort| uniq

English Devtest Line 439: An extra "I" after sentence.

Line 439 of the Eng_Latn.devtest is as follow:

For over a thousand years the Christian religion had bound European states together despite differences in language and customs. I

I believe that "I" in the end is a typo in the dataset, as other language doesn't have this problem.

Wrong Text on Spanish Devset Line 536

This is a minor report, but I noticed that the line 536 of the Spanish Devset does not match the other languages (I corroborated against English, German, and Italian).

The Spanish sentence for line 536 of the Devset is about registration for a visa process while in the other languages it is about placing an alarm clock far away to help you wake up.

ERROR in download-data.sh

Thank you for this project and the paper.

I have issue with bash download-data.sh

I think the error happens at line 155 when it tries to download the file https://anoopk.in/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz

Using web browser, the link appears to be dead.

The line: download_data $DATA/en-hi.tgz "https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz"

Downloading https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz
--2019-09-27 17:18:36--  https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz
Resolving www.cse.iitb.ac.in (www.cse.iitb.ac.in)... 103.21.127.134
Connecting to www.cse.iitb.ac.in (www.cse.iitb.ac.in)|103.21.127.134|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://anoopk.in/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz [following]
--2019-09-27 17:18:38--  https://anoopk.in/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz
Resolving anoopk.in (anoopk.in)... 184.168.131.241
Connecting to anoopk.in (anoopk.in)|184.168.131.241|:443... connected.
ERROR: no certificate subject alternative name matches
	requested host name ‘anoopk.in’.
To connect to anoopk.in insecurely, use `--no-check-certificate'.
https://www.cse.iitb.ac.in/~anoopk/share/iitb_en_hi_parallel/iitb_corpus_download/parallel.tgz not successfully downloaded.

Thank you,

Dataset Problem.

In the paper , you wrote in the assamese language you have 738k mono text and 43.7k Bitext. But we are geeting only 1912 assamese-english pair data. Can you pls provide us the whole dataset i.e mono 738k and 43.7k Bitext. It will really helpful for us. Thanking you in advanced.

Scope for addition of New Language Bodo

Hi I am interested to translate by outsourcing 3001 sentences to our own language Bodo (brx). However I could not find all the sentences in English. Its only dev and devtest data available which is only around 2009 only. Where can I find the rest of the dataset sentences. So that I can start translating it.

detokenize output

hello! I am just running some baseline tests on trained models and I was wondering if there is a script to detokenize the output. I have trained an en-ne model on BPE text as per the data provided and upon inference I have produced a pred.txt file run upon the test set and now I wanted to detokenize the output to compute BLEU scores.

also one more thing, what test files are we supposed to use? there are two directories: data and data-bin. Currently, I am using the wikipedia.test.ne-en.$L1 or $L2 files from data. Is this correct?

About the function creep and 9 improvements to the file est_Latn_twl.txt

Here are listed some of the excessive problems with Estonian "toxic" wordlist.
The particular improvement items for the toxicity list (ref: est_Latn_twl.txt) are below, right after the introduction. Warning: because I am making the improvements regarding the "toxicity list", it is unavoidable to explicitly name certain body parts.


A tremendous problem with the META understanding of the "toxicity" is, the scope and rules are undefined. What are the conditions and limitations to use this censorship list? When and for what is this used? Because, if usage scenarios are not defined, your list can seriously harm some potential translation cases outside of your primary focus (due to function creep phenomena). The issue of substring matching belongs to the same category. You can read more about the function creep in MSc dissertation of Manon Jacobs: “Function Creep in Surveillance Situations: Identifying control paradoxes through agency and power relations using ANT” (2016)

Your disclaimer says "The primary purpose of such lists is to help with translation model safety by monitoring for hallucinated toxicity. By hallucinated toxicity, we mean the presence of toxic items in the translated text when no such toxic items can be found in the source text." However, this context is absolutely unsufficient for the purpose and will result misusages (as minimum, due to the function creep). Censorship is an invasive technology. Making an invasive technology freely available without pinpointing the accompanying issues is unethical.

I really do not understand how is the business logic able to decide e.g. that "mdv" in Estonian target text is toxic due to a hallucination when the original Finnish "mitä vittu" is missing from your lists. You are unable to follow this origin and could wrongly mark "mdv" as an AI MT hallucination, what it is not. I feel your "hallucination criterias" may be wrong and the "hallucination" concept just looks a bad pretext for an applied commercial censorship.

Even further, we can estimate that your disclaimer is not fully correct about the purpose, because another source (https://ai.facebook.com/blog/nllb-200-high-quality-machine-translation/) reveals "25 billion translations served every day" and the scope being "to spot harmful content and misinformation" (not at all saving the world from "hallucinates toxicity"). As a minimum you should justify the purpose reasons over all of your texts.

The summary:

A. For languages like Estonian, no wordlist based censorship will guarantee the results, while it will certainly damage the original pragma of the source text.

B. Publishing a censorship list (and technology) without indicating its limitations and blurring about its true purpose is un-ethical.

In general, joining the 56 African languages to the Internet community is a beautiful idea.


ref: est_Latn_twl.txt

  1. The username for the toxicity-200 resource file is huihuifan.
    In file est_Latn_twl.txt, the word "hui" is listed on line 57.
    Should now the uploader be twice blocked at the Github? Will it actually be blocked in case of a FB instant communication with an Estonian? (The particular phenomena originates from a Russian word denoting the male member while Slavic cursewords are popular and often included into Estonian speech and instant communications.)

  2. Due to the toxicity lists, The Christian Bible translation is not anymore possible without a particular concept that words on lines 3-56 are expressing (see Jacob 2:25 about someone Rahaab). This is quite unexpected considering the baseline level of the toxicity project was claimed to be calibrated against the Christian Bible corpus.

  3. Lines 60-62 contain multiple words separated by a space. In case of a wrong tokenization, errors may arise. "ime" means both "suck" and "miracle" in Estonian language. The issue with the space symbol seems to be wider, see also lines 83-84 and e.g. 231. Albeit there are some disclaiming attempts at your site, censorship lists are very often misused for function creep purposes. That kind of an unintended use should be explicitly warned against, what however has not been done.

  4. For the word on the line 64-79, the Singular Nominative form is not present. A question arises, why it is prohibited to declinate a word while its main form is allowed. Btw, Plural forms are also missing. What is the logic behind this kind of an approach? The full paradigm of an Estonian noun is 28 words (cases from 1-14, both Singular and Plural), plus there are occasional "short forms" (for certain cases) that you seem not to notice at all. Then, it is known that every generation has its own profanities. More classical profanities (beside the instant messaging ones) seem to be missing from the list.

  5. The word on line 80 "krt" - why is this prohibited? It is a very mild abbreviation for Devil. I do not see Devil listed under all his names, only lines 80-82. All other cases (remember - 28 in the full paradigm) are missing for the category as well a couple of synonyms.

  6. It is unclear whether or not the substrings count. For the words on lines 183 and 187, these form substring from the word "politsei" (the police). The word on line 109 form a substring from the Estonian word "trollibusse" - "trolley bus" in Plural Accusative case. There are more casualities, e.g. the word on the line 586 forms a substring for Estonian equivalents of thyratrons. Then, the word on the line 477 forms a substring for the Estonian equivalent of the "closing ceremony". There are more examples. Why is this important - because of the compound words and a possible function creep usage of the list.

  7. As a native speaker, I do not understand the line 235. In Business Register of Estonia, e.g. a company "MDV Ekspress OÜ" exist (https://www.teatmik.ee/et/personlegal/11718798-MDV-Ekspress-O%C3%9C). That means the abbreviation MDV is absolutely legal. The origin of "mdv" is Finnish, while the corresponding "mitä vittu" is missing from the file fin_Latn_twl.zip. The reviewer bias is obvious.

  8. The word on line 278 means "nazi". Why should this word be supressed? It is a legal word. Israel cannot continue with their holocaust propaganda if you apply this. Why aren't you supressing the "communists"?

  9. The word on line 558 is required to explain both the biology (calyx) and armour (sheath). It's use in no way is limited to the topic of toxicity.

I leave the action on your discretion.

FLORES-101 benchmark and Alternative Spelling rules in some languages, using FLORES-101 for benchmarking embeddings models

Thanks for Open-Source The FLORES-101 Data Set. While working with him, I noticed a certain feature that I wanted to share here. Some languages contain Alternative Spelling rules therefore some words that have more than one accepted spelling. This is known-well feature for German, Danish, Swedish or Traditional to Simplified Chinese conversion rules, etc. For example, in german language alternative spelling rules are:
alternatespelling
Consider the sentence № 991 from Flores-101 dev (deu.dev and eng.dev):

  • ENGLISH (0): “The walls and roofs of ice caves can collapse and cracks can get closed.“
  • GERMAN (1) : “Die Wände und Decken von Eishöhlen können einstürzen und Risse sich schließen.”

In this sentence:
alternatespelling_example
Therefore sentence:

  • ALTERNATIVE SPELLING GERMAN (2): “Die Waende und Decken von Eishoehlen koennen einstuerzen und Risse sich schliessen.“

is fully equivalent of sentence (1), but not for Ai (I’m see this for many Ai translation and embeddings models):

  • in German (1) - German Alternative Spelling (2):
    alternatespelling_laser_sentences
  • and in German (1), German Alternative Spelling (2) - English (0):
    alternatespelling_laser
  • and sentences (1) and (2) for Neural machine translation model:
    alternatespelling_translations
    As I can see for German language (and other?), Flores-101 created without examples considering these Alternative Spelling rules. And today Ai trained without considering these Alternative Spelling rules (I mean the methods of augmentation, since when training on big datasets like CommonCrawl + Wikipedia, Ai receives an unbalanced set of alternative spellings (because Alternative Spelling reforms took place recently (in many languages average ~1960-1990 year including modern dictionaries, english language and qwerty keyboard layout spread) and texts contains alternatives spelling unevenly, for example old books or newspapers without alternative spelling and modern text from books, newspapers, internet with alternative spelling), which leads to the fact that (Straße and Strasse (alternative spelling) on german language = street on english lanuage):
  1. words != words in alternative spelling (Straße != Strasse) for Ai
    alternatespelling_laser_words

  2. during training, the meaning of the context for words Straße, Strasse gets distorted or lost

  3. for Ai after learning on unbalanced by Alternative Spelling CC + Wiki dataset: Straße = street and Strasse != street OR Straße != street and Strasse = street OR Straße != street and Strasse != street
    alternatespelling_laser_words_variants

How about extend Flores-101 (or create additional dataset) with sentences in Alternative Spelling for languages contains this rules for benchmarking (and create metric for measure quality) in two cases:

  1. How equals words/sentences in Alternative Spelling in one language (is Straße == Strasse for model) for languages with Alternative Spelling rules
  2. How equals words/sentences in Alternative Spelling in crosslanguages case (as I showed above) German sentence == English sentence VS Alternative Spelling German sentence == English sentence

How about extend Flores-101 (or create additional dataset) with metric or test cases for measure quality of aligning language spaces for embeddings models like LASER, USE etc… for tasks other than machine translation: multilingual Ai tasks (scientific problems) like classification, similarity measure, BUCC, few-shot multilingual learning as discuss there?

Adding new languages to FLORES dataset

First of all, thank you for releasing this dataset and for your work in general!

Are there any plans to expand FLORES to other languages?
If so, is it possible to suggest translations into a new language (Circassian - Kabardian and Adyghe variants)?

Central Kurdish Problems

A few months ago I looked at the file of Central Kurdish and I noticed some problems and issues, now as the second version of the Flores is available, it seems like the same file of the first version is used again without any improvements.

The problems including using non-standard Unicode characters, inconsistent spellings, and wrong translations.
For fixing wrong translations and spellings, you need to send the English and Central Kurdish files to a professional Kurdish translator or linguist in order to review it and fix the mistakes.

Regarding the problem of non-standard Unicode characters, which many sentences have this issue, fortunately there are tools (like this) to fix this easily. I hope you do this step for now, and if you could then send the file for reviewing.

200k Sinhala parallel sentences are filtered

We attempted to reproduce the results of the Sinhala-English pair. Using the data process and training scripts provided in the repo, we found that 1) 200k parallel sentences are filtered with the TRAIN_MINLEN=6; 2) the bleu score is 6.56, around 1.2 lower than that is claimed in the paper. I am wondering is this the correct way to reproduce the result? Will it be possible that we shouldn't filter sentences with minilens?

Standard Moroccan Tamazight mislabeled.

The translations in FLORES 200 and NLLB seed to Central Atlas Tamazight (tzm) are actually in Standard Moroccan Tamazight (zgh). I have checked with a (zgh) teacher who is also a native speaker of Central Atlas Tamazight.

Problems in Catalan files (encoding / conversions)

In the dataset zip file, the file dev/cat.dev contains sentences like:

Dilluns, cientfics de la Facultat de Medicina de la Universitat de Stanford van anunciar la invenci

which should be:

Dilluns, científics de la Facultat de Medicina de la Universitat de Stanford van anunciar la invenció

All the special charters (like accents, apostrophe, etc) are missing.

The file cat.devtest has the same problem

The Spanish or Galician language files (which use the same encoding are both provide from Latin) do not have this problem.

Could not reproduce FloresV1 BLEU scores

I have tried to reproduce FLORESV1 BLEU scores using the reproduce.sh script and I am off by a significant amount.

Table 3 of "The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English" (Guzman, Chen et al, 2019) shows the following scores

image

After fixing data and fairseq issues in #40, I ran floresv1/reproduce.sh and got the following scores

Supervised NE-EN: 5.69
BT-1 EN-NE: 6.65
BT-2 NE-EN: 12.83

English-Nepali is within range but Nepali-English is pretty far from the scores presented (2+ BLEU off)

I ran on a single RTX8000 and only changed the max_tokens from 4000 to 16000. This is because train.py would change gpu=4 to update_freq=4 and the RTX8000 has enough memory to accomodate a batch size of 16000.

Do you have any advice on possible hyperparameter tuning that might reproduce the initial numbers?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.