neulab / word-embeddings-for-nmt Goto Github PK

Supplementary material for "When and Why Are Pre-trained Word Embeddings Useful for Neural Machine Translation?" at NAACL 2018

Python 100.00%

word-embeddings-for-nmt's Introduction

When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?

This page contains the details of the code and TED talks dataset which was used for conducting the experiments included in the above paper.

The content could also be found at https://github.com/neulab/word-embeddings-for-nmt.

Software:

We used XNMT with commitID 38044b3 for all the experiments.

Experiments:

Datasets for the specific language pairs used in the experiments mentioned in this paper: qi18naacl-dataset.tar.gz.
All the configuration files: experiment_settings.tar.gz.
Descriptions for configuration files: experiment_setting_specification.xlsx.
Supplementary examples: supplementary_figures_and_tables.pdf.

Data Processing:

In order to perform experiments, we collected (during early 2017) a common corpus of TED talks which has been translated into many low-resource languages. Under the Open Translation project, TED talks transcripts are available for more than 2400 talks in 109 languages. A histogram plot of language (represented by its ISO Code) vs total number of talks in the original dataset is visualized in the figure below.

To obtain a parallel corpus for experiments, we preprocessed the dataset using Moses tokenizer and used hard punctuation symbols to identify valid sentence boundaries for English language. In order to create train, dev and test sets, we apply a greedy selection algorithm based on the popularity of the talks and selected disjoint talks for each split. We selected talks which had translations in more than 50 languages. Finally, we selected a list of 60 languages that had sufficient data for performing meaningful experiments. The train, test and dev splits for the most common talks are also shown in the table alongside the above figure.

The train, dev and test splits for the above TED talks: ted_talks.tar.gz.
ted_reader.py is a sample python script to read this TED talks data. An example is shown under the "main" attribute of the code.

If you use the dataset or code, please consider citing the paper using following bibtex:

BibTex

@inproceedings{Ye2018WordEmbeddings,
  author  = {Ye, Qi and Devendra, Sachan and Matthieu, Felix and Sarguna, Padmanabhan and Graham, Neubig},
  title   = {When and Why are pre-trained word embeddings useful for Neural Machine Translation},
  booktitle = {HLT-NAACL},
  year    = {2018},
  }

word-embeddings-for-nmt's People

Contributors

Stargazers

Watchers

Forkers

clmarr anhduc2203 xcgfth hfxunlp thientu vmujadia zhangyichi1z tiagoblima souravdutta91 doheejin neuqser markhsia ustcsteve lizezhonglaile scewiner 691248533 wxjiao pangjh3 hang-jiangnan-naist

word-embeddings-for-nmt's Issues

About experiment result

Hi, I recently do some experiments about transfer learning in NMT.
I read your paper "When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?" and use your shared ted data.
I notice you have a obvious improvement when using pre-trained embedding. For example, language pair ru-en from 18.5 to 21.2, it-pt from 14.5 to 19.2.
In your experiment, you seem to use 'a standard 1-layer encoder-decoder model with attention'.
My question is have you tried to do the experiment on a more strong model like 4 layers LSTM or a base Transformer? In my experiment, when I directly train a base Transformer, both ru-en or it-pt get a BLEU of about 21+. When I init embedding using pretrained vectors(also use wiki data with fasttext), BLEU both reduce about 1 point.
Is there any point to pay attention to? Or is the pre-trained embedding not working on a strong model?

I try to send a e-mail but get this:
"553 5.3.0 [email protected]... The email account that you tried to reach does not exist. Please try double-checking the recipient's email address for typos or unnecessary spaces."

The column named "calv" has no data

Hi, thank you for compiling this exciting dataset.
I am trying to conduct some multilingual translation experiments with this.

I am using the files from ted_talks.tar.gz and I've noticed that the files contains a column named calv and the fields only have __NULL__ data.
Is the column something mistakenly left in the data? It seems calv does not correspond to any language codes.

Question: About new languages

Hello, thank you for this repository of TED talk languages parallel corpus!
I'm interested in specific languages (like Hindi or Urdu, etc.), I'm not sure which is the process to retrieve these datasets. I'm aware of the transcript and translation processes as described here and on the WIT^3 we site here, but assumed I have specific talk like this in Tamil, how did you automatically get the translations/transcript?

My aim is to augment your dataset with more missing languages, specifically I'm looking for the following talks translations, where I have checked the number of talks available at present time:

Language		Talks
Urdu	urd	146 talks
Malayalam	Mal	43 talks
Hindi	hin	417 talks
Assamese	asm	1 talk
Bengali	ben	111 talks
Gujarati	guj	36 talks
Kannada	kan	14 talks
Marathi	mar	184 talks
Nepali	nep	43 talks
Punjabi	pan	9 talks
Tamil	tam	114 talks
Telugu	tel	59 talks

Japanese	jpn	2565 talks
Chinese, Simplified	zh-Hans	2597 talks
Chinese, Traditional	zh-Hant / zho	2765 talks

Thank you very much.

_ _ NULL _ _??

Hello, I was working your data and I found something weird phrase "_ _ NULL _ _".

Can you check and what is this?

6147th line in all_ted_train.tsv.

Thanks,

Extending the dataset

Hello thank you for this repository of TED talk languages.

Since the time of data acquisition (early 2017), lots of other talks have been added, transcribed, and translated. i intend to extract these new talks and extend the current dataset. I was wondering if you can publish the script you use to extract and crawl the subtitles for TED talks from their website.