Giter VIP home page Giter VIP logo

word-embeddings-for-nmt's Introduction

When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?

This page contains the details of the code and TED talks dataset which was used for conducting the experiments included in the above paper.

The content could also be found at https://github.com/neulab/word-embeddings-for-nmt.

Contents

Software:

We used XNMT with commitID 38044b3 for all the experiments.

Experiments:
Data Processing:

In order to perform experiments, we collected (during early 2017) a common corpus of TED talks which has been translated into many low-resource languages. Under the Open Translation project, TED talks transcripts are available for more than 2400 talks in 109 languages. A histogram plot of language (represented by its ISO Code) vs total number of talks in the original dataset is visualized in the figure below.

TED Talks statistics

To obtain a parallel corpus for experiments, we preprocessed the dataset using Moses tokenizer and used hard punctuation symbols to identify valid sentence boundaries for English language. In order to create train, dev and test sets, we apply a greedy selection algorithm based on the popularity of the talks and selected disjoint talks for each split. We selected talks which had translations in more than 50 languages. Finally, we selected a list of 60 languages that had sufficient data for performing meaningful experiments. The train, test and dev splits for the most common talks are also shown in the table alongside the above figure.

  • The train, dev and test splits for the above TED talks: ted_talks.tar.gz.
  • ted_reader.py is a sample python script to read this TED talks data. An example is shown under the "main" attribute of the code.

If you use the dataset or code, please consider citing the paper using following bibtex:

BibTex

@inproceedings{Ye2018WordEmbeddings,
  author  = {Ye, Qi and Devendra, Sachan and Matthieu, Felix and Sarguna, Padmanabhan and Graham, Neubig},
  title   = {When and Why are pre-trained word embeddings useful for Neural Machine Translation},
  booktitle = {HLT-NAACL},
  year    = {2018},
  }

word-embeddings-for-nmt's People

Contributors

charlotteyeq avatar neubig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

word-embeddings-for-nmt's Issues

About experiment result

Hi, I recently do some experiments about transfer learning in NMT.
I read your paper "When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?" and use your shared ted data.
I notice you have a obvious improvement when using pre-trained embedding. For example, language pair ru-en from 18.5 to 21.2, it-pt from 14.5 to 19.2.
In your experiment, you seem to use 'a standard 1-layer encoder-decoder model with attention'.
My question is have you tried to do the experiment on a more strong model like 4 layers LSTM or a base Transformer? In my experiment, when I directly train a base Transformer, both ru-en or it-pt get a BLEU of about 21+. When I init embedding using pretrained vectors(also use wiki data with fasttext), BLEU both reduce about 1 point.
Is there any point to pay attention to? Or is the pre-trained embedding not working on a strong model?

I try to send a e-mail but get this:
"553 5.3.0 [email protected]... The email account that you tried to reach does not exist. Please try double-checking the recipient's email address for typos or unnecessary spaces."

The column named "calv" has no data

Hi, thank you for compiling this exciting dataset.
I am trying to conduct some multilingual translation experiments with this.

I am using the files from ted_talks.tar.gz and I've noticed that the files contains a column named calv and the fields only have __NULL__ data.
Is the column something mistakenly left in the data? It seems calv does not correspond to any language codes.

Question: About new languages

Hello, thank you for this repository of TED talk languages parallel corpus!
I'm interested in specific languages (like Hindi or Urdu, etc.), I'm not sure which is the process to retrieve these datasets. I'm aware of the transcript and translation processes as described here and on the WIT^3 we site here, but assumed I have specific talk like this in Tamil, how did you automatically get the translations/transcript?

My aim is to augment your dataset with more missing languages, specifically I'm looking for the following talks translations, where I have checked the number of talks available at present time:

Language   Talks
Urdu urd 146 talks
Malayalam Mal 43 talks
Hindi hin 417 talks
Assamese asm 1 talk
Bengali ben 111 talks
Gujarati guj 36 talks
Kannada kan 14 talks
Marathi mar 184 talks
Nepali nep 43 talks
Punjabi pan 9 talks
Tamil tam 114 talks
Telugu tel 59 talks
     
Japanese jpn 2565 talks
Chinese, Simplified zh-Hans 2597 talks
Chinese, Traditional zh-Hant / zho 2765 talks

Thank you very much.

_ _ NULL _ _??

Hello, I was working your data and I found something weird phrase "_ _ NULL _ _".

Can you check and what is this?

6147th line in all_ted_train.tsv.

Thanks,

Extending the dataset

Hello thank you for this repository of TED talk languages.

Since the time of data acquisition (early 2017), lots of other talks have been added, transcribed, and translated. i intend to extract these new talks and extend the current dataset. I was wondering if you can publish the script you use to extract and crawl the subtitles for TED talks from their website.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.