Domain Adaptation of Neural Machine Translation by Lexicon Induction

Implemented by Junjie Hu

If you use the codes in this repo, please cite our ACL2019 paper.

@inproceedings{hu-etal-2019-domain,
    title = "Domain Adaptation of Neural Machine Translation by Lexicon Induction",
    author = "Hu, Junjie and Xia, Mengzhou and Neubig, Graham and Carbonell, Jaime",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1286",
    doi = "10.18653/v1/P19-1286",
    pages = "2989--3001",
}

Installation

Anaconda environment

conda env create --file conda-dali-env.txt

Install fairseq

cd fairseq && pip install --editable . && cd ..

Install fastText

cd fastText && mkdir build && cd build && cmake .. && make && cd ../..

Download MUSE's dictionary

cd MUSE/data/ && bash get_evaluation.sh && cd ../..

Downloads

The preprocessed data and pre-trained models can be found here. Extract dataset.tar.gz under the dali directory. Extract {data-bin, it-de-en-epoch40, it2emea-de-en}.tar.gz under the dali/outputs directory.

dataset.tar.gz: train/dev/test data in five domains: it, emea, acquis, koran, subtitles.
data-bin.tar.gz: fairseq's binarized data.
it-de-en-epoch40.tar.gz: fairseq's transformer model pre-trained on data in the it domain.
it2emea-de-en.tar.gz: fairseq's transformer model adapted from it domain to emea domain using DALI-U.
S2T+T2S-de-en.lex: the lexicon induced by DALI-U.
embed-vec.tar.gz: {it,emea,acquis,koran,subtitles}.{de,en}.vec embeddings trained in five domains respectively, and {de,en}.vec embeddings trained in the combination of five domains.

The pre-trained model in the it domain can obtain the BLEU scores in the five domains as follows. After adaptation, the BLEU in the emea test set can be raised to 18.25 from 8.23. The BLEU scores are slightly different from those in the paper since we used different NMT toolkits (fairseq v.s. OpenNMT), but we observed similar improvements as we found in the paper.

Out-of-domain	In-domain
Out-of-domain	it	emea	koran	subtitles	acquis
it	58.94	8.23	2.50	6.26	4.34

Demo

Preprocess the data in the source (it) domain

bash scripts/preprocess.sh

Train the transformer model in the source (it) domain

bash scripts/train.sh [GPU id]

Perform DALI's data augmentation 1.1 (Unupervised Lexicon Induction) Train the word embeddings
```
 bash scripts/train-embed.sh
```
1.2 (Unupervised Lexicon Induction) Train the crosslingual embeddings by supervised lexicon induction
```
 bash scripts/train-muse.sh [path to supervised seed lexicon]
```
1.3 (Unupervised Lexicon Induction) Train the crosslingual embeddings by unsupervised lexicon induction
```
 bash scripts/train-muse.sh
```
1.4 (Unupervised Lexicon Induction) Obtain the word translation by nearest neighbor search
```
 python3 extract_lexicon.py \
   --src_emb $PWD/outputs/unsupervised-muse/debug/v1/vectors-de.txt \
   --tgt_emb $PWD/outputs/unsupervised-muse/debug/v1/vectors-en.txt \
   --output $PWD/outputs/unsupervised-muse/debug/v1/S2T+T2S-de-en.lex \
   --dico_build "S2T&T2S"
```
2.1 (Supervised Lexicon Induction) Obtain the word translation by GIZA++
```
bash scripts/extract_lex_giza.sh
```
1. Perform word-for-word back-translation
```
 bash scripts/wfw_backtranslation.sh 
```
Preprocess the data in the target (emea) domain

bash scripts/preprocess-da.sh

Adapt the pre-train model to the target (emea) domain

bash scripts/train-da-opt.sh

Translate the test1 set in the emea domain

bash scripts/translate.sh \
  outputs/it-de-en-epoch40/checkpoint_best.pt \
  outputs/it-de-en-epoch40/decode-test1-best.txt \
  outputs/data-bin-join/it \
  test1

junjiehu / dali Goto Github PK

dali's Introduction

Domain Adaptation of Neural Machine Translation by Lexicon Induction

Installation

Downloads

Demo

dali's People

Stargazers

Watchers

Forkers

dali's Issues

Explaining the dataset files

About the method of computing BLEU

I made an error when I preprocessed according to your steps

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent