Giter VIP home page Giter VIP logo

dali's Introduction

Domain Adaptation of Neural Machine Translation by Lexicon Induction

Implemented by Junjie Hu

Contact: [email protected]

If you use the codes in this repo, please cite our ACL2019 paper.

@inproceedings{hu-etal-2019-domain,
    title = "Domain Adaptation of Neural Machine Translation by Lexicon Induction",
    author = "Hu, Junjie and Xia, Mengzhou and Neubig, Graham and Carbonell, Jaime",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1286",
    doi = "10.18653/v1/P19-1286",
    pages = "2989--3001",
}

Installation

  • Anaconda environment
conda env create --file conda-dali-env.txt
  • Install fairseq
cd fairseq && pip install --editable . && cd ..
  • Install fastText
cd fastText && mkdir build && cd build && cmake .. && make && cd ../..
  • Download MUSE's dictionary
cd MUSE/data/ && bash get_evaluation.sh && cd ../..

Downloads

The preprocessed data and pre-trained models can be found here. Extract dataset.tar.gz under the dali directory. Extract {data-bin, it-de-en-epoch40, it2emea-de-en}.tar.gz under the dali/outputs directory.

  • dataset.tar.gz: train/dev/test data in five domains: it, emea, acquis, koran, subtitles.
  • data-bin.tar.gz: fairseq's binarized data.
  • it-de-en-epoch40.tar.gz: fairseq's transformer model pre-trained on data in the it domain.
  • it2emea-de-en.tar.gz: fairseq's transformer model adapted from it domain to emea domain using DALI-U.
  • S2T+T2S-de-en.lex: the lexicon induced by DALI-U.
  • embed-vec.tar.gz: {it,emea,acquis,koran,subtitles}.{de,en}.vec embeddings trained in five domains respectively, and {de,en}.vec embeddings trained in the combination of five domains.

The pre-trained model in the it domain can obtain the BLEU scores in the five domains as follows. After adaptation, the BLEU in the emea test set can be raised to 18.25 from 8.23. The BLEU scores are slightly different from those in the paper since we used different NMT toolkits (fairseq v.s. OpenNMT), but we observed similar improvements as we found in the paper.

Out-of-domain In-domain
it emea koran subtitles acquis
it 58.94 8.23 2.50 6.26 4.34

Demo

  • Preprocess the data in the source (it) domain
bash scripts/preprocess.sh
  • Train the transformer model in the source (it) domain
bash scripts/train.sh [GPU id]
  • Perform DALI's data augmentation 1.1 (Unupervised Lexicon Induction) Train the word embeddings

     bash scripts/train-embed.sh
    

    1.2 (Unupervised Lexicon Induction) Train the crosslingual embeddings by supervised lexicon induction

     bash scripts/train-muse.sh [path to supervised seed lexicon]
    

    1.3 (Unupervised Lexicon Induction) Train the crosslingual embeddings by unsupervised lexicon induction

     bash scripts/train-muse.sh
    

    1.4 (Unupervised Lexicon Induction) Obtain the word translation by nearest neighbor search

     python3 extract_lexicon.py \
       --src_emb $PWD/outputs/unsupervised-muse/debug/v1/vectors-de.txt \
       --tgt_emb $PWD/outputs/unsupervised-muse/debug/v1/vectors-en.txt \
       --output $PWD/outputs/unsupervised-muse/debug/v1/S2T+T2S-de-en.lex \
       --dico_build "S2T&T2S"
    

    2.1 (Supervised Lexicon Induction) Obtain the word translation by GIZA++

    bash scripts/extract_lex_giza.sh
    
    1. Perform word-for-word back-translation
     bash scripts/wfw_backtranslation.sh 
    
  • Preprocess the data in the target (emea) domain

bash scripts/preprocess-da.sh
  • Adapt the pre-train model to the target (emea) domain
bash scripts/train-da-opt.sh
  • Translate the test1 set in the emea domain
bash scripts/translate.sh \
  outputs/it-de-en-epoch40/checkpoint_best.pt \
  outputs/it-de-en-epoch40/decode-test1-best.txt \
  outputs/data-bin-join/it \
  test1

dali's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

dali's Issues

Explaining the dataset files

Hi, can you explain what the dataset files mean? in particular I'm not sure about the difference between bpe.clean.en and bpe.en (and why clean exists only for the training splits)

About the method of computing BLEU

Hi, I notice that when you use scripts/translate.sh to evaluate model, you do not use option --sacrebleu. Does it mean that is not accounted in BLEU computation? Do all the values in the paper is computed as the same way as this codebase does ?
Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.