Giter VIP home page Giter VIP logo

deep's Introduction

DEEP: DEnoising Entity Pre-training for Neural Machine Translation (ACL 2022)

Installation

Here are a list of important tools for installation. We also provide a conda env file py39_env.txt.

cd fairseq
pip install --editable ./

Download

Perform SLING's entity linking

After installing SLING, you should have your SLING installed under $REPO/tools/sling (REPO denotes the path to this repo). Then run the following to perform entity linking on Wikipedia article.

cd tools/sling
lang=uk    # uk: Ukraine
version=20221101  # the version we used
./run.sh --download_wikidata --download_wikipedia --wikipedia $version --language $lang

This will generate annotated Ukraine Wikipedia articles under $REPO/tools/sling/local/data/e/wiki/uk/documents-0000{0-9}-of-00010.rec, which will be used to create pre-training data.

Prepare DEEP's Pre-training Data

After the installation of above tools, run the following to create DEEP's pre-training data.

bash data-scripts/create_deep_pretraining_data.sh

This will generate two folders. Each language (e.g., uk_XX) has its sub-folder:

  • data/Wikipedia/wiki-max512-deep-spm250000/uk_XX/ - Raw text : train-{0-9}.{en_XX,uk_XX,idx,qid}, and valid.{en_XX,uk_XX,idx,qid}
  • data/Wikipedia/wiki-max512-deep-spm250000-bin/uk_XX - Fairseq's binarized data: train-{0-9}.en_XX-uk_XX.{en_XX,uk_XX}.{bin,idx}

Pre-training on TPU

We pre-train the mBART models using TPU on Google Cloud Platform. The model is pre-trained on the pre-training data created above. We modify the Fairseq's repository such that we can run the code on GCP's TPU.

bash train-scripts/pretrain-deep-mbart.sh

Finetune on Downstream MT Task

Here we give an example of fune-tuning our pre-trained models on the Ted En-Uk dataset. Replace [GPU ID] by an integer (e.g., 0, 1, ...) indicating which GPU to use.

bash train-scripts/finetune-deep-ted.sh [GPU ID]  

Evaluate on Downstream MT Task

bash train-scripts/test_ted_enuk_deep.sh [GPU ID]

Citation

If you find our work interesting and use the code in this repository, please cite our ACL2022 paper.

@inproceedings{hu-etal-2022-deep,
    title = "{DEEP}: {DE}noising Entity Pre-training for Neural Machine Translation",
    author = "Hu, Junjie  and
      Hayashi, Hiroaki  and
      Cho, Kyunghyun  and
      Neubig, Graham",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.123",
    doi = "10.18653/v1/2022.acl-long.123",
    pages = "1753--1766",
}

deep's People

Contributors

junjiehu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

deep's Issues

Code ?

When will the code get released?

confusion

hi,
When will the code get released?
thank u.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.