Giter VIP home page Giter VIP logo

data-rejuvenation's Introduction

Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation

Implementation of our paper "Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation" to appear in EMNLP 2020. [paper]

Brief Introduction

Large-scale training datasets lie at the core of the recent success of neural machine translation (NMT) models. However, the complex patterns and potential noises in the large-scale data make training NMT models difficult. In this work, we explore to identify the inactive training examples which contribute less to the model performance, and show that the existence of inactive examples depends on the data distribution. We further introduce data rejuvenation to improve the training of NMT models on large-scale datasets by exploiting inactive examples. The proposed framework consists of three phases. First, we train an identification model on the original training data, and use it to distinguish inactive examples and active examples by their sentence-level output probabilities. Then, we train a rejuvenation model on the active examples, which is used to re-label the inactive examples with forward-translation. Finally, the rejuvenated examples and the active examples are combined to train the final NMT model. Experimental results on WMT14 English-German and English-French datasets show that the proposed data rejuvenation consistently and significantly improves performance for several strong NMT models. Extensive analyses reveal that our approach stabilizes and accelerates the training process of NMT models, resulting in final models with better generalization capability.

Figure 1: The framework of Data Rejuvenation.

Code Base

This implementation is based on fairseq(v0.9.0), with customized modification of scripts.

To start, you need to clone this repo and install fairseq firstly. Use the following pip command in fairseq/:

pip install --editable .

Additional Functionalities:

  • Transformer-based LSTM;
  • Force decoding: force_decode.py;
  • Identification: identify_split.py;

Pipeline

Take the Transformer-Base model and WMT14 En-De dataset as an example.

Identification

  1. Create four folders in fairseq/.

    mkdir dataset
    mkdir data-bin
    mkdir checkpoints
    mkdir results
    

    These four folders are used as below:

    • fairseq/dataset/: Save raw dataset with BPE.
      wmt14_en_de_base/train.en
      wmt14_en_de_base/train.de
      wmt14_en_de_base/valid.en
      wmt14_en_de_base/valid.de
      wmt14_en_de_base/test.en
      wmt14_en_de_base/test.de
      
    • fairseq/data-bin/: Save the binarized data after pre-processing.
    • fairseq/checkpoints/: Save the checkpoints of models during training.
    • fairseq/results/: Save the output results, including training log, inference output, token-wise probability, etc.
  2. Train an identification NMT model and obtain the token-wise prediction probability.

    • Train the NMT model on full training data of WMT14 En-De:
      sh sh_train.sh
      
    • Check the best model:
      fairseq/checkpoints/wmt14_en_de_base/checkpoint_best.pt
      
    • Force-decode the full training data:
      sh sh_forcedecode.sh
      
    • Check the token-wise probability:
      fairseq/results/wmt14_en_de_base/sample_status/status_train_[BestStep].txt
      
  3. Compute the sentence-level probability and split inactive examples and active examples.

    • Identify and split:
      python identify_split.py
      
    • Check the inactive examples:
      fairseq/dataset/wmt14_en_de_base_identified/inactive.en
      fairseq/dataset/wmt14_en_de_base_identified/inactive.de
      fairseq/dataset/wmt14_en_de_base_identified/active.en
      fairseq/dataset/wmt14_en_de_base_identified/active.de
      

Rejuvenation

  1. Train a rejuvenation NMT model and generate over the inactive samples.
    • Train the NMT model as normal but on the active examples:
      sh sh_train.sh
      
    • Check the best model:
      fairseq/checkpoints/wmt14_en_de_base_active/checkpoint_best.pt
      
    • Generate over the inactive examples (w/o --remove-bpe):
      sh sh_generate_extra.sh
      
    • Check the rejuvenated examples:
      fairseq/results/wmt14_en_de_base_active/inactive/source.txt
      fairseq/results/wmt14_en_de_base_active/inactive/target.txt
      fairseq/results/wmt14_en_de_base_active/inactive/decoding.txt
      

*Note*: A strong identification NMT models can take over the job of the rejuvenation NMT model, thus reducing the effort for training a new model. For example, the large-batch configured Transformer-Big and Dynamic-Conv models.

Final NMT Model

  1. Train a final NMT model from scratch.
    • Train the NMT model on the combination of active examples and rejuvenated examples:
      sh sh_train.sh
      
    • Check the best model:
      fairseq/checkpoints/wmt14_en_de_base_rejuvenated/checkpoint_best.pt
      
    • Evaluate on the test set:
      sh sh_generate.sh
      

Reference Performance

We evaluate the proposed Data Rejuvenation approach over various SOTA architectures and two language pairs. Clearly, our data rejuvenation consistently and significantly improves translation performance in all cases, demonstrating the effectiveness and universality of the proposed data rejuvenation approach. It’s worth noting that our approach achieves significant improvements without introducing any additional data and model modification.

Table 1: Evaluation of translation performance across model architectures and language pairs.

Public Impact

Citation

Please kindly cite our paper if you find it helpful:

@inproceedings{jiao2020data,
  title     = {Data Rejuvenation: Exploiting Inactive Training Examples for Neural Machine Translation}, 
  author    = {Wenxiang Jiao and Xing Wang and Shilin He and Irwin King and Michael R. Lyu and Zhaopeng Tu},
  booktitle = {EMNLP},
  year      = {2020}
}

data-rejuvenation's People

Contributors

wxjiao avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.