Giter VIP home page Giter VIP logo

pytorch_basic_nmt's Introduction

A Basic PyTorch Implementation of Attentional Neural Machine Translation

This is a basic implementation of attentional neural machine translation (Bahdanau et al., 2015, Luong et al., 2015) in Pytorch 0.4. It implements the model described in Luong et al., 2015, and supports label smoothing, beam-search decoding and random sampling. With 256-dimensional LSTM hidden size, it achieves 28.13 BLEU score on the IWSLT 2014 Germen-English dataset (Ranzato et al., 2015).

This codebase is used for instructional purposes in Stanford CS224N Nautral Language Processing with Deep Learning and CMU 11-731 Machine Translation and Sequence-to-Sequence Models.

File Structure

  • nmt.py: contains the neural machine translation model and training/testing code.
  • vocab.py: a script that extracts vocabulary from training data
  • util.py: contains utility/helper functions

Example Dataset

We provide a preprocessed version of the IWSLT 2014 German-English translation task used in (Ranzato et al., 2015) [script]. To download the dataset:

wget http://www.cs.cmu.edu/~pengchey/iwslt2014_ende.zip
unzip iwslt2014_ende.zip

Running the script will extract adata/ folder which contains the IWSLT 2014 dataset. The dataset has 150K German-English training sentences. The data/ folder contains a copy of the public release of the dataset. Files with suffix *.wmixerprep are pre-processed versions of the dataset from Ranzato et al., 2015, with long sentences chopped and rared words replaced by a special <unk> token. You could use the pre-processed training files for training/developing (or come up with your own pre-processing strategy), but for testing you have to use the original version of testing files, ie., test.de-en.(de|en).

Environment

The code is written in Python 3.6 using some supporting third-party libraries. We provided a conda environment to install Python 3.6 with required libraries. Simply run

conda env create -f environment.yml

Usage

Each runnable script (nmt.py, vocab.py) is annotated using dotopt. Please refer to the source file for complete usage.

First, we extract a vocabulary file from the training data using the command:

python vocab.py \
    --train-src=data/train.de-en.de.wmixerprep \
    --train-tgt=data/train.de-en.en.wmixerprep \
    data/vocab.json

This generates a vocabulary file data/vocab.json. The script also has options to control the cutoff frequency and the size of generated vocabulary, which you may play with.

To start training and evaluation, simply run data/train.sh. After training and decoding, we call the official evaluation script multi-bleu.perl to compute the corpus-level BLEU score of the decoding results against the gold-standard.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

pytorch_basic_nmt's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.