harvardnlp / seq2seq-attn Goto Github PK

Sequence-to-sequence model with LSTM encoder/decoders and attention

Home Page: http://nlp.seas.harvard.edu/code

License: MIT License

Lua 75.22% Python 24.78%

seq2seq-attn's Introduction

Sequence-to-Sequence Learning with Attentional Neural Networks

UPDATE: Check-out the beta release of OpenNMT a fully supported feature-complete rewrite of seq2seq-attn. Seq2seq-attn will remain supported, but new features and optimizations will focus on the new codebase.

Torch implementation of a standard sequence-to-sequence model with (optional) attention where the encoder-decoder are LSTMs. Encoder can be a bidirectional LSTM. Additionally has the option to use characters (instead of input word embeddings) by running a convolutional neural network followed by a highway network over character embeddings to use as inputs.

The attention model is from Effective Approaches to Attention-based Neural Machine Translation, Luong et al. EMNLP 2015. We use the global-general-attention model with the input-feeding approach from the paper. Input-feeding is optional and can be turned off.

The character model is from Character-Aware Neural Language Models, Kim et al. AAAI 2016.

There are a lot of additional options on top of the baseline model, mainly thanks to the fantastic folks at SYSTRAN. Specifically, there are functionalities which implement:

Effective Approaches to Attention-based Neural Machine Translation. Luong et al., EMNLP 2015.
Character-based Neural Machine Translation. Costa-Jussa and Fonollosa, ACL 2016.
Compression of Neural Machine Translation Models via Pruning. See et al., COLING 2016.
Sequence-Level Knowledge Distillation. Kim and Rush., EMNLP 2016.
Deep Recurrent Models with Fast Forward Connections for Neural Machine Translation. Zhou et al, TACL 2016.
Guided Alignment Training for Topic-Aware Neural Machine Translation. Chen et al., arXiv:1607.01628.
Linguistic Input Features Improve Neural Machine Translation. Senrich et al., arXiv:1606.02892

See below for more details on how to use them.

This project is maintained by Yoon Kim. Feel free to post any questions/issues on the issues page.

Dependencies

Python

h5py
numpy

Lua

You will need the following packages:

hdf5
nn
nngraph

GPU usage will additionally require:

cutorch
cunn

If running the character model, you should also install:

cudnn
luautf8

Quickstart

We are going to be working with some example data in data/ folder. First run the data-processing code

python preprocess.py --srcfile data/src-train.txt --targetfile data/targ-train.txt
--srcvalfile data/src-val.txt --targetvalfile data/targ-val.txt --outputfile data/demo

This will take the source/target train/valid files (src-train.txt, targ-train.txt, src-val.txt, targ-val.txt) and make some hdf5 files to be consumed by Lua.

demo.src.dict: Dictionary of source vocab to index mappings. demo.targ.dict: Dictionary of target vocab to index mappings. demo-train.hdf5: hdf5 containing the train data. demo-val.hdf5: hdf5 file containing the validation data.

The *.dict files will be needed when predicting on new data.

Now run the model

th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model

This will run the default model, which consists of a 2-layer LSTM with 500 hidden units on both the encoder/decoder. You can also add -gpuid 1 to use (say) GPU 1 in the cluster.

Now you have a model which you can use to predict on new data. To do this we are going to be running beam search

th evaluate.lua -model demo-model_final.t7 -src_file data/src-val.txt -output_file pred.txt
-src_dict data/demo.src.dict -targ_dict data/demo.targ.dict

This will output predictions into pred.txt. The predictions are going to be quite terrible, as the demo dataset is small. Try running on some larger datasets! For example you can download millions of parallel sentences for translation or summarization.

Details

Preprocessing options (`preprocess.py`)

srcvocabsize, targetvocabsize: Size of source/target vocabularies. This is constructed by taking the top X most frequent words. Rest are replaced with special UNK tokens.
srcfile, targetfile: Path to source/target training data, where each line represents a single source/target sequence.
srcvalfile, targetvalfile: Path to source/target validation data.
batchsize: Size of each mini-batch.
seqlength: Maximum sequence length (sequences longer than this are dropped).
outputfile: Prefix of the output file names.
maxwordlength: For the character models, words are truncated (if longer than maxwordlength) or zero-padded (if shorter) to maxwordlength.
chars: If 1, construct the character-level dataset as well. This might take up a lot of space depending on your data size, so you may want to break up the training data into different shards.
srcvocabfile, targetvocabfile: If working with a preset vocab, then including these paths will ignore the srcvocabsize,targetvocabsize.
unkfilter: Ignore sentences with too many UNK tokens. Can be an absolute count limit (if > 1) or a proportional limit (0 < unkfilter < 1).
shuffle: Shuffle sentences.
alignfile, alignvalfile: If provided with filenames that contain 'Pharaoh' format alignment on the train and validation data, source-to-target alignments are stored in the dataset.

Training options (`train.lua`)

Data options

data_file, val_data_file: Path to the training/validation *.hdf5 files created from running preprocess.py.
savefile: Savefile name (model will be saved as savefile_epochX_PPL.t7 after every save_every epoch where X is the X-th epoch and PPL is the validation perplexity at the epoch.
num_shards: If the training data has been broken up into different shards, then this is the number of shards.
train_from: If training from a checkpoint then this is the path to the pre-trained model.

Model options

num_layers: Number of layers in the LSTM encoder/decoder (i.e. number of stacks).
rnn_size: Size of LSTM hidden states.
word_vec_size: Word embedding size.
attn: If = 1, use attention over the source sequence during decoding. If = 0, then it uses the last hidden state of the encoder as the context at each time step.
brnn: If = 1, use a bidirectional LSTM on the encoder side. Input embeddings (or CharCNN if using characters) are shared between the forward/backward LSTM, and hidden states of the corresponding forward/backward LSTMs are added to obtain the hidden representation for that time step.
use_chars_enc: If = 1, use characters on the encoder side (as inputs).
use_chars_dec: If = 1, use characters on the decoder side (as inputs).
reverse_src: If = 1, reverse the source sequence. The original sequence-to-sequence paper found that this was crucial to achieving good performance, but with attention models this does not seem necessary. Recommend leaving it to 0.
init_dec: Initialize the hidden/cell state of the decoder at time 0 to be the last hidden/cell state of the encoder. If 0, the initial states of the decoder are set to zero vectors.
input_feed: If = 1, feed the context vector at each time step as additional input (via concatenation with the word embeddings) to the decoder.
multi_attn: If > 0, then use a multi-attention on this layer of the decoder. For example, if num_layers = 3 and multi_attn = 2, then the model will do an attention over the source sequence on the second layer (and use that as input to the third layer) and the penultimate layer. We've found that this did not really improve performance on translation, but may be helpful for other tasks where multiple attentional passes over the source sequence are required (e.g. for more complex reasoning tasks).
res_net: Use residual connections between LSTM stacks whereby the input to the l-th LSTM layer of the hidden state of the l-1-th LSTM layer summed with hidden state of the l-2th LSTM layer. We didn't find this to really help in our experiments.

Below options only apply if using the character model.

char_vec_size: If using characters, size of the character embeddings.
kernel_width: Size (i.e. width) of the convolutional filter.
num_kernels: Number of convolutional filters (feature maps). So the representation from characters will have this many dimensions.
num_highway_layers: Number of highway layers in the character composition model.

To build a model with guided alignment (implemented similarly to Guided Alignment Training for Topic-Aware Neural Machine Translation (Chen et al. 2016)):

guided_alignment: If 1, use external alignments to guide the attention weights
guided_alignment_weight: weight for guided alignment criterion
guided_alignment_decay: decay rate per epoch for alignment weight

Optimization options

epochs: Number of training epochs.
start_epoch: If loading from a checkpoint, the epoch from which to start.
param_init: Parameters of the model are initialized over a uniform distribution with support (-param_init, param_init).
optim: Optimization method, possible choices are 'sgd', 'adagrad', 'adadelta', 'adam'. For seq2seq I've found vanilla SGD to work well but feel free to experiment.
learning_rate: Starting learning rate. For 'adagrad', 'adadelta', and 'adam', this is the global learning rate. Recommended settings vary based on optim: sgd (learning_rate = 1), adagrad (learning_rate = 0.1), adadelta (learning_rate = 1), adam (learning_rate = 0.1).
layer_lrs: Comma-separated learning rates for encoder, decoder, and generator when using 'adagrad', 'adadelta', or 'adam' for 'optim' option. Layer-specific learning rates cannot currently be used with sgd.
max_grad_norm: If the norm of the gradient vector exceeds this, renormalize to have its norm equal to max_grad_norm.
dropout: Dropout probability. Dropout is applied between vertical LSTM stacks.
lr_decay: Decay learning rate by this much if (i) perplexity does not decrease on the validation set or (ii) epoch has gone past the start_decay_at epoch limit.
start_decay_at: Start decay after this epoch.
curriculum: For this many epochs, order the minibatches based on source sequence length. (Sometimes setting this to 1 will increase convergence speed).
feature_embeddings_dim_exponent: If the additional feature takes N values, then the embbeding dimension will be set to N^exponent.
pre_word_vecs_enc: If using pretrained word embeddings (on the encoder side), this is the path to the *.hdf5 file with the embeddings. The hdf5 should have a single field word_vecs, which references an array with dimensions vocab size by embedding size. Each row should be a word embedding and follow the same indexing scheme as the *.dict files from running preprocess.py. In order to be consistent with beam.lua, the first 4 indices should always be <blank>, <unk>, <s>, </s> tokens.
pre_word_vecs_dec: Path to *.hdf5 for pretrained word embeddings on the decoder side. See above for formatting of the *.hdf5 file.
fix_word_vecs_enc: If = 1, fix word embeddings on the encoder side.
fix_word_vecs_dec: If = 1, fix word embeddings on the decoder side.
max_batch_l: Batch size used to create the data in preprocess.py. If this is left blank (recommended), then the batch size will be inferred from the validation set.

Other options

start_symbol: Use special start-of-sentence and end-of-sentence tokens on the source side. We've found this to make minimal difference.
gpuid: Which GPU to use (-1 = use cpu).
gpuid2: If this is >=0, then the model will use two GPUs whereby the encoder is on the first GPU and the decoder is on the second GPU. This will allow you to train bigger models.
cudnn: Whether to use cudnn or not for convolutions (for the character model). cudnn has much faster convolutions so this is highly recommended if using the character model.
save_every: Save every this many epochs.
print_every: Print various stats after this many batches.
seed: Change the random seed for random numbers in torch - use that option to train alternate models for ensemble
prealloc: when set to 1 (default), enable memory preallocation and sharing between clones - this reduces by a lot the used memory - there should not be any situation where you don't need it. Also - since memory is preallocated, there is not (major) memory increase during the training. When set to 0, it rolls back to original memory optimization.

Decoding options (`beam.lua`)

model: Path to model .t7 file.
src_file: Source sequence to decode (one line per sequence).
targ_file: True target sequence (optional).
output_file: Path to output the predictions (each line will be the decoded sequence).
src_dict: Path to source vocabulary (*.src.dict file from preprocess.py).
targ_dict: Path to target vocabulary (*.targ.dict file from preprocess.py).
feature_dict_prefix: Prefix of the path to the features vocabularies (*.feature_N.dict files from preprocess.py).
char_dict: Path to character vocabulary (*.char.dict file from preprocess.py).
beam: Beam size (recommend keeping this at 5).
max_sent_l: Maximum sentence length. If any of the sequences in srcfile are longer than this it will error out.
simple: If = 1, output prediction is simply the first time the top of the beam ends with an end-of-sentence token. If = 0, the model considers all hypotheses that have been generated so far that ends with end-of-sentence token and takes the highest scoring of all of them.
replace_unk: Replace the generated UNK tokens with the source token that had the highest attention weight. If srctarg_dict is provided, it will lookup the identified source token and give the corresponding target token. If it is not provided (or the identified source token does not exist in the table) then it will copy the source token.
srctarg_dict: Path to source-target dictionary to replace UNK tokens. Each line should be a source token and its corresponding target token, separated by |||. For example

hello|||hallo
ukraine|||ukrainische

This dictionary can be obtained by, for example, running an alignment model as a preprocessing step. We recommend fast_align.

score_gold: If = 1, score the true target output as well.
n_best: If > 1, then it will also output an n_best list of decoded sentences in the following format.

1 ||| sentence_1 ||| sentence_1_score
2 ||| sentence_2 ||| sentence_2_score

gpuid: ID of the GPU to use (-1 = use CPU).
gpuid2: ID if the second GPU (if specified).
cudnn: If the model was trained with cudnn, then this should be set to 1 (otherwise the model will fail to load).
rescore: when set to scorer name, use scorer to find hypothesis with highest score - available 'bleu', 'gleu'
rescore_param: parameter to rescorer - for bleu/gleu ngram length

Using additional input features

Linguistic Input Features Improve Neural Machine Translation (Senrich et al. 2016) shows that translation performance can be increased by using additional input features.

Similarly to this work, you can annotate each word in the source text by using the -|- separator:

word1-|-feat1-|-feat2 word2-|-feat1-|-feat2

It supports an arbitrary number of features with arbitrary labels. However, all input words must have the same number of annotations. See for example data/src-train-case.txt which annotates each word with the case information.

To evaluate the model, the option -feature_dict_prefix is required on evaluate.lua which points to the prefix of the features dictionnaries generated during the preprocessing.

Pruning a model

Compression of Neural Machine Translation Models via Pruning (See et al. 2016) shows that a model can be aggressively pruned while keeping the same performace.

To prune a model - you can use prune.lua which implement class-bind, and class-uniform pruning technique from the paper.

model: the model to prune
savefile: name of the pruned model
gpuid: Which gpu to use. -1 = use CPU. Depends if the model is serialized for GPU or CPU
ratio: pruning rate
prune: pruning technique blind or uniform, by default blind

note that the pruning cut connection with lowest weight in the linear models by using a boolean mask. The size of the file is a little larger since it stores the actual full matrix and the binary mask.

Models can be retrained - typically you can recover full capacity of a model pruned at 60% or even 80% by few epochs of additional trainings.

Switching between GPU/CPU models

By default, the model will always save the final model as a CPU model, but it will save the intermediate models as a CPU/GPU model depending on how you specified -gpuid. If you want to run beam search on the CPU with an intermediate model trained on the GPU, you can use convert_to_cpu.lua to convert the model to CPU and run beam search.

GPU memory requirements/Training speed

Training large sequence-to-sequence models can be memory-intensive. Memory requirements will dependent on batch size, maximum sequence length, vocabulary size, and (obviously) model size. Here are some benchmark numbers on a GeForce GTX Titan X. (assuming batch size of 64, maximum sequence length of 50 on both the source/target sequence, vocabulary size of 50000, and word embedding size equal to rnn size):

(prealloc = 0)

1-layer, 100 hidden units: 0.7G, 21.5K tokens/sec
1-layer, 250 hidden units: 1.4G, 14.1K tokens/sec
1-layer, 500 hidden units: 2.6G, 9.4K tokens/sec
2-layers, 500 hidden units: 3.2G, 7.4K tokens/sec
4-layers, 1000 hidden units: 9.4G, 2.5K tokens/sec

Thanks to some fantastic work from folks at SYSTRAN, turning prealloc on will lead to much more memory efficient training

(prealloc = 1)

1-layer, 100 hidden units: 0.5G, 22.4K tokens/sec
1-layer, 250 hidden units: 1.1G, 14.5K tokens/sec
1-layer, 500 hidden units: 2.1G, 10.0K tokens/sec
2-layers, 500 hidden units: 2.3G, 8.2K tokens/sec
4-layers, 1000 hidden units: 6.4G, 3.3K tokens/sec

Tokens/sec refers to total (i.e. source + target) tokens processed per second. If using different batch sizes/sequence length, you should (linearly) scale the above numbers accordingly. You can make use of memory on multiple GPUs by using -gpuid2 option in train.lua. This will put the encoder on the GPU specified by -gpuid, and the decoder on the GPU specified by -gpuid2.

Evaluation

For translation, evaluation via BLEU can be done by taking the output from beam.lua and using the multi-bleu.perl script from Moses. For example

perl multi-bleu.perl gold.txt < pred.txt

Evaluation of States and Attention

attention_extraction.lua can be used to extract the attention and the LSTM states. It uses the following (required) options:

model: Path to model .t7 file.
src_file: Source sequence to decode (one line per sequence).
targ_file: True target sequence.
src_dict: Path to source vocabulary (*.src.dict file from preprocess.py).
targ_dict: Path to target vocabulary (*.targ.dict file from preprocess.py).

Output of the script are two files, encoder.hdf5 and decoder.hdf5. The encoder contains the states for every layer of the encoder LSTM and the offsets for the start of each source sentence. The decoder contains the states for the decoder LSTM layers and the offsets for the start of gold sentence. It additionally contains the attention for each time step (if the model uses attention).

Pre-trained models

We've uploaded English <-> German models trained on 4 million sentences from Workshop on Machine Translation 2015. Download link is below:

https://drive.google.com/open?id=0BzhmYioWLRn_aEVnd0ZNcWd0Y2c

These models are 4-layer LSTMs with 1000 hidden units and essentially replicates the results from Effective Approaches to Attention-based Neural Machine Translation, Luong et al. EMNLP 2015.

Acknowledgments

Our implementation utilizes code from the following:

Licence

MIT

seq2seq-attn's People

Contributors

Stargazers

Watchers

Forkers

mbartoli ml-ai-nlp-ir vnguyen01 harrywy nieshaoshuai amoliu little1tow hitluobin peratham wanjinchang adrianhust peterzhang2029 qiuyuew crusherx2 baiyancheng20 willitpass hitwsl jhnlp yanadm ml-lab akbari59 mkorpusik ana2s007 jhcmine binbinbian jungikim plsang fedorajzf yukehit xi-qian barbarapinho yangjunpro stevenlol babaozhouy5 chenb67 chagge i55code fatmas1982 igormq nicolas-ivanov ehosseiniasl rollingstone dpanshu neo4reo smopart gregorysenay adi007 systran chenyulinx nhittt imclab zuiwufenghua kevinwenya liuxuan0526 penghuangcn matrix-revolution g10dras meteora9479 terrencew hfxunlp lemaoliu srush wellwang kirk86 frankchu0229 saikswaroop sohuren nguyenducnhaty mnrmja007 shyamalschandra coll3ctions barneyeldinosaurio codeaudit n0madsky kuixu mdsamiulhaquesunny oztc benjamesbabala cc13ny selimam eriche2016 lijian8 0ldm0s mhjabreel clear-datacenter xshhhm cfdenton mdasadul mingstupid amds123 luohongyin pengcheng-wang albertxiebnu vyraun ili3p jlorieux wantongtang wrapperband twistedmove mzhai2

seq2seq-attn's Issues

beam.lua needs a -model argument

also relatedly not giving -outfile should default to stdout

beam.lua should work with cuda or non-cuda

Currently fails when reading a cuda tensor based model. Short-term it should probably "require 'cunn'", long-term we probably should support both modes.

cuda runtime error on ubuntu 16.04 LTS / cuda 8.0

I'm seeing

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-299/cutorch/lib/THC/THCGeneral.c line=608 error=4 : unspecified launch failure
/home/ankit/torch/install/bin/luajit: cuda runtime error (4) : unspecified launch failure at /tmp/luarocks_cutorch-scm-1-299/cutorch/lib/THC/generic/THCStorage.c:147

when running default training. i'm on ubuntu 16.04 with cuda 8.0 RC. it seems likely that there's a problem with cutorch and cuda 8.0 (the failure happens sometimes before a single batch is evaluated on, sometimes after -- the nondeterminism makes me think it's a fundamental driver problem), but before i either switch back to cuda 7.5 or ubuntu 14.04 or dive into the cutorch code, I was wondering if anything sticks out to you concerning this bug.

Local Attention Model and Content-based Function

Hello guys,

Thanks for sharing this amazing repository with us! It will be very useful for a lot of people :-)

Quick questions:

Which of the content-based functions are you using: dot, general or concat?
Do you guys have any plans of implementing the local attention part of Luong's et al. (2015) paper?

Thanks again for sharing the code

'nn' requirement missing

The package 'nn' is a dependency for 'cunn' but it does not appear on the list in the README. Should probably be added to the list.
Thanks

out of memory during the training

I am trying to train a seq2seq model like it using my own dataset

th train.lua -data_file ../../data/demo-train.hdf5 -val_data_file ../../data/demo-val.hdf5 -savefile model-demo -brnn 1 -attn 1 -multi_attn 3 -gpuid 3 -max_batch_l 64 -lr_decay 0.8 -num_layers 4 -rnn_size 1024 -dropout 0.3 -curriculum 1

At the beginning, it works well. However, after 4800 batch, it threw an error about out of memory.

My machine has 196G memory with Tesla K40m. Since the training has begun, why it would fail during the training.

Epoch: 1, Batch: 4800/15072, Batch size: 64, LR: 1.0000, PPL: 246.53, |Param|: 690.31, |GParam|: 5.21, Training: 1817/1308/509 total/source/target tokens/sec
Epoch: 1, Batch: 4850/15072, Batch size: 64, LR: 1.0000, PPL: 244.21, |Param|: 691.18, |GParam|: 5.53, Training: 1816/1311/505 total/source/target tokens/sec
Epoch: 1, Batch: 4900/15072, Batch size: 64, LR: 1.0000, PPL: 241.79, |Param|: 692.06, |GParam|: 5.30, Training: 1815/1314/501 total/source/target tokens/sec
Epoch: 1, Batch: 4950/15072, Batch size: 64, LR: 1.0000, PPL: 239.34, |Param|: 692.97, |GParam|: 5.76, Training: 1814/1317/496 total/source/target tokens/sec
Epoch: 1, Batch: 5000/15072, Batch size: 64, LR: 1.0000, PPL: 236.97, |Param|: 693.87, |GParam|: 6.32, Training: 1813/1321/492 total/source/target tokens/sec
Epoch: 1, Batch: 5050/15072, Batch size: 64, LR: 1.0000, PPL: 234.39, |Param|: 694.76, |GParam|: 5.80, Training: 1812/1324/488 total/source/target tokens/sec
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-2002/cutorch/lib/THC/generic/THCStorage.cu line=40 error=2 : out of memory
/home/work/dev/torch/distro/install/bin/luajit: ...v/torch/distro/install/share/lua/5.1/nngraph/nesting.lua:34: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-2002/cutorch/lib/THC/generic/THCStorage.cu:40
stack traceback:
[C]: in function 'resizeAs'
...v/torch/distro/install/share/lua/5.1/nngraph/nesting.lua:34: in function 'resizeNestedAs'
...v/torch/distro/install/share/lua/5.1/nngraph/gmodule.lua:37: in function 'getTotalGradOutput'
...v/torch/distro/install/share/lua/5.1/nngraph/gmodule.lua:404: in function 'neteval'
...v/torch/distro/install/share/lua/5.1/nngraph/gmodule.lua:454: in function 'updateGradInput'
...v/torch/distro/install/share/lua/5.1/nngraph/gmodule.lua:420: in function 'neteval'
...v/torch/distro/install/share/lua/5.1/nngraph/gmodule.lua:454: in function 'updateGradInput'
...ork/dev/torch/distro/install/share/lua/5.1/nn/Module.lua:31: in function 'backward'
train.lua:535: in function 'train_batch'
train.lua:745: in function 'train'
train.lua:1071: in function 'main'
train.lua:1074: in main chunk
[C]: in function 'dofile'
...rch/distro/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405810

Option for n-best output translations

moses format is something like

1 ||| sent
2 ||| sent

How to run char-cnn model in this code?

I think this code is not ready for the char-cnn model. Because some options are not used in the main codes, e.g. '-char_vec_size', '-kernel_width' and '-num_kernels'.

Char2Char, Performance, Speed, Training tips?

Thanks for the repository!
I have a few questions regarding performance (speed and evaluation) and training on word and char level.

You truncate the length to 50 on word, what about on chars?
What was your BLEU findings? On word and char level? (5 BLEU on chars after 5 epochs WMT'15, 15 BLEU on words after 5 epochs Europarl, etc.?)
In the training speed you mention tokens, how many batches is a token? e.g. for 64 batch_size and 50 seq_len, is a 20,000 tokens/sec = 20,000 / (batch_size x 2 x seq_len) = 3.125 batches/second?
What was your biggest challenges/takeaways with going to chars instead of using words?
How did you regularize the model?

Thanks!

Pretrained Model

Hi all,

I was running the test in pretrained model, I received the following error:

th beam.lua -model trained_models/en-to-de-model.t7 -src_file data/german/src-val.txt -output_file german_pred.txt -src_dict trained_models/en.dict -targ_dict trained_models/de.dict
loading trained_models/en-to-de-model.t7...
done!
SENT 1: Parliament Does Not Support Amendment Freeing Tymoshenko
/home/zzhong/torch/install/bin/luajit: beam.lua:163: attempt to perform arithmetic on field 'input_feed' (a nil value)
stack traceback:
beam.lua:163: in function 'generate_beam'
beam.lua:632: in function 'main'
beam.lua:667: in main chunk
[C]: in function 'dofile'
...hong/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x004064d0

Cheers,
Zhong

error loading module 's2sa.models' ... ambiguous syntax (function call x new statement)

I get this error when running the model:

 ...allations/torch/install/share/lua/5.1/trepl/init.lua:384: error loading module 's2sa.models' from file './s2sa/models.lua':
        ./s2sa/models.lua:85: ambiguous syntax (function call x new statement) near '('
stack traceback:
        [C]: in function 'error'
        ...allations/torch/install/share/lua/5.1/trepl/init.lua:384: in function 'require'
        train.lua:6: in main chunk
        [C]: in function 'dofile'
        .../torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: ?

The line in question is:

   x = nn.JoinTable(2):usePrealloc("dec_inputfeed_join",
                                             {{opt.max_batch_l, opt.word_vec_size},{opt.max_batch_l, opt.rnn_size}})
                                ({x, inputs[1+offset]}) -- batch_size x (word_vec_size + rnn_size)

And the same issue is present in a couple of places in s2sa.models. The issue is usePrealloc returns self and then we call again a function on this returned self. Changing this to:

 x = nn.JoinTable(2):usePrealloc("dec_inputfeed_join",{{opt.max_batch_l, opt.word_vec_size},{opt.max_batch_l, opt.rnn_size}})
x = x({x, inputs[1+offset]}) -- batch_size x (word_vec_size + rnn_size)

creates other issues later on.

optimization with optim package

Hi ,great job.
It seems that the optimization part is done manually , i wonder whether can we combine optim package (or some other automatic way)with it or not, if yes, how to do that, i am a new comer to torch, thank you in advance.

Creation of vocabularies

Hi,

I observed that the preprocess.py script uses both training and validation corpora to generate the word dictionaries. Generally the source and target vocabularies are created only from training corpora. What's the intuition behind using both because you will never update the embeddings of the words only appearing in validation set?

bad argument #2 to '?' (end index out of bound) error

I followed all the steps specified in your README.md to try to implement a baseline RNN Attention Encoder Decoder model by Luong et al. 2015. I had no problems until I reached the actual training part.
When I give the train.lua script, I get this error. How do I fix it to run the model as it should?

I have Nvidia GeForce 650M 2 GB GPU with 384 cores. I have cuda 7.5 and cudnn 4. Please help

th train.lua -data_file data/demo-train.hdf5 -val_data_file data/d
emo-val.hdf5 -savefile demo-model
using CUDA on GPU 1...  
loading data... 
done!   
Source vocab size: 50004, Target vocab size: 50004  
Source max sent len: 50, Target max sent len: 52    
Number of additional features on source side: 0 
Switching on memory preallocation   
Number of parameters: 54338004 (active: 54338004)   
/home/hans/torch/install/bin/luajit: bad argument #2 to '?' (end index out of bound)
stack traceback:
    [C]: at 0x7f558238c530
    [C]: in function '__index'
    train.lua:394: in function 'train_batch'
    train.lua:745: in function 'train'
    train.lua:1071: in function 'main'
    train.lua:1074: in main chunk
    [C]: in function 'dofile'
    ...hans/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00405d50

Dropout rate

Hi,
Thanks for the great code.
I am wondering is there any tricks when trying to tune the dropout probability of LSTM?

Thanks

Shards

Hi,

May you show an example on how shards is used?

Also, is there momentum in the code? Thanks!

Cheers,
Zhong

preprocess fails on sample data

python preprocess.py --srcfile data/src-train.txt --targetfile data/targ-train.txt
fails with the following output:
Number of sentences in training: 10000
Traceback (most recent call last):
File "preprocess.py", line 343, in
sys.exit(main(sys.argv[1:]))
File "preprocess.py", line 340, in main
get_data(args)
File "preprocess.py", line 257, in get_data
args.seqlength, max_word_l, args.chars)
File "preprocess.py", line 79, in make_vocab
enumerate(itertools.izip(open(srcfile,'r'), open(targetfile,'r'))):
TypeError: coercing to Unicode: need string or buffer, NoneType found

single directory for dictionaries

It seems potentially buggy to have to specify the source and target and character dictionary at decoding. Maybe we can just point to a directory with the dicts having fixed names? (this is what moses does)

LinearNoBias not define when train on demo data

I have setup cuda and torch on an aws gpu instances (ubuntu 14.04). I encountered this error when I tried to train on demo data:

dependent packages are installed as in readme. I tried to add require "util" in train but it did not work. No other code changes were made yet.

ubuntu@ip-xx-xx-xx-xx:/mnt/seq2seq-attn$ th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model
loading data...
done!
Source vocab size: 28721, Target vocab size: 42787
Source max sent len: 50, Target max sent len: 52
/usr/local/bin/luajit: ./models.lua:89: attempt to call field 'LinearNoBias' (a nil value)
stack traceback:
./models.lua:89: in function 'make_lstm'
train.lua:545: in function 'main'
train.lua:584: in main chunk
[C]: in function 'dofile'
/usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406260

Please let me know if i should attach more information.

Describe memory used

We should make it more clear how much memory is needed. For instance something like "if you have 2gb gpu use this setting".

Readme update -output_file rather than -out_file

On the Readme, I believe the argument for making predictions should be -output_file rather than just -out_file. The code worked when I typed in the former rather than the latter. -out_file is the argument under the Quickstart section.

Data parallel training

I noticed that the model allows for "model parallel" training. Does the code automatically support "data parallel" training(without explicitly assigning data to different GPUs)?

My goal is to see if I can get approximately linear speed up using the 16 GPU machines from AWS.

Optimization issues

Greetings!
I've tried both sgd and adagrad optimizers provided by defaul and with both of them I failed to train any good models.

While training with sgd I took the default params and the model converged after 15 epochs with the perplexity of 74 on the training set and 125 on the validation set. Learning rate dropped almost to 0 at this point. Apparently, the vanilla sgd with the proposed learning decay strategy is not the best choice...

Hence I hooked up adagrad for training 4 slightly different models (there is a difference in rnn_size, num of layers, dropout and in the usage of bidir_lstm for the encoder, however for all the models starting learning rate is 1 and learning decay 0.5, even though I assume the latter plays no role when using adagrad). After training the models on my GPUs for almost a day I still get perplexity values that at best have 14 digits :)

Here is the sample of the training logs:

Train   17676577270078  
Valid   3.9688841882487e+19
saving checkpoint to no_feed_epoch16.00_39688841882487062528.00.t7
Epoch: 17, Batch: 250/8274, Batch size: 64, LR: 1.0000, PPL: 4725532035834.91, |Param|: 67331.00, |GParam|: 137.76, Training: 4320/1044/3276 total/source/target tokens/sec
Epoch: 17, Batch: 500/8274, Batch size: 64, LR: 1.0000, PPL: 3359570332666.66, |Param|: 67335.08, |GParam|: 215.59, Training: 4315/1035/3279 total/source/target tokens/sec
Epoch: 17, Batch: 750/8274, Batch size: 64, LR: 1.0000, PPL: 5181787313557.71, |Param|: 67339.10, |GParam|: 21.83, Training: 4314/1036/3277 total/source/target tokens/sec

That's a bit too much. Am I doing something wrong or is this an expected behaviour?

Lastly, are you planning to incorporate other optimisers from torch/optim package?

Describe evaluation

Let's add a recommendation for computing BLEU score on the output as well, and give a pointer to another library.

Train from issue

Hi, sorry for not replying! I am still having the same issue where loading a model gives the empty tensor error. have you had this issue still?

Training Seq2Seq models for modelling conversations

Hello,

This is not an issue but a call for some guidance. Has anyone tried using this amazing project to train seq2seq models for generating conversations like [https://arxiv.org/pdf/1506.05869.pdf]

Thanks

Character vectors

Hello,
I was wondering what model or method is used in this code in order to obtain the character vectors?
Are these character vectors available online? (like word2vec or GloVe for word vectors)
Thank you.

Using this for sorting

Hi, I generated my own dataset with string sequences of random numbers and associated sorted string sequences or floats. I think I trained the network correctly - just followed the instructions on the page. Does it make sense to try to train this network to learn how to sort float values?

When I ran the command:
th evaluate.lua -model demo-model_final.t7 -src_file data/src-val.txt -output_file pred.txt
-src_dict data/demo.src.dict -targ_dict data/demo.targ.dict

Nothing was written to predict.txt, so I'm guessing I did something wrong or this network cannot learn to sort.

One questions on the problem of attention model

Hi, Kim. Currently i am working on attention model too. But I want to add some regularization term on the attention weights to the loss function, then how to train such model model in torch? can you give me some advice? because the attention weights are also affected by the parameters of the model, i am confused. Very grateful if you give me some advices?

Error while computing probabilities of gold data using a model without attention

Hi there,

I seem to be having a problem when trying to compute the probabilities of the gold data under some model trained without attention. While I'm able to decode and obtain predictions, when scoring the gold data (i.e., by using the option parameter -targ_file data/targ-val.txt on the demo data), the model produces the following error.

As a note, everything works smoothly if I train with attention.

command

th beam.lua -model demo-model_epoch1.00_3095.37.t7  -src_file data/src-val.txt -targ_file data/targ-val.txt  -output_file pred.txt -src_dict data/demo.src.dict -targ_dict data/demo.targ.dict  -gpuid 1

output

loading demo-model_epoch1.00_3095.37.t7...  
done!   
loading GOLD labels at data/targ-val.txt    
SENT 1: Parliament Does Not Support Amendment Freeing Tymoshenko    
/auto/rcf-40/al_227/torch/install/bin/luajit: ...f-40/al_227/torch/install/share/lua/5.1/nn/JoinTable.lua:39: bad argument #1 to 'copy' (sizes do not match at /tmp/luarocks_cutorch-scm-1-3925/cutorch/lib/THC/generic/THCTensorCopy.cu:10)
stack traceback:
    [C]: in function 'copy'
    ...f-40/al_227/torch/install/share/lua/5.1/nn/JoinTable.lua:39: in function 'func'
    ...0/al_227/torch/install/share/lua/5.1/nngraph/gmodule.lua:311: in function 'neteval'
    ...0/al_227/torch/install/share/lua/5.1/nngraph/gmodule.lua:346: in function 'forward'
    beam.lua:325: in function 'generate_beam'
    beam.lua:637: in function 'main'
    beam.lua:672: in main chunk
    [C]: in function 'dofile'
    ..._227/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00405800

Cheers,
Angeliki

Train from logical error

I am trying to resume training from checkpoint file and even though it says loaded model, the perplexity restarts at weight initialization level and the accuracy of translation when I use evaluate.lua also seems to indicate that the model is simply reinitializing the vectors instead of loading from checkpoint.

Is this an issue with the API? What am I doing wrong?

.......
Epoch: 4, Batch: 11850/11961, Batch size: 16, LR: 0.1000, PPL: 2565.87, |Param|: 5479.77, |GParam|: 44.02, Training: 134/65/69 total/source/target tokens/sec   
Epoch: 4, Batch: 11900/11961, Batch size: 16, LR: 0.1000, PPL: 2573.56, |Param|: 5480.11, |GParam|: 46.07, Training: 134/65/69 total/source/target tokens/sec   
Epoch: 4, Batch: 11950/11961, Batch size: 16, LR: 0.1000, PPL: 2580.50, |Param|: 5480.42, |GParam|: 90.12, Training: 134/65/69 total/source/target tokens/sec   
Train   2582.1220978721 
Valid   2958.3082902242 
saving checkpoint to demo-model_epoch4.00_2958.31.t7    
Script started on Monday 24 October 2016 08:55:52 AM IST
hans@hans-Lenovo-IdeaPad-Y500:~/seq2seq-attn-master$ th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model
using CUDA on GPU 1...
loading data...
done!
Source vocab size: 50004, Target vocab size: 150004
Source max sent len: 50, Target max sent len: 52
Number of additional features on source side: 0
Switching on memory preallocation
loading demo-model_epoch4.00_2958.31.t7...
Number of parameters: 84236504 (active: 84236504)
Epoch: 5, Batch: 50/11961, Batch size: 16, LR: 0.0500, PPL: 375825299.43, |Param|: 5407.84, |GParam|: 503.37, Training: 131/61/69 total/source/target tokens/sec
Epoch: 5, Batch: 100/11961, Batch size: 16, LR: 0.0500, PPL: 145308733.29, |Param|: 5407.19, |GParam|: 130.81, Training: 132/63/69 total/source/target tokens/sec
Epoch: 5, Batch: 150/11961, Batch size: 16, LR: 0.0500, PPL: 85249666.69, |Param|: 5406.86, |GParam|: 1190.36, Training: 133/64/69 total/source/target tokens/sec

cool egyptian name

Traditionally mt systems are often named after egyptian stuff. I found you a good secondary source detailing the process http://www.buzzfeed.com/FeverGlam/the-ancient-egyptians-were-total-badasses-amaq

cudnn and combined char+word model

Hi,

Thanks for building this nice tool!

Do you have any plans to incorporate cudnn's LSTM implementation (e.g. from cudnn.torch) for speed-up?
Is there an option to use a combined char+word model in the same way as in the original language modelling work?

Blank the defaults for -targfile and fail if no -srcfile

0d source

Hey Yoon, I'm trying out your model for some tasks and its working great!

SENT 1639:
/home/work/torch/install/bin/luajit: ./s2sa/beam.lua:131: bad argument #1 to 'size' (dimension 1
out of range of 0D tensor at /home/work/torch/pkg/torch/generic/Tensor.c:19)
stack traceback:
[C]: in function 'size'
./s2sa/beam.lua:131: in function 'generate_beam'
./s2sa/beam.lua:770: in function 'search'
evaluate.lua:12: in function 'main'
evaluate.lua:30: in main chunk
[C]: in function 'dofile'
...work/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d50

Setting source_l to 0 doesn't fix it as the 0d source will get a view later. I don't know if it makes a difference but I'm using your defaults + score gold
For now I'm just outputting an empty table for max_hyp if source:dim() == 0 so it passes that example.

Also using 2 gpu's fails on beam. I get this issue: torch/cutorch#434
Adding cutorch.setKernelPeerToPeerAccess(true) fixes it.

Create Character level data-set using shards

I was trying to create character level dataset by adding 'chars' flag 1. My training data consists of several source and target files because I am planning to use shards . When I am running preprocess.py I need to specify both train and validation data. The problem is I have several set of train files but one set of validation files. So I am not sure how to specify one set of validation file with each set of training file.
I am doing something like this
python preprocess.py --srcfile ../Train/src-train2.txt --targetfile ../Train/targ-train2.txt --srcvalfile ../Val/src-valid.txt --targetvalfile ../Val/targ-valid.txt --chars 1 --outputfile data/train2

Also I realize that in each case it's creating new dictionary for vocabulary but I think it should use same vocabulary for all shard?
Is there any other way to perform this task?

Thanks

GRU implementation

Hi team. thanks for great work.

I'm currently trying to construct seq2seq model with Bahdanau style, featured by bidirectional encoder - decoder using GRU cell.

This repository, however, doesn't seem to have GRU implementation, which now i'm trying to add.
(If you already have, it would be a lot of help for time saving..!)

Is it just the models.lua/make_lstm function that i need to modify?

And one more question is about bidirectional encoder:

The description of bidirectional encoder options' saying: "hidden states of the corresponding forward/backward LSTMs are added to obtain the hidden representation for that time step."

does it mean it literally adds two hidden states, not the normal concatenation scheme?

and if so, would it be compatible to other codes if i simply change the hidden representation into the concatenation of two hidden states?

thanks for the reply in advance

Distribute a model

Let's distribute a WMT model (word, char). We can host it on google drive.

Multi-GPU training

Hi s2s team,

There is a multi-GPU problem, I tried to set DIABLE_CHECK_GPU, it does not work either. Please let me know what would help. Thanks!

using CUDA on GPU 1...
using CUDA on second GPU 2...
loading data...
done!
Source vocab size: 28721, Target vocab size: 42787
Source max sent len: 50, Target max sent len: 52
Number of parameters: 66948287
/home//util/torch/install/bin/luajit: /home/util.lua:46: Assertion `THCudaTensor_checkGPU(state, 4, r_, t, m1, m2)' failed. at /tmp/luarocks_cutorch-scm-1-7585/cutorch/lib/THC/THCTensorMathBlas.cu:79
stack traceback:
[C]: in function 'addmm'
/home/util.lua:46: in function 'func'
.../util/torch/install/share/lua/5.1/nngraph/gmodule.lua:333: in function 'neteval'
.../util/torch/install/share/lua/5.1/nngraph/gmodule.lua:368: in function 'forward'
train.lua:367: in function 'train_batch'
train.lua:622: in function 'train'
train.lua:871: in function 'main'
train.lua:874: in main chunk
[C]: in function 'dofile'
...util/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

Cheers,
Zhong

Invalid device ordinal

Thank you for a marvelous library!
I'm trying to train a demo model on a GPU, however I get the following error:

ubuntu@testing:~/nicolas/seq2seq-attn$ th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model -gpuid 0
using CUDA on GPU 0...  
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-1657/cutorch/init.c line=711 error=10 : invalid device ordinal
/home/ubuntu/torch/install/bin/luajit: train.lua:770: cuda runtime error (10) : invalid device ordinal at /tmp/luarocks_cutorch-scm-1-1657/cutorch/init.c:711
stack traceback:
    [C]: in function 'setDevice'
    train.lua:770: in function 'main'
    train.lua:874: in main chunk
    [C]: in function 'dofile'
    ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

Any suggestions on how it can be fixed?

All the Lua packages (hdf5, nngraph, cutorch, cunn) seem to be installed properly.

BLEU score 16.44

Hi yoonkim,

May you help to upload an example of the pre-word-vec-enc and -dec, and the shards input? I think it would be helpful for everyone.

Also I have run the pretrained model on the validation set given in your code. It is 16.44, may I ask which is your test set? Is it the same as the paper "Effective Approaches to Attention-based Neural Machine Translation"? Also, which model in the Table 1 of the paper that the pre-trained model replicates? Not the ensembled right?

Is it the "Base+reverse+dropout+global attention" or "Base+reverse+dropout+global attention+feed input"? The later is 18.1 in the original paper.

One last question: have you run your code on English French translation, for example Bengio's paper? or any other language pairs?

Thank you and enjoy the 4th July holiday!

Cheers,
Zhong

Allow fixed/initialized word embeddings

Would be nice to allow people to start/stay with their own word embeddings for small domain problems.

Speed numbers

give rough "token per second" numbers as well.

Understanding the data preprocessing

I am sure that i am missing something simple here but please bear with me :)

What is the philosophy behind loading the pkl files of movietriple corpus (which are in the form of indices) and then using format_data() function in preprocess.py to update the word indices and then converting them to file having words and then again using get_data() function to convert them back to indices?

I was thinking of using the txt files of the corpus and pass them to get_data() function directly. This would create the vocab as well as do the conversion to indices. What would i miss here?

why I cannot use gpu 0?

rzai@rzai00:/prj/seq2seq-attn-1$ th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model -gpuid 0 -num_layers 4 -rnn_size 500
using CUDA on GPU 0...
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6130/cutorch/init.c line=719 error=10 : invalid device ordinal
/home/rzai/torch/install/bin/luajit: train.lua:957: cuda runtime error (10) : invalid device ordinal at /tmp/luarocks_cutorch-scm-1-6130/cutorch/init.c:719
stack traceback:
[C]: in function 'setDevice'
train.lua:957: in function 'main'
train.lua:1074: in main chunk
[C]: in function 'dofile'
...rzai/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
rzai@rzai00:/prj/seq2seq-attn-1$

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 519 C /home/rzai/torch/install/bin/luajit 253MiB |
| 0 639 C /home/rzai/torch/install/bin/luajit 247MiB |
| 0 1332 G /usr/lib/xorg/Xorg 183MiB |
| 0 2449 G compiz 59MiB |
| 1 519 C /home/rzai/torch/install/bin/luajit 6317MiB |
| 1 639 C /home/rzai/torch/install/bin/luajit 981MiB |
+-----------------------------------------------------------------------------+
rzai@rzai00:~/prj/seq2seq-attn-1$

Visualize attention weights

Hi! I was wondering if you guys plan on adding support for easy retrieval of the attention weights during decoding? This would be really helpful for qualitative analysis.

Thanks!

train_from not working for old models

Hi,

I have a trained model which was trained using an old repo version (without attn). Now that I am trying to fine tune that model, it gives me this error -
/home/turing/torch/install/bin/luajit: /home/turing/torch/install/share/lua/5.1/nn/MM.lua:25: second input tensor must be 2D
stack traceback:
[C]: in function 'assert'
/home/turing/torch/install/share/lua/5.1/nn/MM.lua:25: in function 'func'
...e/turing/torch/install/share/lua/5.1/nngraph/gmodule.lua:311: in function 'neteval'
...e/turing/torch/install/share/lua/5.1/nngraph/gmodule.lua:346: in function 'func'
...e/turing/torch/install/share/lua/5.1/nngraph/gmodule.lua:311: in function 'neteval'
...e/turing/torch/install/share/lua/5.1/nngraph/gmodule.lua:346: in function 'forward'
train.lua:422: in function 'train_batch'
train.lua:633: in function 'train'
train.lua:892: in function 'main'
train.lua:895: in main chunk
[C]: in function 'dofile'
...ring/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

Can you suggest where is the error coming from?

Include a License

We should explicitly include a MIT license file, as well as a lua/python .gitignore.

Loading checkpoint model seems to fail

Seems that loading the checkpoint model fails, even though it says it loads. Any thoughts?

Austins-MacBook-Pro:src2anno austinshin$ python run_model.py
loading data...
done!
Source vocab size: 8814, Target vocab size: 15669
Source max sent len: 50, Target max sent len: 52
loading djc.t7...
Number of parameters: 13278519
/Users/austinshin/torch/install/bin/luajit: bad argument #1 to '?' (empty tensor at /Users/austinshin/torch/pkg/torch/generic/Tensor.c:888)
stack traceback:
[C]: at 0x02ebead0
[C]: in function '__index'
/Users/austinshin/torch/install/share/lua/5.1/nn/MM.lua:51: in function 'updateGradInput'
...stinshin/torch/install/share/lua/5.1/nngraph/gmodule.lua:386: in function 'neteval'
...stinshin/torch/install/share/lua/5.1/nngraph/gmodule.lua:420: in function 'updateGradInput'
...stinshin/torch/install/share/lua/5.1/nngraph/gmodule.lua:386: in function 'neteval'
...stinshin/torch/install/share/lua/5.1/nngraph/gmodule.lua:420: in function 'updateGradInput'
/Users/austinshin/torch/install/share/lua/5.1/nn/Module.lua:31: in function 'backward'
train.lua:370: in function 'train_batch'
train.lua:479: in function 'train'
train.lua:644: in function 'main'
train.lua:647: in main chunk
[C]: in function 'dofile'
...shin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x0102b56bc0

model ensemble

Is model ensemble part available? How can we use the code to do model ensemble?

harvardnlp / seq2seq-attn Goto Github PK

seq2seq-attn's Introduction

Sequence-to-Sequence Learning with Attentional Neural Networks

Dependencies

Python

Lua

Quickstart

Details

Preprocessing options (preprocess.py)

Training options (train.lua)

Decoding options (beam.lua)

Using additional input features

Pruning a model

Switching between GPU/CPU models

GPU memory requirements/Training speed

Evaluation

Evaluation of States and Attention

Pre-trained models

Acknowledgments

Licence

seq2seq-attn's People

Contributors

Stargazers

Watchers

Forkers

seq2seq-attn's Issues

Recommend Projects

Recommend Topics

Recommend Org

Preprocessing options (`preprocess.py`)

Training options (`train.lua`)

Decoding options (`beam.lua`)