lancopku / prime Goto Github PK

A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.

License: Other

Python 99.19% Shell 0.09% Lua 0.40% C++ 0.32%

attention transformer sequence-to-sequence language-model

prime's Introduction

News

2019/12/10 We have changed the model name from MUSE(parallel MUlti-Scale attEntion) to PRIME(PaRallel Intersected Multi-scale AttEntion)

Introduction

Core Code:

Code for parallel representation learning: fairseq\models\combine_transformer.py
Code for combining convolution and self-attention: fairseq\modules\multihead_attention.py
Code for acceleration, bm means big matrix: fairseq\models\transformer_bm.py

Relevent links:

Arxiv pdf: https://arxiv.org/abs/1911.09483
Pre-trained models as well as instructions for training: examples/parallel_intersected_multi-scale_attention(Prime)/README.md
Reddit post link

About the paper:

TL;DR: A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.

We ask three questions:

Is attention alone good enough？
Is parallel representation learning applicable to sequence data and tasks?
How to design a module that combines both inductive bias of convolution and self-attention？

We find that there are shortcomings in stand-alone self-attention, and present a new module that maps the input to the hidden space and performs the three operations of self-attention, convolution and nonlinearity in parallel, simply stacking this module outperforms all previous models including Transformer (Vasvani et al., 2017) on main NMT tasks under standard setting.

Key features:

Design a multi-branch schema evolving self attention and first successfully combine convolution and self-attention in one module for sequence tasks by the proposed shared projection,
SOTA on three main translation datasets, including WMT14 En-Fr, WMT14 En-De and IWSLT14 De-En,
Parallel learn sequence representations and thus have potential for acceleration.

Results:

Better than previous models on large NMT datasets; can scale to small datasets and base model setting.
The shared projection is key to combine conv and self-attn; generate better long sequences;potential for acceleration. )

Task	size	test (BLEU)
IWSLT14 De-En	Base	36.3
WMT14 En-De	Large	29.9
WMT14 En-Fr	Large	43.5

Requirements and Installation

PyTorch version >= 1.0.0
Python version >= 3.6
For training new models, you'll also need an NVIDIA GPU and NCCL
torch==1.3.1 with cuda==10.0

Installing from source

To install from source and develop locally:

pip install --editable . --user

We provide pre-trained models and detailed example training and evaluation in examples/parallel_intersected_multi-scale_attention(Prime)/README.md.

Citation

Please cite as:

@article{zhao2019muse,
  title={MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning},
  author={Zhao, Guangxiang and Sun, Xu and Xu, Jingjing and Zhang, Zhiyuan and Luo, Liangchen},
  journal={arXiv preprint arXiv:1911.09483},
  year={2019}
}

Notes

The code is based on fairseq-0.6.2

prime's People

Contributors

Stargazers

Watchers

Forkers

zhao1iang karrynest ml-lab luogen1996 phymucs twistedmove hiteshlpatel sdxshuai awootim

prime's Issues

muse code?

nice work here and I really love the results of this paper, just wonder is the muse code already in this repo?

TypeError: argument of type 'NoneType' is not iterable

Traceback (most recent call last):
File "train.py", line 311, in
cli_main()
File "train.py", line 306, in cli_main
main(args)
File "train.py", line 49, in main
model = task.build_model(args)
File "/home/xgzhu/MUSE/fairseq/tasks/fairseq_task.py", line 169, in build_model
return models.build_model(args, self)
File "/home/xgzhu/MUSE/fairseq/models/init.py", line 50, in build_model
return ARCH_MODEL_REGISTRY[args.arch].build_model(args, task)
File "/home/xgzhu/MUSE/fairseq/models/transformer.py", line 188, in build_model
encoder = TransformerCombineEncoder(args, src_dict, encoder_embed_tokens)
File "/home/xgzhu/MUSE/fairseq/models/combine_transformer.py", line 57, in init
for i in range(args.encoder_layers)
File "/home/xgzhu/MUSE/fairseq/models/combine_transformer.py", line 57, in
for i in range(args.encoder_layers)
File "/home/xgzhu/MUSE/fairseq/models/combine_transformer.py", line 157, in init
dropout=args.attention_dropout, cur_attn_type='es'
File "/home/xgzhu/MUSE/fairseq/modules/multihead_attention.py", line 93, in init
num_heads=dynamic_num_heads, weight_dropout=0.1, )
File "/home/xgzhu/MUSE/fairseq/modules/dynamic_convolution.py", line 73, in init
self.weight_linear = Linear(self.query_size, num_heads * kernel_size * 1, bias=bias)
File "/home/xgzhu/MUSE/fairseq/modules/linear.py", line 7, in Linear
init_method = args.init_method if 'init_method' in args else 'xavier'
TypeError: argument of type 'NoneType' is not iterable

Reproducing IWSLT14-de-en results

Hi there,
Thanks so much for the great work!
I'm currently trying to reproduce IWSLT14-de-en (Prime model) results on a single P100 GPU. I follow the exact script at https://github.com/lancopku/Prime/blob/master/examples/parallel_intersected_multi-scale_attention(Prime)/README.md.
However, I'm unable to reproduce the results. It gave me 100+ perplexity after training is finished, and the BLEU score is below 30.

Do you have any suggestions? What is the expected perplexity / curve?

Spelling in the paper appendix

In one of the example sentences in the appendix, the letters ä, ö, and ß are missing. Please correct the sentence

und deswegen haben wir uns entschlossen in berlin eine halle zu bauen,in der wir sozusagen die elektrischen verhltnisse der insel im mastabeins zu drei ganz genau abbilden knnen.

to correctly:

und deswegen haben wir uns entschlossen in berlin eine halle zu bauen, in der wir sozusagen die elektrischen verhältnisse der insel im maßstab eins zu drei ganz genau abbilden können.

In Latex, these letters can be encoded using {\ss} and {"o} or {"a}.

Cheers!

The multi-scale gate is modeled by parameterized weights instead of depending on the input data. So, why should term it 'dymanticlly' rather than 'adaptively'?

anyone running into 'nan'

I am changing the TransformerCombineEncoder to do a seq to seq job, but I got 'nan' after some steps, anyone has experience on this?

IWSLT'14 DE-EN Numbers

Hi,

I followed all the commands mentioned in https://github.com/lancopku/Prime/blob/master/examples/parallel_intersected_multi-scale_attention(Prime)/README.md#iwslt14-de-en and ran it till 20000 steps. The bleu score for the best ckpt was 35.07 and the bleu score for the avg of the last 10 ckpts was 35.78. PPL was 4.7+. The repo mentions that the bleu score for the best ckpt is around 35.7. Is there any mistake in my implementation? or do i have tune the lenpen and beam size to get the numbers mentioned? Would be helpful if you could clarify these doubts. Thanks!