Giter VIP home page Giter VIP logo

wets's Introduction

WeTS: A Benchmark for Translation Suggestion

Translation Suggestion (TS), which provides alternatives for specific words or phrases given the entire documents translated by machine translation (MT) has been proven to play a significant role in post editing (PE). WeTS is a benchmark data set for TS, which is annotated by expert translators. WeTS contains corpus(train/dev/test) for four different translation directions, i.e., English2German, German2English, Chinese2English and English2Chinese.


Contents

Data


WeTS is a benchmark dataset for TS, where all the examples are annotated by expert translators. As far as we know, this is the first golden corpus for TS. The statistics about WeTS are listed in the following table:

Translation Direction Train Valid Test
English2German 14,957 1000 1000
German2English 11,777 1000 1000
English2Chinese 15,769 1000 1000
Chinese2English 21,213 1000 1000

For corpus in each direction, the data is organized as:
direction.split.src: the source-side sentences
direction.split.mask: the masked translation sentences, the placeholder is "<MASK>"
direction.split.tgt: the predicted suggestions, the test set for English2Chinese has three references for each example

direction: En2De, De2En, Zh2En, En2Zh
split: train, dev, test

Models


We release the pre-trained NMT models which are used to generate the MT sentences. Additionally, the released NMT models can be used to generate synthetic corpus for TS, which can improve the final performance dramatically.Detailed description about the way of generating synthetic corpus can be found in our paper.

The released models can be downloaded at:

Download the models

and the password is "2iyk"

For inference with the released model, we can:

sh inference_*direction*.sh 

direction can be: en2de, de2en, en2zh, zh2en

Get Started


data preprocessing

sh process.sh 

pre-training

Codes for the first-phase pre-training are not included in this repo, as we directly utilized the codes of XLM (https://github.com/facebookresearch/XLM) with little modiafication. And we did not achieve much gains with the first-phase pretraining.

The second-phase pre-training:

sh pretraining.sh
``

#### fine-tuning
```Bash
sh finetuning.sh

Codes in this repo is mainly forked from fairseq (https://github.com/pytorch/fairseq.git)

Citation


Please cite the following paper if you found the resources in this repository useful.

@article{yang2021wets,
  title={WeTS: A Benchmark for Translation Suggestion},
  author={Yang, Zhen and Zhang, Yingxue and Li, Ernan and Meng, Fandong and Zhou, Jie},
  journal={arXiv preprint arXiv:2110.05151},
  year={2021}
}

LICENCE


See LICENCE

wets's People

Contributors

zhenyangiacas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

wets's Issues

Some questions about process.py

Hi:)
thank you for providing the tutorial code.
Some of the code on process.py seems to be wrong.

I think the code below needs to be modified.
Please check if this modification is correct.

  • original process.py
# process.py
...
# bpe
python apply_bpe.py -c $bpe_codes <$data_dir/en2cn/en2cn.train.src > $data_dir/en2cn/en2cn.train.src.bpe.cn 
python apply_bpe.py -c $bpe_codes <$data_dir/en2cn/en2cn.train.mask > $data_dir/en2cn/en2cn.train.src.bpe.en
python apply_bpe.py -c $bpe_codes <$data_dir/en2cn/en2cn.train.tgt > $data_dir/en2cn/en2cn.train.tgt.en

python apply_bpe.py -c $bpe_codes <$data_dir/en2cn/en2cn.valid.src > $data_dir/en2cn/en2cn.valid.src.bpe.cn 
python apply_bpe.py -c $bpe_codes <$data_dir/en2cn/en2cn.valid.mask > $data_dir/en2cn/en2cn.valid.src.bpe.en
python apply_bpe.py -c $bpe_codes <$data_dir/en2cn/en2cn.valid.tgt > $data_dir/en2cn/en2cn.valid.tgt.en


# build vocab
touch $src_vocab
python $codes_dir/build_vocab.py $data_dir/en2cn/en2cn.train.src.bpe.cn $data_dir/en2cn.train.src.bpe.en $src_vocab 5 
...
  • After modification
# process.py
...
# bpe
python apply_bpe.py -c $bpe_codes <$data_dir/en2cn/en2cn.train.src > $data_dir/en2cn/en2cn.train.src.bpe.en 
python apply_bpe.py -c $bpe_codes <$data_dir/en2cn/en2cn.train.mask > $data_dir/en2cn/en2cn.train.src.bpe.cn
python apply_bpe.py -c $bpe_codes <$data_dir/en2cn/en2cn.train.tgt > $data_dir/en2cn/en2cn.train.tgt.bpe.cn

python apply_bpe.py -c $bpe_codes <$data_dir/en2cn/en2cn.valid.src > $data_dir/en2cn/en2cn.valid.src.bpe.en 
python apply_bpe.py -c $bpe_codes <$data_dir/en2cn/en2cn.valid.mask > $data_dir/en2cn/en2cn.valid.src.bpe.cn
python apply_bpe.py -c $bpe_codes <$data_dir/en2cn/en2cn.valid.tgt > $data_dir/en2cn/en2cn.valid.tgt.bpe.cn


# build vocab
touch $src_vocab
python $codes_dir/build_vocab.py $data_dir/en2cn/en2cn.train.src.bpe.cn $data_dir/en2cn/en2cn.train.src.bpe.en $src_vocab 5 
...

Another question seems to be that the ratio in the results of fairseq preprocessing is too high. Is this normal?

Namespace(alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='../WMT22_TS/WeTS/data-bin', fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=True, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, seed=1, source_lang='en', srcdict='../WMT22_TS/WeTS/src.vocab', target_lang='cn', task='input_suggestion', tbmf_wrapper=False, tensorboard_logdir='', testpref=None, tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='../WMT22_TS/WeTS/train_and_dev_0425/NaiveTs/en2cn/en2cn.train.src.bpe', user_dir=None, validpref='../WMT22_TS/WeTS/train_and_dev_0425/NaiveTs/en2cn/en2cn.dev.src.bpe', workers=10)
| [en] Dictionary: 34751 types
| [en] ../WMT22_TS/WeTS/train_and_dev_0425/NaiveTs/en2cn/en2cn.train.src.bpe.en: 14759 sents, 785715 tokens, 0.0% replaced by <unk>
| [en] Dictionary: 34751 types
| [en] ../WMT22_TS/WeTS/train_and_dev_0425/NaiveTs/en2cn/en2cn.dev.src.bpe.en: 2733 sents, 161151 tokens, 0.853% replaced by <unk>
| [cn] Dictionary: 34751 types
| [cn] ../WMT22_TS/WeTS/train_and_dev_0425/NaiveTs/en2cn/en2cn.train.src.bpe.cn: 14759 sents, 919988 tokens, 4.68% replaced by <unk>
| [cn] Dictionary: 34751 types
| [cn] ../WMT22_TS/WeTS/train_and_dev_0425/NaiveTs/en2cn/en2cn.dev.src.bpe.cn: 2733 sents, 191096 tokens, 4.47% replaced by <unk>
| Wrote preprocessed data to ../WMT22_TS/WeTS/data-bin
Namespace(alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='../WMT22_TS/WeTS/data-bin', fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer='nag', padding_factor=8, seed=1, source_lang='cn', srcdict='../WMT22_TS/WeTS/tgt.vocab', target_lang=None, task='input_suggestion', tbmf_wrapper=False, tensorboard_logdir='', testpref=None, tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='../WMT22_TS/WeTS/train_and_dev_0425/NaiveTs/en2cn/en2cn.train.tgt.bpe', user_dir=None, validpref='../WMT22_TS/WeTS/train_and_dev_0425/NaiveTs/en2cn/en2cn.dev.tgt.bpe', workers=10)
| [cn] Dictionary: 11039 types
| [cn] ../WMT22_TS/WeTS/train_and_dev_0425/NaiveTs/en2cn/en2cn.train.tgt.bpe.cn: 14759 sents, 103263 tokens, 0.0% replaced by <unk>
| [cn] Dictionary: 11039 types
| [cn] ../WMT22_TS/WeTS/train_and_dev_0425/NaiveTs/en2cn/en2cn.dev.tgt.bpe.cn: 2733 sents, 17299 tokens, 5.13% replaced by <unk>
| Wrote preprocessed data to ../WMT22_TS/WeTS/data-bin

Missing apply_bpe.py

Hi
I am trying to implement your code following the instruction you noted in README.md.
I found that "apply_bpe.py" which is required in executing "process.sh" does not exist in the repository.
Could you please provide more information about the subword tokenization method?

Additionally, I cannot find the sentencepiece model or BPE model that is suited to the pre-trained model you have released. Is there any released file that I have not found yet?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.