Giter VIP home page Giter VIP logo

gec-pseudodata's Introduction

pseudodata-for-gec

This is the official repository of following paper:

An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction
Shun Kiyono, Jun Suzuki, Masato Mita, Tomoya Mizumoto, Kentaro Inui
2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP2019), 2019 

Requirements

  • Python 3.6 or higher
  • PyTorch (version 1.0.1.post2 is recommended)
  • blingfire (for preprocessing - sentence splitting)
  • spaCy (for preprocessing - tokenization)
  • subword-nmt (for splitting the data into subwords)
  • fairseq (I used commit ID: 3658fa3 for all experiments. I strongly recommend sticking with the same commit ID.)

Resources

Reproducing the CoNLL2014/JFLEG/BEA-test Result

  • Download test-set from appropriate places.
  • Split source sentence into subwords using this bpe code file.
  • Run following command: output.txt is the decoded result.
#! /bin/sh
set -xe

cd /path/to/cloned/fairseq

# PATHs
CHECKPOINT="/path/to/downloaded/model.pt"  # avaiable at https://github.com/butsugiri/gec-pseudodata#resources
SRC_BPE="/path/to/src_file"  # this needs to be in subword
DATA_DIR="/path/to/vocab_dir"  # i.e., `vocab` dir in this repository

# Decoding
cat $SRC_BPE | python -u interactive.py ${DATA_DIR} \
    --path ${CHECKPOINT} \
    --source-lang src_bpe8000 \
    --target-lang trg_bpe8000 \
    --buffer-size 1024 \
    --batch-size 12 \
    --log-format simple \
    --beam 5 \
    --remove-bpe \
    | tee temp.txt

cat temp.txt | grep -e "^H" | cut -f1,3 | sed 's/^..//' | sort -n -k1  | cut -f2 > output.txt
rm temp.txt

The model pretlarge+SSE (finetuned) should achieve the score: F0.5=62.03 .

Generating Pseudo Data from Monolingual Corpus

Preprocessing

  • ssplit_and_tokenize.py applies sentence splitting and tokenization
  • remove_dirty_examples.py removes noisy examples (details are described in the script)

DirectNoise

  • cat monolingual_corpus.bpe | python count_unigram_freq.py > freq_file
  • python normalize_unigram_freq.py --norm 100 < freq_file > norm_freq_file
  • python generate_pseudo_samples.py -uf norm_freq_file -po 0.2 -pm 0.7 --single_mistake 0 --seed 2020 > proc_file
  • feed proc_file to fairseq_preprocess

Citing

If you use resources in this repository, please cite our paper.

@InProceedings{kiyono-etal-2019-empirical,
    title = "An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction",
    author = "Kiyono, Shun  and
      Suzuki, Jun  and
      Mita, Masato  and
      Mizumoto, Tomoya  and
      Inui, Kentaro",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1119",
    pages = "1236--1242",
    abstract = "The incorporation of pseudo data in the training of grammatical error correction models has been one of the main factors in improving the performance of such models. However, consensus is lacking on experimental configurations, namely, choosing how the pseudo data should be generated or used. In this study, these choices are investigated through extensive experiments, and state-of-the-art performance is achieved on the CoNLL-2014 test set (F0.5=65.0) and the official test set of the BEA-2019 shared task (F0.5=70.2) without making any modifications to the model architecture."
}

gec-pseudodata's People

Contributors

butsugiri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

gec-pseudodata's Issues

I can not reproduce the performance on CoNLL-2014

Hello, thanks for the efforts in GEC. I download the files and run it as the instruction. However, I get 61.94 F0.5 on CoNLL-2014.
CORRECT EDITS : 1211
PROPOSED EDITS : 1779
GOLD EDITS : 2659
P = 0.68071950534
R = 0.455434373825
F_0.5 = 0.619437340153
Precision : 0.6807
Recall : 0.4554
F_0.5 : 0.6194

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.