Giter VIP home page Giter VIP logo

automatic-corpus-generation's Introduction

A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Checking (EMNLP2018)

This repository contains the scripts which can be used to automatically generate sentences with errors, of which locations and the corresponding corrections can be easily marked without any human intervention. A generated Dataset containing 271,329 sentences, with the min_length=4, max_length=140, average_length=42.5, total_error=381,962, average_error=1.4, and Confusionset are also provided for future research on Chinese Spelling Checking.

Note: The Dataset and Confusionset will be continuously updated.

Main Libraries

OCR-based Method

ocr

ASR-based Method

ocr

Basic Model

After generating the dataset using our proposed method, you can try any model you wanna on CSC. Here, we implement a pytorch-based bilstm model, in which lots of details can be furture optimized.

  • For training, use the command line python main_train.py. Training details will be printed on the screen.

  • For test, use the command line python main_test.py.

Note: You can fine-tune the hyper-parameters or add more generated data to imrprove the model performance.

Confusionset

For a given word, a confusionset refers to a set of words that are visually or phonologically similar with the given word. For example, 哨:宵诮梢捎俏咪尚悄少销消硝赵逍屑吵噹躺稍峭鞘肖. As a "byproduct" of our proposed method, we construct a confusion set for all involved correct characters by collecting all incorrect variants for each correct character, which is widely used in the task of CSC. We also open this confusionset for future research on CSC.

Testing Datasets


SIGHAN Bake-off 2013: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html

SIGHAN Bake-off 2014 : http://ir.itc.ntnu.edu.tw/lre/clp14csc.html

SIGHAN Bake-off 2015 : http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html

Note: All datasets above are originally written in Traditional Chinese. Considering the fact that our generated datasets are in Simplified Chinese, we have translated the original datasets into a version of Simplified Chinese, which can be found in the Data folder. The tool we use to translate Tranditional Chinese to Simplified Chinese is OpenCC.

Citation

If you find the implementation useful, please cite the following paper: A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check

@InProceedings{Reimers:2018:EMNLP,
  author    = {DingminWang, Yan Song, Jing Li, Jialong Han, Haisong Zhang},
  title     = {{A Hybrid Approach to Automatic Corpus Generation for Chinese Spelling Check}},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  month     = {11},
  year      = {2018},
  address   = {Brussels, Belgium},
}

Contact

Drop me (Dingmin Wang) an email at wangdimmy (AT) gmail.com if you have any question.

automatic-corpus-generation's People

Contributors

wdimmy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.