Giter VIP home page Giter VIP logo

tvsub's Introduction

TVsub: DCU-Tencent Chinese-English Dialogue Corpus

The data are used in our AAAI-18 paper Translating Pro-Drop Languages with Reconstruction Models.

The corpus is designed to be dialogue domain and parallel data with larger-context information for research purpose. More than two million sentence pairs were extracted from the subtitles of television episodes.

Within the corpus, sentences are generally short and the Chinese side contains many examples of dropped pronouns (DPs). Therefore, the corpus was initially designed for pro-drop language translation task, and the related paper (Translating Pro-Drop Languages with Reconstruction Models) was accepted by AAAI 2018 conference.

Actually, the corpus can be also used for various translation tasks such as larger-context MT (Exploiting Cross-Sentence Context for Neural Machine Translation; Learning to Remember Translation History with a Continuous Cache).

Novelty

The differences to other existing bilignaul subtitle corpora are as follows:

  • We only extract subtitles of television episodes instead of movie ones. The vocabulary in movies is more sparsity than that in TV series. To aviod the long-tail problems, we use TV series data for MT tasks.

  • We pre-processed the extracted data using a number of in-house scripts including sentence boundary detection and bilingual sentence alignment etc. Thus, we obtained a more cleaner, better-aligned, high-quality corpus.

  • We keep the larger-context information instead of disordering sentences. Thus, you can mine useful discourse information from the previous or following sentences for MT.

  • We randomly select two complete television episodes as the tuning set, and another two episodes as the test set. We manually create multiple references for them.

  • In order to re-implement our AAAI-18 paper (Translating Pro-Drop Languages with Reconstruction Models), we also released the +DP corpus, in which the Chinese sentences are automatically labelled with DPs using alignment information.

Getting Started

Plsease clone the repo, because we may update new version of data in the future.

git clone https://github.com/longyuewangdcu/tvsub.git

The folder stucture is as follows:

++ tvsub (root)
++++ data
++++++ orignal corpus
++++++++ train
++++++++ dev
++++++++ test
++++++ preprocessed corpus
++++++++ train
++++++++ dev
++++++++ test

Data Details

The following table lists the statistics of the corpus.

data_details

Authors

Publications

If you use the data, please cite the following paper:

Longyue Wang, Zhaopeng Tu, Shuming Shi, Tong Zhang, Yvette Graham, Qun Liu. (2018). "Translating Pro-Drop Languages with Reconstruction Models", Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI 2018).

@inproceedings{wang2018aaai,
  title={Translating Pro-Drop Languages with Reconstruction Models},
  author={Wang, Longyue and Tu, Zhaopeng and Shi, Shuming and Zhang, Tong and Graham, Yvette and Liu, Qun},
  year={2018},
  publisher = {{AAAI} Press},
  booktitle={Proceedings of the Thirty-Second {AAAI} Conference on Artificial Intelligence},
  address={New Orleans, Louisiana, USA},
  pages={1--9}
}

The data were crawled from the subtitle websites: http://assrt.net and http://www.zimuzu.tv. If you use the TVsub corpus, please add these links (http://www.zimuzu.tv and http://assrt.net) to your website and publications!

License

This data is only used for research purpose.

Plsease read the License Agreement before you use the data.

Acknowledgments

The released data is part of contribution of our AAAI-18 paper.

The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. Work was done when Longyue Wang was interning at Tencent AI Lab.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.