Giter VIP home page Giter VIP logo

pg19's Introduction

PG-19 Language Modelling Benchmark

This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Project Gutenberg books library [1], that were published before 1919. It also contains metadata of book titles and publication dates.

Full dataset download link

PG-19 is over double the size of the Billion Word benchmark [2] and contains documents that are 20X longer, on average, than the WikiText long-range language modelling benchmark [3].

Books are partitioned into a train, validation, and test set. Book metadata is stored in metadata.csv which contains (book_id, short_book_title, publication_date).

Unlike prior benchmarks, we do not constrain the vocabulary size --- i.e. mapping rare words to an UNK token --- but instead release the data as an open-vocabulary benchmark. The only processing of the text that has been applied is the removal of boilerplate license text, and the mapping of offensive discriminatory words as specified by Ofcom [4] to placeholder tokens. Users are free to model the data at the character-level, subword-level, or via any mechanism that can model an arbitrary string of text.

To compare models we propose to continue measuring the word-level perplexity, by calculating the total likelihood of the dataset (via any chosen subword vocabulary or character-based scheme) divided by the number of tokens --- specified below in the dataset statistics table.

One could use this dataset for benchmarking long-range language models, or use it to pre-train for other natural language processing tasks which require long-range reasoning, such as LAMBADA [5] or NarrativeQA [6]. We would not recommend using this dataset to train a general-purpose language model, e.g. for applications to a production-system dialogue agent, due to the dated linguistic style of old texts and the inherent biases present in historical writing.

Dataset Statistics

Train Validation Test
Books 28,602 50 100
Num. Tokens 1,973,136,207 3,007,061 6,966,499

Bibtex

@article{raecompressive2019,
author = {Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
          Hillier, Chloe and Lillicrap, Timothy P},
title = {Compressive Transformers for Long-Range Sequence Modelling},
journal = {arXiv preprint},
url = {https://arxiv.org/abs/1911.05507},
year = {2019},
}

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value
name The PG-19 Language Modeling Benchmark
alternateName PG-19
url
sameAs https://github.com/deepmind/pg19
description This repository contains the PG-19 dataset. It includes a set of books extracted from the Project Gutenberg books project (https://www.gutenberg.org), that were published before 1919. It also contains metadata of book titles and publication dates.
provider
property value
name DeepMind
sameAs https://en.wikipedia.org/wiki/DeepMind
license
property value
name Apache License, Version 2.0
url
citation https://identifiers.org/arxiv:1911.05507

Contact

If you have any questions, please contact Jack Rae.

References

  • [1] https://www.gutenberg.org
  • [2] Chelba et al. "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling" (2013)
  • [3] Merity et al. "Pointer Sentinel Mixture Models" (2016)
  • [4] Ofcom offensive language guide
  • [5] Paperno et al. "The LAMBADA dataset: Word prediction requiring a broad discourse context" (2016)
  • [6] Kočiský et al. "The narrativeqa reading comprehension challenge" (2018)

pg19's People

Contributors

dm-jrae avatar drusepth avatar

Stargazers

Lingjie Kong avatar  avatar  avatar  avatar Sebastian Majstorovic avatar Eugene Siow avatar Fan Pu Zeng avatar Shoaib Ahmed Siddiqui avatar  avatar Zhao JIe avatar  avatar Han-Cheol Cho avatar Ahmed Oumar avatar Jean-Loup Tastet avatar  avatar DooWoong Lee (David) avatar  avatar Noppakorn Kaewsalabnil avatar  avatar Lira avatar hongzhe bi avatar  avatar Bruce Xin avatar Yu Zhang avatar  avatar chen-jing avatar Weijie Liu avatar tomzhang avatar Bumjin Park avatar Jonathan Zhouhan LIN avatar Weidi Xu avatar Songlin Yang avatar Shida Wang avatar Wei Chu avatar Strivin avatar Shaoyang Xu avatar Yifei Zuo avatar zk avatar  avatar Guofan Fan avatar  avatar  avatar Don Kang avatar UniversalMer avatar  avatar BlueFlame avatar Dmitry Nikolayev avatar Hayden Shively avatar Edan Meyer avatar Chris Lengerich avatar Qi Yang avatar WangHeng avatar rylynn avatar  avatar  avatar Miguel Ángel Medina Ramírez avatar Kristian Klemon avatar Sungju Kim avatar Nicolay Rusnachenko avatar Sangryul Kim avatar Ali Safaya avatar Billy Dickson avatar  avatar Tianle Cai avatar Michael McMahon avatar Weizhi Wang avatar Anthony Mercurio avatar Juhan Bae avatar Alexandre Salle avatar Jean-Rémi KING avatar Arun Balajiee Lekshmi Narayanan avatar  avatar Tobi Deußer avatar Peter Hollows avatar Piji Li avatar Longyue Wang avatar Derek Larson avatar  avatar  avatar Kazuki Irie avatar Es'kin Vasiliy Alekseevich avatar  avatar  mdev avatar  avatar Dan avatar  avatar  avatar Qi Zeng avatar Alex avatar  avatar Dmitri avatar  avatar Slava avatar Joel Stremmel avatar Daniel Grittner avatar Junhyun Park avatar Siddesh Sambasivam avatar normanj avatar  avatar Ian Derrington avatar

Watchers

James Cloos avatar Andreas Fidjeland avatar Tim Hunt avatar Arun Balajiee Lekshmi Narayanan avatar  avatar Nuno Edgar Nunes Fernandes avatar Tatsuya Matsushima avatar Arun Sathiya avatar  avatar paper2code - bot avatar

pg19's Issues

Code for "COMPRESSIVE TRANSFORMERS"?

Hi,

Thank you for sharing the pg19 dataset. Do you have any plan on sharing the code for COMPRESSIVE TRANSFORMERS FOR LONG-RANGE SEQUENCE MODELLING?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.