Giter VIP home page Giter VIP logo

the-pile's Introduction

The Pile Replication Code

The official website for the the Pile is here.

The Pile is a large, diverse, open source language modelling data set that consists of many smaller datasets combined together. The objective is to obtain text from as many modalities as possible to ensure that models trained using The Pile will have much broader generalization abilities.

This repository is for replicating or making variants of the Pile. IF YOU ARE HERE TO USE THE PILE DATASET, THIS REPO IS PROBABLY NOT WHAT YOU ARE LOOKING FOR. A copy of the Pile can be downloaded here.

Component Raw Size Weight Epochs Effective Size Mean Document Size
Pile-CC 227.12 GiB 18.11% 1.0 227.12 GiB 4.33 KiB
PubMed Central 90.27 GiB 14.40% 2.0 180.55 GiB 30.55 KiB
Books3 100.96 GiB 12.07% 1.5 151.44 GiB 538.36 KiB
OpenWebText2 62.77 GiB 10.01% 2.0 125.54 GiB 3.85 KiB
ArXiv 56.21 GiB 8.96% 2.0 112.42 GiB 46.61 KiB
Github 95.16 GiB 7.59% 1.0 95.16 GiB 5.25 KiB
FreeLaw 51.15 GiB 6.12% 1.5 76.73 GiB 15.06 KiB
StackExchange 32.20 GiB 5.13% 2.0 64.39 GiB 2.16 KiB
USPTO Backgrounds 22.90 GiB 3.65% 2.0 45.81 GiB 4.08 KiB
PubMed Abstracts 19.26 GiB 3.07% 2.0 38.53 GiB 1.30 KiB
Gutenberg (PG-19) 10.88 GiB 2.17% 2.5 27.19 GiB 398.73 KiB
OpenSubtitles 12.98 GiB 1.55% 1.5 19.47 GiB 30.48 KiB
Wikipedia (en) 6.38 GiB 1.53% 3.0 19.13 GiB 1.11 KiB
DM Mathematics 7.75 GiB 1.24% 2.0 15.49 GiB 8.00 KiB
Ubuntu IRC 5.52 GiB 0.88% 2.0 11.03 GiB 545.48 KiB
BookCorpus2 6.30 GiB 0.75% 1.5 9.45 GiB 369.87 KiB
EuroParl 4.59 GiB 0.73% 2.0 9.17 GiB 68.87 KiB
HackerNews 3.90 GiB 0.62% 2.0 7.80 GiB 4.92 KiB
YoutubeSubtitles 3.73 GiB 0.60% 2.0 7.47 GiB 22.55 KiB
PhilPapers 2.38 GiB 0.38% 2.0 4.76 GiB 73.37 KiB
NIH ExPorter 1.89 GiB 0.30% 2.0 3.79 GiB 2.11 KiB
Enron Emails 0.88 GiB 0.14% 2.0 1.76 GiB 1.78 KiB
Total 1254.20 GiB 5.91 KiB

(Epochs refers to the number of epochs elapsed after 1.2TB)

Usage

Install:

pip install -e .

To replicate pile

python the_pile/pile.py --interleave_output 30 --using pile_reprod

Use the pass 2 script here to complete shuffling.

Other

To force download all data:

python the_pile/pile.py --force_download

To generate fasttext training data for CC filtering (OWT2 only):

sudo apt install build-essential
python the_pile/pile.py --using owt2 --make_fasttext 

Manual Download Components

The following components need manual downloading. Either download them or comment out from pile.py.

  • Bibliotik: books3.tar.gz needs to be in the current directory. Download temporarily unavailable.

Workflow

To propose a new dataset be added to the Pile, open an issue. Your issue should include a description of the dataset, its size, what language(s) it is in, a link to the data, and any other relevant information. If a project manger approves your proposal, they will change its label to Datasets and add it to Project: Datasets. Datasets that we elect to not include in the current version of the Pile will receive a Deferred or Declined label. While we welcome multilingual datasets and plan on including non-English datasets in the future, the initial release of the Pile will be English-only and all submissions of non-English datasets will be deferred.

To claim responsibility for implementing an unclaimed dataset, leave a comment on one of our unassigned issues. Once an dataset has been assigned to you, make the necessary changes to datsets.py and pile.py in a fork and submit a pull request. If you require, you can also submit a script for processing the data as shown here.

To raise an issue that is not proposing a new dataset, open an issue with the tag Feature Request or Bug as appropriate.

Data ready for final implementation should meet the following criteria:

  • The data must be in lm_dataformat format.
  • The data must be shuffled.

In preparation for the initial release, we are no longer accepting additions to the master branch. If you would like to contribute a dataset, please submit the pull request to the Version2 branch.

the-pile's People

Contributors

leogao2 avatar mgrankin avatar researcher2 avatar sdtblck avatar stellaathena avatar thoppe avatar trisongz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

the-pile's Issues

Europarl

Transcripts from EU Parliament meetings from 1996 to 2011. Contains approximately 4.5 GB of text.

Languages: French, Italian, Spanish, Portuguese, Romanian, English, Dutch, German, Danish, Swedish, Bulgarian, Czech, Polish, Slovak, Slovene, Finnish, Hungarian, Estonian, Latvian, Lithuanian, and Greek.

Link: www.statmt.org/europarl/

Fanfiction.net

Fanfiction.net is the largest repository of fanfiction on the internet.

Separate functions for downloading pre-processed and datasets and downloading & processing

I think it would be a good idea, where possible, to have separate functions in the pile for downloading a pre-processed version of the dataset from a hosted version (i.e a single wget) and running the entire replication step.

The Stackexchange dataset, for example, requires a large amount of storage for processing, and is pretty slow. It's good that people are able to recreate the pipeline if possible (and the stackexchange data on archive.org is being updated fairly regularly, so will grow over time), but in general, it would be better to host the data somewhere.

There are also some functions that only provide download, and no replication steps. I think these two things should be separated where possible.

So, we should first provide a hosted dataset, then fall back to full downloading and pre-processing, if, say, we can no longer host.

Move processing code to this repo

Having a whole bunch of repositories scattered across GitHub for processing code is no beuno. We should really make a directory in this repo for housing them. If people want to keep theirs off-repo that's fine, but I really don't see why we shouldn't house them here.

I've assigned people who have been loud about this in the past to this issue.

arXiv

arXiv is a preprint repository containing mathematics, computer science, and physics research papers.

Estimated Size: 75 GB

PUBMED (biomedical abstracts)

PUBMED comprises of "more than 30 million citations for biomedical literature from MEDLINE, life science journals, and online books". The data are stored on a FTP, publicly accessible and free for public use. The compressed XML is about 28GB, although much of that is boilerplate. Though the metadata is useful for other purposes, it looks like The-Pile™ would benefit the most from the abstract and titles only.

Example

<ArticleTitle>Side chain packing below the fusion peptide strongly modulates triggering of the Hendra virus F protein.</ArticleTitle>

<AbstractText> Triggering of the Hendra virus fusion (F) protein is required to initiate the conformational changes which drive membrane fusion, but the factors which control triggering remain poorly understood. Mutation of a histidine predicted to lie near the fusion peptide to alanine greatly reduced fusion despite wild-type cell surface expression levels, while asparagine substitution resulted in a moderate restoration in fusion levels. Slowed kinetics of six-helix bundle formation, as judged by sensitivity to heptad repeat B-derived peptides, was observed for all H372 mutants. These data suggest that side chain packing beneath the fusion peptide is an important regulator of Hendra virus F triggering. </AbstractText>

Note that this is different from PMC (PubMed Central). That database contains full text of recent articles. While PubMed abstracts are comprehensive, they do not contain the full text. I'll submit a separate request for that dataset.

I've downloaded the entire set last year so I already have a crawler ready to go.

CORD-19

Official description:

In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

Kaggle URL: https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

Literotica

A data set of user-submitted erotic literature.

Small Flag

to assist with data exploration and testing (both ours and that of other people) we should add a “small” flag that causes it to download a small amount of data (size TBD... 10M per data source? --Stella

The Eye

The eye is a platform deicatded to archving any and all kinds of data.

They say they have 140 Tb in total in assorted formats and a good fraction seems to be in text format.

https://the-eye.eu/public/

unfortuatly due to the fact that all of their size estimates seem to be "pending update" it is dificult to give exact estimats on how much of this is textual

Congressional Records

URL: https://www.govinfo.gov/help/crecb#about

Size: Estimated 6-8 GB text uncompressed

The Congressional Record is the official record of the proceedings and debates of the United States Congress. It is published daily when Congress is in session. The Congressional Record began publication in 1873 and is still published today.

At the end of each session of Congress, all of the daily editions are collected, re-paginated, and re-indexed into a permanent, bound edition. This permanent edition, referred to as the Congressional Record (Bound Edition), is made up of one volume per session of Congress, with each volume published in multiple parts, each part containing approximately 10 to 20 days of Congressional proceedings. The primary ways in which the bound edition differs from the daily edition are continuous pagination; somewhat edited, revised, and rearranged text; and the dropping of the prefixes H, S, and E before page numbers.

What is available?

Volumes 144 (1998) and prior are made available as digitized versions of the Congressional Record (Bound Edition) created as a result of a partnership between GPO and the Library of Congress. These volumes include all parts of the official printed edition.

There is an API to access the records that seems straight forward, once you get past the idea of collections and packages:

https://api.govinfo.gov/docs/

The data are all in PDF, so it would require some parsing but it looks like the documents are already OCR'd.

Example: https://www.govinfo.gov/content/pkg/CRECB-2001-pt1/pdf/CRECB-2001-pt1.pdf

Preliminary experiments with pdfbox show good extraction. Example:

Mr. BYRD. Yes, exactly, one of which
happens to appear to target a facility
for a district represented by a Member
of the House from Texas. We do not
know what that facility is, but it has
been slipped into this measure.
Mr. SARBANES. I say to the distin-
guished Senator, I was not even aware
of that one. That one has not yet risen
to the level of being covered in these
newspaper stories.
Mr. BYRD. I think that is where I got
a glimmer of it, somewhere in a news-
paper story.

They are in column format, so there will be a lot of words broken by hyphens "distin- guished". I don't think that will be a problem though for the LM.

Additionally, the Congressional Records can be pulled from 1998 forward, but these are already digitized and are on a different API access endpoint.

Enron Emails

Official description:

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.

Project URL: https://www.cs.cmu.edu/~./enron/

Paperity

Multidisciplinary open-access research aggregator.

Size: 1.5 million papers (rough estimate: about 70GiB, give or take)

Overlap with other sets: There is probably some overlap with arXiv and PubMedCentral, but Paperity seems to have a lot of papers from subjects not currently in any of our sets.

Quote from website:

We can provide dumps of Paperity data, full and incremental, for use in external services and applications. For more information please contact us at: services (at) paperity.org.

TODO: contact paperity

https://paperity.org/

Screenplays (Subtitles don't contain info about who says & does what)

We should consider adding a screenplay dataset in addition to opensubtitles, cause subtitles dont contain the the contextual information like where the actors go, what they do, how they behave, ...

This would contain valuable info about social interactions and situations.

image

I could write a scraper for several screenplay sites, if this would be liked. :)

United Nations Publications

Languages: English, French, Spanish, Arabic, Russian, Chinese (should have translations for all of these)
Date ranges: 1946-2020
Size: 700,000 publications
Link to UN digital library.

Outstanding questions:

  • How many of these are downloadable through the portal?
  • Are all of the documents available in all languages?
  • What is the total corpus size (in bytes) we should expect from this?

bioRxiv

bioRxiv is a preprint repository for biology research. It can be downloaded from here.

Coastal Zone Information Center Collection

Language: English
Date ranges: 1951-1998
Size: ~ 5,000 long form documents
Text size: 800 MB

https://www.govinfo.gov/app/collection/czic

The Coastal Zone Information Center (CZIC) collection on this site provides access to nearly 5,000 coastal related documents that GPO received from the National Oceanic and Atmospheric Administration (NOAA) Central Library.

The collection provides almost 30 years of data and information crucial to the understanding of U.S. coastal management and NOAA's mission to sustain healthy coasts. These documents were originally submitted to the NOAA Office of Ocean and Coastal Resource Management (OCRM) by state coastal zone management programs in accordance with the Coastal Zone Management Act (CZMA) of 1972. These historic documents were provided to GPO by the content originator, and digitized for public use. For optimal viewing and for the original look and feel of the original documents, view the PDF versions.

Sample text:

       Begins with an analysis of environmental and cultural dynamics affecting
       prehistoric resources. An evaluation of the resources in each county follows,
       and an examination of the different stresses on prehistoric resources (from
       shore erosion, residential development, wetlands destruction, etc.). The
       management strategy discusses current programs, land use planning at dif-
       ferent levels of government, and a county by county approach. Concludes
       with an appendix on principal existing legislation which concerns prehistoric
       resources.

Biodiversity Heritage Library

Language: primarily English, with a few thousand works total in German, French, Spanish, Dutch, Portuguese, and Latin
Date ranges: Primarily pre-1923
Size: Unclear. A large number of full length books, so likely > 1GB.

The Biodiversity Heritage Library has a very large collection (~250,000) of pre-OCR'd historical books and documents on natural history topics. https://about.biodiversitylibrary.org/tools-and-services/developer-and-data-tools/

The individual .txt file links are listed in the ItemTextURL column of this TSV (warning: this link leads to a 40+MB file) https://www.biodiversitylibrary.org/data/hosted/item.txt

My primary concern is with the quality of the OCR.

FreeLaw Project

Looks similar to #27:

URL https://www.courtlistener.com/api/bulk-info/

Free Law Project seeks to provide free access to primary legal materials, develop legal research tools, and support academic research on legal corpora. We work diligently with volunteers to expand our efforts at building an open source, open access, legal research ecosystem. Currently Free Law Project sponsors the development of CourtListener, Juriscraper, and RECAP. We currently have 423 courts that can be accessed with our APIs.

For each court there appears to be a file collecting all information on each case heard. A sample download of "Court of Appeals for the First Circuit" with 35K entries is about 500MB. The data seems to be organized with a useful field of "text" or "html", the later of which can be reduced with pandoc. There is definitely overlap with #27, though it's unclear how much. An example:

https://www.courtlistener.com/opinion/4242578/sandquist-v-lebo-automotive-inc/?q=Sandquist%20v.%20Lebo%20Automotive&type=o&order_by=score%20desc&stat_Precedential=on&court=cal

https://cite.case.law/cal-5th/1/233/

According to their site, it looks like they might be a strict subset. By the numbers it looks to be about half the size, "3,676,348: Number of precedential opinions in CourtListener." vs the claimed 6M for case.law. The upshot is that this service is free w/o an account to access all states and can begin parsing immediately.

Caselaw Access Project

Description: US caselaw collected by Harvard Law School. Four states are available for download, and people who get "researcher" accounts have access to the entire data set.

Size: 900 MB compressed -> ~4.5 GB uncompressed (open data). Full dataset size unknown.

Language: English

URL: https://case.law/

CORE: Academic papers

Already downloaded to hetzner. just need to convert to text.

Medium priority, low hanging fruit

PhilPapers

Language: primarily but not exclusively English
Date ranges: 1600s-2020
Size: +260,000 works indexed and +52,000 available to download via PhilArchive

PhilPapers is an international, interactive academic database of journal articles for professionals and students in philosophy. It is maintained by the Centre for Digital Philosophy at the University of Western Ontario.

I think we could at least add the section of open access works, which should all be downloadable as PDFs from https://philarchive.org/ There may also be other works availabe for download on the main site. In addition, we could probably scrape abstracts for the rest.

African Journals Online

An archive of over 800 academic journals on a wide variety of subjects written by and for African scientists. It’s in a mix of languages, mostly African languages or English. The website advertises

The site has 15 132 Issues containing 183 699 Abstracts with 177 532 Full Text Articles for download of which 117 018 are Open Access

However they’re journal count is out-of-date (they say 500) so I suspect that there’ll actually be much more content than these numbers imply.

I suspect that a lot of them will not be duplicative of scientific articles found in other archives because western academia kinda just ignores the scientific output of Africa. It’s very hard to get permission to submit to arXiv as an African, for example, as many countries’ universities are not on the auto-approve list.

Website: https://www.ajol.info/index.php/ajol

Bibliotik

A scrape of public libraries provided by the-eye.

NIH Abstract text for awarded grants

The NIH (National Institutes of Health) provide a record of all the abstracts of publicly funded grants on ExPorter. There are two main URLs:

https://exporter.nih.gov/ExPORTER_Catalog.aspx?sid=0&index=1
https://exporter.nih.gov/CRISP_Catalog.aspx?sid=0&index=1

The later of which contains some overlapping legacy data. The text needs some minimal preprocessing, but is otherwise in good shape. Example:

DESCRIPTION (provided by applicant): Promising results from prophylactic HPV vaccine trials support using these vaccines in cervical cancer prevention programs in the-future. Since vaccine coverage rarely if ever reaches 100%, population-level effectiveness of a prophylactic vaccine designed to prevent a sexually transmitted infection, such as an HPV vaccine, depends not only on the efficacy of the vaccine, but also on the incidence and duration of infection in both men and women. Although much has been learned about the epidemiology of human papillomavirus (HPV) infections in women, little is known about the incidence, determinants, and natural history of HPV infections in men. Research in men has been hampered, in part, by an inability to obtain adequate genital samples for HPV DNA testing. As discussed in this proposal, we developed a sensitive and acceptable method for sample collection and now propose to use this method in a prospective natural history study with the following aims. Among young men, (1) determine the incidence of infection with any type of HPV, oncogenic HPV, specific HPV types including HPV 16 and HPV6/11, and HPV 16 variants; (2) define risk 'factors for incident HPV infection, including lifetime and recent number of sex partners, circumcision status, condom use, frequency of vaginal intercourse, and courtship behavior; and (3) describe the natural history of HPV infection in men as measured by duration and levels of HPV DNA, HPV type-specific seroconversion, duration of antibodies, and development of genital warts. Our long-term goal is development of cost-effective approaches to the prevention of HPV-related cancers.

It's not the largest dataset (estimated about 2 GB compressed?) but it's easy to get and the text is high-quality.

RePEc

Language: predominantly English
Date ranges: 1997-2020
Size: Claims 2M downloadable articles, 800K working papers, 26K books, and 59K chapters

Research Papers in Economics (RePEc) is a collaborative effort of hundreds of volunteers in many countries to enhance the dissemination of research in economics. The heart of the project is a decentralized database of working papers, preprints, journal articles, and software components.

We would be extracting the text components only. From what I've seen, it's PDFs.

http://www.repec.org/

USPTO Patent

Language: English
Date ranges: 1976-2000 (APS), 2001-2020 (XML)
Size: > 10GB.

Could either do patent applications or patent grants. Most relevant sections are likely "Background", followed by "Summary" / "Brief Description" and then "Detailed Description". May want to exclude claims and discussion of drawings.

https://bulkdata.uspto.gov/

BookCorpus download not working

When I try to download the bookcorpus dataset, my connection keeps getting closed, and it eventually gives up:

Connecting to battle.shawwn.com (battle.shawwn.com)|2606:4700:3033::681b:80c6|:443... connected.                                                                                                                   
HTTP request sent, awaiting response... 206 Partial Content                                                                                                                                                        
Length: 2404269430 (2.2G), 2402333974 (2.2G) remaining [application/gzip]                                                                                                                                          
Saving to: ‘books1.tar.gz’                                                                                                                                                                                         
                                                                                                                                                                                                                   
books1.tar.gz                0%[                                      ]   1.95M   221KB/s    in 0.5s                                                                                                               
                                                                                                                                                                                                                   
2020-09-25 11:34:47 (221 KB/s) - Connection closed at byte 2042976. Retrying.                                                                                                                                      
                                                                                                                                                                                                                   
--2020-09-25 11:34:57--  (try:20)  https://battle.shawwn.com/sdb/books1/books1.tar.gz                                                                                                                              
Connecting to battle.shawwn.com (battle.shawwn.com)|2606:4700:3033::681b:80c6|:443... connected.                                                                                                                   
HTTP request sent, awaiting response... 206 Partial Content
Length: 2404269430 (2.2G), 2402226454 (2.2G) remaining [application/gzip]
Saving to: ‘books1.tar.gz’

books1.tar.gz                0%[                                      ]   2.05M   222KB/s    in 0.5s

2020-09-25 11:34:58 (222 KB/s) - Connection closed at byte 2150496. Giving up.

Is anyone else having this problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.