r-three / common-pile Goto Github PK

View Code? Open in Web Editor NEW

21.0 6.0 6.0 427 KB

Repo to hold code and track issues for the collection of permissively licensed data

License: MIT License

Python 88.60% Shell 11.40%

common-pile's Introduction

Licensed Pile

Repo to hold code and track issues for the collection of permissively licensed data

Tips

You can look at Dolma formatted data via commandline tools like so.

cat ${file}.jsonl.gz | gunzip | jq -s ${commmand}

js -s is used to process the input as jsonl (valid json per line) instead of expecting the whole input to be valid json.

Then you can use jq syntax to look for specific things, e.g.:

Look at the text for item 1115 cat ${file}.jsonl.gz | gunzip | jq -s '.[1115].text'

Look at the text for item with the id of 12 (note that position in file is not correlated with id) cat ${file}.jsonl.gz | gunzip | jq -s '.[] | select(.id == "12").text'

Note: You can also use gunzip -c ${file}.jsonl.gz | jq -s ${command} which is slightly faster (reduces the amount of data flowwing through pipes) but if you forget the -c flag you end up uncompressing the file and deleting the compressed version, i.e. you need to run gzip ${file}.jsonl to fix it.

Capped-parallelism in bash script

Sometimes we want to download/process multiple files in parallel up to a limited number of jobs in bash script. Below is a example code snippet (used in courtlistener/get_data.sh). Note that jobs -r counts all jobs running in the current shell.

max_jobs = 8
for file in "${files[@]}"; do
    download_and_process "file" &

    # Limit the number of parallel jobs
    if (( $(jobs -r | wc -l) >= max_jobs )); then
        wait -n
    fi
done

Development

We use git pre-commit hooks to format code and keep style consistent.

Install the pre-commit library with pip install pre-commit.
Install the pre-commit hooks with pre-commit install from the repository root.

Now when a git commit is run, the hooks will run. If one of the hooks reformats a file, the commit will be blocked. Then you need to inspect the changes and readd them with git add. Then you can re-run your commit command and the commit will actually be added.

common-pile's People

Contributors

Stargazers

Watchers

Forkers

lintangsutawika albertvillanova baberabb storytracer muennighoff conceptofmind

common-pile's Issues

Earnings Call Transcripts

Every quarter, publicly traded companies have an earnings call where they discuss the financial results of the quarter. According to this ruling, publication of earnings call transcripts are considered fair use since there's little copyrightable material in an earnings call.

However, the transcripts do seem to be a good source of high-quality text. For instance, see Apple's 2024 Q1 earnings call. A transcript contains quite a bit of factual information about a company, real-world events that impacted the company, financial information, and also contains Q&A dialogue.

Some napkin math: Transcripts on average seem to be about 10K tokens (checks out since it's usually a 30-60 min phone call), companies do them 4 times per year, and there are around 5k-10k publicly traded companies for which there are transcripts on the popular finance sites. Collecting 10 years worth of these transcripts would give 2-4B tokens of high-quality text.

Work would be needed to figure out the best way to collect this data. Also, it looks like some of these transcripts will be in the SEC data #31 but I have no clue whether this is all of them or just some one-off transcripts.

Ubuntu IRC

Domain: Technical Chat
Moderated
Data is text

"The content of all Ubuntu channels, whether official logs or otherwise, are considered to be in the public domain."

There are a lot of interleaved dialogues in datasets like this. They are hard to disentangle. Some work has been done where (IIRC) they manually annotated conversations within ubuntu IRC. Code they used from the previous paper https://jkk.name/irc-disentanglement/

Are there other IRC chats/forms like this? For example ArchWiki? Linux Kernel Development Mailing List (might be too much of a "toxic" source tho, lol)?

AMPS

ArXiv Articles

Domain: Scientific Articles

Not all articles posted on ArXiv are permissively licensed. Need to only do the CC ones.

Need to parse the LaTeX into plain text.

@craffel has already parsed some of these documents so we can re-use that code and those processed articles. We should re-run the data collection as there are always new papers being posted to arxiv.

The S2ORC dataset also includes scientific articles, we may need to deduplicate some between the two datasets.

Patent Data

Domain: Patents

Can we use the Google Patents data for this?

It might be possible to use C4/Common Crawl data for this as patents.google.com is one of the most represented domains in c4

StarCoder

Raw Data: https://huggingface.co/datasets/bigcode/starcoderdata
Tokens: 250B

StarCoder is a subset of The Stack v1.2 which is all permissively licensed. License info for individual examples in StarCoder are not available. We could get this data by cross-referencing records with The Stack (which does contain license metadata).

Biodiversity Heritage Library

Digital library for works on biodiversity. Total collection is ~60M pages but from inspecting their bulk download about one third of this is public domain. Bulk downloads available here: https://about.biodiversitylibrary.org/tools-and-services/developer-and-data-tools/#x--MODS

Hathi Trust

https://www.hathitrust.org/member-libraries/resources-for-librarians/data-resources/research-datasets/

HathiTrust makes public domain work available for bulk download on request for non-commercial research purposes.

There are 4 distinctions based on where the researcher is located and if volumes digitized by Google is included.

Public domain text, excluding Google-digitized volumes

Dataset for researchers in the U.S. -> 814,045 public domain and Creative Commons-licensed (480GB)
Dataset for researchers outside the U.S. -> 610,575 (351GB)

All public domain text, including Google-digitized volumes

Researchers in the U.S. -> 6,649,535 public domain and Creative Commons-licensed volumes (5.4TB)
Researchers outside the U.S. -> 4,316,648 (3.4TB)

I think the version that includes Google-digitized volumes is a superset of #18 .

Stack Exchange

Domain: Technical Question-Answer and Discussion
~10m pages
English

@craffel said he has a stack exchange parser to use.

SiloLM paper also used stack exchange, we need to look at there data preprocessing and decide if we want to use theirs or re-do it.

Design Considerations:

Questions can have multiple answers
- Do you create multiple (q, a) pairs with the same question for each answer?
- Do you create a single (q, a1, a2, ...) document?
  - Do you sort by ranking or similar?
Where do we put the comments in a document?
Stack Exchange can have edits that change text/author information opaquely, thus some comments become non sequiturs when something they referenced changes

Efficient Reshard Tool

Some of the preprocessing can remove all document content (and some documents in some sources are blank to begin with). We should have an efficient way to remove these documents (there is an example in the dolma docs) and reshard them to that the resulting shards are balanced.

Printables and Similar Sites

Domain: Descriptions of 3d models for 3d printers

Each model includes license information on the page and the TOS says that the text description counts as part of the "print documents" so it should be under the same license.

The TOS of printables.com says no scraping, but we might be able to email them to get a dump.

printables.com
thingiverse.com
makerworld.com (Bambu Labs, they have a feud with printables and are probably why they added no scraping to the TOS)

Khan Academy

https://www.khanacademy.org
Courses are listed with license Creative Commons Attribution/Non-Commercial/Share-Alike

All videos have transcripts, so that could be scraped.

OpenStax

https://openstax.org

Probably very small in comparison to other datasets, but the textbooks are designated with Creative Commons Attribution 4.0

Project Gutenberg

Domain: Old Books
~10K pages

Data at https://gutenberg.org/. Data is mostly raw text, probably need to remove some project gutenberg header/footer text.

Internet Archive Podcasts

There are many podcasts published on Internet Archive under permissive licenses that can be transcribed with an ASR system like whisperX.

Below are the Internet Archive search queries for English podcasts under different licenses (see IA Search Guide for how to filter by license):

Public Domain - 8,131 results
CC-BY-SA - 10,061 results
CC-BY - 4,530 results

I have not tested and cannot verify whisperX transcription quality on non-English audio, but here are the search queries for permissively licensed podcasts in all languages:

Public Domain - 33,826 results
CC-BY-SA - 18,270 results
CC-BY - 17,720 results

With these search queries we can follow these instructions to bulk download the podcasts and then pass them through an ASR system.

Food Content

Domain: Recipes/Food Blogs

In general, recipes (as a list of ingredients) aren't copywritable, but the prose that comes with them often are.

Foodist CC is mentioned in TOS

Legal Documents

Domain: Legal

Pile of Law
Case Law Access Project
US Congressional Documents
Digitized records of congress proceedings. See here: https://www.govinfo.gov/app/collection/cdoc/118/sdoc/all
Some of the data is text and some is just PDF, from a quick look it seems like there are a decent number of tables in the PDFs (which generally don't have text versions available).

peS2o

Data: https://github.com/allenai/peS2o

Google Books

The CC books from google books.

The dataset isn't huge. There was already agreements to release the data, but that was a while ago.

Previous Datasets

Visualdialog Dataset (Dialog Q&A)
Taskmaster Dataset (Dialog Q&A)

Meta: Data format

We should decide on what format to store the processed data in and what metadata to store. In the past, for each "document" from a source (e.g. a webpage, an arxiv paper, a book, etc.), I separately stored:

The "raw" data (e.g. HTML, LaTeX, whatever)
Some reasonably preprocessed data as plain-ish text
Metadata, including
- date
- URL
- author list
- source
- license

We probably need to revise the above as we consider more sources.

We also should decide on a storage format. One reasonable option would be to store it as arrow files, using the same format as Hugging Face datasets.

Finally, we should decide on whether we will do a canonical train/validation/test split and, if so, how to do it and store it.

CC Tagged Content

These can be noisy as the CC tag may be a false positive (i.e. someone includes the CC image in a comment, etc.)

Common Crawl pages with a CC tag
YouTube Video Transcripts with a CC license

Meta: Allowable licenses

We should decide on the set of licenses that we will allow using data from. Here is a list of possible licenses, not necessarily the ones we should allow:

Public domain/CC0
CC-by / by-nc / by-nc-nd / by-nc-sa / by-nd / by-sa
GNU free document license (GFDL)
Various code licenses (MIT, BSD, GPL, LGPL, etc. etc.).

Meta: Issue labels

Can someone make a set of labels to apply to issues? Like, needs a contributor, etc.

PubMed

Domain: Medical Articles

Only some of the articles are creative commons licensed.

The XML dump itself has <AbstractText> for some articles.

Not sure if this would apply to our dataset release:

Users who republish or redistribute the data (services, products or raw data) agree to:
-- maintain the most current version of all distributed data, or

WikiDot

Domain: Wiki farm, lots of random context
~10m pages

Does not use wikitext - https://www.wikidot.com/doc-wiki-syntax:start Can probably just remove all [[...]] blocks, ||, and @@.

NASA ADS

NASA Astrophysical Data System

List of News Sources

Similar to #17

CC BY

CC BY-SA

Public Domain

Voice of America
Caravanserai

Regulations.gov

Federal regulation proposals and user comments.

Get list of documents via Regulations.gov API
Download docs (bulk download or individually?)

Biomedical preprints

Stability AI has collected and preprocessed bioRxiv, MedRxiv, chemRxiv, and possibly some other source of biomedical text. @MicPie can speak more to this.

OER commons

https://oercommons.org/

Collection of CC-licensed (at least I think all of them are CC-licensed) course materials of various kinds. Seems like it might be hard to scrape and not all of the materials contain a good amount of clean text but there's a lot of it and a good amount of it might be useful given how useful textbooks seem to be.

Open Arizona

Open Arizona is a portal of open-access titles from the University of Arizona Press. We are adding new content to the site monthly. Recently, we added titles in anthropology, archaeology, border studies, and Latin American studies. Please check back frequently to see our latest offerings.

https://open.uapress.arizona.edu/

Collection of 124 contemporary books published under CC BY-NC-ND 4.0

Wikis that have dumps (e.g. wikimedia ones)

Several datasets some from Wikimedia sources and they provide data dumps. These dumps contain wikitext markup which can be parsed with libraries like wtf_wikipedia.

Wikipedias (Encyclopedia)
- ~10m pages
- ~100 languages
- Moderated
- meta-dump includes "talk" pages
Wikinews (News)
- ~10k pages
- ~10 languages
- Moderated
Wikibooks (Textbooks)
- ~100k pages
- ~10 languages
- Moderated
Wikitravel/Wikivoyage (Travel guides)
- ~100k pages
- ~10 languages
- Moderated
Wikisource (old stuff)
- ~100k pages
- ~10 languages
- Moderated
Wikiversity (Learning Resources/Textbooks)
- ~100k pages
- ~ 10 languages
- Moderated
Wikipedia Talk Pages (Dialog-ish)
- ~1m pages?
- ~100 languages?
- Not Moderated
- Some hate speech/noise

Famous Datasets Tracker

A number of famous datasets have a large % of their corpora that we want to include. The purpose of this issue is to make it easy to track our coverage of these datasets. A check in the box means we're finished with this dataset, either because the code is done or because we've rejected it. See the attached issue or note for further details.

The Pile

The Stack

Red Pajamas

Silo LM

OpenSubtitles

They still offer big data dumps

Data from the Pile

Pile of Law
HackerNews?

Ballotpedia

Domain: Encyclopedia of American Politics

We need to find a dump or a make a scrape. Will it be available in wikitext format?

Datasets from SILOLM

Domain: Various

We may need to double check their pre-processing and decide if we want to update them.

New Public Domain Books

We are going to make a dataset of public domain books that people don't currently know are in the public domain.

Current estimated count: 443,333 books, another 200,000 have incomplete metadata and may end up being included.

GitHub Repo for work: https://github.com/EleutherAI/pd-books

SEC Data

@StellaAthena, do you know who has been working on this source or if it's currently unassigned?

Global News

Domain: Multilingual news

See globalvoices.org

Need to scrape and parse the HTML

Meta: Data Storage

Figure out storage: S3? R2? AI2 could maybe provide a bucket?

Wikis that don't have dumps (i.e. non-wikimedia ones) and need scraping

Domain: Encyclopedia (Wikia)/various (Fandom wiki's)
~10m pages
~10 languages
Partially Moderated, Has some hate speech/noise

Would need to scrape the pages as there isn't a dump. There should be a URL parameter that allows one to get the wikitext version. These wikis will probably also have "talk" pages, those should be included.

Marked "done" on the sheet, need to get the code/dump from Adam

Fandom wikis are wikis created by users for information about things videogames, etc.

Hacker News

Hackernews from The Pile was used in Silo LM https://github.com/kernelmachine/silo-lm/blob/d40d2dafe0ec0d3ad856161af5a496be21d423f9/README.md?plain=1#L102

It may have changed since The Pile was created but the TOS for HackerNews says no scraping

In connection with your use of the Site you will not engage in or use any data mining, robots, scraping or similar data gathering or extraction methods.

Hackernews is listed as SW in the Silo LM paper (MIT/Apache license) but the only thing I found in the TOS was about the rights granted to YC wrt to submitted content, nothing about rights for others.

By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed.

We should look into the pile subset to see what data is there, it seems like they could just be following "show HN" links to find code to include?

Hansard(s)

A lot of Commonwealth counties provide official transcripts of parliamentary debates going back many years. The work on the Canadian one already seems to be done (couldn't find a license, can ask them), and the UK Hansard can be easily scraped.

Create contributing.md

American Stories

Name: American Stories
Size: The paper reports 65.6 billion tokens total, and according to the authors only 75% of the documents are "Legible." This may project to ~ 50 billion tokens of usable text.
License: Public Domain
Description:

The American Stories dataset is a collection of full article texts extracted from historical U.S. newspaper images. It includes nearly 20 million scans from the public domain Chronicling America collection maintained by the Library of Congress. The dataset is designed to address the challenges posed by complex layouts and low OCR quality in existing newspaper datasets. It was created using a novel deep learning pipeline that incorporates layout detection, legibility classification, custom OCR, and the association of article texts spanning multiple bounding boxes. It employs efficient architectures specifically designed for mobile phones to ensure high scalability. The dataset offers high-quality data that can be utilized for various purposes. It can be used to pre-train large language models and improve their understanding of historical English and world knowledge. The dataset can also be integrated into retrieval-augmented language models, making historical information more accessible, including interpretations of political events and details about people's ancestors. Additionally, the structured article texts in the dataset enable the use of transformer-based methods for applications such as detecting reproduced content. This significantly enhances accuracy compared to relying solely on existing OCR techniques. The American Stories dataset serves as an invaluable resource for developing multimodal layout analysis models and other multimodal applications. Its vast size and silver quality make it ideal for innovation and research in this domain.
Thoughts: This dataset seems pretty rough. Old text is pretty dubious all the time, but this probably needs substantially cleaning before we can use it.

Blogs

Domain: Blogs

Wordpress
Apparently lots of Wordpress posts have CC license, see https://creativecommons.org/2019/11/22/updated-cc-wordpress-plugin/
Medium
Some of the blog posts can be CC. Doesn't actually show the CC license - has "some rights reserved" link that goes to the appropriate license

YouTube Transcripts

Videos on YouTube can optionally be published under a CC-BY license. We can identify these videos with the YouTube API, download them, and transcribe them with an ASR system like whisperX.

GitHub data

Domain: Code/Technical Discussion

Permissively licensed repos can be used for code.

We need to check if the issues, discussions, etc. are licensed under the same terms as the code repo.

Code
Issues