Giter VIP home page Giter VIP logo

common-pile's Introduction

Licensed Pile

Repo to hold code and track issues for the collection of permissively licensed data

Tips

You can look at Dolma formatted data via commandline tools like so.

cat ${file}.jsonl.gz | gunzip | jq -s ${commmand}

js -s is used to process the input as jsonl (valid json per line) instead of expecting the whole input to be valid json.

Then you can use jq syntax to look for specific things, e.g.:

Look at the text for item 1115 cat ${file}.jsonl.gz | gunzip | jq -s '.[1115].text'

Look at the text for item with the id of 12 (note that position in file is not correlated with id) cat ${file}.jsonl.gz | gunzip | jq -s '.[] | select(.id == "12").text'

Note: You can also use gunzip -c ${file}.jsonl.gz | jq -s ${command} which is slightly faster (reduces the amount of data flowwing through pipes) but if you forget the -c flag you end up uncompressing the file and deleting the compressed version, i.e. you need to run gzip ${file}.jsonl to fix it.

Capped-parallelism in bash script

Sometimes we want to download/process multiple files in parallel up to a limited number of jobs in bash script. Below is a example code snippet (used in courtlistener/get_data.sh). Note that jobs -r counts all jobs running in the current shell.

max_jobs = 8
for file in "${files[@]}"; do
    download_and_process "file" &

    # Limit the number of parallel jobs
    if (( $(jobs -r | wc -l) >= max_jobs )); then
        wait -n
    fi
done

Development

We use git pre-commit hooks to format code and keep style consistent.

  • Install the pre-commit library with pip install pre-commit.
  • Install the pre-commit hooks with pre-commit install from the repository root.

Now when a git commit is run, the hooks will run. If one of the hooks reformats a file, the commit will be blocked. Then you need to inspect the changes and readd them with git add. Then you can re-run your commit command and the commit will actually be added.

common-pile's People

Contributors

alon-albalak avatar blester125 avatar nkandpa2 avatar stellaathena avatar baberabb avatar craffel avatar conceptofmind avatar muennighoff avatar wildphoton avatar shayne-longpre avatar

Stargazers

Greg Lindahl avatar Christian avatar  avatar Sebastian Majstorovic avatar  avatar Nikolaus Schlemm avatar Unchun Yang avatar Christopher Schröder avatar Albert Villanova del Moral avatar Aflah avatar Hailey Schoelkopf avatar Chris Ha avatar USVSN SAI PRASHANTH avatar Kranthi Kiran GV avatar  avatar Sunny Son avatar Kyle Lo avatar Jeff Carpenter avatar Luca Soldaini avatar  avatar Nihal Nayak avatar

Watchers

 avatar Greg Lindahl avatar alex2awesome avatar  avatar  avatar  avatar

common-pile's Issues

Wikis that have dumps (e.g. wikimedia ones)

Several datasets some from Wikimedia sources and they provide data dumps. These dumps contain wikitext markup which can be parsed with libraries like wtf_wikipedia.

  • Wikipedias (Encyclopedia)
    • ~10m pages
    • ~100 languages
    • Moderated
    • meta-dump includes "talk" pages
  • Wikinews (News)
    • ~10k pages
    • ~10 languages
    • Moderated
  • Wikibooks (Textbooks)
    • ~100k pages
    • ~10 languages
    • Moderated
  • Wikitravel/Wikivoyage (Travel guides)
    • ~100k pages
    • ~10 languages
    • Moderated
  • Wikisource (old stuff)
    • ~100k pages
    • ~10 languages
    • Moderated
  • Wikiversity (Learning Resources/Textbooks)
    • ~100k pages
    • ~ 10 languages
    • Moderated
  • Wikipedia Talk Pages (Dialog-ish)
    • ~1m pages?
    • ~100 languages?
    • Not Moderated
    • Some hate speech/noise

Internet Archive Podcasts

There are many podcasts published on Internet Archive under permissive licenses that can be transcribed with an ASR system like whisperX.

Below are the Internet Archive search queries for English podcasts under different licenses (see IA Search Guide for how to filter by license):

I have not tested and cannot verify whisperX transcription quality on non-English audio, but here are the search queries for permissively licensed podcasts in all languages:

With these search queries we can follow these instructions to bulk download the podcasts and then pass them through an ASR system.

YouTube Transcripts

Videos on YouTube can optionally be published under a CC-BY license. We can identify these videos with the YouTube API, download them, and transcribe them with an ASR system like whisperX.

Legal Documents

Domain: Legal

  • Pile of Law

  • Case Law Access Project

  • US Congressional Documents
    Digitized records of congress proceedings. See here: https://www.govinfo.gov/app/collection/cdoc/118/sdoc/all
    Some of the data is text and some is just PDF, from a quick look it seems like there are a decent number of tables in the PDFs (which generally don't have text versions available).

Previous Datasets

  • Visualdialog Dataset (Dialog Q&A)
  • Taskmaster Dataset (Dialog Q&A)

Regulations.gov

Federal regulation proposals and user comments.

  • Get list of documents via Regulations.gov API
  • Download docs (bulk download or individually?)

Hansard(s)

A lot of Commonwealth counties provide official transcripts of parliamentary debates going back many years. The work on the Canadian one already seems to be done (couldn't find a license, can ask them), and the UK Hansard can be easily scraped.

List of News Sources

Similar to #17

CC BY

  • 360info
  • Meduza
  • Alt News
  • SciDev.Net
  • Agenzia Fides
  • Factly
  • Milwaukee Neighborhood News Service
  • Global Voices
  • Tasnim News Agency
  • Mekong Eye
  • Africa is a Country
  • Balkan Diskurs
  • Minority Africa
  • PanARMENIAN.Net
  • ZimFact
  • The Solutions Journalism Exchange
  • Freedom of the Press Foundation
  • New Canadian Media
  • Project Multatuli

CC BY-SA

  • Propastop
  • The Public Record
  • EduCeleb
  • Liberty TV
  • Oxpeckers

Public Domain

  • Voice of America
  • Caravanserai

ArXiv Articles

Domain: Scientific Articles

Not all articles posted on ArXiv are permissively licensed. Need to only do the CC ones.

Need to parse the LaTeX into plain text.

@craffel has already parsed some of these documents so we can re-use that code and those processed articles. We should re-run the data collection as there are always new papers being posted to arxiv.

The S2ORC dataset also includes scientific articles, we may need to deduplicate some between the two datasets.

New Public Domain Books

We are going to make a dataset of public domain books that people don't currently know are in the public domain.

Current estimated count: 443,333 books, another 200,000 have incomplete metadata and may end up being included.

GitHub Repo for work: https://github.com/EleutherAI/pd-books

Meta: Issue labels

Can someone make a set of labels to apply to issues? Like, needs a contributor, etc.

Datasets from SILOLM

Domain: Various

We may need to double check their pre-processing and decide if we want to update them.

  • Pile of Law
  • Case Law Access Project
  • Github from RedPajama
  • HackerNews
  • Ubuntu IRC
  • Stackexchange from RedPajama
  • Stackoverflow corpus from Kaggle
  • Deepmind Mathematics
  • AMPS dataset
  • ArXiv abstracts
  • S2ORC (scientific papers)
  • Gutenberg Corpus (PG19)
  • MOT corpus (News)
  • Wikinews
  • Wikipedia from RedPajama

Biomedical preprints

Stability AI has collected and preprocessed bioRxiv, MedRxiv, chemRxiv, and possibly some other source of biomedical text. @MicPie can speak more to this.

Global News

Domain: Multilingual news

See globalvoices.org

Need to scrape and parse the HTML

Open Arizona

Open Arizona is a portal of open-access titles from the University of Arizona Press. We are adding new content to the site monthly. Recently, we added titles in anthropology, archaeology, border studies, and Latin American studies. Please check back frequently to see our latest offerings.

https://open.uapress.arizona.edu/

Collection of 124 contemporary books published under CC BY-NC-ND 4.0

Stack Exchange

Domain: Technical Question-Answer and Discussion
~10m pages
English

@craffel said he has a stack exchange parser to use.

SiloLM paper also used stack exchange, we need to look at there data preprocessing and decide if we want to use theirs or re-do it.

Design Considerations:

  • Questions can have multiple answers
    • Do you create multiple (q, a) pairs with the same question for each answer?
    • Do you create a single (q, a1, a2, ...) document?
      • Do you sort by ranking or similar?
  • Where do we put the comments in a document?
  • Stack Exchange can have edits that change text/author information opaquely, thus some comments become non sequiturs when something they referenced changes

Patent Data

Domain: Patents

Can we use the Google Patents data for this?

It might be possible to use C4/Common Crawl data for this as patents.google.com is one of the most represented domains in c4

SEC Data

@StellaAthena, do you know who has been working on this source or if it's currently unassigned?

Famous Datasets Tracker

A number of famous datasets have a large % of their corpora that we want to include. The purpose of this issue is to make it easy to track our coverage of these datasets. A check in the box means we're finished with this dataset, either because the code is done or because we've rejected it. See the attached issue or note for further details.

The Pile

  • Pile-CC #22
  • PubMed Central #8
  • Books3 Skipping for licensing reasons
  • OpenWebText2 #22
  • arXiv #4
  • GitHub #14
  • Free Law #21
  • Stack Exchange #3
  • USPTO Backgrounds #9
  • PubMed Abstracts
  • Gutenberg (PG-19) #2
  • OpenSubtitles Skipping for licensing reasons #23
  • Wikipedia #1
  • DeepMind Mathematics Skipping, potentially going to do our own version.
  • Ubuntu Freenode IRC #5
  • BookCorpus2 Skipping for licensing reasons
  • EuroParl
  • Hacker News #6
  • YouTube Subtitles Skipping for licensing reasons
  • PhilPapers Skipping for licensing reasons
  • NIH ExPorter
  • Enron Emails

The Stack

Red Pajamas

  • arXiv #4
  • Books skipping for licensing reasons
  • C4 #22
  • Common Crawl #22
  • GitHub #14
  • Stack Exchange #3
  • Wikipedia #1

Silo LM

  • Pile of Law #21
  • Case Law Access Project #21
  • Github from RedPajama #14
  • HackerNews #6
  • Ubuntu IRC #5
  • Stackexchange from RedPajama #3
  • Stackoverflow corpus from Kaggle #3
  • DeepMind Mathematics Skipping for now, potentially going to do our own version.
  • AMPS dataset #30
  • ArXiv abstracts #4
  • S2ORC (scientific papers) #26
  • Gutenberg Corpus (PG19) #2
  • MOT corpus (News) #44
  • Wikinews #1 #7
  • Wikipedia from RedPajama #1

Ballotpedia

Domain: Encyclopedia of American Politics

We need to find a dump or a make a scrape. Will it be available in wikitext format?

GitHub data

Domain: Code/Technical Discussion

Permissively licensed repos can be used for code.

We need to check if the issues, discussions, etc. are licensed under the same terms as the code repo.

  • Code
  • Issues

CC Tagged Content

These can be noisy as the CC tag may be a false positive (i.e. someone includes the CC image in a comment, etc.)

  • Common Crawl pages with a CC tag
  • YouTube Video Transcripts with a CC license

Hacker News

Hackernews from The Pile was used in Silo LM https://github.com/kernelmachine/silo-lm/blob/d40d2dafe0ec0d3ad856161af5a496be21d423f9/README.md?plain=1#L102

It may have changed since The Pile was created but the TOS for HackerNews says no scraping

In connection with your use of the Site you will not engage in or use any data mining, robots, scraping or similar data gathering or extraction methods.

Hackernews is listed as SW in the Silo LM paper (MIT/Apache license) but the only thing I found in the TOS was about the rights granted to YC wrt to submitted content, nothing about rights for others.

By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed.

We should look into the pile subset to see what data is there, it seems like they could just be following "show HN" links to find code to include?

Food Content

Domain: Recipes/Food Blogs

In general, recipes (as a list of ingredients) aren't copywritable, but the prose that comes with them often are.

OER commons

https://oercommons.org/

Collection of CC-licensed (at least I think all of them are CC-licensed) course materials of various kinds. Seems like it might be hard to scrape and not all of the materials contain a good amount of clean text but there's a lot of it and a good amount of it might be useful given how useful textbooks seem to be.

Hathi Trust

https://www.hathitrust.org/member-libraries/resources-for-librarians/data-resources/research-datasets/

HathiTrust makes public domain work available for bulk download on request for non-commercial research purposes.

There are 4 distinctions based on where the researcher is located and if volumes digitized by Google is included.

image

Public domain text, excluding Google-digitized volumes

  1. Dataset for researchers in the U.S. -> 814,045 public domain and Creative Commons-licensed (480GB)
  2. Dataset for researchers outside the U.S. -> 610,575 (351GB)

All public domain text, including Google-digitized volumes

  1. Researchers in the U.S. -> 6,649,535 public domain and Creative Commons-licensed volumes (5.4TB)
  2. Researchers outside the U.S. -> 4,316,648 (3.4TB)

I think the version that includes Google-digitized volumes is a superset of #18 .

Meta: Allowable licenses

We should decide on the set of licenses that we will allow using data from. Here is a list of possible licenses, not necessarily the ones we should allow:

  • Public domain/CC0
  • CC-by / by-nc / by-nc-nd / by-nc-sa / by-nd / by-sa
  • GNU free document license (GFDL)
  • Various code licenses (MIT, BSD, GPL, LGPL, etc. etc.).

PubMed

Domain: Medical Articles

Only some of the articles are creative commons licensed.

The XML dump itself has <AbstractText> for some articles.

Not sure if this would apply to our dataset release:

Users who republish or redistribute the data (services, products or raw data) agree to:
-- maintain the most current version of all distributed data, or

NASA ADS

NASA Astrophysical Data System

Printables and Similar Sites

Domain: Descriptions of 3d models for 3d printers

Each model includes license information on the page and the TOS says that the text description counts as part of the "print documents" so it should be under the same license.

The TOS of printables.com says no scraping, but we might be able to email them to get a dump.

  • printables.com
  • thingiverse.com
  • makerworld.com (Bambu Labs, they have a feud with printables and are probably why they added no scraping to the TOS)

Wikis that don't have dumps (i.e. non-wikimedia ones) and need scraping

Domain: Encyclopedia (Wikia)/various (Fandom wiki's)
~10m pages
~10 languages
Partially Moderated, Has some hate speech/noise

Would need to scrape the pages as there isn't a dump. There should be a URL parameter that allows one to get the wikitext version. These wikis will probably also have "talk" pages, those should be included.

Marked "done" on the sheet, need to get the code/dump from Adam

Fandom wikis are wikis created by users for information about things videogames, etc.

  • Wikia
  • Fandom Wikis
    • Example
    • Example Talk Pages
  • Ballotpedia (Encyclopedia about American Politics)
  • localwiki (Encyclopedia about "local knowledge" in the US)
  • Proofwiki

Meta: Data format

We should decide on what format to store the processed data in and what metadata to store. In the past, for each "document" from a source (e.g. a webpage, an arxiv paper, a book, etc.), I separately stored:

  • The "raw" data (e.g. HTML, LaTeX, whatever)
  • Some reasonably preprocessed data as plain-ish text
  • Metadata, including
    • date
    • URL
    • author list
    • source
    • license

We probably need to revise the above as we consider more sources.

We also should decide on a storage format. One reasonable option would be to store it as arrow files, using the same format as Hugging Face datasets.

Finally, we should decide on whether we will do a canonical train/validation/test split and, if so, how to do it and store it.

American Stories

Name: American Stories
Size: The paper reports 65.6 billion tokens total, and according to the authors only 75% of the documents are "Legible." This may project to ~ 50 billion tokens of usable text.
License: Public Domain
Description:

The American Stories dataset is a collection of full article texts extracted from historical U.S. newspaper images. It includes nearly 20 million scans from the public domain Chronicling America collection maintained by the Library of Congress. The dataset is designed to address the challenges posed by complex layouts and low OCR quality in existing newspaper datasets. It was created using a novel deep learning pipeline that incorporates layout detection, legibility classification, custom OCR, and the association of article texts spanning multiple bounding boxes. It employs efficient architectures specifically designed for mobile phones to ensure high scalability. The dataset offers high-quality data that can be utilized for various purposes. It can be used to pre-train large language models and improve their understanding of historical English and world knowledge. The dataset can also be integrated into retrieval-augmented language models, making historical information more accessible, including interpretations of political events and details about people's ancestors. Additionally, the structured article texts in the dataset enable the use of transformer-based methods for applications such as detecting reproduced content. This significantly enhances accuracy compared to relying solely on existing OCR techniques. The American Stories dataset serves as an invaluable resource for developing multimodal layout analysis models and other multimodal applications. Its vast size and silver quality make it ideal for innovation and research in this domain.
Thoughts: This dataset seems pretty rough. Old text is pretty dubious all the time, but this probably needs substantially cleaning before we can use it.

Earnings Call Transcripts

Every quarter, publicly traded companies have an earnings call where they discuss the financial results of the quarter. According to this ruling, publication of earnings call transcripts are considered fair use since there's little copyrightable material in an earnings call.

However, the transcripts do seem to be a good source of high-quality text. For instance, see Apple's 2024 Q1 earnings call. A transcript contains quite a bit of factual information about a company, real-world events that impacted the company, financial information, and also contains Q&A dialogue.

Some napkin math: Transcripts on average seem to be about 10K tokens (checks out since it's usually a 30-60 min phone call), companies do them 4 times per year, and there are around 5k-10k publicly traded companies for which there are transcripts on the popular finance sites. Collecting 10 years worth of these transcripts would give 2-4B tokens of high-quality text.

Work would be needed to figure out the best way to collect this data. Also, it looks like some of these transcripts will be in the SEC data #31 but I have no clue whether this is all of them or just some one-off transcripts.

OpenStax

https://openstax.org

Probably very small in comparison to other datasets, but the textbooks are designated with Creative Commons Attribution 4.0

Google Books

The CC books from google books.

The dataset isn't huge. There was already agreements to release the data, but that was a while ago.

Ubuntu IRC

Domain: Technical Chat
Moderated
Data is text

"The content of all Ubuntu channels, whether official logs or otherwise, are considered to be in the public domain."

There are a lot of interleaved dialogues in datasets like this. They are hard to disentangle. Some work has been done where (IIRC) they manually annotated conversations within ubuntu IRC. Code they used from the previous paper https://jkk.name/irc-disentanglement/

Are there other IRC chats/forms like this? For example ArchWiki? Linux Kernel Development Mailing List (might be too much of a "toxic" source tho, lol)?

Efficient Reshard Tool

Some of the preprocessing can remove all document content (and some documents in some sources are blank to begin with). We should have an efficient way to remove these documents (there is an example in the dolma docs) and reshard them to that the resulting shards are balanced.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.