Exact spec still WIP, but TODOs are basically: Athena query to

Collected a first version of the corpus. Steps I followed are <a href="https://github.

Collect 70B S2 tokens about olmo HOT 2 CLOSED

kyleclo commented on September 16, 2024

Collect 70B S2 tokens

from olmo.

rodneykinney commented on September 16, 2024

Quick estimate for open-access papers in S2orc:

Titles + Abstracts = 14.4B characters
Body Text = 468B characters

from olmo.

soldni commented on September 16, 2024

Collected a first version of the corpus. Steps I followed are here, but a summary is as follows:

Data info:

Corpus is located at s3://ai2-s2-research-public/lucas/s2orc_oa_2022_01_03
It is comprised of 30 gzipped JSONL files.
Each line is a JSON object with the following fields:
- id: the corpus ID of the paper in Semantic Scholar. If you want to look up the paper, use https://api.semanticscholar.org/CorpusID:<id>
- text: the text of the paper. Sections are separated by double newlines, i.e. \n\n

The current set of filters is:

language is en as identified by pycld3
number of whitespace-separated tokens is at least 50
- abstracts below 50 are typically parsing errors.
number of whitespace-separated tokens is at most 50,000
- past 50k, you typically have large books, vocabulary, number heavy reports, etc. Not worth it.
the most frequent token matches the regex ^[A-Za-z][a-z]+$
- documents that have parsing errors or are number heavy usually have a non alpha token as the most frequent, e.g. . or \n.
for documents that have at least 500 tokens, the most frequent token is at most 7.5% of the total number of tokens.
- estimate for English put frequency of top word in a document at 5-10% of the total number of tokens. splitting differences and going with 7.5%.
for documents that are less than 500 tokens, the most frequent token is at most 30% of the total number of tokens.
- for shorter documents, frequency estimates from above are not as reliable. going for a more generous 30%.