Giter VIP home page Giter VIP logo

Comments (2)

rodneykinney avatar rodneykinney commented on September 16, 2024

Quick estimate for open-access papers in S2orc:

Titles + Abstracts = 14.4B characters
Body Text = 468B characters

from olmo.

soldni avatar soldni commented on September 16, 2024

Collected a first version of the corpus. Steps I followed are here, but a summary is as follows:

Data info:

  • Corpus is located at s3://ai2-s2-research-public/lucas/s2orc_oa_2022_01_03
  • It is comprised of 30 gzipped JSONL files.
  • Each line is a JSON object with the following fields:
    • id: the corpus ID of the paper in Semantic Scholar. If you want to look up the paper, use https://api.semanticscholar.org/CorpusID:<id>
    • text: the text of the paper. Sections are separated by double newlines, i.e. \n\n

The current set of filters is:

  • language is en as identified by pycld3
  • number of whitespace-separated tokens is at least 50
    • abstracts below 50 are typically parsing errors.
  • number of whitespace-separated tokens is at most 50,000
    • past 50k, you typically have large books, vocabulary, number heavy reports, etc. Not worth it.
  • the most frequent token matches the regex ^[A-Za-z][a-z]+$
    • documents that have parsing errors or are number heavy usually have a non alpha token as the most frequent, e.g. . or \n.
  • for documents that have at least 500 tokens, the most frequent token is at most 7.5% of the total number of tokens.
    • estimate for English put frequency of top word in a document at 5-10% of the total number of tokens. splitting differences and going with 7.5%.
  • for documents that are less than 500 tokens, the most frequent token is at most 30% of the total number of tokens.
    • for shorter documents, frequency estimates from above are not as reliable. going for a more generous 30%.

Final counts:

  • Number of whitespace-separated tokens: 72,582,009,602
  • Number of documents: 74,772,626

from olmo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.