Giter VIP home page Giter VIP logo

Comments (8)

jamiesully avatar jamiesully commented on September 25, 2024 1

I think the preferred way to download bulk metadata is using the OAI not API (see https://arxiv.org/help/oa/index). It is actually much easier than running many API requests, especially for larger categories of the arXiv (such as hep-th with ~115,000 entries, which I was able to download in about an hour).

I rewrote a version of 'fetch_papers.py' to fetch the OAI XML and parse it to conform with the API dictionary structure so that one can use the two methods interchangably. I could upload it to the fork I made on github, but I am more or less illiterate when it comes to programming, so it's not 'pretty'.

from arxiv-sanity-preserver.

karelin avatar karelin commented on September 25, 2024 1

@jamiesully Please upload the new version. I'm interested and can do some kind of code review/polishing.

@karpathy Andrej, you're talking on daily updates of papers on ML categories. But, when one starts new arxiv-sanity installation, the load will be much larger, especially for physics-related stuff. Afraid you've closed the issue too soon.

Btw, I expect that is it not fetching metadata, but downloading the PDFs that will make most load on arXiv servers. They propose to use S3 servers (payment necessary) for large amounts of papers, see here: https://arxiv.org/help/bulk_data_s3

from arxiv-sanity-preserver.

jamiesully avatar jamiesully commented on September 25, 2024 1

Yes, apologies for being unclear. I was considering the case of initializing the database, which requires a lot of calls to the API.

@karelin The files are now up on master branch of the fork I made. There were also some minor tweaks I had to make elsewhere to handle the legacy arxiv IDs of the form 'CAT/XXXXXXX'

from arxiv-sanity-preserver.

Moredread avatar Moredread commented on September 25, 2024

IANAL, but using the search API should be OK by https://arxiv.org/help/bulk_data. As we need recent PDFs, "Bulk Full-Text Access" isn't really an option, but PDF downloads are allowed by the robots.txt.

I think the page is mainly targeted at enumerating all pages, or scraping (meta-)data from the webpage, instead of using the provided APIs. Only downloading every PDF might be a grey area I guess.

from arxiv-sanity-preserver.

karpathy avatar karpathy commented on September 25, 2024

In practice, arxiv-sanity only makes on average about 1-3 API requests to arxiv per day (to fetch the latest papers in the ML categories), so it's nowhere close to spamming their servers with requests.

from arxiv-sanity-preserver.

karelin avatar karelin commented on September 25, 2024

@jamiesully Thanks for update. Try to examine it in a couple of days. And it's very good you've addressed the old Arxiv ID format! Very important for many paper types.

from arxiv-sanity-preserver.

Moredread avatar Moredread commented on September 25, 2024

@karelin you're right, bootstrapping via the search API is inefficient and slow. Missed that.

from arxiv-sanity-preserver.

Dabbrivia avatar Dabbrivia commented on September 25, 2024

I can just confirm that after 1000 pdfs downloaded at the initiation arxiv.org bans your IP and you start getting a 403 Forbidden response. This should be addressed if this code to be used on new installation, IMHO.

from arxiv-sanity-preserver.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.