Here: https://arxiv.org/help/ro

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

IANAL, but using the search API should be OK by <a href="https://arxiv.org/help/bulk_d

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Legal? about arxiv-sanity-preserver HOT 8 CLOSED

karpathy commented on September 25, 2024

Legal?

from arxiv-sanity-preserver.

Comments (8)

jamiesully commented on September 25, 2024 1

I think the preferred way to download bulk metadata is using the OAI not API (see https://arxiv.org/help/oa/index). It is actually much easier than running many API requests, especially for larger categories of the arXiv (such as hep-th with ~115,000 entries, which I was able to download in about an hour).

I rewrote a version of 'fetch_papers.py' to fetch the OAI XML and parse it to conform with the API dictionary structure so that one can use the two methods interchangably. I could upload it to the fork I made on github, but I am more or less illiterate when it comes to programming, so it's not 'pretty'.

from arxiv-sanity-preserver.

karelin commented on September 25, 2024 1

@jamiesully Please upload the new version. I'm interested and can do some kind of code review/polishing.

@karpathy Andrej, you're talking on daily updates of papers on ML categories. But, when one starts new arxiv-sanity installation, the load will be much larger, especially for physics-related stuff. Afraid you've closed the issue too soon.

Btw, I expect that is it not fetching metadata, but downloading the PDFs that will make most load on arXiv servers. They propose to use S3 servers (payment necessary) for large amounts of papers, see here: https://arxiv.org/help/bulk_data_s3

from arxiv-sanity-preserver.

jamiesully commented on September 25, 2024 1

Yes, apologies for being unclear. I was considering the case of initializing the database, which requires a lot of calls to the API.

@karelin The files are now up on master branch of the fork I made. There were also some minor tweaks I had to make elsewhere to handle the legacy arxiv IDs of the form 'CAT/XXXXXXX'

from arxiv-sanity-preserver.

Moredread commented on September 25, 2024

IANAL, but using the search API should be OK by https://arxiv.org/help/bulk_data. As we need recent PDFs, "Bulk Full-Text Access" isn't really an option, but PDF downloads are allowed by the robots.txt.

I think the page is mainly targeted at enumerating all pages, or scraping (meta-)data from the webpage, instead of using the provided APIs. Only downloading every PDF might be a grey area I guess.

from arxiv-sanity-preserver.

karpathy commented on September 25, 2024

In practice, arxiv-sanity only makes on average about 1-3 API requests to arxiv per day (to fetch the latest papers in the ML categories), so it's nowhere close to spamming their servers with requests.

from arxiv-sanity-preserver.

karelin commented on September 25, 2024

@jamiesully Thanks for update. Try to examine it in a couple of days. And it's very good you've addressed the old Arxiv ID format! Very important for many paper types.

from arxiv-sanity-preserver.

Moredread commented on September 25, 2024

@karelin you're right, bootstrapping via the search API is inefficient and slow. Missed that.

from arxiv-sanity-preserver.

Dabbrivia commented on September 25, 2024

I can just confirm that after 1000 pdfs downloaded at the initiation arxiv.org bans your IP and you start getting a 403 Forbidden response. This should be addressed if this code to be used on new installation, IMHO.

from arxiv-sanity-preserver.

Legal? about arxiv-sanity-preserver HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent