Comments (8)
I think the preferred way to download bulk metadata is using the OAI not API (see https://arxiv.org/help/oa/index). It is actually much easier than running many API requests, especially for larger categories of the arXiv (such as hep-th with ~115,000 entries, which I was able to download in about an hour).
I rewrote a version of 'fetch_papers.py' to fetch the OAI XML and parse it to conform with the API dictionary structure so that one can use the two methods interchangably. I could upload it to the fork I made on github, but I am more or less illiterate when it comes to programming, so it's not 'pretty'.
from arxiv-sanity-preserver.
@jamiesully Please upload the new version. I'm interested and can do some kind of code review/polishing.
@karpathy Andrej, you're talking on daily updates of papers on ML categories. But, when one starts new arxiv-sanity installation, the load will be much larger, especially for physics-related stuff. Afraid you've closed the issue too soon.
Btw, I expect that is it not fetching metadata, but downloading the PDFs that will make most load on arXiv servers. They propose to use S3 servers (payment necessary) for large amounts of papers, see here: https://arxiv.org/help/bulk_data_s3
from arxiv-sanity-preserver.
Yes, apologies for being unclear. I was considering the case of initializing the database, which requires a lot of calls to the API.
@karelin The files are now up on master branch of the fork I made. There were also some minor tweaks I had to make elsewhere to handle the legacy arxiv IDs of the form 'CAT/XXXXXXX'
from arxiv-sanity-preserver.
IANAL, but using the search API should be OK by https://arxiv.org/help/bulk_data. As we need recent PDFs, "Bulk Full-Text Access" isn't really an option, but PDF downloads are allowed by the robots.txt.
I think the page is mainly targeted at enumerating all pages, or scraping (meta-)data from the webpage, instead of using the provided APIs. Only downloading every PDF might be a grey area I guess.
from arxiv-sanity-preserver.
In practice, arxiv-sanity only makes on average about 1-3 API requests to arxiv per day (to fetch the latest papers in the ML categories), so it's nowhere close to spamming their servers with requests.
from arxiv-sanity-preserver.
@jamiesully Thanks for update. Try to examine it in a couple of days. And it's very good you've addressed the old Arxiv ID format! Very important for many paper types.
from arxiv-sanity-preserver.
@karelin you're right, bootstrapping via the search API is inefficient and slow. Missed that.
from arxiv-sanity-preserver.
I can just confirm that after 1000 pdfs downloaded at the initiation arxiv.org bans your IP and you start getting a 403 Forbidden response. This should be addressed if this code to be used on new installation, IMHO.
from arxiv-sanity-preserver.
Related Issues (20)
- Would it be acceptable to the team if I redesign and code the UI/UX part of the website? HOT 1
- This site can’t be reached HOT 1
- Browser Extension?
- This site can’t be reached HOT 4
- cannot reach the site HOT 1
- Willing to donate to keep the site up longer!
- The site is down. HOT 5
- Compromised password; cannot change or close acccount HOT 3
- Site not updating HOT 2
- Website is down HOT 1
- Site is down HOT 11
- Site is down HOT 2
- HTTPS error when loading the site HOT 1
- Top Recent (Weekly) Broken HOT 1
- Introducing Skim - A Platform that helps you to skim through papers in this fast moving research world HOT 4
- shutdown the project? HOT 6
- Site is down currently?
- Arxiv sanity preserver down permanently? HOT 5
- Missing recent top papers; Is redirect to Arxiv-Sanity Lite intentional?
- Add "comments" under each entry
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arxiv-sanity-preserver.