ropensci / arxiv Goto Github PK
View Code? Open in Web Editor NEWProgrammatic interface to the Arxiv API
Home Page: https://docs.ropensci.org/aRxiv
License: Other
Programmatic interface to the Arxiv API
Home Page: https://docs.ropensci.org/aRxiv
License: Other
hey @kbroman - We want all rOpenSci pkgs to consistently keep track of changes, following https://github.com/ropensci/onboarding/blob/master/packaging_guide.md#-news
NEWS
file, thanks!In test-search.R
line 13, we get occasional errors, like 1 time in 10, maybe.
1. Error: empty results don't give an error (@test-search.R#13) ---------------------------------------------------
1: PCDATA invalid Char value 8
The arXiv API has some limitations; e.g., even with repeated requests of different slices, you may not be able to get all manuscripts matching a particular search. See, for example, this response on the arxiv-api google group, which notes that an initial search is cached and subsequent calls will just give subsets of that initial search.
They suggest using slices of time, but then they suggest that the OPI-PMH interface would be better for larger downloads.
Haven't looked at the OPI-PMH thing yet. I suspect it's better but more complicated.
For example "cs.SI" is not in the list of categories in arxiv_cats
, but does exist on the actual arXiv and is possible to search for.
The arXiv API user manual says to include a 3 second delay between API requests:
In cases where the API needs to be called multiple times in a row, we encourage you to play nice and incorporate a 3 second delay in your code. The detailed examples below illustrate how to do this in a variety of languages.
This seems unnecessarily long and will really slow down the package tests.
I'm using 3 seconds as the default, but then in the tests and examples I'm using a 0.5 second delay. Am I wrong to speed up the tests in this way?
The arXiv API server was down much of the day. I revised things to try to avoid problems at CRAN in this situation, but I guess this wasn't totally successful, as I got the following report from Prof. Brian Ripley:
This ran its checks once, failed when run with --run-donttest and then again without. With the error:
* checking tests ...
Running ‘testthat.R’ [2s/26s]
ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
Component "authors": Lengths (3, 1) differ (string compare on first 1)
Component "affiliations": Lengths (3, 1) differ (string compare on first 1)
Component "link_abstract": Lengths (3, 1) differ (string compare on first 1)
Component "link_pdf": Lengths (3, 1) differ (string compare on first 1)
Component "link_doi": Lengths (3, 1) differ (string compare on first 1)
Component "comment": Lengths (3, 1) differ (string compare on first 1)
Component "journal_ref": Lengths (3, 1) differ (string compare on first 1)
Component "doi": Lengths (3, 1) differ (string compare on first 1)
Component "primary_category": Lengths (3, 1) differ (string compare on first 1)
Component "categories": Lengths (3, 1) differ (string compare on first 1)
Error: Test failures
Execution halted
So the issue is not just if the server is down, but if it is unreliable.
hi @kbroman - would you mind using a NEWS
file, and tag each CRAN release with relevant NEWS
section for that tag in github Releases https://github.com/ropensci/arxiv/releases ?
would be easier to see what's been changed in each version :)
Hi and thanks for this very nice package (it made my day!).
I'm trying to scrape the last, say, 15k papers from the hep-ph
category, with:
res <- arxiv_search(
"cat:hep-ph",
limit = 15000,
batchsize = 1000,
sort_by = "submitted", ascending = F
)
However, the number of rows in the returned dataframe varies from query to query (usually it is around 10k, but once I also got 1k)... I would love to provide a reproducible example but could not come up with one.
I'm not sure whether this is due to aRxiv
or arXiv
😃 Have you ever noticed something similar? Might have something to do with your comments to #14 ?
Thanks,
Valerio
For example:
> arxiv_count("stat.AP")
[1] 3140
attr(,"search_info")
query id_list time
"stat.AP" "" "2014-09-07 02:09:11 America/Chicago"
I think I should give it a class and have a print method.
Certain records seem to cause a crash. We have narrowed it down to this query, which should retrieve all records submitted in a one-minute period of 22:16 to 22:17 on January 24, 2018.
dfy<-arxiv_search(query = "submittedDate:[201801242216 TO 201801242217]", limit = 15000, batchsize=2000)
which returns an error of:
> Error in attr(results, "search_info") <- search_attributes(query, id_list, :
> attempt to set an attribute on NULL
>
We can isolate the record, which appears to be this one:
https://arxiv.org/abs/1610.04266
If we were to search using title, the same error appears:
dfy<-arxiv_search(query = "ti:Fourfolds", limit = 1200, batchsize=300)
We therefore think that either the record is corrupt (e.g., hidden unintentional column delimiter, etc.)
A similar error occurs on this single-date range, though we have not isolated the individual record causing the error:
dfy<-arxiv_search(query = "submittedDate:[201612030000 TO 201612040000]", limit = 15000, batchsize=2000)
Does the query need to be modified? Can the query auto-skip corrupt records? Should arxiv be notified?
I can take care of this soon.
Need to revise the part of the vignette that talks about the arxiv_cats
dataset, since I revised the structure of that dataset to have the abbreviations in a column called category
. (Ugh; I'd just uploaded a new version of the package to CRAN.)
> test_check("aRxiv")
Loading required package: aRxiv
arxiv_errors : ...
arxiv_search in batches : 1
cleaning the records : Error in arxiv_count(query) : arXiv error: incorrect id format for NA
Calls: test_check ... all.equal.numeric -> attr.all.equal -> mode -> omit_attr -> arxiv_count
Execution halted
Would you mind taking a look please?
Hey @cpsievert, This might be a good use case to try out your XML2R package. It's one of our few sources that only returns XML with no option for JSON.
Oops; I forgot the roxygen2 stuff for the S3 method I created.
It would be nice to provide some helper functions to serialize the results into a data.frame
, especially since the fields returned are often * (but not always) standard. For e.g.
Arrrr> library(aRxiv)
Arrrr> z <- arxiv_search(id_list = "1403.3048,1402.2633,1309.1192")
sapply(z, length)
Arrrr> sapply(z, length)
entry entry entry
17 17 15
and fields returned are also not super consistent.
Arrrr> sapply(z, names)
$entry
[1] "id" "updated" "published" "title"
[5] "summary" "author" "author" "author"
[9] "author" "doi" "link" "comment"
[13] "journal_ref" "link" "link" "primary_category"
[17] "category"
$entry
[1] "id" "updated" "published" "title"
[5] "summary" "author" "author" "author"
[9] "author" "author" "author" "author"
[13] "comment" "link" "link" "primary_category"
[17] "category"
$entry
[1] "id" "updated" "published" "title"
[5] "summary" "author" "doi" "link"
[9] "comment" "journal_ref" "link" "link"
[13] "primary_category" "category" "category"
This helper function could take a rbind.fill
approach to get an even data.frame
returned, or you could consult the API and get a complete list of field names and construct a standard data.frame into which search results can be coerced. Feel free to discard the idea -- just throwing out a suggestion.
Hi guys.
Executing a very simple count:
q='ti:"COVID-19"'
z=arxiv_count(query=q)
I get only 65 results, when trying the same query on aRxive web site produces 1144 results.
What I'm doing of wrong?
Tx a lot
Manlio
I like the name aRxiv, because of the R, but I wonder if it will be confusing: aRxiv vs arXiv.
If we went with plain arxiv, there might be less confusion about the capitalization.
The other alternative is rarxiv. Like rplos etc. But maybe that's too many r's.
Dears,
I was using your package to download some bibliographic references from arXiv.
The package is very nice but now I have a problem.
I'm collecting references from several sources of information on a single EndNote file.
Question, is there a way to export the aRxiv dataframe that I have created into a RIS or EndNote file?
Sorry if the question is trivial but I'm not experienced with R.
Cheers
Fulvio
PS here below you can find the code I used
count1 <- arxiv_count(
'(ecologi* OR aggreg*) AND (fallac* OR bias*)'
)
count2 <- arxiv_count(
'(“cross-level” OR “cross level”) AND (inferenc* OR extrapolat* OR interpretat*)'
)
arXiv1 <- arxiv_search('(ecologi* OR aggreg*) AND (fallac* OR bias*)', limit = as.numeric(count1))
arXiv2 <- arxiv_search('(“cross-level” OR “cross level”) AND (inferenc* OR extrapolat* OR interpretat*)', limit = as.numeric(count2))
arXiv <- rbind(arXiv1, arXiv2)
Fix use of class()
, using inherits()
rather than class(blah) == "blah"
.
arxiv_count.R
arxiv_search.R
The tests of sorted results are giving sporadic errors, in which the arxiv_search
seems to retrieve fewer than the expected number of results.
I put some example results here. That's for three successive runs of test()
, with no other changes. The first two gave errors (but not exactly the same errors), while the last run was clean.
Note that the error message
Lengths (2, 1) differ (string compare on first 1)
means that the expected result had length 2 but the code was giving a result with length 1.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.