ropensci / arxiv Goto Github PK

View Code? Open in Web Editor NEW

60.0 10.0 10.0 335 KB

Programmatic interface to the Arxiv API

Home Page: https://docs.ropensci.org/aRxiv

License: Other

Makefile 0.85% R 99.15%

arxiv-api arxiv arxiv-org arxiv-analytics r rstats r-package

arxiv's Issues

Do we really need to wait 3 sec between API requests?

The arXiv API user manual says to include a 3 second delay between API requests:

In cases where the API needs to be called multiple times in a row, we encourage you to play nice and incorporate a 3 second delay in your code. The detailed examples below illustrate how to do this in a variety of languages.

This seems unnecessarily long and will really slow down the package tests.

I'm using 3 seconds as the default, but then in the tests and examples I'm using a 0.5 second delay. Am I wrong to speed up the tests in this way?

something of wrong?

Hi guys.
Executing a very simple count:
q='ti:"COVID-19"'
z=arxiv_count(query=q)
I get only 65 results, when trying the same query on aRxive web site produces 1144 results.
What I'm doing of wrong?

Tx a lot

Manlio

arxiv_count: the "search_info" attribute clutters the output

For example:

> arxiv_count("stat.AP")
[1] 3140
attr(,"search_info")
                                query                               id_list                                  time
                            "stat.AP"                                    "" "2014-09-07 02:09:11 America/Chicago"

I think I should give it a class and have a print method.

Change the name to arxiv (lower-case)?

I like the name aRxiv, because of the R, but I wonder if it will be confusing: aRxiv vs arXiv.

If we went with plain arxiv, there might be less confusion about the capitalization.

The other alternative is rarxiv. Like rplos etc. But maybe that's too many r's.

Searches with submittedDate ranges give varying results

The tests of sorted results are giving sporadic errors, in which the arxiv_search seems to retrieve fewer than the expected number of results.

I put some example results here. That's for three successive runs of test(), with no other changes. The first two gave errors (but not exactly the same errors), while the last run was clean.

Note that the error message

 Lengths (2, 1) differ (string compare on first 1)

means that the expected result had length 2 but the code was giving a result with length 1.

Test failure with dev version of httr

  > test_check("aRxiv")
  Loading required package: aRxiv
  arxiv_errors : ...
  arxiv_search in batches : 1
  cleaning the records : Error in arxiv_count(query) : arXiv error: incorrect id format for NA
  Calls: test_check ... all.equal.numeric -> attr.all.equal -> mode -> omit_attr -> arxiv_count
  Execution halted

Would you mind taking a look please?

Converting a aRxiv object into a RIS or EndNote file

Dears,
I was using your package to download some bibliographic references from arXiv.
The package is very nice but now I have a problem.
I'm collecting references from several sources of information on a single EndNote file.
Question, is there a way to export the aRxiv dataframe that I have created into a RIS or EndNote file?
Sorry if the question is trivial but I'm not experienced with R.
Cheers
Fulvio
PS here below you can find the code I used

count1 <- arxiv_count(
'(ecologi* OR aggreg*) AND (fallac* OR bias*)'
)

count2 <- arxiv_count(
'(“cross-level” OR “cross level”) AND (inferenc* OR extrapolat* OR interpretat*)'
)

arXiv1 <- arxiv_search('(ecologi* OR aggreg*) AND (fallac* OR bias*)', limit = as.numeric(count1))

arXiv2 <- arxiv_search('(“cross-level” OR “cross level”) AND (inferenc* OR extrapolat* OR interpretat*)', limit = as.numeric(count2))

arXiv <- rbind(arXiv1, arXiv2)

put NEWS bits in releases

hey @kbroman - We want all rOpenSci pkgs to consistently keep track of changes, following https://github.com/ropensci/onboarding/blob/master/packaging_guide.md#-news

you already keep a NEWS file, thanks!
thanks for git tagging!
Could you please use the releases tab on this repo to include the associated NEWS items for each tag/version ? thanks 😄

`nrow(arxiv_search())` is unpredictable

Hi and thanks for this very nice package (it made my day!).

I'm trying to scrape the last, say, 15k papers from the hep-ph category, with:

res <- arxiv_search(
	"cat:hep-ph",
	limit = 15000,
	batchsize = 1000,
	sort_by = "submitted", ascending = F
	)

However, the number of rows in the returned dataframe varies from query to query (usually it is around 10k, but once I also got 1k)... I would love to provide a reproducible example but could not come up with one.

I'm not sure whether this is due to aRxiv or arXiv 😃 Have you ever noticed something similar? Might have something to do with your comments to #14 ?

Thanks,
Valerio

Need to fix vignette for change in arxiv_cats

Need to revise the part of the vignette that talks about the arxiv_cats dataset, since I revised the structure of that dataset to have the abbreviations in a column called category. (Ugh; I'd just uploaded a new version of the package to CRAN.)

Serialize into a data.frame?

It would be nice to provide some helper functions to serialize the results into a data.frame, especially since the fields returned are often * (but not always) standard. For e.g.

Arrrr> library(aRxiv)
Arrrr> z <- arxiv_search(id_list = "1403.3048,1402.2633,1309.1192")
sapply(z, length)
Arrrr> sapply(z, length)
entry entry entry
   17    17    15

and fields returned are also not super consistent.

Arrrr> sapply(z, names)
$entry
 [1] "id"               "updated"          "published"        "title"
 [5] "summary"          "author"           "author"           "author"
 [9] "author"           "doi"              "link"             "comment"
[13] "journal_ref"      "link"             "link"             "primary_category"
[17] "category"

$entry
 [1] "id"               "updated"          "published"        "title"
 [5] "summary"          "author"           "author"           "author"
 [9] "author"           "author"           "author"           "author"
[13] "comment"          "link"             "link"             "primary_category"
[17] "category"

$entry
 [1] "id"               "updated"          "published"        "title"
 [5] "summary"          "author"           "doi"              "link"
 [9] "comment"          "journal_ref"      "link"             "link"
[13] "primary_category" "category"         "category"

This helper function could take a rbind.fill approach to get an even data.frame returned, or you could consult the API and get a complete list of field names and construct a standard data.frame into which search results can be coerced. Feel free to discard the idea -- just throwing out a suggestion.

use case for XML2R?

Hey @cpsievert, This might be a good use case to try out your XML2R package. It's one of our few sources that only returns XML with no option for JSON.

NEWS and releases

hi @kbroman - would you mind using a NEWS file, and tag each CRAN release with relevant NEWS section for that tag in github Releases https://github.com/ropensci/arxiv/releases ?

would be easier to see what's been changed in each version :)

print method for result of arxiv_count

Oops; I forgot the roxygen2 stuff for the S3 method I created.

Continued pain regarding the tests

The arXiv API server was down much of the day. I revised things to try to avoid problems at CRAN in this situation, but I guess this wasn't totally successful, as I got the following report from Prof. Brian Ripley:

This ran its checks once, failed when run with --run-donttest and then again without. With the error:

* checking tests ...
 Running ‘testthat.R’ [2s/26s]
ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
 Component "authors": Lengths (3, 1) differ (string compare on first 1)
 Component "affiliations": Lengths (3, 1) differ (string compare on first 1)
 Component "link_abstract": Lengths (3, 1) differ (string compare on first 1)
 Component "link_pdf": Lengths (3, 1) differ (string compare on first 1)
 Component "link_doi": Lengths (3, 1) differ (string compare on first 1)
 Component "comment": Lengths (3, 1) differ (string compare on first 1)
 Component "journal_ref": Lengths (3, 1) differ (string compare on first 1)
 Component "doi": Lengths (3, 1) differ (string compare on first 1)
 Component "primary_category": Lengths (3, 1) differ (string compare on first 1)
 Component "categories": Lengths (3, 1) differ (string compare on first 1)

 Error: Test failures
 Execution halted

So the issue is not just if the server is down, but if it is unreliable.

Fix use of class()

Fix use of class(), using inherits() rather than class(blah) == "blah".

arxiv_count.R
arxiv_search.R

Should aRxiv be using the OPI-PMH interface rather than the simpler API?

The arXiv API has some limitations; e.g., even with repeated requests of different slices, you may not be able to get all manuscripts matching a particular search. See, for example, this response on the arxiv-api google group, which notes that an initial search is cached and subsequent calls will just give subsets of that initial search.

They suggest using slices of time, but then they suggest that the OPI-PMH interface would be better for larger downloads.

Haven't looked at the OPI-PMH thing yet. I suspect it's better but more complicated.

Sporadic test fail

In test-search.R line 13, we get occasional errors, like 1 time in 10, maybe.

1. Error: empty results don't give an error (@test-search.R#13) ---------------------------------------------------
1: PCDATA invalid Char value 8

Corrupt record handling

Certain records seem to cause a crash. We have narrowed it down to this query, which should retrieve all records submitted in a one-minute period of 22:16 to 22:17 on January 24, 2018.

dfy<-arxiv_search(query = "submittedDate:[201801242216 TO 201801242217]", limit = 15000, batchsize=2000)

which returns an error of:

> Error in attr(results, "search_info") <- search_attributes(query, id_list,  : 
>   attempt to set an attribute on NULL
>

We can isolate the record, which appears to be this one:
https://arxiv.org/abs/1610.04266

If we were to search using title, the same error appears:
dfy<-arxiv_search(query = "ti:Fourfolds", limit = 1200, batchsize=300)
We therefore think that either the record is corrupt (e.g., hidden unintentional column delimiter, etc.)

A similar error occurs on this single-date range, though we have not isolated the individual record causing the error:
dfy<-arxiv_search(query = "submittedDate:[201612030000 TO 201612040000]", limit = 15000, batchsize=2000)
Does the query need to be modified? Can the query auto-skip corrupt records? Should arxiv be notified?

The list of categories in the package is out of date

For example "cs.SI" is not in the list of categories in arxiv_cats, but does exist on the actual arXiv and is possible to search for.

Add Windows CI

I can take care of this soon.

ropensci / arxiv Goto Github PK

arxiv's Issues

Recommend Projects

Recommend Topics

Recommend Org