Giter VIP home page Giter VIP logo

arxiv's Introduction

rOpenSci

Project Status: Abandoned

This repository has been archived. The former README is now in README-NOT.md.

arxiv's People

Contributors

diana-ly avatar karthik avatar kbroman avatar sckott avatar stevenysw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arxiv's Issues

Sporadic test fail

In test-search.R line 13, we get occasional errors, like 1 time in 10, maybe.

1. Error: empty results don't give an error (@test-search.R#13) ---------------------------------------------------
1: PCDATA invalid Char value 8

use case for XML2R?

Hey @cpsievert, This might be a good use case to try out your XML2R package. It's one of our few sources that only returns XML with no option for JSON.

Continued pain regarding the tests

The arXiv API server was down much of the day. I revised things to try to avoid problems at CRAN in this situation, but I guess this wasn't totally successful, as I got the following report from Prof. Brian Ripley:

This ran its checks once, failed when run with --run-donttest and then again without. With the error:

* checking tests ...
 Running ‘testthat.R’ [2s/26s]
ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
 Component "authors": Lengths (3, 1) differ (string compare on first 1)
 Component "affiliations": Lengths (3, 1) differ (string compare on first 1)
 Component "link_abstract": Lengths (3, 1) differ (string compare on first 1)
 Component "link_pdf": Lengths (3, 1) differ (string compare on first 1)
 Component "link_doi": Lengths (3, 1) differ (string compare on first 1)
 Component "comment": Lengths (3, 1) differ (string compare on first 1)
 Component "journal_ref": Lengths (3, 1) differ (string compare on first 1)
 Component "doi": Lengths (3, 1) differ (string compare on first 1)
 Component "primary_category": Lengths (3, 1) differ (string compare on first 1)
 Component "categories": Lengths (3, 1) differ (string compare on first 1)

 Error: Test failures
 Execution halted

So the issue is not just if the server is down, but if it is unreliable.

Converting a aRxiv object into a RIS or EndNote file

Dears,
I was using your package to download some bibliographic references from arXiv.
The package is very nice but now I have a problem.
I'm collecting references from several sources of information on a single EndNote file.
Question, is there a way to export the aRxiv dataframe that I have created into a RIS or EndNote file?
Sorry if the question is trivial but I'm not experienced with R.
Cheers
Fulvio
PS here below you can find the code I used

count1 <- arxiv_count(
'(ecologi* OR aggreg*) AND (fallac* OR bias*)'
)

count2 <- arxiv_count(
'(“cross-level” OR “cross level”) AND (inferenc* OR extrapolat* OR interpretat*)'
)

arXiv1 <- arxiv_search('(ecologi* OR aggreg*) AND (fallac* OR bias*)', limit = as.numeric(count1))

arXiv2 <- arxiv_search('(“cross-level” OR “cross level”) AND (inferenc* OR extrapolat* OR interpretat*)', limit = as.numeric(count2))

arXiv <- rbind(arXiv1, arXiv2)

something of wrong?

Hi guys.
Executing a very simple count:
q='ti:"COVID-19"'
z=arxiv_count(query=q)
I get only 65 results, when trying the same query on aRxive web site produces 1144 results.
What I'm doing of wrong?

Tx a lot

Manlio

Should aRxiv be using the OPI-PMH interface rather than the simpler API?

The arXiv API has some limitations; e.g., even with repeated requests of different slices, you may not be able to get all manuscripts matching a particular search. See, for example, this response on the arxiv-api google group, which notes that an initial search is cached and subsequent calls will just give subsets of that initial search.

They suggest using slices of time, but then they suggest that the OPI-PMH interface would be better for larger downloads.

Haven't looked at the OPI-PMH thing yet. I suspect it's better but more complicated.

Need to fix vignette for change in arxiv_cats

Need to revise the part of the vignette that talks about the arxiv_cats dataset, since I revised the structure of that dataset to have the abbreviations in a column called category. (Ugh; I'd just uploaded a new version of the package to CRAN.)

Searches with submittedDate ranges give varying results

The tests of sorted results are giving sporadic errors, in which the arxiv_search seems to retrieve fewer than the expected number of results.

I put some example results here. That's for three successive runs of test(), with no other changes. The first two gave errors (but not exactly the same errors), while the last run was clean.

Note that the error message

 Lengths (2, 1) differ (string compare on first 1)

means that the expected result had length 2 but the code was giving a result with length 1.

Serialize into a data.frame?

It would be nice to provide some helper functions to serialize the results into a data.frame, especially since the fields returned are often * (but not always) standard. For e.g.

Arrrr> library(aRxiv)
Arrrr> z <- arxiv_search(id_list = "1403.3048,1402.2633,1309.1192")
sapply(z, length)
Arrrr> sapply(z, length)
entry entry entry
   17    17    15

and fields returned are also not super consistent.

Arrrr> sapply(z, names)
$entry
 [1] "id"               "updated"          "published"        "title"
 [5] "summary"          "author"           "author"           "author"
 [9] "author"           "doi"              "link"             "comment"
[13] "journal_ref"      "link"             "link"             "primary_category"
[17] "category"

$entry
 [1] "id"               "updated"          "published"        "title"
 [5] "summary"          "author"           "author"           "author"
 [9] "author"           "author"           "author"           "author"
[13] "comment"          "link"             "link"             "primary_category"
[17] "category"

$entry
 [1] "id"               "updated"          "published"        "title"
 [5] "summary"          "author"           "doi"              "link"
 [9] "comment"          "journal_ref"      "link"             "link"
[13] "primary_category" "category"         "category"

This helper function could take a rbind.fill approach to get an even data.frame returned, or you could consult the API and get a complete list of field names and construct a standard data.frame into which search results can be coerced. Feel free to discard the idea -- just throwing out a suggestion.

Do we really need to wait 3 sec between API requests?

The arXiv API user manual says to include a 3 second delay between API requests:

In cases where the API needs to be called multiple times in a row, we encourage you to play nice and incorporate a 3 second delay in your code. The detailed examples below illustrate how to do this in a variety of languages.

This seems unnecessarily long and will really slow down the package tests.

I'm using 3 seconds as the default, but then in the tests and examples I'm using a 0.5 second delay. Am I wrong to speed up the tests in this way?

Change the name to arxiv (lower-case)?

I like the name aRxiv, because of the R, but I wonder if it will be confusing: aRxiv vs arXiv.

If we went with plain arxiv, there might be less confusion about the capitalization.

The other alternative is rarxiv. Like rplos etc. But maybe that's too many r's.

`nrow(arxiv_search())` is unpredictable

Hi and thanks for this very nice package (it made my day!).

I'm trying to scrape the last, say, 15k papers from the hep-ph category, with:

res <- arxiv_search(
	"cat:hep-ph",
	limit = 15000,
	batchsize = 1000,
	sort_by = "submitted", ascending = F
	)

However, the number of rows in the returned dataframe varies from query to query (usually it is around 10k, but once I also got 1k)... I would love to provide a reproducible example but could not come up with one.

I'm not sure whether this is due to aRxiv or arXiv 😃 Have you ever noticed something similar? Might have something to do with your comments to #14 ?

Thanks,
Valerio

Corrupt record handling

Certain records seem to cause a crash. We have narrowed it down to this query, which should retrieve all records submitted in a one-minute period of 22:16 to 22:17 on January 24, 2018.

dfy<-arxiv_search(query = "submittedDate:[201801242216 TO 201801242217]", limit = 15000, batchsize=2000)

which returns an error of:

> Error in attr(results, "search_info") <- search_attributes(query, id_list,  : 
>   attempt to set an attribute on NULL
> 

We can isolate the record, which appears to be this one:
https://arxiv.org/abs/1610.04266

If we were to search using title, the same error appears:
dfy<-arxiv_search(query = "ti:Fourfolds", limit = 1200, batchsize=300)
We therefore think that either the record is corrupt (e.g., hidden unintentional column delimiter, etc.)

A similar error occurs on this single-date range, though we have not isolated the individual record causing the error:
dfy<-arxiv_search(query = "submittedDate:[201612030000 TO 201612040000]", limit = 15000, batchsize=2000)
Does the query need to be modified? Can the query auto-skip corrupt records? Should arxiv be notified?

Fix use of class()

Fix use of class(), using inherits() rather than class(blah) == "blah".

  • arxiv_count.R
  • arxiv_search.R

Test failure with dev version of httr

  > test_check("aRxiv")
  Loading required package: aRxiv
  arxiv_errors : ...
  arxiv_search in batches : 1
  cleaning the records : Error in arxiv_count(query) : arXiv error: incorrect id format for NA
  Calls: test_check ... all.equal.numeric -> attr.all.equal -> mode -> omit_attr -> arxiv_count
  Execution halted

Would you mind taking a look please?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.