ropensci / rdryad Goto Github PK

View Code? Open in Web Editor NEW

25.0 6.0 7.0 601 KB

R client for Dryad web services

Home Page: https://docs.ropensci.org/rdryad

License: Other

R 94.95% Makefile 1.84% Rich Text Format 3.20%

dryad-api dryad data doi oai-pmh r rstats r-package

rdryad's Introduction

rdryad

rdryad is a package to interface with the Dryad data repository.

*This package will be superceded by {deposits}. See Issue #39

General Dryad API documentation: https://datadryad.org/api/v2/docs/

rdryad docs: https://docs.ropensci.org/rdryad/

Installation

Install Dryad from CRAN

install.packages("rdryad")

development version:

remotes::install_github("ropensci/rdryad")

library('rdryad')

rdryad's People

Contributors

Stargazers

Watchers

Forkers

tengelkes jlehtoma mja aammd alrutten nmepscor mpadge

rdryad's Issues

solr fxns: add new dspace archived parameter to queries

&DSpaceStatus:Archived

No corresponding file found

Hi,
First, thanks for building rdryad =) it's nice to have a package to directly manage Dryad repositories.

I stumbled upon an interesting dataset from Dryad the DOI of the article is 10.1111/ecog.01986 and I can find it on Dryad:

q = rdryad::d_solr_search(q = "10.1111/ecog.01986")
q$handle
# [1] "10255/dryad.116170"

I get the handle back but can't seem to download data from it:

rdryad::download_url("10255/dryad.116170")
# Error: No output from search

Looking at query all_ac field, there is another handle mentioned 10255/dryad.116171

rdryad::download_url("10255/dryad.116171")
# [1] "http://datadryad.org/bitstream/handle/10255/dryad.116171/DryadArchive.zip?sequence=1"

This one works. Do you know why? Is it proper to Dryad's architecture? Is there a way to programmatically extract this second handle?

lapply error with datasets.

I've come across this with multiple data sets. Here's the data-set I want:
http://datadryad.org/resource/doi:10.5061/dryad.1664

dryaddat <- download_url("10255/dryad.1664")
Error in out[lapply(out, str_detect, pattern = "sequence=1") == "TRUE"][[1]] : subscript out of bounds

keep

get a download url

download_url

OAI-PMH functions

download_dryadmetadata -> dr_get_records
dr_identify -> same
dr_listidentifiers -> same
dr_listmetadataformats -> same
dr_listsets -> same

download a file, return path, simple

dryad_getfile

Solr interface functions

d_solr_facet
d_solr_group
d_solr_highlight
d_solr_mlt
d_solr_search
d_solr_stats

remove

getalldryad_metadata
search_dryad

Changes in Dryad's ZIP download feature

Dryad has recently changed our internal process for generating ZIP files that allow download of an entire dataset.

As a result of these changes, the vast majority of downloads will start faster. However, for larger datasets (currently above 200MB), a single-ZIP download will not be available. In these cases, dryad_download will not work, and users should be directed to use dryad_files_download instead.

use oai pkg for OAI work

replace httr with crul

Uploading to Dryad

Hi,

I would like to upload files to Dryad using RDryad, do you think that is going to be possible in a near future?

Thank you.

Session Info

readme xhtml

From kurt hornik

These have README.md files which when converted to (X)HTML using a
current version of pandoc show problems when validated using W3C Markup
Validator, see below.

Most of these problems are caused by using images without giving a name
(so the required alt attribute for <img> is not provided), or using <br>
instead of <br/>.

Pls fix these problems in your README.md files for your next release: in
all cases I inspected, the fixes were obvious and confirmation using
pandoc and W3C markup validator seemed unnecessary.

Please also visit your package check web page at http://cran.r-project.org/web/checks/check_results_PACKAGENAME.html to see if other problems need to be addressed as well.

REadme fixes

These packages contain README.md files with invalid HTML output created
by pandoc 1.12.4.2 according to W3C-validator.

I attach the HTML errors and warnings found below, and will put copies
of the corresponding HTML files up at
http://www.r-project.org/nosvn/pandoc.

Please investigate the problems and fix as needed.

Afaics, many of the problems are caused by adding "raw" HTML elements in
the README.md files and not realizing that the default output format
"html" is XHTML 1 (and not HTML 5). E.g., a raw
results in an

end tag for "br" omitted, but OMITTAG NO was specified

error.

Best
-k

rdryad.html:
  Valid: FALSE (errors: 1, warnings: 0)
  Errors:
    line  col  message
      43   14  there is no attribute "border"

More tests

CC0 to MIT

Name conflict

identify is a function in base graphics. Can we rename it in the next version to avoid masking a base function?

replace ReadImages with some other package

Message from Brian Ripley:

"ReadImages has been orphaned and will be archived shortly: the 'maintainer' never updated it for R 2.14.0 and never gave credit for work he included.

Packages:

Histdata ImageMetrics Momocs RXKCD RcmdrPlugin.SCDA SCVA geomorph rdryad

in theory make use of it (only HistData does in its checks). Please make alternative arrangements (e.g. read.jpeg can be replaced by readJPEG in package jpeg) by the end of January."

Can't pass arguments to dryad_datasets

All these calls return the same results:

dryad_datasets()
dryad_datasets(per_page = 25, page = 2)
dryad_datasets(per_page = 200)

This means that the page and per_page (Dryad API docs) are not getting passed on. This should be a simple fix according to @mpadge

Unable to download files from dryad

Hi, I'm trying to download individual files from a published dataset.

From the linked dryad website, I copied file ids for files of interest, however I encountered problems while trying to download them.

One file failed completely:

> dryad_files_download(33893)
Error in file(file): invalid 'description' argument
Traceback:

1. dryad_files_download(33893)
2. Map(function(x, y) each_files_download(x, y, ...), ids, paths)
3. mapply(FUN = f, ..., SIMPLIFY = FALSE)
4. (function (x, y) 
 . each_files_download(x, y, ...))(dots[[1L]][[1L]], dots[[2L]][[1L]])
5. each_files_download(x, y, ...)
6. file(file)

Another file successfully downloaded, but it's content is missing:

> dryad_files_download(33892)
[[1]]
[1] "/home/jena/.cache/R/rdryad/33892.docx"

$ ls -l /home/jena/.cache/R/rdryad/33892.docx
-rw-r--r-- 1 jena jena 6 pro 14 12:32 /home/jena/.cache/R/rdryad/33892.docx
$ head /home/jena/.cache/R/rdryad/33892.docx
PK���

Here is a screenshot showing some extra characters after the "PK":

Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: elementary OS 5.1.7 Hera

Matrix products: default
BLAS/LAPACK: /home/jena/miniconda3/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=cs_CZ.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=cs_CZ.UTF-8        LC_COLLATE=cs_CZ.UTF-8    
 [5] LC_MONETARY=cs_CZ.UTF-8    LC_MESSAGES=cs_CZ.UTF-8   
 [7] LC_PAPER=cs_CZ.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rdryad_1.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5      magrittr_2.0.1  rappdirs_0.3.1  uuid_0.1-4     
 [5] R6_2.5.0        rlang_0.4.8     hoardr_0.5.2    tools_3.6.1    
 [9] htmltools_0.5.0 ellipsis_0.3.1  digest_0.6.27   httpcode_0.3.0 
[13] tibble_3.0.4    lifecycle_0.2.0 crayon_1.3.4    zip_2.1.1      
[17] IRdisplay_0.7.0 repr_1.1.0      base64enc_0.1-3 vctrs_0.3.5    
[21] triebeard_0.3.0 IRkernel_1.1.1  curl_4.3        crul_1.0.0     
[25] evaluate_0.14   mime_0.9        pbdZMQ_0.3-3.1  compiler_3.6.1 
[29] pillar_1.4.7    urltools_1.7.3  jsonlite_1.7.1  pkgconfig_2.0.3

Suggestion: Progress Counter

Just a minor suggestion. It would be nice to have some kind of "progress counter" output on the console. I just downloaded a very large dataset with rdryad::dryad_download() and it took a long time to complete. I got myself thinking that maybe I was having connection issues or my R had broke. A percentile counter could help me to check whether the download had stopped or not.

rdryad::dryad_download() cannot be used in a CRAN package

Hello!
It seems that one cannot use rdryad::dryad_download() in a R package. What happens is that rdryad::dryad_download() can only download to ~/Library/Caches/R . I used this in the vignette to my mvSLOUCH R package, however I obtained from CRAN:
The CRAN policy only allows writing in file areas via tools::R_user_dir() (which differ by OS). On macOS, ~/Library/Caches/R is not one of those, yet we see there
drwxr-xr-x 4 ripley staff 128 8 Nov 06:54 rdryad/
from
mvSLOUCH via rdryad (which does not create this in its own checks).
This makes it much harder to sweep up after you.

I would suggest adding the possibility to specify a target download destination for rdryad::dryad_download().

Best wishes
Krzysztof Bartoszek

How to get files' ids?

Hi, sorry for stupid question, but I don't know how to get files' ids so I can download individual files from a dryad dataset.

I tried looking at our published dataset with:

> dryad_dataset("10.5061/dryad.7nt8f")
# truncated output
$`10.5061/dryad.7nt8f`$id
[1] 6817

However if I try to use that id to get files, it shows different doi for this id:

> dryad_files(6817)
# truncated output
$`6817`$`_links`$`stash:dataset`$href
[1] "/api/v2/datasets/doi%3A10.5061%2Fdryad.nf757"

i.e. the returned doi is rather 10.5061/dryad.nf757 instead of 10.5061/dryad.7nt8f.

So how do I get:

a proper ids for my dataset, to be used in functions like dryad_files?
a link to a particular file (e.g. Appendix S2.txt in the doi link above)?

Session Info

R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: elementary OS 5.1.7 Hera

Matrix products: default
BLAS/LAPACK: /home/jena/miniconda3/lib/libopenblasp-r0.3.12.so

locale:
 [1] LC_CTYPE=cs_CZ.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=cs_CZ.UTF-8        LC_COLLATE=cs_CZ.UTF-8    
 [5] LC_MONETARY=cs_CZ.UTF-8    LC_MESSAGES=cs_CZ.UTF-8   
 [7] LC_PAPER=cs_CZ.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=cs_CZ.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rdryad_1.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5      magrittr_2.0.1  rappdirs_0.3.1  uuid_0.1-4     
 [5] R6_2.5.0        rlang_0.4.8     hoardr_0.5.2    tools_3.6.1    
 [9] htmltools_0.5.0 ellipsis_0.3.1  digest_0.6.27   httpcode_0.3.0 
[13] tibble_3.0.4    lifecycle_0.2.0 crayon_1.3.4    zip_2.1.1      
[17] IRdisplay_0.7.0 repr_1.1.0      base64enc_0.1-3 vctrs_0.3.5    
[21] triebeard_0.3.0 IRkernel_1.1.1  curl_4.3        crul_1.0.0     
[25] evaluate_0.14   mime_0.9        pbdZMQ_0.3-3.1  compiler_3.6.1 
[29] pillar_1.4.7    urltools_1.7.3  jsonlite_1.7.1  pkgconfig_2.0.3

really hard to install rdryad package in ubuntu 12.04

cause it depends on libcurl and may other packages compiled by other languages, which the install.packages() function will not automatically download for you. better to give some information on this like the RCURL packages given(systems requirement in http://cran.r-project.org/web/packages/RCurl/index.html) . like the following:
""
libcurl (version 7.14.0 or higher) http://curl.haxx.se. On Linux systems, you will often have to explicitly install libcurl-devel to have the header files and the libcurl library.
"

Switch to new solrium setup

Ryan at Dryad says

http://wiki.datadryad.org/External_Metadata_Use#ROpenSci

Appearance:
This is a rich command-line tool. It is possible to extract any metadata available for any data package on the site that is available through OAI-PMH.

Potential Problems:
When querying for total data packages and total data files, the numbers that are returned are substantially higher than what appears on the homepage for datadryad.org. Ryan guesses this is because the number from R includes datasets that have been deleted, and R is failing to account for those that have been tagged deleted.

Recommendations:
We should communicate the above problem to the developers of ROpenSci.

Update to new oai::id_entify fxn for oai CRAN push

new Dryad API

https://datadryad.org/api/v2/docs/

Not able to download files with r dryad

Hi!

I am running the commands from your examples and the don't seem to work. For instance, when running the following command that you offer as an example:

x <- dryad_files('10.5061/dryad.1758')
x
[1] "http://datadryad.org"

Instead of showing:

dryad_files('10.5061/dryad.1758')
#> [1] "http://datadryad.org/bitstream/handle/10255/dryad.1759/dataset.csv?sequence=1"
#> [2] "http://datadryad.org/bitstream/handle/10255/dryad.1759/README.txt?sequence=2"

Thanks!

fxns to get doi from handle and vice versa

doi2handle and handle2doi

Update to solrium from solr

Arg for specifying target directory for files retrieved by dryad_download?

Maybe I am missing it, but suggest adding an argument in dryad_download() that allows user to specify the cache/file storage directory. For example,

mydir <- tempdir()
rdryad::dryad_download("10.5061/dryad.1nm650h", path=mydir)

stringi dependency

When I execute library(rdryad) after installation I receive the following error message:

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
there is no package called ‘stringi’
Error: package or namespace load failed for ‘rdryad’

Works fine after I run install.packages("stringi").

Updating a metadata download

I wrote a new function called updatealldryad_metadata. The idea is that if you already have a downloaded dryadmetadata.csv, you can run update to get newer records and not wait forever. The function will let you overwrite the file or create a new one. If you don't specify a new filename, it just appends date_time to the current filename.

Issue: It seems like there is no way to just get the identifiers and do a diff. This is likely due to the fact that in the getalldryad_metadata function, it downloads everything, then removes records with no metadata. This seems to throw off the diff. I will work on this again over the weekend but if you guys have any quick fixes, that would be great.

Error: Internal Server Error (HTTP 500)

> dryad_download(dois = "10.5061/dryad.f385721n")
Error: Internal Server Error (HTTP 500)

Any reason why the example code is not working?

> devtools::session_info()
─ Session info ───────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS Ventura 13.6.1
 system   x86_64, darwin20
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Detroit
 date     2024-02-01
 rstudio  2023.12.1+402 Ocean Storm (desktop)
 pandoc   NA

─ Packages ───────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.0)
 callr         3.7.3   2022-11-02 [1] CRAN (R 4.3.0)
 cli           3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
 crayon        1.5.2   2022-09-29 [1] CRAN (R 4.3.0)
 crul          1.4.0   2023-05-17 [1] CRAN (R 4.3.0)
 curl          5.2.0   2023-12-08 [1] CRAN (R 4.3.0)
 devtools      2.4.5   2022-10-11 [1] CRAN (R 4.3.0)
 digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.0)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.3.0)
 fansi         1.0.5   2023-10-08 [1] CRAN (R 4.3.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
 fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.0)
 glue          1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
 hoardr        0.5.4   2024-01-23 [1] CRAN (R 4.3.2)
 htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.0)
 htmlwidgets   1.6.4   2023-12-06 [1] CRAN (R 4.3.0)
 httpcode      0.3.0   2020-04-10 [1] CRAN (R 4.3.0)
 httpuv        1.6.14  2024-01-26 [1] CRAN (R 4.3.2)
 jsonlite      1.8.7   2023-06-29 [1] CRAN (R 4.3.0)
 later         1.3.2   2023-12-06 [1] CRAN (R 4.3.0)
 lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
 memoise       2.0.1   2021-11-26 [1] CRAN (R 4.3.0)
 mime          0.12    2021-09-28 [1] CRAN (R 4.3.0)
 miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
 pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
 pkgbuild      1.4.2   2023-06-26 [1] CRAN (R 4.3.0)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
 pkgload       1.3.3   2023-09-22 [1] CRAN (R 4.3.0)
 prettyunits   1.2.0   2023-09-24 [1] CRAN (R 4.3.0)
 processx      3.8.2   2023-06-30 [1] CRAN (R 4.3.0)
 profvis       0.3.8   2023-05-02 [1] CRAN (R 4.3.0)
 promises      1.2.1   2023-08-10 [1] CRAN (R 4.3.0)
 ps            1.7.5   2023-04-18 [1] CRAN (R 4.3.0)
 purrr         1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
 rappdirs      0.3.3   2021-01-31 [1] CRAN (R 4.3.0)
 Rcpp          1.0.11  2023-07-06 [1] CRAN (R 4.3.0)
 rdryad      * 1.0.0   2020-06-25 [1] CRAN (R 4.3.0)
 remotes       2.4.2.1 2023-07-18 [1] CRAN (R 4.3.2)
 rlang         1.1.2   2023-11-04 [1] CRAN (R 4.3.0)
 rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
 shiny         1.8.0   2023-11-17 [1] CRAN (R 4.3.0)
 stringi       1.8.3   2023-12-11 [1] CRAN (R 4.3.0)
 stringr       1.5.1   2023-11-14 [1] CRAN (R 4.3.0)
 tibble        3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
 triebeard     0.4.1   2023-03-04 [1] CRAN (R 4.3.0)
 urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.3.0)
 urltools      1.7.3   2019-04-14 [1] CRAN (R 4.3.0)
 usethis       2.2.2   2023-07-06 [1] CRAN (R 4.3.0)
 utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.0)
 vctrs         0.6.4   2023-10-12 [1] CRAN (R 4.3.0)
 xtable        1.8-4   2019-04-21 [1] CRAN (R 4.3.0)
 zip           2.3.1   2024-01-27 [1] CRAN (R 4.3.2)

 [1] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library

──────────────────────────────────────────────────────────────────────

Change dryad_getfile to only download, and not open

Errors in example.R

This works:

 dryaddat <- download_url("10255/dryad.1759")
dat <- read.csv(dryaddat)

but then this fails:

dat <- read.csv(dryaddat, ";") # This file happens to be ; delimited instead.
Error in !header : invalid argument type

This fails;

# Get all OAIs
alldryadoais <- get_dryadoais()

Error: could not find function "get_dryadoais"

This fails:

metadat <- download_dryadmetadata("10255/dryad.1759", TRUE)

Error in OAI_PMH_issue_request(baseurl, request) : 
  Received condition 'idDoesNotExist' with diagnostic:
"10255/dryad.1759" is unknown or illegal in this repository

Causing examples using metadat to fail

Archive Package

@mpadge could you please add yourself to DESCRIPTION as I was told you are the new maintainer? 😉

`dryad_fetch`: may need to rethink which URLs to use to fetch files

the urls like http://api.datadryad.org/mn/object/doi:xxx sometimes work and sometimes don't . e.g. ,

http://api.datadryad.org/mn/object/doi:10.5061/dryad.1758/1/bitstream

used to work, but now doesn't. but you can get it by doing http://datadryad.org/bitstream/handle/10255/dryad.1759/dataset.csv?sequence=1 which I think can get through dc metadata on the landing page for the DOI maybe http://datadryad.org/resource/doi:10.5061/dryad.1758

Solr!

Info at http://wiki.datadryad.org/Data_Access#SOLR_search_access

Basic query: http://datadryad.org/solr/search/select/?q=Galliard
Field-specific query: http://datadryad.org/solr/search/select/?q=dwc.ScientificName:drosophila
Search all text for a string, but limits results to two specified fields: http://datadryad.org/solr/search/select/?q=Galliard&fl=dc.title,dc.contributor.author
Dryad data based on an article DOI: http://datadryad.org/solr/search/select/?q=dc.relation.isreferencedby:10.1038/nature04863&fl=dc.identifier,dc.title_ac
All terms in the dc.subject facet, along with their frequencies:
http://datadryad.org/solr/search/select/?q=location:l2&facet=true&facet.field=dc.subject_filter&facet.minCount=1&facet.limit=5000&fl=nothing
Article DOIs associated with all data published in Dryad over the past 90 days:
http://datadryad.org/solr/search/select/?q=dc.date.available_dt:%5BNOW-90DAY/DAY%20TO%20NOW%5D&fl=dc.relation.isreferencedby&rows=1000000
Data DOIs published in Dryad during January 2011, with results returned in JSON format:
http://datadryad.org/solr/search/select/?q=location:l2+dc.date.available_dt:%5B2011-01-01T00:00:00Z%20TO%202011-01-31T23:59:59Z%5D&fl=dc.identifier&rows=1000000&wt=json

There's still not a new API for getting datasets, but they say they're working on it

ropensci / rdryad Goto Github PK

rdryad's Introduction

rdryad

Installation

Meta

Data provided by...

rdryad's People

Contributors

Stargazers

Watchers

Forkers

rdryad's Issues

keep

get a download url

OAI-PMH functions

download a file, return path, simple

Solr interface functions

remove

Recommend Projects

Recommend Topics

Recommend Org