ropensci / bowerbird Goto Github PK

View Code? Open in Web Editor NEW

45.0 45.0 6.0 1.47 MB

Keep a collection of sparkly data resources

Home Page: https://docs.ropensci.org/bowerbird

License: Other

R 100.00%

peer-reviewed r r-package rstats

bowerbird's Introduction

rOpenSci

This repository has been archived. The former README is now in README-NOT.md.

bowerbird's People

Contributors

Stargazers

Watchers

Forkers

milesmcbain guhjy gravitytrope mdsumner chenyz03 han-tun

bowerbird's Issues

Resolution available for satellite data

Dear all,

I have a question regarding satellite data resolution.
I need SST, surface salinity, chlor_a and sea surface height for a region around madagascar.
The best resolutions I could find are:

SST: 1km (oceancolor)
-salinity: 9km (copernicus)
Chlor_a: 4km (ocean color)
SSH: 9km (copernicus)

Is there a chance that you know any other website with better resolutions for all of the above?

Thank you

Camille

Status vector returned from sync needs thought

bb_sync expands the data_sources with one row per source_url, and these are run as separate sync items. Hence the status vector returned by bb_sync doesn't have a 1:1 match to the rows of the config$data_sources object.

CRAN

Super interesting package. Just wondering if there is any chance of pushing this to CRAN also?
It can be difficult to get permissions to use repositories other than CRAN in some organisations.

possible issue, found on RStudio cloud

I don't really have any leads for what might have happened, but I tried this on rstudio.cloud today, and while the sync job downloaded all the NRT files (270Mb) the session crashed for some reason.

From what I can gather the user session has access to 1.5 Gb for files so I don't think the quota was the problem. I can read and interact with the files now, in the reloaded project. I'll try to explore this some more, with a filter to avoid downloading all the NRT files. Possibly it's to do with the console output, and something to do with RStudio rather than bb.

sources <- "NSIDC SMMR-SSM/I Nasateam near-real-time sea ice concentration"
library(blueant)
## define a local file root, this code may be used
## to identify a predictable location for this package *for a given user*
local_file_root <- rappdirs::user_data_dir(appname = "seaice")

## create the local directly if it doesn't exist
if (!file.exists(local_file_root) || file.info(local_file_root)$isdir) {
  dir.create(local_file_root ,recursive = TRUE)
}
## /home/rstudio-user/.local/share/seaice
config <- bb_config(local_file_root = local_file_root)
config <- config %>% bb_add(blueant_sources(sources))
bb_sync(config)

the session info

R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2    rappdirs_0.3.1 yaml_2.1.16    knitr_1.19

use libwget?

Would libwget/wget2 allow us to avoid dependency on wget as an external utility? Unlikely to be trivial but may be worth the effort ...
See e.g. https://gitlab.com/gnuwget/wget2

how to add datasource with source_url without filename in url

Hi,
I try to create a function as bb_zenodo_source

bowerbird/R/zenodo.R

Line 23 in 53d7bcb

bb_zenodo_source <- function(id) {

I would like to download data from french data.gouv.fr portal.

Unfortunately lastest data are accessible for example with a url of this form :
https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2

When i run the sync without the extra param force_local_filename.
the data is downloaded but I didn't have the data in the root folder. It works great with this param.

I try to create a custom function to hack the bb_hanlder_rget function to use the force_local_fileame extra param...
but with no success...

Any idea on how to achieve this ?

library(tidyverse)
library(fs)
library(bowerbird)

id_data_gouv <- "5e1f20058b4c414d3f94460d"


bb_data_gouv_source <- function(id) {
  ne_or <- function(z, or) tryCatch(if (!is.null(z) && nzchar(z)) z else or, error = function(e) or)
  jx <- jsonlite::fromJSON(paste0("https://www.data.gouv.fr/api/1/datasets/", id_data_gouv))
  #collection sizes
  csize <- tryCatch(as.numeric(format(sum(jx$resources$filesize, na.rm = TRUE)/1024^3, digits = 1)), error = function(e) NULL)
  id <- id
  description <- jx$description
  folder <- jx$slug
  doc_url <-jx$page
  postproc <- list()
  files <- paste0(fs::path_ext_remove(jx$resources$title), '.', jx$resources$format)
  bb_source(name = ne_or(jx$title, ne_or(jx$acronym, "Dataset title")),
            id = id,
            description = ne_or(jx$description, "Dataset description"),
            ##keywords = ne_or(jx$metadata$keywords, NA_character_),
            doc_url = jx$page,
            citation = paste0("See ", jx$page, " for the correct citation"), ## seems odd that this isn't part of the record
            license = ne_or(jx$license, paste0("See ", doc_url, " for license information")),
            source_url = jx$resources$latest, ## list all urls. Does this cover datasets with multiple buckets? (Are there such things?)
            method = list("bb_data_gouv_handler_get", files = files), 
            comment = "Source definition created by bb_zenodo_source",
            postprocess = postproc,
            collection_size = csize)
  
  
}

bb_data_gouv_handler_get <- function(config, verbose = TRUE, files, ...) {
  cfrow <- bowerbird:::bb_settings_to_cols(config)
  this_flags <- list(...)
  this_flags <- c(list(url = cfrow$source_url), list(force_local_filename = files),this_flags, list(verbose = verbose))
  do.call(bb_rget, this_flags)
}


src <- bb_data_gouv_source(id_data_gouv)
cf <- bb_config(local_file_root = "~")
cf <- bb_add(cf, src )
status <- bb_sync(cf, create_root = TRUE, verbose = TRUE)
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/8af06636-9d44-4cfd-92b0-32a7e2059c75
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/08a78f58-81cc-43f3-b16a-f4d4a6392ab3
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/ed17ac9b-2025-482b-bda4-f9d667644b8b
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/5c878e99-96a9-48dc-b349-03878ff34392
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/85d11f2d-f7cd-469b-89e5-b210d2658e4f
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/d743f361-0376-4d84-bc5b-7e41a5cb86c6
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)

^{Created on 2021-10-12 by the reprex package (v2.0.1)}

Use R.utils in place of archive?

I haven't looked deeply enough to be sure this would work, but just an idea. The situation with archive(r-lib/archive#28) does seem rather tragic, but for bowerbird to be CRAN-compatible, perhaps you could use the extended compression utilities in https://github.com/HenrikBengtsson/R.utils instead?

add short id to each data source

reusing curl handle has odd behaviour

https://github.com/ropensci/bowerbird/blob/master/R/rget.R#L221

Reusing the handle here does not work: if the first file in a sequence of downloads already exists and does not need to be downloaded, then any subsequent downloads will act as if that associated file already exists as well. Using new handle for the time being as a workaround.

links, resources related to versioning and provenance

some links and notes to ensure we have a good overview of available resources for this

https://github.com/NCEAS/recordr

https://github.com/o2r-project/containerit - discussed at UserR2017 and will have video

https://github.com/nuest/docker-reproducible-research - a course on docker-reproducibility

US building footprints

This looks like it will be an important benchmark data set, so a good one for example config:

https://github.com/Microsoft/USBuildingFootprints

Adjustments to zenodo handler

file list should be populated at sync time, not at source-construction time (as is done in https://github.com/AustralianAntarcticDivision/blueant/blob/master/R/polarview_handler.R#L36 and #34)
check that the returned file paths are complete paths

Can GHRSST handler be made faster by using the remote md5 hashes?

Suspect not, the slow speed is likely due to the workaround for the symlinks/wget issue.

getting OISST

I'm having this bomb out:

 downloading file 2646 of 15503: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198811/oisst-avhrr-v02r01.19881128.nc ...
 bb_rget exited with an error (OpenSSL SSL_connect: Connection reset by peer in connection to www.ncei.noaa.gov:443 )

I tried it again, to see if it was similar but it bombed earlier:

 file unchanged on server, not downloading.
 downloading file 319 of 15504: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198207/oisst-avhrr-v02r01.19820716.nc ...
  |                                                                      |   0%
 file unchanged on server, not downloading.
 downloading file 320 of 15504: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198207/oisst-avhrr-v02r01.19820717.nc ...
 bb_rget exited with an error (Timeout was reached: [www.ncei.noaa.gov] SSL connection timeout)

Tue Feb 13 04:18:50 2024 dataset synchronization complete: NOAA OI 1/4 Degree Daily SST AVHRR
>
>
>
> proc.time()
    user   system  elapsed
  71.713    2.780 1075.225

trying again with 'wait = 10'

Interrupting a download does not necessarily delete the partially-downloaded file

Support for 'local' or unpublished data?

I am wondering how well a bowerbird setup could work to define a central 'shared data' pool for a lab group. In addition to sharing raw remote downloads, I think it would often be useful to be able to add processed data in the bowerbird catalogue (i.e. this could be either independently collected data or merely 'processed' forms of existing bowerbird downloads). Maybe this is beyond the scope of bowerbird.

File counter

The download process gives a correct indication of the number of files ("Downloading file x of y") when using the rget spider-then-download procedure. But when downloading a list of individually specified files, the indicator gives repeated "Downloading file 1 of 1" messages (e.g. in https://github.com/AustralianAntarcticDivision/blueant/blob/master/R/polarview_handler.R). Can it be fixed?

Add guidance on writing new handler functions

See placeholder section in README/vignette.

Mike's postrev notes

Just a few notes, mostly minor.

vignette("bowerbird") is mostly present in the readme, should the readme be reduced? (How are they kept in-sync?)

"Sea Ice Trends and Climatologies from SMMR and SSM/I-SSMIS, Version 2" is not in the help for bb_example_sources, but Nimbus is (and is not in the example object)

spelling consistency

synchronize/synchronisation (-ize is consistent elsewhere)
typo in bb_source "providedso"

value

bb_source value is "tibble", consider "data frame" and state that it will have columns as per this function's arguments (excluding warn_empty_auth)
bb_data_sources, bb_example_sources - value is "tibble", maybe link to bb_source value
other functions also only list "tibble"

examples

Consider a dontrun for bb_sync, I think it's a nice place to repeat the overall pattern with an example that can be copy/pasted in whole:

td <- tempdir()  ## strictly, user should specify this explicitly for control over file system
cf <- bb_config(local_file_root = td)

## Bowerbird must then be told which data sources to synchronize. 
## Let’s use data from the Australian 2016 federal election, which is provided as one 
## of the example data sources:
my_source <- subset(bb_example_sources(),id=="aus-election-house-2016")

## add this data source to the configuration
cf <- bb_add(cf,my_source)
## Once the configuration has been defined and the data source added to it, run the sync process:
status <- bb_sync(cf)

## see the fruits of our labour (files are CSVs in this data source)
head(list.files(td, recursive = TRUE, pattern = "csv$"))

## we can run this at any later time and our repository will update if the source has changed
status2 <- bb_sync(cf)

data retriever and bowerbird

I just learned about this cool package! It looks very nice. I just wanted to drop a line to let you guys know about @weecology's project called the data retriever and the corresponding R package rdataretriever. It seems like there could be some room for some future complementary development of these data acquisition tools. Anyways just wanted to say hi and keep up the good work.

R depends version

Suggest to add a minimal R version, say 3.3.1 - Sara just tried installing on 3.0.0 and this version check would help guide a new starter

Replace the MODIS Rapid Response handler with Worldview/GIBS

See https://earthdata.nasa.gov/earth-observation-data/near-real-time/rapid-response and earthdata.nasa.gov/about/science-system-description/eosdis-components/global-imagery-browse-services-gibs

add R mailing lists

Just R-devel as an example for now, will see how it goes with the upack and update before trying any others.

https://gist.github.com/mdsumner/ca54033b55beb96ecb422db8dde02353

An motivation is to search for things to test code with, like specific "+proj=" strings in the mailing lists to see how much projections get used etc.

Add WOA13 data source

Direct FTP access via https://data.nodc.noaa.gov/woa/WOA13/DATAv2/

test Z decompression

Either something has changed with archive::file_read, or this decompression never worked. The connection coming back from file_read is not a binary one.

structure of list-col for method

I couldn't reconcile the documentation here while trying to set up a custom accept_regex.

https://github.com/AustralianAntarcticDivision/bowerbird/blame/master/vignettes/bowerbird.Rmd#L432-L434

Specifically, I don't think this code fragment could work:

mysrc <- mysrc %>%
  mutate(method=c(method,list(accept_regex="/2017/")))

This is an example of what worked for me (with a private user/password not included here).

library(bowerbird)
my_directory <- "~/my/data/directory"
cf <- bb_config(local_file_root = my_directory)
my_source <- blueant::sources("CMEMS global gridded SSH near-real-time")
library(dplyr)
my_source <- my_source %>%
  mutate(method= list(c("bb_handler_wget", level = 3, 
                      accept_regex = "nrt_global_allsat_phy_l4_2018.*.nc")))

Consider an extensible metadata model?

I like the minimal metadata required by bb_source() that we can search from the bb_data_sources() table. For large collections though, I wonder if it would make sense to support some additional optional fields that users could specify to make it easier to search their collections later, e.g. a keyword field, or file type, etc?

Going further -- much ink has been spelt over metadata descriptions for scientific data, but I am curious if it would be worth crimping from some of those. e.g. bowerbird could adopt the https://schema.org/Dataset or DCAT2 as the basis for it's metadata representation. I imagine most fields would still be optional, but this would allow for a bit greater expressiveness. Perhaps more relevantly, these fields could be auto-populated when importing data from sources that already expose metadata in these formats (e.g. Zenodo, data.gov, and many others serve the schema.org/Dataset metadata descriptions).

Unexpected timeout error in syncing from password-protected ftp server?

Hi folks. Very excited about the recursive download techniques in bowerbird but couldn't get this example to work, not sure what I am missing. The example below should be reproducible.

library(bowerbird)
modis <- bb_source(
  name = "MODIS MOD14 C6 L3 Fire Product",
  id = "c6-mcd14ml",
  description = "MODIS Monthly Fire Product",
  doc_url = "https://lpdaac.usgs.gov/documents/88/MOD14_User_Guide_v6.pdf/",
  citation = "https://doi.org/10.1029/2005JG000142",
  source_url = c("ftp://fuoco.geog.umd.edu/modis/C6/mcd14ml/"),
  license = "CC-BY",
  user = "fire",
  password = "burnt",
  method = list("bb_handler_rget", level = 1, accept_download = "\\.gz$"),
  collection_size=NA)

my_directory <- rappdirs::user_data_dir("bowerbird")
dir.create(my_directory, F, T)
cf <- bb_config(local_file_root = my_directory)
cf <- bb_add(cf, modis)
status <- bb_sync(cf, verbose = TRUE)

On the last sync command, I merely get a connection timeout error.

The following works fine for me:

wget --user=fire --password=burnt -r -np -R "index.html*" ftp://fuoco.geog.umd.edu/modis/C6/mcd14ml/

rewrite rget to use curl directly, not httr

httr v1.4.0 breaks rget with ftp (to be fair, httr was never meant to support ftp, so no blame there!)

Oceandata downloader broken

Downloads from NASA's Oceandata system now seem to require user authentication via an Earthdata login. The existing bb_handler_oceandata therefore won't work.

bb and GADM

I have these sources for the GADM data behind raster::getData

b <- "http://biogeo.ucdavis.edu/data/gadm2.8/rds/%s_adm0.rds"
gadm0 <- sprintf(b, raster::getData("ISO3")$ISO3)

gadm.rds <- bb_source(
  name="GADM maps and data in RDS format",
  id="gadm-maps-rdb",
  description="GADM provides maps and spatial data for all countries and their sub-divisions.",
  doc_url="http://www.gadm.org",
  citation="http://gadm.org/about.html",
  source_url= gadm0,
  license="http://gadm.org/license.html",
  method=list("bb_handler_wget",level=1, robots_off=TRUE),
 collection_size= 0.1,
  access_function = "base::readRDS",
  data_group="Administrative")

gadm <- bb_source(
  name="GADM maps and data in ESRI Geodatabase",
  id="gadm-maps-gdb",
  description="GADM provides maps and spatial data for all countries and their sub-divisions.",
  doc_url="http://www.gadm.org",
  citation="http://gadm.org/about.html",
  source_url=c("http://biogeo.ucdavis.edu/data/gadm2.8/gadm28.gdb.zip"),
  license="http://gadm.org/license.html",
  method=list("bb_handler_wget",recursive=TRUE,level=1, robots_off=TRUE),
  postprocess=list("bb_unzip"),
  collection_size= 1,
  access_function = "sf::read_sf",
  data_group="Administrative")

I can't get it to access any data from an actual directory URL, and so the gdb.zip is hardcoded and I construct the full (!!) list of URLs available for the level 0 for each ISO3 country from the raster package list.

Obviously this is not robust to version updates, and is not adaptable to varying levels in the RDS (apparently some are higher than 3). Are there wget tricks to make this work more generally?