ropensci / bowerbird Goto Github PK
View Code? Open in Web Editor NEWKeep a collection of sparkly data resources
Home Page: https://docs.ropensci.org/bowerbird
License: Other
Keep a collection of sparkly data resources
Home Page: https://docs.ropensci.org/bowerbird
License: Other
I am wondering how well a bowerbird setup could work to define a central 'shared data' pool for a lab group. In addition to sharing raw remote downloads, I think it would often be useful to be able to add processed data in the bowerbird catalogue (i.e. this could be either independently collected data or merely 'processed' forms of existing bowerbird downloads). Maybe this is beyond the scope of bowerbird.
The download process gives a correct indication of the number of files ("Downloading file x of y") when using the rget
spider-then-download procedure. But when downloading a list of individually specified files, the indicator gives repeated "Downloading file 1 of 1" messages (e.g. in https://github.com/AustralianAntarcticDivision/blueant/blob/master/R/polarview_handler.R). Can it be fixed?
I like the minimal metadata required by bb_source()
that we can search from the bb_data_sources()
table. For large collections though, I wonder if it would make sense to support some additional optional fields that users could specify to make it easier to search their collections later, e.g. a keyword
field, or file type, etc?
Going further -- much ink has been spelt over metadata descriptions for scientific data, but I am curious if it would be worth crimping from some of those. e.g. bowerbird could adopt the https://schema.org/Dataset or DCAT2 as the basis for it's metadata representation. I imagine most fields would still be optional, but this would allow for a bit greater expressiveness. Perhaps more relevantly, these fields could be auto-populated when importing data from sources that already expose metadata in these formats (e.g. Zenodo, data.gov, and many others serve the schema.org/Dataset metadata descriptions).
Hi,
I try to create a function as bb_zenodo_source
Line 23 in 53d7bcb
Unfortunately lastest data are accessible for example with a url of this form :
https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2
When i run the sync without the extra param force_local_filename
.
the data is downloaded but I didn't have the data in the root folder. It works great with this param.
I try to create a custom function to hack the bb_hanlder_rget
function to use the force_local_fileame
extra param...
but with no success...
Any idea on how to achieve this ?
library(tidyverse)
library(fs)
library(bowerbird)
id_data_gouv <- "5e1f20058b4c414d3f94460d"
bb_data_gouv_source <- function(id) {
ne_or <- function(z, or) tryCatch(if (!is.null(z) && nzchar(z)) z else or, error = function(e) or)
jx <- jsonlite::fromJSON(paste0("https://www.data.gouv.fr/api/1/datasets/", id_data_gouv))
#collection sizes
csize <- tryCatch(as.numeric(format(sum(jx$resources$filesize, na.rm = TRUE)/1024^3, digits = 1)), error = function(e) NULL)
id <- id
description <- jx$description
folder <- jx$slug
doc_url <-jx$page
postproc <- list()
files <- paste0(fs::path_ext_remove(jx$resources$title), '.', jx$resources$format)
bb_source(name = ne_or(jx$title, ne_or(jx$acronym, "Dataset title")),
id = id,
description = ne_or(jx$description, "Dataset description"),
##keywords = ne_or(jx$metadata$keywords, NA_character_),
doc_url = jx$page,
citation = paste0("See ", jx$page, " for the correct citation"), ## seems odd that this isn't part of the record
license = ne_or(jx$license, paste0("See ", doc_url, " for license information")),
source_url = jx$resources$latest, ## list all urls. Does this cover datasets with multiple buckets? (Are there such things?)
method = list("bb_data_gouv_handler_get", files = files),
comment = "Source definition created by bb_zenodo_source",
postprocess = postproc,
collection_size = csize)
}
bb_data_gouv_handler_get <- function(config, verbose = TRUE, files, ...) {
cfrow <- bowerbird:::bb_settings_to_cols(config)
this_flags <- list(...)
this_flags <- c(list(url = cfrow$source_url), list(force_local_filename = files),this_flags, list(verbose = verbose))
do.call(bb_rget, this_flags)
}
src <- bb_data_gouv_source(id_data_gouv)
cf <- bb_config(local_file_root = "~")
cf <- bb_add(cf, src )
status <- bb_sync(cf, create_root = TRUE, verbose = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/8af06636-9d44-4cfd-92b0-32a7e2059c75
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/08a78f58-81cc-43f3-b16a-f4d4a6392ab3
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/ed17ac9b-2025-482b-bda4-f9d667644b8b
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/5c878e99-96a9-48dc-b349-03878ff34392
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/85d11f2d-f7cd-469b-89e5-b210d2658e4f
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
#>
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/d743f361-0376-4d84-bc5b-7e41a5cb86c6
#> --------------------------------------------------------------------------------------------
#>
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)
Created on 2021-10-12 by the reprex package (v2.0.1)
Suspect not, the slow speed is likely due to the workaround for the symlinks/wget issue.
See placeholder section in README/vignette.
Super interesting package. Just wondering if there is any chance of pushing this to CRAN also?
It can be difficult to get permissions to use repositories other than CRAN in some organisations.
I couldn't reconcile the documentation here while trying to set up a custom accept_regex.
Specifically, I don't think this code fragment could work:
mysrc <- mysrc %>%
mutate(method=c(method,list(accept_regex="/2017/")))
This is an example of what worked for me (with a private user/password not included here).
library(bowerbird)
my_directory <- "~/my/data/directory"
cf <- bb_config(local_file_root = my_directory)
my_source <- blueant::sources("CMEMS global gridded SSH near-real-time")
library(dplyr)
my_source <- my_source %>%
mutate(method= list(c("bb_handler_wget", level = 3,
accept_regex = "nrt_global_allsat_phy_l4_2018.*.nc")))
Either something has changed with archive::file_read
, or this decompression never worked. The connection coming back from file_read
is not a binary one.
https://github.com/ropensci/bowerbird/blob/master/R/rget.R#L221
Reusing the handle here does not work: if the first file in a sequence of downloads already exists and does not need to be downloaded, then any subsequent downloads will act as if that associated file already exists as well. Using new handle for the time being as a workaround.
I just learned about this cool package! It looks very nice. I just wanted to drop a line to let you guys know about @weecology's project called the data retriever and the corresponding R package rdataretriever. It seems like there could be some room for some future complementary development of these data acquisition tools. Anyways just wanted to say hi and keep up the good work.
Direct FTP access via https://data.nodc.noaa.gov/woa/WOA13/DATAv2/
bb_sync
expands the data_sources with one row per source_url, and these are run as separate sync items. Hence the status vector returned by bb_sync
doesn't have a 1:1 match to the rows of the config$data_sources object.
See https://earthdata.nasa.gov/earth-observation-data/near-real-time/rapid-response and earthdata.nasa.gov/about/science-system-description/eosdis-components/global-imagery-browse-services-gibs
Would libwget/wget2 allow us to avoid dependency on wget as an external utility? Unlikely to be trivial but may be worth the effort ...
See e.g. https://gitlab.com/gnuwget/wget2
I don't really have any leads for what might have happened, but I tried this on rstudio.cloud today, and while the sync job downloaded all the NRT files (270Mb) the session crashed for some reason.
From what I can gather the user session has access to 1.5 Gb for files so I don't think the quota was the problem. I can read and interact with the files now, in the reloaded project. I'll try to explore this some more, with a filter to avoid downloading all the NRT files. Possibly it's to do with the console output, and something to do with RStudio rather than bb.
sources <- "NSIDC SMMR-SSM/I Nasateam near-real-time sea ice concentration"
library(blueant)
## define a local file root, this code may be used
## to identify a predictable location for this package *for a given user*
local_file_root <- rappdirs::user_data_dir(appname = "seaice")
## create the local directly if it doesn't exist
if (!file.exists(local_file_root) || file.info(local_file_root)$isdir) {
dir.create(local_file_root ,recursive = TRUE)
}
## /home/rstudio-user/.local/share/seaice
config <- bb_config(local_file_root = local_file_root)
config <- config %>% bb_add(blueant_sources(sources))
bb_sync(config)
the session info
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS
Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2 rappdirs_0.3.1 yaml_2.1.16 knitr_1.19
I haven't looked deeply enough to be sure this would work, but just an idea. The situation with archive
(r-lib/archive#28) does seem rather tragic, but for bowerbird to be CRAN-compatible, perhaps you could use the extended compression utilities in https://github.com/HenrikBengtsson/R.utils instead?
Just R-devel as an example for now, will see how it goes with the upack and update before trying any others.
https://gist.github.com/mdsumner/ca54033b55beb96ecb422db8dde02353
An motivation is to search for things to test code with, like specific "+proj=" strings in the mailing lists to see how much projections get used etc.
Downloads from NASA's Oceandata system now seem to require user authentication via an Earthdata login. The existing bb_handler_oceandata
therefore won't work.
This looks like it will be an important benchmark data set, so a good one for example config:
Hi folks. Very excited about the recursive download techniques in bowerbird but couldn't get this example to work, not sure what I am missing. The example below should be reproducible.
library(bowerbird)
modis <- bb_source(
name = "MODIS MOD14 C6 L3 Fire Product",
id = "c6-mcd14ml",
description = "MODIS Monthly Fire Product",
doc_url = "https://lpdaac.usgs.gov/documents/88/MOD14_User_Guide_v6.pdf/",
citation = "https://doi.org/10.1029/2005JG000142",
source_url = c("ftp://fuoco.geog.umd.edu/modis/C6/mcd14ml/"),
license = "CC-BY",
user = "fire",
password = "burnt",
method = list("bb_handler_rget", level = 1, accept_download = "\\.gz$"),
collection_size=NA)
my_directory <- rappdirs::user_data_dir("bowerbird")
dir.create(my_directory, F, T)
cf <- bb_config(local_file_root = my_directory)
cf <- bb_add(cf, modis)
status <- bb_sync(cf, verbose = TRUE)
On the last sync command, I merely get a connection timeout error.
The following works fine for me:
wget --user=fire --password=burnt -r -np -R "index.html*" ftp://fuoco.geog.umd.edu/modis/C6/mcd14ml/
I'm having this bomb out:
downloading file 2646 of 15503: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198811/oisst-avhrr-v02r01.19881128.nc ...
bb_rget exited with an error (OpenSSL SSL_connect: Connection reset by peer in connection to www.ncei.noaa.gov:443 )
I tried it again, to see if it was similar but it bombed earlier:
file unchanged on server, not downloading.
downloading file 319 of 15504: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198207/oisst-avhrr-v02r01.19820716.nc ...
| | 0%
file unchanged on server, not downloading.
downloading file 320 of 15504: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198207/oisst-avhrr-v02r01.19820717.nc ...
bb_rget exited with an error (Timeout was reached: [www.ncei.noaa.gov] SSL connection timeout)
Tue Feb 13 04:18:50 2024 dataset synchronization complete: NOAA OI 1/4 Degree Daily SST AVHRR
>
>
>
> proc.time()
user system elapsed
71.713 2.780 1075.225
trying again with 'wait = 10'
some links and notes to ensure we have a good overview of available resources for this
https://github.com/NCEAS/recordr
https://github.com/o2r-project/containerit - discussed at UserR2017 and will have video
https://github.com/nuest/docker-reproducible-research - a course on docker-reproducibility
I have these sources for the GADM data behind raster::getData
b <- "http://biogeo.ucdavis.edu/data/gadm2.8/rds/%s_adm0.rds"
gadm0 <- sprintf(b, raster::getData("ISO3")$ISO3)
gadm.rds <- bb_source(
name="GADM maps and data in RDS format",
id="gadm-maps-rdb",
description="GADM provides maps and spatial data for all countries and their sub-divisions.",
doc_url="http://www.gadm.org",
citation="http://gadm.org/about.html",
source_url= gadm0,
license="http://gadm.org/license.html",
method=list("bb_handler_wget",level=1, robots_off=TRUE),
collection_size= 0.1,
access_function = "base::readRDS",
data_group="Administrative")
gadm <- bb_source(
name="GADM maps and data in ESRI Geodatabase",
id="gadm-maps-gdb",
description="GADM provides maps and spatial data for all countries and their sub-divisions.",
doc_url="http://www.gadm.org",
citation="http://gadm.org/about.html",
source_url=c("http://biogeo.ucdavis.edu/data/gadm2.8/gadm28.gdb.zip"),
license="http://gadm.org/license.html",
method=list("bb_handler_wget",recursive=TRUE,level=1, robots_off=TRUE),
postprocess=list("bb_unzip"),
collection_size= 1,
access_function = "sf::read_sf",
data_group="Administrative")
I can't get it to access any data from an actual directory URL, and so the gdb.zip is hardcoded and I construct the full (!!) list of URLs available for the level 0 for each ISO3 country from the raster package list.
Obviously this is not robust to version updates, and is not adaptable to varying levels in the RDS (apparently some are higher than 3). Are there wget tricks to make this work more generally?
Just a few notes, mostly minor.
vignette("bowerbird") is mostly present in the readme, should the readme be reduced? (How are they kept in-sync?)
"Sea Ice Trends and Climatologies from SMMR and SSM/I-SSMIS, Version 2" is not in the help for bb_example_sources
, but Nimbus is (and is not in the example object)
bb_sync
, bb_source
, and bb_config
in each other's see alsobb_source
"providedso"bb_source
value is "tibble", consider "data frame" and state that it will have columns as per this function's arguments (excluding warn_empty_auth)bb_data_sources
, bb_example_sources
- value is "tibble", maybe link to bb_source valueConsider a dontrun for bb_sync
, I think it's a nice place to repeat the overall pattern with an example that can be copy/pasted in whole:
td <- tempdir() ## strictly, user should specify this explicitly for control over file system
cf <- bb_config(local_file_root = td)
## Bowerbird must then be told which data sources to synchronize.
## Let’s use data from the Australian 2016 federal election, which is provided as one
## of the example data sources:
my_source <- subset(bb_example_sources(),id=="aus-election-house-2016")
## add this data source to the configuration
cf <- bb_add(cf,my_source)
## Once the configuration has been defined and the data source added to it, run the sync process:
status <- bb_sync(cf)
## see the fruits of our labour (files are CSVs in this data source)
head(list.files(td, recursive = TRUE, pattern = "csv$"))
## we can run this at any later time and our repository will update if the source has changed
status2 <- bb_sync(cf)
Suggest to add a minimal R version, say 3.3.1 - Sara just tried installing on 3.0.0 and this version check would help guide a new starter
Dear all,
I have a question regarding satellite data resolution.
I need SST, surface salinity, chlor_a and sea surface height for a region around madagascar.
The best resolutions I could find are:
Is there a chance that you know any other website with better resolutions for all of the above?
Thank you
Camille
httr v1.4.0 breaks rget
with ftp (to be fair, httr was never meant to support ftp, so no blame there!)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.