Giter VIP home page Giter VIP logo

bowerbird's Introduction

rOpenSci

Project Status: Abandoned

This repository has been archived. The former README is now in README-NOT.md.

bowerbird's People

Contributors

cboettig avatar jeroen avatar maelle avatar mdsumner avatar milesmcbain avatar raymondben avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bowerbird's Issues

Resolution available for satellite data

Dear all,

I have a question regarding satellite data resolution.
I need SST, surface salinity, chlor_a and sea surface height for a region around madagascar.
The best resolutions I could find are:

  • SST: 1km (oceancolor)
    -salinity: 9km (copernicus)
  • Chlor_a: 4km (ocean color)
  • SSH: 9km (copernicus)

Is there a chance that you know any other website with better resolutions for all of the above?

Thank you

Camille

Status vector returned from sync needs thought

bb_sync expands the data_sources with one row per source_url, and these are run as separate sync items. Hence the status vector returned by bb_sync doesn't have a 1:1 match to the rows of the config$data_sources object.

CRAN

Super interesting package. Just wondering if there is any chance of pushing this to CRAN also?
It can be difficult to get permissions to use repositories other than CRAN in some organisations.

possible issue, found on RStudio cloud

I don't really have any leads for what might have happened, but I tried this on rstudio.cloud today, and while the sync job downloaded all the NRT files (270Mb) the session crashed for some reason.

From what I can gather the user session has access to 1.5 Gb for files so I don't think the quota was the problem. I can read and interact with the files now, in the reloaded project. I'll try to explore this some more, with a filter to avoid downloading all the NRT files. Possibly it's to do with the console output, and something to do with RStudio rather than bb.

sources <- "NSIDC SMMR-SSM/I Nasateam near-real-time sea ice concentration"
library(blueant)
## define a local file root, this code may be used
## to identify a predictable location for this package *for a given user*
local_file_root <- rappdirs::user_data_dir(appname = "seaice")

## create the local directly if it doesn't exist
if (!file.exists(local_file_root) || file.info(local_file_root)$isdir) {
  dir.create(local_file_root ,recursive = TRUE)
}
## /home/rstudio-user/.local/share/seaice
config <- bb_config(local_file_root = local_file_root)
config <- config %>% bb_add(blueant_sources(sources))
bb_sync(config)

the session info

R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2    rappdirs_0.3.1 yaml_2.1.16    knitr_1.19 

how to add datasource with source_url without filename in url

Hi,
I try to create a function as bb_zenodo_source

bb_zenodo_source <- function(id) {

I would like to download data from french data.gouv.fr portal.

Unfortunately lastest data are accessible for example with a url of this form :
https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2

When i run the sync without the extra param force_local_filename.
the data is downloaded but I didn't have the data in the root folder. It works great with this param.

I try to create a custom function to hack the bb_hanlder_rget function to use the force_local_fileame extra param...
but with no success...

Any idea on how to achieve this ?

library(tidyverse)
library(fs)
library(bowerbird)

id_data_gouv <- "5e1f20058b4c414d3f94460d"


bb_data_gouv_source <- function(id) {
  ne_or <- function(z, or) tryCatch(if (!is.null(z) && nzchar(z)) z else or, error = function(e) or)
  jx <- jsonlite::fromJSON(paste0("https://www.data.gouv.fr/api/1/datasets/", id_data_gouv))
  #collection sizes
  csize <- tryCatch(as.numeric(format(sum(jx$resources$filesize, na.rm = TRUE)/1024^3, digits = 1)), error = function(e) NULL)
  id <- id
  description <- jx$description
  folder <- jx$slug
  doc_url <-jx$page
  postproc <- list()
  files <- paste0(fs::path_ext_remove(jx$resources$title), '.', jx$resources$format)
  bb_source(name = ne_or(jx$title, ne_or(jx$acronym, "Dataset title")),
            id = id,
            description = ne_or(jx$description, "Dataset description"),
            ##keywords = ne_or(jx$metadata$keywords, NA_character_),
            doc_url = jx$page,
            citation = paste0("See ", jx$page, " for the correct citation"), ## seems odd that this isn't part of the record
            license = ne_or(jx$license, paste0("See ", doc_url, " for license information")),
            source_url = jx$resources$latest, ## list all urls. Does this cover datasets with multiple buckets? (Are there such things?)
            method = list("bb_data_gouv_handler_get", files = files), 
            comment = "Source definition created by bb_zenodo_source",
            postprocess = postproc,
            collection_size = csize)
  
  
}

bb_data_gouv_handler_get <- function(config, verbose = TRUE, files, ...) {
  cfrow <- bowerbird:::bb_settings_to_cols(config)
  this_flags <- list(...)
  this_flags <- c(list(url = cfrow$source_url), list(force_local_filename = files),this_flags, list(verbose = verbose))
  do.call(bb_rget, this_flags)
}


src <- bb_data_gouv_source(id_data_gouv)
cf <- bb_config(local_file_root = "~")
cf <- bb_add(cf, src )
status <- bb_sync(cf, create_root = TRUE, verbose = TRUE)
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/8af06636-9d44-4cfd-92b0-32a7e2059c75
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/08a78f58-81cc-43f3-b16a-f4d4a6392ab3
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/ed17ac9b-2025-482b-bda4-f9d667644b8b
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/e655f973-fb12-4771-9723-b1e4a2b086b2
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/5c878e99-96a9-48dc-b349-03878ff34392
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/85d11f2d-f7cd-469b-89e5-b210d2658e4f
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE) 
#> 
#> Tue Oct 12 16:14:28 2021
#> Synchronizing dataset: Base nationale sur les intercommunalités
#> Source URL https://www.data.gouv.fr/fr/datasets/r/d743f361-0376-4d84-bc5b-7e41a5cb86c6
#> --------------------------------------------------------------------------------------------
#> 
#> There was a problem synchronizing the dataset: Base nationale sur les intercommunalités.
#> The error message was: argument inutilisé (local_dir_only = TRUE)

Created on 2021-10-12 by the reprex package (v2.0.1)

getting OISST

I'm having this bomb out:

 downloading file 2646 of 15503: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198811/oisst-avhrr-v02r01.19881128.nc ...
 bb_rget exited with an error (OpenSSL SSL_connect: Connection reset by peer in connection to www.ncei.noaa.gov:443 )

I tried it again, to see if it was similar but it bombed earlier:

 file unchanged on server, not downloading.
 downloading file 319 of 15504: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198207/oisst-avhrr-v02r01.19820716.nc ...
  |                                                                      |   0%
 file unchanged on server, not downloading.
 downloading file 320 of 15504: https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198207/oisst-avhrr-v02r01.19820717.nc ...
 bb_rget exited with an error (Timeout was reached: [www.ncei.noaa.gov] SSL connection timeout)

Tue Feb 13 04:18:50 2024 dataset synchronization complete: NOAA OI 1/4 Degree Daily SST AVHRR
>
>
>
> proc.time()
    user   system  elapsed
  71.713    2.780 1075.225

trying again with 'wait = 10'

Support for 'local' or unpublished data?

I am wondering how well a bowerbird setup could work to define a central 'shared data' pool for a lab group. In addition to sharing raw remote downloads, I think it would often be useful to be able to add processed data in the bowerbird catalogue (i.e. this could be either independently collected data or merely 'processed' forms of existing bowerbird downloads). Maybe this is beyond the scope of bowerbird.

Mike's postrev notes

Just a few notes, mostly minor.

vignette("bowerbird") is mostly present in the readme, should the readme be reduced? (How are they kept in-sync?)

"Sea Ice Trends and Climatologies from SMMR and SSM/I-SSMIS, Version 2" is not in the help for bb_example_sources, but Nimbus is (and is not in the example object)

seealso

  • put bb_sync, bb_source, and bb_config in each other's see also

spelling consistency

  • synchronize/synchronisation (-ize is consistent elsewhere)
  • typo in bb_source "providedso"

value

  • bb_source value is "tibble", consider "data frame" and state that it will have columns as per this function's arguments (excluding warn_empty_auth)
  • bb_data_sources, bb_example_sources - value is "tibble", maybe link to bb_source value
  • other functions also only list "tibble"

examples

Consider a dontrun for bb_sync, I think it's a nice place to repeat the overall pattern with an example that can be copy/pasted in whole:

td <- tempdir()  ## strictly, user should specify this explicitly for control over file system
cf <- bb_config(local_file_root = td)

## Bowerbird must then be told which data sources to synchronize. 
## Let’s use data from the Australian 2016 federal election, which is provided as one 
## of the example data sources:
my_source <- subset(bb_example_sources(),id=="aus-election-house-2016")

## add this data source to the configuration
cf <- bb_add(cf,my_source)
## Once the configuration has been defined and the data source added to it, run the sync process:
status <- bb_sync(cf)

## see the fruits of our labour (files are CSVs in this data source)
head(list.files(td, recursive = TRUE, pattern = "csv$"))

## we can run this at any later time and our repository will update if the source has changed
status2 <- bb_sync(cf)

data retriever and bowerbird

I just learned about this cool package! It looks very nice. I just wanted to drop a line to let you guys know about @weecology's project called the data retriever and the corresponding R package rdataretriever. It seems like there could be some room for some future complementary development of these data acquisition tools. Anyways just wanted to say hi and keep up the good work.

R depends version

Suggest to add a minimal R version, say 3.3.1 - Sara just tried installing on 3.0.0 and this version check would help guide a new starter

test Z decompression

Either something has changed with archive::file_read, or this decompression never worked. The connection coming back from file_read is not a binary one.

structure of list-col for method

I couldn't reconcile the documentation here while trying to set up a custom accept_regex.

https://github.com/AustralianAntarcticDivision/bowerbird/blame/master/vignettes/bowerbird.Rmd#L432-L434

Specifically, I don't think this code fragment could work:

mysrc <- mysrc %>%
  mutate(method=c(method,list(accept_regex="/2017/")))

This is an example of what worked for me (with a private user/password not included here).

library(bowerbird)
my_directory <- "~/my/data/directory"
cf <- bb_config(local_file_root = my_directory)
my_source <- blueant::sources("CMEMS global gridded SSH near-real-time")
library(dplyr)
my_source <- my_source %>%
  mutate(method= list(c("bb_handler_wget", level = 3, 
                      accept_regex = "nrt_global_allsat_phy_l4_2018.*.nc")))

Consider an extensible metadata model?

I like the minimal metadata required by bb_source() that we can search from the bb_data_sources() table. For large collections though, I wonder if it would make sense to support some additional optional fields that users could specify to make it easier to search their collections later, e.g. a keyword field, or file type, etc?

Going further -- much ink has been spelt over metadata descriptions for scientific data, but I am curious if it would be worth crimping from some of those. e.g. bowerbird could adopt the https://schema.org/Dataset or DCAT2 as the basis for it's metadata representation. I imagine most fields would still be optional, but this would allow for a bit greater expressiveness. Perhaps more relevantly, these fields could be auto-populated when importing data from sources that already expose metadata in these formats (e.g. Zenodo, data.gov, and many others serve the schema.org/Dataset metadata descriptions).

Unexpected timeout error in syncing from password-protected ftp server?

Hi folks. Very excited about the recursive download techniques in bowerbird but couldn't get this example to work, not sure what I am missing. The example below should be reproducible.

library(bowerbird)
modis <- bb_source(
  name = "MODIS MOD14 C6 L3 Fire Product",
  id = "c6-mcd14ml",
  description = "MODIS Monthly Fire Product",
  doc_url = "https://lpdaac.usgs.gov/documents/88/MOD14_User_Guide_v6.pdf/",
  citation = "https://doi.org/10.1029/2005JG000142",
  source_url = c("ftp://fuoco.geog.umd.edu/modis/C6/mcd14ml/"),
  license = "CC-BY",
  user = "fire",
  password = "burnt",
  method = list("bb_handler_rget", level = 1, accept_download = "\\.gz$"),
  collection_size=NA)

my_directory <- rappdirs::user_data_dir("bowerbird")
dir.create(my_directory, F, T)
cf <- bb_config(local_file_root = my_directory)
cf <- bb_add(cf, modis)
status <- bb_sync(cf, verbose = TRUE)

On the last sync command, I merely get a connection timeout error.

The following works fine for me:

wget --user=fire --password=burnt -r -np -R "index.html*" ftp://fuoco.geog.umd.edu/modis/C6/mcd14ml/

Oceandata downloader broken

Downloads from NASA's Oceandata system now seem to require user authentication via an Earthdata login. The existing bb_handler_oceandata therefore won't work.

bb and GADM

I have these sources for the GADM data behind raster::getData

b <- "http://biogeo.ucdavis.edu/data/gadm2.8/rds/%s_adm0.rds"
gadm0 <- sprintf(b, raster::getData("ISO3")$ISO3)

gadm.rds <- bb_source(
  name="GADM maps and data in RDS format",
  id="gadm-maps-rdb",
  description="GADM provides maps and spatial data for all countries and their sub-divisions.",
  doc_url="http://www.gadm.org",
  citation="http://gadm.org/about.html",
  source_url= gadm0,
  license="http://gadm.org/license.html",
  method=list("bb_handler_wget",level=1, robots_off=TRUE),
 collection_size= 0.1,
  access_function = "base::readRDS",
  data_group="Administrative")

gadm <- bb_source(
  name="GADM maps and data in ESRI Geodatabase",
  id="gadm-maps-gdb",
  description="GADM provides maps and spatial data for all countries and their sub-divisions.",
  doc_url="http://www.gadm.org",
  citation="http://gadm.org/about.html",
  source_url=c("http://biogeo.ucdavis.edu/data/gadm2.8/gadm28.gdb.zip"),
  license="http://gadm.org/license.html",
  method=list("bb_handler_wget",recursive=TRUE,level=1, robots_off=TRUE),
  postprocess=list("bb_unzip"),
  collection_size= 1,
  access_function = "sf::read_sf",
  data_group="Administrative")

I can't get it to access any data from an actual directory URL, and so the gdb.zip is hardcoded and I construct the full (!!) list of URLs available for the level 0 for each ISO3 country from the raster package list.

Obviously this is not robust to version updates, and is not adaptable to varying levels in the RDS (apparently some are higher than 3). Are there wget tricks to make this work more generally?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.