Giter VIP home page Giter VIP logo

biocfilecache's Introduction

BiocManager

CRAN status CRAN release CRAN downloads

Overview

The BiocManager package, as the modern successor package to BiocInstaller, allows users to install and manage packages from the Bioconductor project. Bioconductor focuses on the statistical analysis and comprehension of high-throughput genomic data.

Current Bioconductor packages are available on a ‘release’ version intended for every-day use, and a ‘devel’ version where new features are continually introduced. A new release version is created every six months. Using the BiocManager package helps users accurately install packages from the appropriate release.

  • available() shows all packages associated with a search pattern
  • install() installs and/or updates packages either CRAN or Bioconductor
  • repositories() shows all package repository URL endpoints
  • valid() checks and returns packages that are out-of-date or too new
  • version() returns the current Bioconductor version number

Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

Usage

Checking Bioconductor version currently installed

BiocManager::version()
#> [1] '3.15'

Installing Bioconductor packages

BiocManager::install(c("GenomicRanges", "SummarizedExperiment"))

Verifying a valid Bioconductor installation

BiocManager::valid()
#> [1] TRUE

More information

Please see the package vignette for more detailed information such as changing Bioconductor version, offline use, and other advanced usage.

Getting help

To report apparent bugs, create a minimal and reproducible example on GitHub.

biocfilecache's People

Contributors

federicomarini avatar hpages avatar jwokaty avatar lshep avatar ltla avatar mgirlich avatar mtmorgan avatar nturaga avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

biocfilecache's Issues

Problem with remote files with whitespaces in the file name

Hi Lori,

I would like to cache files from a public repository of mzML (raw mass spec data files) using BiocFileCache but it doesn't work because many of these files contain white spaces in their file names. Example:

library(curl)
url <- "ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML"

Unfortunately, there is a white space in the file name. So, adding the file right away does not work:

library(BiocFileCache)
bfc <- BiocFileCache(tempdir())
path <- bfcrpath(bfc, url)
adding rname 'ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML'
Error in bfcrpath(bfc, url) : not all 'rnames' found or unique.
In addition: Warning messages:
1: download failed
  web resource path:ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzMLlocal file path:/tmp/Rtmp3Pr4NW/74d1e62e9_20160603151123624-1576262%20Batch5_SHP77_2a.mzMLreason: URL using bad/illegal format or missing URL 
2: bfcadd() failed; resource removed
  rid: BFC1
  fpath:ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzMLreason: download failed 
3: In value[[3L]](cond) : 
trying to add rname 'ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262 Batch5_SHP77_2a.mzML' produced error:
  bfcadd() failed; see warnings()

Replacing the white space with a %20 as required for URLs allows me to add the file to the cache - but this is not ideal because I need to change the original file name (which is usually used to link samples to the data files).

> url <- sub(" ", "%20", url, fixed = TRUE)
> bfc <- BiocFileCache(tempdir())
> path <- bfcrpath(bfc, url)
adding rname 'ftp://massive.ucsd.edu/MSV000087155/ccms_peak/New_mzMLFinal/20160603151123624-1576262%20Batch5_SHP77_2a.mzML'
  |======================================================================| 100%

What also puzzled me is that BiocFileCache further modified the file name by replacing the %20 with %2520 (???).

> path
                                                                          BFC2 
"/tmp/Rtmp3Pr4NW/76a624503_20160603151123624-1576262%2520Batch5_SHP77_2a.mzML" 

What would however be ideal is if I could provide the original file names (eventually also containing white spaces) for remote sources to BiocFileCache and that the package internally fixes the URLs (e.g. replacing white spaces with %20) but then uses again the original file name for the local copy. In other words, it would be great if I could provide e.g. like above the original path and file name (20160603151123624-1576262 Batch5_SHP77_2a.mzML), BiocFileCache downloads that file (needs to fix the file name in the URL to 20160603151123624-1576262%20Batch5_SHP77_2a.mzML) and stores the data to the local copy with the original file name 20160603151123624-1576262 Batch5_SHP77_2a.mzML. Would that be possible?

How to query and add atomically?

I could not find an API which would do the following:

  1. start transaction
  2. bfcquery for resource X
  3. if absent, bfcadd X
  4. end transaction

I am afraid that I will run into race condition if two processes both query, both get negative result, then they both add at once.

Is there a way to remove a possibility of such a race condition?

Cannot lock file

Initially from biomart ` ensembl <- useEnsembl(biomart = "ensembl") but due to biocfileCache ?

biomartCacheInfo() give the same error.

`
I deleted /home/villemin/.cache/biomaRt. No sucess

Error in lock(.sql_lock_path(dbfile), exclusive = FALSE) :
  Cannot lock file: '/home/villemin/.cache/biomaRt/BiocFileCache.sqlite.LOCK': Aucun verrou disponible
Appels : biomartCacheInfo ... tryCatch -> tryCatchList -> .sql_connect_RW -> lock
Error in h(simpleError(msg, call)) :
  error in evaluating the argument 'conn' in selecting a method for function 'dbDisconnect': object 'info' not found
Calls: <Anonymous> ... .sql_disconnect -> dbDisconnect -> .handleSimpleError -> h
Execution halted
R version 4.1.3 (2022-03-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS/LAPACK: /data/USERS/villemin/anaconda3/envs/r4.1.3/lib/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=fr_FR.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=fr_FR.UTF-8           LC_COLLATE=fr_FR.UTF-8       
 [5] LC_MONETARY=fr_FR.UTF-8       LC_MESSAGES=fr_FR.UTF-8      
 [7] LC_PAPER=fr_FR.UTF-8          LC_NAME=fr_FR.UTF-8          
 [9] LC_ADDRESS=fr_FR.UTF-8        LC_TELEPHONE=fr_FR.UTF-8     
[11] LC_MEASUREMENT=fr_FR.UTF-8    LC_IDENTIFICATION=fr_FR.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] BulkSignalR_0.0.9   BiocFileCache_2.2.1 dbplyr_2.1.1       
 [4] stringi_1.7.6       rjson_0.2.21        biomaRt_2.50.3     
 [7] ggrepel_0.9.1       ggridges_0.5.3      viridis_0.6.2      
[10] viridisLite_0.4.0   stringr_1.4.0       doParallel_1.0.17  
[13] iterators_1.0.14    foreach_1.5.2       paxtoolsr_1.28.0   
[16] XML_3.99-0.9        rJava_1.0-6         tidyr_1.2.0        
[19] dplyr_1.0.8         ggplot2_3.3.5       igraph_1.3.0       
[22] devtools_2.4.3      usethis_2.1.5       data.table_1.14.2  
[25] glue_1.6.2         

loaded via a namespace (and not attached):
 [1] matrixStats_0.61.0     bitops_1.0-7           fs_1.5.2              
 [4] bit64_4.0.5            RColorBrewer_1.1-3     filelock_1.0.2        
 [7] progress_1.2.2         httr_1.4.2             rprojroot_2.0.3       
[10] GenomeInfoDb_1.30.1    tools_4.1.3            utf8_1.2.2            
[13] R6_2.5.1               DBI_1.1.2              BiocGenerics_0.40.0   
[16] colorspace_2.0-3       GetoptLong_1.0.5       withr_2.5.0           
[19] tidyselect_1.1.2       gridExtra_2.3          prettyunits_1.1.1     
[22] processx_3.5.3         curl_4.3.2             bit_4.0.4             
[25] compiler_4.1.3         cli_3.2.0              Biobase_2.54.0        
[28] xml2_1.3.3             desc_1.4.1             scales_1.1.1          
[31] readr_2.1.2            callr_3.7.0            rappdirs_0.3.3        
[34] digest_0.6.29          R.utils_2.11.0         XVector_0.34.0        
[37] pkgconfig_2.0.3        sessioninfo_1.2.2      fastmap_1.1.0         
[40] GlobalOptions_0.1.2    rlang_1.0.2            RSQLite_2.2.12        
[43] shape_1.4.6            generics_0.1.2         jsonlite_1.8.0        
[46] R.oo_1.24.0            RCurl_1.98-1.6         magrittr_2.0.3        
[49] GenomeInfoDbData_1.2.7 Rcpp_1.0.8.3           munsell_0.5.0         
[52] S4Vectors_0.32.4       fansi_1.0.3            lifecycle_1.0.1       
[55] R.methodsS3_1.8.1      zlibbioc_1.40.0        brio_1.1.3            
[58] pkgbuild_1.3.1         plyr_1.8.7             grid_4.1.3            
[61] blob_1.2.3             crayon_1.5.1           Biostrings_2.62.0     
[64] circlize_0.4.14        hms_1.1.1              KEGGREST_1.34.0       
[67] ComplexHeatmap_2.10.0  ps_1.6.0               pillar_1.7.0          
[70] codetools_0.2-18       stats4_4.1.3           pkgload_1.2.4         
[73] remotes_2.4.2          vctrs_0.4.0            png_0.1-7             
[76] tzdb_0.3.0             testthat_3.1.3         gtable_0.3.0          
[79] purrr_0.3.4            clue_0.3-60            assertthat_0.2.1      
[82] cachem_1.0.6           tibble_3.1.6           AnnotationDbi_1.56.2  
[85] memoise_2.0.1          IRanges_2.28.0         cluster_2.1.3         
[88] ellipsis_0.3.2        

Brainstorming: multiple cache locations handled like .libPaths()

Following a chat with @lshep here is an idea that crossed my mind, and that I offer for discussion.

First observation:

  • Similarly to .libPaths(), it can be appealing for groups to have a 'central' cache (e.g. avoid each person in a group to cache the TENxBrainData data set. Instead have a central one accessible by everyone)

Second observation:

  • again, similarly to .libPaths(), rather than having a single location (potentially the 'central' one described above), it may be attractive to have an array of location
    • in order of priority, that can be iteratively scanned ("if the resource is not available in the central cache, check my private cache, otherwise download it in [central|private|other] cache")
    • to stratify resources (e.g. 'central', 'personal', maybe even 'annotations', etc.)

curl error with bfcrpath: HTTP/2 stream 1 was not closed cleanly: PROTOCOL_ERROR (err 1)

Hi BiocFileCache team,

Last week at BioC2023 I ran into some issues with spatialLIBD::fetch_data("spatialDLPFC_Visium_modeling_results") and related functions after I wiped out clean my BiocFileCache on my laptop and forced it to re-download data. I made a reprex below abstracting away the spatialLIBD layer.

library("BiocFileCache")
#> Loading required package: dbplyr
bfc <- BiocFileCache::BiocFileCache()
BiocFileCache::bfcrpath(bfc, "http://www.dropbox.com/s/srkb2ife75px2yz/modeling_results_BayesSpace_k09.Rdata?dl=1")
#> adding rname 'http://www.dropbox.com/s/srkb2ife75px2yz/modeling_results_BayesSpace_k09.Rdata?dl=1'
#> Warning in value[[3L]](cond): 
#> trying to add rname 'http://www.dropbox.com/s/srkb2ife75px2yz/modeling_results_BayesSpace_k09.Rdata?dl=1' produced error:
#>   HTTP/2 stream 1 was not closed cleanly: PROTOCOL_ERROR (err 1)
#> Error in BiocFileCache::bfcrpath(bfc, "http://www.dropbox.com/s/srkb2ife75px2yz/modeling_results_BayesSpace_k09.Rdata?dl=1"): not all 'rnames' found or unique.
traceback()
#> No traceback available
options(width = 120)
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.1 (2023-06-16)
#>  os       macOS Ventura 13.4
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2023-08-11
#>  pandoc   3.1.5 @ /opt/homebrew/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package       * version date (UTC) lib source
#>  BiocFileCache * 2.8.0   2023-04-25 [1] Bioconductor
#>  bit             4.0.5   2022-11-15 [1] CRAN (R 4.3.0)
#>  bit64           4.0.5   2020-08-30 [1] CRAN (R 4.3.0)
#>  blob            1.2.4   2023-03-17 [1] CRAN (R 4.3.0)
#>  cachem          1.0.8   2023-05-01 [1] CRAN (R 4.3.0)
#>  cli             3.6.1   2023-03-23 [1] CRAN (R 4.3.0)
#>  curl            5.0.1   2023-06-07 [1] CRAN (R 4.3.0)
#>  DBI             1.1.3   2022-06-18 [1] CRAN (R 4.3.0)
#>  dbplyr        * 2.3.3   2023-07-07 [1] CRAN (R 4.3.0)
#>  digest          0.6.33  2023-07-07 [1] CRAN (R 4.3.0)
#>  dplyr           1.1.2   2023-04-20 [1] CRAN (R 4.3.0)
#>  evaluate        0.21    2023-05-05 [1] CRAN (R 4.3.0)
#>  fansi           1.0.4   2023-01-22 [1] CRAN (R 4.3.0)
#>  fastmap         1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
#>  filelock        1.0.2   2018-10-05 [1] CRAN (R 4.3.0)
#>  fs              1.6.3   2023-07-20 [1] CRAN (R 4.3.0)
#>  generics        0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
#>  glue            1.6.2   2022-02-24 [1] CRAN (R 4.3.0)
#>  htmltools       0.5.5   2023-03-23 [1] CRAN (R 4.3.0)
#>  httr            1.4.6   2023-05-08 [1] CRAN (R 4.3.0)
#>  knitr           1.43    2023-05-25 [1] CRAN (R 4.3.0)
#>  lifecycle       1.0.3   2022-10-07 [1] CRAN (R 4.3.0)
#>  magrittr        2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
#>  memoise         2.0.1   2021-11-26 [1] CRAN (R 4.3.0)
#>  pillar          1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig       2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
#>  purrr           1.0.1   2023-01-10 [1] CRAN (R 4.3.0)
#>  R.cache         0.16.0  2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3     1.8.2   2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo            1.25.0  2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils         2.12.2  2022-11-11 [1] CRAN (R 4.3.0)
#>  R6              2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
#>  reprex          2.0.2   2022-08-17 [1] CRAN (R 4.3.0)
#>  rlang           1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
#>  rmarkdown       2.23    2023-07-01 [1] CRAN (R 4.3.0)
#>  RSQLite         2.3.1   2023-04-03 [1] CRAN (R 4.3.0)
#>  rstudioapi      0.15.0  2023-07-07 [1] CRAN (R 4.3.0)
#>  sessioninfo     1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
#>  styler          1.10.1  2023-06-05 [1] CRAN (R 4.3.0)
#>  tibble          3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
#>  tidyselect      1.2.0   2022-10-10 [1] CRAN (R 4.3.0)
#>  utf8            1.2.3   2023-01-31 [1] CRAN (R 4.3.0)
#>  vctrs           0.6.3   2023-06-14 [1] CRAN (R 4.3.0)
#>  withr           2.5.0   2022-03-03 [1] CRAN (R 4.3.0)
#>  xfun            0.39    2023-04-20 [1] CRAN (R 4.3.0)
#>  yaml            2.3.7   2023-01-23 [1] CRAN (R 4.3.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Created on 2023-08-11 with reprex v2.0.2

I was surprised by this as everything is working ok at GitHub actions using the biocthis workflow as well as in the Bioconductor BBS system.

In the 30 min prior to our spatialLIBD BioC2023 demo with @lahuuki, I tried resolving this. By googling the error message, I realized that curl (and httr), BiocFileCache dependencies, now use http version 2 instead of 1.1. See jeroen/curl#232 for a related issue. There they link to https://cran.r-project.org/web/packages/curl/vignettes/intro.html#Disabling_HTTP2 for disabling HTTP2, but I'm not sure if I can do that as a BiocFileCache user. Maybe through the config argument https://github.com/search?q=repo%3ABioconductor%2FBiocFileCache%20config&type=code. But that doesn't seem to be the case (maybe I'm not specifying config where I should):

> BiocFileCache::bfcrpath(bfc, "http://www.dropbox.com/s/srkb2ife75px2yz/modeling_results_BayesSpace_k09.Rdata?dl=1", config = list(http_version = 2))
adding rname 'http://www.dropbox.com/s/srkb2ife75px2yz/modeling_results_BayesSpace_k09.Rdata?dl=1'
  |==================================================================================| 100%

Error in BiocFileCache::bfcrpath(bfc, "http://www.dropbox.com/s/srkb2ife75px2yz/modeling_results_BayesSpace_k09.Rdata?dl=1",  : 
  not all 'rnames' found or unique.
In addition: Warning message:
In value[[3L]](cond) : 
trying to add rname 'http://www.dropbox.com/s/srkb2ife75px2yz/modeling_results_BayesSpace_k09.Rdata?dl=1' produced error:
  HTTP/2 stream 1 was not closed cleanly: PROTOCOL_ERROR (err 1)

What I find weird is that the download does actually happen, but at the very end it errors out. So I'm not 100% sure that the error is related to the http version 1.1 vs 2 change.

Hm... maybe something else changed in BiocFileCache::bfcrpath() that I'm not aware of, but well, in that case, I would expect things to break on GitHub Actions and BBS as well.

1 to 3 students at https://github.com/lcolladotor/cshl_rstats_genome_scale_2023 ran into this error and I thought it was something related their laptop configurations, and for the purpose of that course, we could get around things by having them download the data from Dropbox manually.

Thanks,
Leo

prompt wraps poorly on macOS

When prompted to create a new AnnotationHub cache, at https://github.com/Bioconductor/BiocFileCache/blob/master/R/utilities.R#L79 , I see the equivalent of

> txt = paste0(cache, "\n  does not exist, create directory?", " (yes/no): ")
> readline(txt)
/Users/ma38727/Library/Caches/BiocFileCache
(yes/no):  exist, create directory?

where '(yes/no):' has been aligned to the left; maybe readline() is trying to wrap line width?

> sessionInfo()
R version 3.6.0 alpha (2019-03-29 r76306)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BiocFileCache_1.7.3 dbplyr_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       digest_0.6.18    crayon_1.3.4     dplyr_0.8.0.1
 [5] assertthat_0.2.1 rappdirs_0.3.1   R6_2.4.0         DBI_1.0.0
 [9] magrittr_1.5     RSQLite_2.1.1    httr_1.4.0       pillar_1.3.1
[13] rlang_0.3.3      curl_3.3         blob_1.1.1       bit64_0.9-7
[17] glue_1.3.1       bit_1.1-14       purrr_0.3.2      compiler_3.6.0
[21] pkgconfig_2.0.2  memoise_1.1.0    tidyselect_0.2.5 tibble_2.1.1

A solution might remove the \n and instead provide a single line, like

txt = paste0("'", cache, "' does not exist, create directory?", " (yes/no): ")

likely other uses of .util_ask() would need similar modification.

... or use cat() (stdout) for the body of the text, and readline() for "yes/no: "

Is BiocFileCache's directory structure "scratch-safe"?

In many HPC environments, we have scratch storage where files are deleted after they have not been accessed for X days. This provides quota-free data usage without the cost of accumulating a bunch of crud. For one of my applications, I've thought about setting its dedicated BiocFileCache instance to live in the scratch space. This would mean that I could put large files in there without worrying about the space limits in the user's home directory.

However, it also means that files will just randomly disappear over time. I am fine with this, as I can write checks for whether the file path exists and redownload it if it vanished. I also assume that the SQLite file will not be deleted as it is touched upon any access; or specifically, if it does get deleted, it should be the last thing in the cache to go. At that point the folder does not exist as a meaningful cache anymore and I can just recreate it without much consequence.

The questions I have are:

  • Is this safe? Are there any other files expected by BFC that might be lost by the scratch policy?
  • Is this actually effective? Does BiocFileCache touch files beyond the SQLite file in routine operations?

rpath arguments should do exact matching

Having the rpath argument be a regex can lead to new errors in the same code as the cache expands. At least, I'd suggest offering an exact matching option, which is what I initially thought this argument was doing until I got an error like below:

> library(BiocFileCache)
> bfc0 <- BiocFileCache(tempfile(), ask=FALSE)         # temporary catch for examples
> fl1 <- tempfile(); file.create(fl1)
[1] TRUE
> bfcadd(bfc0, "Test1", fl1)                 # copy
                                                                                                           BFC1 
"/var/folders/3l/s3hwdq9s1mb9sx2_7wmy1v800000gq/T//Rtmp4IFKTQ/file136884b92dc93/1368880dcc4b_file1368816cf5da8" 
> bfcrpath(bfc0, "Test1")
                                                                                                           BFC1 
"/var/folders/3l/s3hwdq9s1mb9sx2_7wmy1v800000gq/T//Rtmp4IFKTQ/file136884b92dc93/1368880dcc4b_file1368816cf5da8" 
> fl10 <- tempfile(); file.create(fl10)
[1] TRUE
> bfcadd(bfc0, "Test10", fl10)                 # copy
                                                                                                            BFC2 
"/var/folders/3l/s3hwdq9s1mb9sx2_7wmy1v800000gq/T//Rtmp4IFKTQ/file136884b92dc93/136886a8fad26_file1368848a77765" 
> bfcrpath(bfc0, "Test1")
Error in bfcrpath(bfc0, "Test1") : all 'rnames' not found or valid.
In addition: Warning message:
In FUN(X[[i]], ...) : rname: 'Test1' is not unique.
> bfcrpath(bfc0, "Test1$")
                                                                                                           BFC1 
"/var/folders/3l/s3hwdq9s1mb9sx2_7wmy1v800000gq/T//Rtmp4IFKTQ/file136884b92dc93/1368880dcc4b_file1368816cf5da8" 
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2      BiocFileCache_1.5.0 dbplyr_1.2.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17     rstudioapi_0.7   bindr_0.1.1      magrittr_1.5    
 [5] rappdirs_0.3.1   tidyselect_0.2.4 bit_1.1-14       R6_2.2.2        
 [9] rlang_0.2.1      blob_1.1.1       httr_1.3.1       dplyr_0.7.5     
[13] tools_3.5.0      utf8_1.1.4       cli_1.0.0        DBI_1.0.0       
[17] yaml_2.1.19      bit64_0.9-7      assertthat_0.2.0 digest_0.6.15   
[21] tibble_1.4.2     crayon_1.3.4     purrr_0.2.5      memoise_1.1.0   
[25] glue_1.2.0       RSQLite_2.1.1    compiler_3.5.0   pillar_1.2.3    
[29] pkgconfig_2.0.1 
> 

bfcneedsupdate() fails when last_modified_time is NA

library(BiocFileCache)
#> Loading required package: dbplyr

bfc <- BiocFileCache()
url <- "https://github.com/statgen/libStatGen/raw/master/general/test/phiX.fa"

ans <- bfcadd(bfc, rname = "phix", fpath = url)
rid <- bfcquery(bfc, url)$rid
check <- bfcneedsupdate(bfc, rid)
#> Error in if (expired) {: missing value where TRUE/FALSE needed

Created on 2018-11-26 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.5.1 (2018-07-02)
#>  os       macOS High Sierra 10.13.6   
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Melbourne         
#>  date     2018-11-26                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package       * version     date       lib
#>  assertthat      0.2.0       2017-04-11 [2]
#>  backports       1.1.2       2017-12-13 [2]
#>  base64enc       0.1-3       2015-07-28 [2]
#>  BiocFileCache * 1.7.0       2018-11-26 [1]
#>  bit             1.1-14      2018-05-29 [2]
#>  bit64           0.9-7       2017-05-08 [2]
#>  blob            1.1.1       2018-03-25 [2]
#>  callr           3.0.0       2018-08-24 [2]
#>  cli             1.0.1       2018-09-25 [2]
#>  crayon          1.3.4       2017-09-16 [2]
#>  curl            3.2         2018-03-28 [2]
#>  DBI             1.0.0       2018-05-02 [2]
#>  dbplyr        * 1.2.2       2018-07-25 [2]
#>  desc            1.2.0       2018-05-01 [2]
#>  devtools        2.0.1       2018-10-26 [1]
#>  digest          0.6.18      2018-10-10 [1]
#>  dplyr           0.7.99.9000 2018-10-25 [1]
#>  evaluate        0.12        2018-10-09 [2]
#>  fs              1.2.6       2018-08-23 [1]
#>  glue            1.3.0       2018-07-17 [1]
#>  htmltools       0.3.6       2017-04-28 [2]
#>  httr            1.3.1       2017-08-20 [2]
#>  knitr           1.20        2018-02-20 [2]
#>  magrittr        1.5         2014-11-22 [2]
#>  memoise         1.1.0       2017-04-21 [1]
#>  pillar          1.3.0.9001  2018-10-25 [1]
#>  pkgbuild        1.0.2       2018-10-16 [2]
#>  pkgconfig       2.0.2       2018-10-14 [1]
#>  pkgload         1.0.2       2018-10-29 [2]
#>  prettyunits     1.0.2       2015-07-13 [2]
#>  processx        3.2.0       2018-08-16 [2]
#>  ps              1.2.1       2018-11-06 [2]
#>  purrr           0.2.5       2018-05-29 [1]
#>  R6              2.3.0       2018-10-04 [2]
#>  rappdirs        0.3.1       2016-03-28 [1]
#>  Rcpp            1.0.0       2018-11-07 [1]
#>  remotes         2.0.2       2018-10-30 [2]
#>  rlang           0.3.0.9000  2018-11-26 [1]
#>  rmarkdown       1.10        2018-06-11 [2]
#>  rprojroot       1.3-2       2018-01-03 [2]
#>  RSQLite         2.1.1       2018-05-06 [2]
#>  sessioninfo     1.1.1       2018-11-05 [2]
#>  stringi         1.2.4       2018-07-20 [1]
#>  stringr         1.3.1       2018-05-10 [2]
#>  testthat        2.0.1       2018-10-13 [2]
#>  tibble          1.4.99.9005 2018-11-14 [1]
#>  tidyselect      0.2.5       2018-10-11 [1]
#>  usethis         1.4.0       2018-08-14 [2]
#>  withr           2.1.2       2018-03-15 [2]
#>  yaml            2.2.0       2018-07-25 [2]
#>  source                                     
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  Github (Bioconductor/BiocFileCache@fbec077)
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.1)                             
#>  CRAN (R 3.5.0)                             
#>  Github (tidyverse/dplyr@2ea8925)           
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  Github (r-lib/pillar@c5bf622)              
#>  CRAN (R 3.5.0)                             
#>  Github (gaborcsardi/pkgconfig@e9d0190)     
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.1)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  Github (r-lib/rlang@cd272fd)               
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  Github (tidyverse/tibble@6d53680)          
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#>  CRAN (R 3.5.0)                             
#> 
#> [1] /Users/su.s/Library/R/3.5/library
#> [2] /Library/Frameworks/R.framework/Versions/3.5/Resources/library

Error in removebfc in an interactive session

I think this error is inconsequential for the outcome, but removebfc throws an error when the answer is 'no' in an interactive session.

library(BiocFileCache)
bfc <- BiocFileCache()
removebfc(bfc)
#> remove cache and 1 resource(s)? (yes/no): no
#> Error in removebfc(bfc) : object 'doit' not found

bfcquery returns inconsistent column types for empty rows

The column header types for the columns create_time and access_time are character vectors when non-empty, and double vectors when empty.
I expect that they should consistently return the same type; maybe character vectors always; although it's not clear why they are not date or datetime types instead.
Returning inconsistent types throws an error when trying to row bind join multiple queries using purrr::map_df where some of the queries are successful and some of them fail:

> files_remote
[1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1480nnn/GSM1480327/suppl/GSM1480327_K562_PROseq_minus.bw"
[2] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1480nnn/GSM1480327/suppl/GSM1480327_K562_PROseq_plus.bw" 
> map_df(files_remote, bfcquery, x = bfc)
Error: Can't combine `create_time` <character> and `create_time` <double>.
Run `rlang::last_error()` to see where the error occurred.
> map_df(files_remote[1], bfcquery, x = bfc)
# A tibble: 1 x 10
  rid   rname create_time access_time rpath rtype fpath last_modified_t… etag 
  <chr> <chr> <chr>       <chr>       <chr> <chr> <chr>            <dbl> <chr>
1 BFC6  ftp:… 2020-06-29… 2020-06-29… /hom… web   ftp:…               NA NA   
# … with 1 more variable: expires <dbl>
> map_df(files_remote[2], bfcquery, x = bfc)
# A tibble: 0 x 10
# … with 10 variables: rid <chr>, rname <chr>, create_time <dbl>,
#   access_time <dbl>, rpath <chr>, rtype <chr>, fpath <chr>,
#   last_modified_time <dbl>, etag <chr>, expires <dbl>
>  

example(bfcmeta) can throw error

eventually we get to

bfcmet> bfcsync(bfc0)
entries without corresponding files: 'BFC1'
delete 1 entries? (yes/no): yes
[1] TRUE

bfcmet> bfcremove(bfc0, "BFC1")
Error in bfcremove(bfc0, "BFC1") : all(rids %in% bfcrid(x)) is not TRUE

Enter a frame number, or 0 to exit   

1: example(bfcmeta)
2: source(tf, local, echo = echo, prompt.echo = paste0(prompt.prefix, getOptio
3: withVisible(eval(ei, envir))
4: eval(ei, envir)
5: eval(ei, envir)
6: Rex18361745a3823#91: bfcremove(bfc0, "BFC1")
7: bfcremove(bfc0, "BFC1")
8: stopifnot(all(rids %in% bfcrid(x)))

I am trying to explain use of bfc to a developer and ?bcfmeta leads to a nonspecific, thorough but overly long man page. The use cases in the vignette are good and need careful study by prospective users.

`BiocFileCache()` inducing R abort

In a clean R session, BiocFileCache seems to be causing a hard crash. I've tested on version 1.9.1 and 1.9.0 and get this behaviour. Not sure here what the underlying cause is. I discovered this in using the HCAData package.

> BiocFileCache::BiocFileCache()

 *** caught illegal operation ***
address 0x7f7ad1803fd3, cause 'illegal operand'

Traceback:
 1: select_impl(.data, vars)
 2: select.data.frame(.data, !!!dots)
 3: select(.data, !!!dots)
 4: select_.data.frame(., ~-id)
 5: select_(., ~-id)
 6: function_list[[k]](value)
 7: withVisible(function_list[[k]](value))
 8: freduce(value, `_function_list`)
 9: `_fseq`(`_lhs`)
10: eval(quote(`_fseq`(`_lhs`)), env, env)
11: eval(quote(`_fseq`(`_lhs`)), env, env)
12: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
13: tbl %>% select_(~-id)
14: .sql_get_resource_table(bfc)
15: eval(lhs, parent, parent)
16: eval(lhs, parent, parent)
17: .sql_get_resource_table(bfc) %>% select_("rid") %>% .formatID
18: .get_all_rids(x)
19: bfcrid(x)
20: bfcrid(x)
21: bfcinfo(x)
22: bfcinfo(x)
23: bfccount(bfcinfo(x))
24: bfccount(object)
25: bfccount(object)
26: cat("class: ", class(object), "\n", "bfccache: ", bfccache(object),     "\n", "bfccount: ", bfccount(object), "\n", "For more information see: bfcinfo() or bfcquery()\n",     sep = "")
27: (new("standardGeneric", .Data = function (object) standardGeneric("show"), generic = "show", package = "methods",     group = list(), valueClass = character(0), signature = "object",     default = new("derivedDefaultMethod", .Data = function (object)     showDefault(object, FALSE), target = new("signature", .Data = "ANY",         names = "object", package = "methods"), defined = new("signature",         .Data = "ANY", names = "object", package = "methods"),         generic = "show"), skeleton = (new("derivedDefaultMethod",         .Data = function (object)         showDefault(object, FALSE), target = new("signature",             .Data = "ANY", names = "object", package = "methods"),         defined = new("signature", .Data = "ANY", names = "object",             package = "methods"), generic = "show"))(object)))(new("BiocFileCache",     cache = "/home/ramezqui/.cache/BiocFileCache"))
28: (new("standardGeneric", .Data = function (object) standardGeneric("show"), generic = "show", package = "methods",     group = list(), valueClass = character(0), signature = "object",     default = new("derivedDefaultMethod", .Data = function (object)     showDefault(object, FALSE), target = new("signature", .Data = "ANY",         names = "object", package = "methods"), defined = new("signature",         .Data = "ANY", names = "object", package = "methods"),         generic = "show"), skeleton = (new("derivedDefaultMethod",         .Data = function (object)         showDefault(object, FALSE), target = new("signature",             .Data = "ANY", names = "object", package = "methods"),         defined = new("signature", .Data = "ANY", names = "object",             package = "methods"), generic = "show"))(object)))(new("BiocFileCache",     cache = "/home/ramezqui/.cache/BiocFileCache"))

Session Info:


> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS/LAPACK: /app/easybuild/software/OpenBLAS/0.2.18-GCC-5.4.0-2.26-LAPACK-3.6.1/lib/libopenblas_prescottp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] here_0.1       fs_1.3.1       devtools_2.1.0 usethis_1.5.1 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2        ps_1.3.0          prettyunits_1.0.2 rprojroot_1.3-2   withr_2.1.2      
 [6] digest_0.6.20     crayon_1.3.4      assertthat_0.2.1  R6_2.4.0          backports_1.1.4  
[11] magrittr_1.5      rlang_0.4.0       cli_1.1.0         remotes_2.1.0     testthat_2.2.1   
[16] callr_3.3.1       desc_1.2.0        tools_3.6.0       glue_1.3.1        compiler_3.6.0   
[21] pkgload_1.0.2     processx_3.4.1    pkgbuild_1.0.3    sessioninfo_1.1.1 memoise_1.1.0    

Also tried to remove the ~/.cache folder completely and re-create it, but this causes yet another R crash.

# after removing ~/.cache
> BiocFileCache::BiocFileCache()
/home/ramezqui/.cache/BiocFileCache
  does not exist, create directory? (yes/no): yes

 *** caught illegal operation ***
address 0x7f5f431c6fd3, cause 'illegal operand'

stale cached file not updated when calling `bfcupdate` with `rid`

Hi Lori, @lshep

Sorry, I don't have a great way to reproduce this error without the file.
I am using the following code to update my resource (attached below):

url <- "https://bioconductor.org/checkResults/3.16/bioc-LATEST/BUILD_STATUS_DB.txt"
bfc <- BiocFileCache()
bquery <- bfcquery(bfc, url, "rname", exact = TRUE)
bfcupdate(bfc, bquery[["rid"]], url)

It looks like the method

setMethod("bfcupdate", "BiocFileCache",
function(x, rids, rname=NULL, rpath=NULL, fpath=NULL,
proxy="", config=list(), ask=TRUE, ...)

has a different signature from the generic:

#' @export
setGeneric("bfcupdate",
function(x, rids, value, ...) standardGeneric("bfcupdate"),
signature = "x"
)

which means that the url input (with no argument name) does not map to the rname in the method as otherwise expected.
Perhaps it is sufficient to supply only the bfc and the rid to the method and have the function figure out the rpath and fpath from e.g., a bfcquery as above.

Otherwise, the code below will not be run because these are NULL by default in the method signature:

if (!is.null(rpath)) {

if (!is.null(fpath)) {

Attached file:

b1804f2a6090_BUILD_STATUS_DB.txt file

Martin @mtmorgan
Any recommendations on mismatched signatures between the generic and methods?

use `file.copy` instead of `file.rename`

Hi Lori, @lshep

I got this error message when copying a file from my /home/$USER/Downloads to
another partition on our server.

In file.rename(fpath, rpath) :
  cannot rename file '/home/mramos/Downloads/hnsc_tcga.tar.gz' to '/data/16tb/cbio/2e7912cc3e62_hnsc_tcga.tar.gz', reason 'Invalid cross-device link'

This has been seen before in vtest issue 14
and it looks like the solution is to use file.copy.

Could this be changed in BiocFileCache?
Thanks!

Regards,
Marcel

Allow "pluggable" download option and update checking for remote resources

For some applications, the URL is not all that is needed to do a download of a remote resource. For example, one might need to do some authentication first, supply a token as a parameter, or even have a tool that does the download outside of R. Again, my concrete use case is the need to supply a token with the download as an http header and to potentially use an external tool (gdc downloader) to do the downloads.

I can certainly do all the work outside of biocfilecache and then add resources as local resources, so this isn't a high priority, but I thought I would bring it up to see what you thought.

set expiration date for web resource

Hi Lori, @lshep

Is it possible to set an expiration date for a URL resource within the bfcadd function or any other function?
I tried this and it doesn't work.

library(BiocFileCache)
#> Loading required package: dbplyr
bfc <- BiocFileCache()
tfile <- "https://bioconductor.org/checkResults/3.15/bioc-LATEST/report.tgz"
treport <- bfcadd(
    bfc, rname = tfile, fpath = tfile, expires = Sys.Date() + 2
)
res <- bfcquery(bfc, tfile, exact = TRUE)
res
#>   id  rid                                                             rname
#> 1  9 BFC9 https://bioconductor.org/checkResults/3.15/bioc-LATEST/report.tgz
#>           create_time         access_time                    rpath rtype
#> 1 2022-01-11 23:04:43 2022-01-11 23:04:48 179285b1b6faf_report.tgz   web
#>                                                               fpath
#> 1 https://bioconductor.org/checkResults/3.15/bioc-LATEST/report.tgz
#>    last_modified_time                  etag expires
#> 1 2022-01-10 16:08:46 77b841f-5d53c8d250fe7      NA

Created on 2022-01-11 by the reprex package (v2.0.1.9000)

Thanks!
-Marcel

cleanbfc uses wrong format strings

library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
stuff <- bfcrpath(bfc, "https://google.com")
cleanbfc(bfc)
## Error in sprintf("Remove id %d %d", sQuote(rids), ifelse(cached, txt0,  :
##   invalid format '%d'; use format %s for character objects
Session information
R Under development (unstable) (2020-11-09 r79409)
Platform: x86_64-apple-darwin19.5.0 (64-bit)
Running under: macOS Catalina 10.15.5

Matrix products: default
BLAS:   /Users/luna/Software/R/trunk/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] BiocFileCache_1.99.6 dbplyr_2.1.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6       magrittr_2.0.1   rappdirs_0.3.3   tidyselect_1.1.0
 [5] bit_4.0.4        R6_2.5.0         rlang_0.4.10     fastmap_1.1.0
 [9] fansi_0.4.2      blob_1.2.1       httr_1.4.2       dplyr_1.0.5
[13] tools_4.1.0      utf8_1.2.1       DBI_1.1.1        withr_2.4.2
[17] ellipsis_0.3.1   bit64_4.0.5      assertthat_0.2.1 tibble_3.1.1
[21] lifecycle_1.0.0  crayon_1.4.1     purrr_0.3.4      vctrs_0.3.7
[25] curl_4.3         memoise_2.0.0    glue_1.4.2       cachem_1.0.4
[29] RSQLite_2.2.7    compiler_4.1.0   pillar_1.6.0     filelock_1.0.2
[33] generics_0.1.0   pkgconfig_2.0.3

overriding rpath filename when downloading

Web resources are currently downloaded to rpath which is constructed by combining a unique id (if requested) and the file name extracted from the url. However, some url dont include a filename e.g.

src = 'https://pubchem.ncbi.nlm.nih.gov/sdq/sdqagent.cgi?infmt=json&outfmt=csv&query={%22download%22:%22*%22,%22collection%22:%22pathway%22,%22order%22:[%22relevancescore,desc%22],%22start%22:1,%22limit%22:10000000,%22downloadfilename%22:%22PubChem_pathway_text_Reactome%22,%22where%22:{%22ands%22:[{%22*%22:%22Reactome%22},{%22source%22:%22Reactome%22}]}}'

In this case the url contains json, so I think the download fails as the filename generated for rpath isnt valid. However, any url that doesn't have a filename at the end but returns a file could end up with an unwieldy filename in the cache folder.

I tried to overcome this using bfcupdate to change rpath before downloading, but it fails because bfcupdate changes the rtype to "local".

One option would be to include an input in bfcadd that allows the user to override the default filename for rpath e.g. rpath_filename = "new_filename.xyz" and construct rpath from that instead of trying to extract it from the url.

Or you could try to extract the intended filename from the httr:GET response, if there is one.

Is there a work around for this that doesnt need an update to BiocFileCache?

Disable progress bar in bfcrpath

As requested by @lshep: currently, bfcrpath prints out a progress bar, which is nice in interactive sessions but annoying when compiling Rmarkdown documents. Ideally this could be turned off - or even better, diverted to the "message" stream so that knitr automatically knows that it is not the output of a function and ignores it.

file comparison only on Date will miss updates on edge case of same day updates

Rare case that more than one update in a day will result in not updating the cache -

Browse[3]> as.Date(web_time, optional=TRUE)
[1] "2019-04-15"
Browse[3]> as.Date(file_time, optional=TRUE)
[1] "2019-04-15"

But

Browse[3]> file_time
                 BFC1 
"2019-04-15 13:21:17" 
Browse[3]> web_time
[1] "2019-04-15 13:51:17"

Can comparison include hour/min/sec?

Allow user-supplied rid?

It would be useful to allow the user to supply the rid. In the case of files coming from public repositories, uuids and accessions are the most useful and meaningful identifiers. This information can be placed in rname, but that is mutable, allowing a "user" to mess with the primary key established by the "developer".

My concrete use case is to use the GDC file uuid as the rid.

tools::R_user_dir vs rappdirs::user_cache_dir

Hi Lori, I was wondering whether it would make sense to use tools::R_user_dir(which = "cache") (introduced in R-4.0.0) for BiocFileCache's default cache directory to strip off the rappdirs dependency? Just a thought.

bfcrpath seems to dislike URLs with odd characters

I was trying to do:

library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
fname <- bfcrpath(bfc, "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE81682&format=file&file=GSE81682%5FHTSeq%5Fcounts%2Etxt%2Egz")

... which resulted in the error:

Error in vapply(rnames, function(bfc, rname) { : values must be length 1,
 but FUN(X[[1]]) result is length 0

... though oddly enough, the file did get added to the cache, after inspection of bfcinfo(bfc).

Session information:

R version 3.5.0 Patched (2018-04-30 r74681)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/R-3-5-branch-release/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/R-3-5-branch-release/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2      BiocFileCache_1.4.0 dbplyr_1.2.1       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17     bindr_0.1.1      magrittr_1.5     rappdirs_0.3.1  
 [5] tidyselect_0.2.4 bit_1.1-14       R6_2.2.2         rlang_0.2.1     
 [9] blob_1.1.1       httr_1.3.1       dplyr_0.7.6      utf8_1.1.4      
[13] cli_1.0.0        DBI_1.0.0        bit64_0.9-7      assertthat_0.2.0
[17] digest_0.6.15    tibble_1.4.2     crayon_1.3.4     purrr_0.2.5     
[21] curl_3.2         memoise_1.1.0    glue_1.2.0       RSQLite_2.1.1   
[25] compiler_3.5.0   pillar_1.2.3     pkgconfig_2.0.1 

manual build of vignette fails

Using render("BiocFileCache.Rmd", BiocStyle::html_document())

label: unnamed-chunk-25
delete 2 entries? (yes/no): yes
delete 2 files? (yes/no): yes
Quitting from lines 480-497 (BiocFileCache.Rmd) 
Error in bfcremove(bfc, rmMe) : all(rids %in% bfcrid(x)) is not TRUE
Selection: 0

I think the interruption to ask for permission to delete should be avoided.

In the following BiocFileCache version is 2.3.4 because I bumped it for my fork/branch.
But there have been no code changes.

> sessionInfo()
R Under development (unstable) (2021-11-10 r81171)
Platform: aarch64-apple-darwin20.6.0 (64-bit)
Running under: macOS Monterey 12.1

Matrix products: default
BLAS:   /Users/vincentcarey/R-dev-dist/lib/R/lib/libRblas.dylib
LAPACK: /Users/vincentcarey/R-dev-dist/lib/R/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.7         BiocFileCache_2.3.4 dbplyr_2.1.1       
[4] BiocStyle_2.23.1    rmarkdown_2.11     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7          bslib_0.3.1         compiler_4.2.0     
 [4] pillar_1.6.4        BiocManager_1.30.16 jquerylib_0.1.4    
 [7] tools_4.2.0         bit_4.0.4           digest_0.6.29      
[10] memoise_2.0.1       RSQLite_2.2.9       jsonlite_1.7.2     
[13] evaluate_0.14       lifecycle_1.0.1     tibble_3.1.6       
[16] pkgconfig_2.0.3     rlang_0.4.12        cli_3.1.0          
[19] DBI_1.1.2           filelock_1.0.2      curl_4.3.2         
[22] yaml_2.2.1          xfun_0.29           fastmap_1.1.0      
[25] withr_2.4.3         httr_1.4.2          stringr_1.4.0      
[28] knitr_1.37          rappdirs_0.3.3      generics_0.1.1     
[31] vctrs_0.3.8         sass_0.4.0          bit64_4.0.5        
[34] tidyselect_1.1.1    glue_1.6.0          R6_2.5.1           
[37] fansi_0.5.0         bookdown_0.24       startup_0.16.0     
[40] blob_1.2.2          purrr_0.3.4         magrittr_2.0.1     
[43] htmltools_0.5.2     ellipsis_0.3.2      assertthat_0.2.1   
[46] utf8_1.2.2          stringi_1.7.6       cachem_1.0.6       
[49] crayon_1.4.2        Cairo_1.5-12.2  

Default path on Windows not usable

Using BiocFileCache() causes following error on Windows 10 64 bit 16299.192:

Error in rsqlite_connect(dbname, loadable.extensions, flags, vfs) : 
  Could not connect to database:
unable to open database file
In addition: Warning message:
In dir.create(cache) :
  cannot create dir 'C:\Users\username\AppData\Local\BiocFileCache\BiocFileCache\Cache', reason 'No such file or directory'

I think this is due to not having the correct privileges, but I didn't test it further.

BiocFileCache("~/.BiocFileCache") works as intended by creating using the directory C:\Users\username\Documents\.BiocFileCache. This behavior differs from the one described in the vignette under 1.2 as far as I understand. The default location ~/.BiocFileCache is not used, if no path is given.

I didn't test it under any Linux, but I think this might be a Windows only issue.

Wipe cache contents without destroying it

I'd like to wipe the cache contents without destroying it, so that bfc is still valid for further use.

library(BiocFileCache)
bfc <- BiocFileCache(tempfile(), ask=FALSE)
bfcrpath(bfc, "https://google.com")
bfccount(bfc)
## 1

cleanbfc(bfc, days=0, ask=FALSE)
bfccount(bfc)
## still 1

As you can see, this does nothing, even with days= set to zero, i.e., immediate expiry.

Session information
R Under development (unstable) (2020-03-11 r77927)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Users/luna/Software/R/trunk/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] SummarizedExperiment_1.17.5 DelayedArray_0.13.12
 [3] matrixStats_0.56.0          Biobase_2.47.3
 [5] GenomicRanges_1.39.3        GenomeInfoDb_1.23.17
 [7] IRanges_2.21.8              S4Vectors_0.25.15
 [9] BiocGenerics_0.33.3         BiocFileCache_1.11.5
[11] dbplyr_1.4.2                dbcommons_0.0.2
[13] testthat_2.3.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6           XVector_0.27.2         compiler_4.0.0
 [4] pillar_1.4.3           zlibbioc_1.33.1        bitops_1.0-6
 [7] tools_4.0.0            digest_0.6.25          bit_1.1-15.2
[10] lattice_0.20-41        RSQLite_2.2.0          memoise_1.1.0
[13] lifecycle_0.2.0        tibble_3.0.0           pkgconfig_2.0.3
[16] rlang_0.4.5            Matrix_1.2-18          DBI_1.1.0
[19] cli_2.0.2              curl_4.3               GenomeInfoDbData_1.2.3
[22] dplyr_0.8.5            httr_1.4.1             vctrs_0.2.4
[25] rappdirs_0.3.1         grid_4.0.0             bit64_0.9-7
[28] tidyselect_1.0.0       glue_1.4.0             R6_2.4.1
[31] fansi_0.4.1            purrr_0.3.3            blob_1.2.1
[34] magrittr_1.5           ellipsis_0.3.0         assertthat_0.2.1
[37] RCurl_1.98-1.1         crayon_1.3.4

bfcrpath is not thread-safe

library(BiocFileCache)
bfc <- BiocFileCache("test", ask=FALSE)
    
library(BiocParallel)
bplapply(1:10, function(x) bfcrpath(bfc, "https://google.com"))
## adding rname 'https://google.com'
## adding rname 'https://google.com'
## adding rname 'https://google.com'
## Error: BiocParallel errors
##   element index: 2, 5, 7, 10
##   first error: not all 'rnames' found or unique.

bfcrpath(bfc, "https://google.com")
## Error in bfcrpath(bfc, "https://google.com") :
##   not all 'rnames' found or unique.
## In addition: Warning message:
## In FUN(X[[i]], ...) : 'rnames' exact pattern
##     'https://google.com'
##   is not unique; use 'bfcquery()' to see matches.

Inspection of bfcrpath indicates that there is no locking between the initial check for the cached resource and its acquisition. I would suggest applying an exclusive lock at the start of bfcrpath, releasing it on.exit(), using the same lock file in .sql_connect_RW. Note that this may require setting and unsetting of a global option within each thread to indicate that the thread has already acquired a lock, as subsequent function calls in the same thread may attempt to re-acquire the lock.

lock.env <- new.env()
lock.env$status <- NA

.lock2 <- function(dbfile, exclusive) {
    if (is.na(lock.env$status)) {
        lock.env$status <- exclusive
        lock(.sql_lock_path(dbfile), exclusive = exclusive)
    } else if (lock.env$status || !exclusive) {
        # Exclusive lock held by a caller is compatible
        # with a subsequent request for a shared lock;
        # we're not escalating privileges here.
        NULL
    } else {
        stop("requested an exclusive lock when caller only holds a shared lock")
    }
}

.unlock2 <- function(loc) {
    if (!is.null(loc)) {
        lock.env$status <- NA
        unlock(loc)
    }
}

Replacing lock and unlock in sql.R with the equivalents above should allow you to slap:

locfile <- .lock2(.sql_dbfile(x), exclusive=TRUE)
on.exit(.unlock2(locfile))

somewhere at the top of bfcrpath(), which then allows me to do:

bfc <- BiocFileCache("test", ask=FALSE)

library(BiocParallel)
bplapply(1:10, function(x) bfcrpath(bfc, "https://google.com"))
## adding rname 'https://google.com'
## [[1]]
##                           BFC1
## "test/92485ee57de8_google.com"
## 
## [[2]]
##                           BFC1
## "test/92485ee57de8_google.com"
##  etc.

BiocFileCache:::.sql_db_execute(): Partial argument match of 'param' to 'params'

Issue

> options(warnPartialMatchArgs=TRUE, warn=2)
> eh <- ExperimentHub::ExperimentHub()
~/.cache/ExperimentHub
  does not exist, create directory? (yes/no): yes
Error in .local(conn, statement, ...) : 
  (converted from warning) partial argument match of 'param' to 'params'
> traceback()
18: doWithOneRestart(return(expr), restart)
17: withOneRestart(expr, restarts[[1L]])
16: withRestarts({
        .Internal(.signalCondition(simpleWarning(msg, call), msg, 
            call))
        .Internal(.dfltWarn(msg, call))
    }, muffleWarning = function() NULL)
15: .signalSimpleWarning("partial argument match of 'param' to 'params'", 
        base::quote(.local(conn, statement, ...)))
14: dbSendQuery(conn, statement, ...)
13: dbSendQuery(conn, statement, ...)
12: dbSendStatement(conn, statement, ...)
11: dbSendStatement(conn, statement, ...)
10: dbExecute(con, sql, param = param)
9: dbExecute(con, sql, param = param)
8: .sql_db_execute(bfc, sql[[2]], con = con)
7: tryCatchList(expr, classes, parentenv, handlers)
6: tryCatch({
       con <- .sql_connect_RW(.sql_dbfile(bfc))
       dbExecute(con, sql[[1]])
       .sql_db_execute(bfc, sql[[2]], con = con)
       package_version <- as.character(packageVersion("BiocFileCache"))
       .sql_db_execute(bfc, sql[[3]], key = c("schema_version", 
           "package_version"), value = c(.CURRENT_SCHEMA_VERSION, 
           package_version), con = con)
       .sql_db_execute(bfc, sql[[4]], con = con)
       dbExecute(con, sql[[5]])
   }, finally = {
       dbDisconnect(con)
   })
5: .sql_create_db(bfc)
4: BiocFileCache(cache = cache, ask = ask)
3: .create_cache(.class, url, cache, proxy, localHub, ask)
2: .Hub("ExperimentHub", hub, cache, proxy, localHub, ask, ...)
1: ExperimentHub::ExperimentHub()

Solution

In .sql_db_execute() function, pass params = param;

BiocFileCache/R/sql.R

Lines 106 to 119 in 0685885

.sql_db_execute <-
function(bfc, sql, ..., con)
{
param <- data.frame(..., stringsAsFactors = FALSE)
if (nrow(param) == 0L)
param <- NULL
if (missing(con)) {
info <- .sql_connect_RW(.sql_dbfile(bfc))
con <- info$con
on.exit(.sql_disconnect(info))
}
dbExecute(con, sql, param = param)
}

Deprecated dplyr functions cause check warnings

I maintain the Depmap package which depends downstream on BiocFileCache. I have been working on updating the Depmap package to be in line with the new dplyr 1.0 release changes. Despite making all the necessary code changes, a check warning appeared (see code below) which I have traced back to the sql.R file in this package, which imports select_ and filter_ functions which will soon be deprecated with the dplyr 1.0 release, and thus responsible for the warning.

* checking whether package 'depmap' can be installed ... WARNING
Found the following significant warnings:
  Warning: `select_()` is deprecated as of dplyr 0.7.0.
  Warning: `filter_()` is deprecated as of dplyr 0.7.0.

Do you plan to update these functions with their non-deprecated dplyr 1.0 versions?

#' @importFrom dplyr %>% tbl select_ collect summarize filter_ n left_join

option for not having unique identifier

Example on support site
https://support.bioconductor.org/p/9144615/

It may be desirable in certain situations where add a unique identifier is not desirable. The original intent was to allow multiple versions of the same file to be cached. Create a user option to disable.

@mtmorgan thoughts on if the default should be to or to not add unique identifier? If consistent with current behavior, the default would be to add a unique identifier.

bfcquery[['rpath']] is not the same as bfcrpath()

Hi Lori, @lshep

Perhaps it was designed this way but the code in AnnotationForge was assuming that bfcquery(...)[["rpath"]] is identical to bfcrpath(...), i.e. the file path to the resource.
It currently returns the basename() of the resource; perhaps this was done to shorten the display in the return of bfcquery?

I believe rpath to the user should be consistent throughout whether one obtains it from bfcquery or bfcrpath.

Thanks!

suppressPackageStartupMessages({
    library(BiocFileCache)
})
fullUri <- "http://www.uniprot.org/uniprot/?query=P13368+or+Q6GZX4&format=tab"
bfc <- BiocFileCache()
## First time around works fine downloads file
uquery1 <- bfcquery(bfc, query = fullUri, exact = TRUE)[["rpath"]]
if (!length(uquery1))
    uquery1 <- BiocFileCache::bfcadd(bfc, fullUri, fullUri)
uquery1
#>                                                                                            BFC35 
#> "/home/user/.cache/R/BiocFileCache/5b2435c9426a_%3Fquery%3DP13368%2Bor%2BQ6GZX4%26format%3Dtab"

## Second time around the query returns the basename of uquery1
(uquery2 <- bfcquery(bfc, query = fullUri, exact = TRUE)[["rpath"]])
#> [1] "5b2435c9426a_%3Fquery%3DP13368%2Bor%2BQ6GZX4%26format%3Dtab"

identical(uquery1, uquery2)
#> [1] FALSE
identical(basename(uquery1), uquery2)
#> [1] TRUE

identical(bfcrpath(bfc, fullUri, exact = TRUE), uquery1)
#> [1] TRUE
bfcremove(bfc, bfcquery(bfc, fullUri, exact = TRUE)[["rid"]])

Created on 2022-03-29 by the reprex package (v2.0.1)

Update remote file only when changed

I am struggling a bit with using BiocFileCache in a package context to cache package-related data. What I want is: For the examples in my R code, I need to first download a small example object that should be cached so it can be retrieved quickly when building the package etc. I want to program it in a way that every modification of the file on the remote site automatically triggers a re-download when using BioFileCache. I used the template here: https://bioconductor.org/packages/release/bioc/vignettes/BiocFileCache/inst/doc/BiocFileCache.html#cache-to-manage-package-data

Quetsions:

  1. bfcneedsupdate(bfc, rid) always returns TRUE and re-downloads the file when executed. Shouldnt it only return TRUE when an update is needed?
  2. How to make it work that it re-downloads automatically when the remote file has been modified? I thought this should be the default behavior, but I can't judge whether this works out of the box given my problem in (1).
  3. Do I have to adjust anything in the template that is given? What, for example, is geneFileV2 meant to be or represent? Just an arbitrary name for the file that should be cached as used in the database?

Thank you,
Christian

Reducing the number of package dependencies

Hi Bioconductor team,

Is it possible to rework BiocFileCache a bit to not depend on quite so many tidyverse packages?
This is looking pretty heavy at the moment:

Depends | R (>= 3.4.0), dbplyr (>= 1.0.0)
Imports | methods, stats, utils, dplyr, RSQLite, DBI, filelock, curl, httr
AcidDevTools::packageDependencies("BiocFileCache")
## [1] "dbplyr"     "methods"    "stats"      "utils"      "dplyr"
## [6] "RSQLite"    "DBI"        "filelock"   "curl"       "httr"
## [11] "blob"       "cli"        "glue"       "lifecycle"  "magrittr"
## [16] "pillar"     "purrr"      "R6"         "rlang"      "tibble"
## [21] "tidyr"      "tidyselect" "vctrs"      "withr"      "generics"
## [26] "jsonlite"   "mime"       "openssl"    "bit64"      "memoise"
## [31] "pkgconfig"  "plogr"      "cpp11"      "bit"        "cachem"
## [36] "tools"      "askpass"    "fansi"      "utf8"       "stringr"
## [41] "graphics"   "grDevices"  "sys"        "fastmap"    "stringi"

Happy to help work on this!

Best,
Mike

web file check

Is it possible that users will not want to use all three cache information that we check (etag, last-modified, expires) - it has been reported objects can expire as soon as they are accessed - should there be an option that allows which of these three are used?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.