bioconductor / biocparallel Goto Github PK

View Code? Open in Web Editor NEW

64.0 18.0 29.0 1.2 MB

Bioconductor facilities for parallel evaluation

Home Page: https://bioconductor.org/packages/BiocParallel

R 96.70% Shell 0.44% C++ 2.63% M4 0.22%

core-package bioconductor-package

biocparallel's Introduction

BiocParallel

Bioconductor facilities for parallel evaluation (experimental)

Possible TODO

map/reduce-like function
bpforeach?
Abstract scheduler
lazy DoparParam
SnowParam support for setSeed, recursive, cleanup
subset SnowParam

DONE

encapsulate arguments as ParallelParam()
Standardize signatures
Make functions generics
parLapply-like function
Short vignette
elaborate SnowParam for SnowSocketParam, SnowForkParam, SnowMpiParam, ...
MulticoreParam on Windows

github notes

commit one-liners with names

git log --pretty=format:"- %h %an: %s"

TO FIX

DoparParam does not pass foreach args (specifically access to .options.nws for chunking)

biocparallel's People

Contributors

Stargazers

Watchers

biocparallel's Issues

Error using BiocParallel on Travis

I recently experienced the following error running parallel code with BiocParallel on Travis:

Quitting from lines 189-194 (benchmarking.Rmd) 
Error: processing vignette 'benchmarking.Rmd' failed with diagnostics:
setting worker timeout:
  error reading from connection
Execution halted

The error was reported here happened when running MSnbase::quantify that uses BiocParallel. Registering a SerialParam instead of the default MulticoreParam fixes the issue and now builds fine on Travis. I raised the issue on Travis directly, and it was suggested that there might be something wrong in BiocParallel.

Any idea?

Documentation: wrong default number of cores

Introduction vignette states the default is parallel::detectCores() but it's parallel::detectCores() - 2. I tried to edit the vignette and make a PR, but was getting TeX errors.

More frequent progress bar updates

The progress bar updates once per task, but could be made to update once per element.

Migration to batchtools

Are there any plans to migrate away from BatchJobs to its successor, batchtools?

If not, would a contributed PR be accepted? For instance, one could write a BatchToolsParam class paralleling the BatchJobsParam one, and then eventually mark the BatchJobsParam as deprecated (e.g., throws a warning when using) once the BatchToolsParam has been verified to be stable across different platforms.

SnowParam with MPI

Could someone help out with the following parallelisation issue. I would like to set up a SnowParam instance using the MPI back-end.

> SnowParam(4L, type = "MPI")
Error in mpi.comm.spawn(slave = mpitask, slavearg = args, nslaves = count,  : 
  Choose a positive number of slaves.

Enter a frame number, or 0 to exit   

1: SnowParam(4, type = "MPI")
2: .nullCluster(type)
3: makeCluster(0, type)
4: snow::makeMPIcluster(spec, ...)
5: mpi.comm.spawn(slave = mpitask, slavearg = args, nslaves = count, intercomm

Selection: 1
Called from: makeCluster(0L, type)
Browse[1]> ls()
[1] "args"         "catch.errors" "type"         "workers"     
Browse[1]> args
$spec
[1] 4

$type
[1] "MPI"

Browse[1]> 

Enter a frame number, or 0 to exit   

1: SnowParam(4, type = "MPI")
2: .nullCluster(type)
3: makeCluster(0, type)
4: snow::makeMPIcluster(spec, ...)
5: mpi.comm.spawn(slave = mpitask, slavearg = args, nslaves = count, intercomm

Selection: 0

The number of workers in SnowParam is parsed properly, but ends up being 0 in makeCluster. Running makeCluster(4L, "MPI") spawns the slaves successfully.

Thank you in advance.

Laurent

SnowParam: cannot create 126 workers; 125 connections available in this session

There seems to be a limit on the number of available workers using SnowParam. I am demonstrating my issue with MPI, but it holds for SOCKET too.

Using snow:

> library(snow)
> cl = makeMPIcluster(126L)
    126 slaves are spawned successfully. 0 failed.
> head(res <- parSapply(cl, 1:126, get("+"), 1))
[1] 2 3 4 5 6 7
> identical(res, 1:126+1)
[1] TRUE

Using BiocParallel:

> library(BiocParallel)
> p = SnowParam(126L, tpye = "MPI")
> head(res2 <- bpvec(1:126, get("+"), 1, BPPARAM=p))
Error in .local(x, ...) : 
  cannot create 126 workers; 125 connections available in this session
> p = SnowParam(12L, tpye = "MPI")
> head(res2 <- bpvec(1:126, get("+"), 1, BPPARAM=p))
starting worker localhost:11155
starting worker localhost:11155
starting worker localhost:11155
starting worker localhost:11155
starting worker localhost:11155
starting worker localhost:11155
starting worker localhost:11155
starting worker localhost:11155
starting worker localhost:11155
starting worker localhost:11155
starting worker localhost:11155
starting worker localhost:11155
[1] 2 3 4 5 6 7
> identical(res, 1:126+1)
[1] TRUE

I need to run an optimisation over at least 256 nodes using MPI. With previous versions of BiocParallel, this error did not happen. Any idea?

Laurent

cc @lmsimp

R version 3.3.1 RC (2016-06-14 r70782)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.1 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocParallel_1.6.2 snow_0.4-1        

loaded via a namespace (and not attached):
[1] tools_3.3.1    parallel_3.3.1 Rmpi_0.6-6

BatchJobsParam should load BatchJobs configuration

When the BatchJobs package is loaded, it automatically loads the user's custom configuration from the BatchJobs config files. But if you just load BiocParallel and use a BatchJobsParam, this config is never loaded, unless you also explicitly do library(BatchJobs) in your own code. We should fix this.

bpslaveLoop uses both parallel & snow for communication

send/recvData() via parallel, but snow::closeNode(). Should only use one package (parallel?)

Does bpvec() perform the appropriate partitioning

Correct behavior for MulticoreParam(), but not for SnowParam() or SerialParam(). See https://github.com/Bioconductor/BiocParallel/blob/master/R/bpvec-methods.R#L17-L36

TIMEOUT during builds

20 Dec zin2 (devel linux builder) --

after (?) last worker start for test_bpiterate() and before first worker start in bplapply()

22 Dec zin2 -- example(bplapply)

> ## ten tasks (1:10) so ten calls to FUN default registered parallel
> ## back-end. Compare with bpvec.
> fun <- function(v) {
+     message("working") ## 10 tasks
+     sqrt(v)
+ }
> bplapply(1:10, fun) 
Killed

Clean Dec 26, 27. Feb 1.

Should be able to create SnowParam from running cluster?

If someone has already started a cluster, there should be a way to get it into a SnowParam for use with BiocParallel (possibly a subclass with bpstart/stop disabled, since there wouldn't necessarily be a way to restart it if it was stopped).

BiocParallel did not register default BiocParallelParams

I am getting warning in BiocParallel. My R version is 3.4.0 and BiocParallel_1.8.2

> library("BiocParallel")
'BiocParallel' did not register default BiocParallelParams:
  invalid class “MulticoreParam” object: 1: ‘cluster’, ‘.clusterargs’, ‘.uid’, ‘RNGseed’ must be length 1
invalid class “MulticoreParam” object: 2: ‘.clusterargs’, ‘.controlled’, ‘logdir’, ‘resultdir’ must be length 1
Warning message:
In is.na(x[[i]]) :
  is.na() applied to non-(list or vector) of type 'environment'

deprecate LastError?

Hi Michel and others,

I've been re-working the error handling and logging in BiocParallel. I'd like to deprecate a few pieces of older error code and wanted to get some feedback.

(1) Remove LastError infrastructure:

Currently BatchJobsParam is the only param that still uses LastError. I'd like to deprecate LastError, bplasterror() and bpresume() unless others find them useful. It looks like pure BatchJobs does not have this functionality so I'm wondering about motivation and utility. Were they added as more of an experiment and haven't proven useful enough to be added to BatchJobs?

In the current BiocParallel, .try() returns errors as conditions. This tames the output and allows traceback from each worker to be accessed with attr():

res <- bplapply(list(1, "2", 3), sqrt)
res
[[1]]
[1] 1

[[2]]
[1] "non-numeric argument to mathematical function"
traceback() available as 'attr(x, "traceback")'

[[3]]
[1] 1.732051

tail(attr(res[[2]], "traceback"))
[1] call <- sapply(sys.calls(), deparse)
[2] e <- structure(e, class = c("remote-error", "condition"),
[3] traceback = capture.output(traceback(call)))
[4] invokeRestart("abort", e)
[5] }, "non-numeric argument to mathematical function", quote(FUN(...)))
[6] 1: h(simpleError(msg, call))

Given this behavior, my thoughts were that LastError is no longer necessary.

As for bpresume(), partial results are now returned with the error messages so successful computations are not lost. I think BatchJobs offers a resume-type mechanism through resetJobs?Knowing which results were successful and resubmitting unfinished jobs is useful in the scheduled cluster setting but I'm not sure bpresume() saw much use interactively.

Opinions?

(2) catch.error and stop.on.error fields not mutually exclusive:

There is a note in the BatchJobsParam code that says these flags are mutually exclusive. I'm not sure why they need to be. 'stop.on.error' seems more like a special case of 'catch.errors'. Ideally we'd have only one field for errors with multiple options, maybe something like,

errors = c("all", "none", "stop.on.error")

Is anyone opposed to consolidating error fields into a single character vs the two logicals?

(3) remove cache.warnings variable from .try():

Prior to Bioconductor 3.1, .try() did not catch warnings. Maybe that was the intention for 'cache.warnings' but it wasn't fully implemented? I've added a warning handler to .try() that does catch and return the warnings.

Is it ok if I remove cache.warnings?

Thanks.
Valerie

bpslots to query for availability of nested parallel cores

From Michael Lang:

Consider the task to align N sequence files (each containing millions of
reads) to a reference genome. We want to perform this in R using
parallelization on a parallel backend with C CPU cores.

If N is large compared to the number of CPU cores C, efficient
parallelization can be achieved by distributing the N sequence files to
the C workers, which will each perform single-threaded alignments.

The use-case arises when N is smaller than C. I further assume that
multiple cores are available on a single physical compute node, which I
think is typical these days. Because the alignment algorithm is able to
run multiple parallel threads, efficient parallelization could be
achieved by engaging N workers, each using P parallel threads.
Optimally, P is such that P*N == C, so P could be close to C/N. In
practice, P is the (minimal) number of cores available on a physical
node. The use case could be thought of as "nested parallelization"
(across workers and across threads).

It is a very typical case at our institute and for our collaborators at
the University of Basel: N is often below 20 (the number of samples in a
sequencing experiment, e.g. 3 genotypes times 3 replicates), and C is
well over 100 (and can be as high as 8000 on the University cluster,
with many 64-core nodes). We think that splitting one sequence file into
smaller chunks (thereby increasing N) would not be a feasible solution
(slow IO performance).

I was hoping that there would be an abstraction allow me to write
parallel code that would be independent of the parallel backend. I am
very excited about BiocParallel and bpworkers(), which is a great way
(for QuasR) to learn about the "first level" of the nested
parallelization, in a manner that is independent of the parallel
backend. What I am missing is a standardized way to also query
BiocParallel for the second level of the nested parallelization (the
number of parallel threads that can run on one worker), e.g. using a
function such as bpslots().

When using a BatchJobs backend, I imagine bpslots() would return a value
that is comparable to the value in the NSLOTS environment variable on a
SGE cluster node. The value would be set by the user when creating the
instance of BatchJobsParam(), similarly as it is now done with "workers":

prm <- BatchJobsParam(workers=n, slots=s, ...)

On other backends, some convention would have to be agreed on, e.g.
that bpslots() returns:
1L for a SerialParam backend
a user-set value (default: NA) for SnowParam, DoparParam and MulticoreParam backends

Currently, QuasR tries to guess bpworkers() and bpslots() by querying
the parallel::SOCKcluster object provided by the user, similar as in the
following example (xenon2 and xenon3 are local machine names):

> library(parallel)
> cl <- makeCluster(rep(c("xenon2","xenon3"),each=3))
> cl
socket cluster with 6 nodes on hosts 'xenon2', 'xenon3'
> tnn <- table(unlist(clusterEvalQ(cl, Sys.info()['nodename'])))
> tnn
xenon2.fmi.ch xenon3.fmi.ch
            3             3
> length(tnn) # bpworkers()
[1] 2
> min(tnn) # bpslots()
[1] 3

This only works for this particular type of parallel backend. If we want
to support multiple parallel backends, we will have to write
backend-specific code into QuasR. Alternatively, if this is abstracted
by BiocParallel with bpslots(), QuasR could parallelize across
bpworkers() nodes, using up to bpslots() parallel threads on each of them.

How to suppress slave spawning message

> p = SnowParam(6L, type = "MPI")
> suppressMessages(bpvec(1:10, get("+"), 1, BPPARAM=p))
    6 slaves are spawned successfully. 0 failed.
starting MPI worker

starting MPI worker

starting MPI worker

starting MPI worker

starting MPI worker

starting MPI worker
 [1]  2  3  4  5  6  7  8  9 10 11

This becomes annoying when many nodes are requested.

How can I suppress the calls to message in RMPInode.R and RSOCKnode.R?

~/BiocParallel/inst/snow$ tail RMPInode.R 
    if (! (snowlib %in% .libPaths()))
        .libPaths(c(snowlib, .libPaths()))
    library(methods) ## because Rscript as of R 2.7.0 doesn't load methods
    loadNamespace("Rmpi")
    loadNamespace("snow")

    #sinkWorkerOutput(outfile)
    message("starting MPI worker\n")
    BiocParallel::bprunMPIslave()
})

NFS? synchronization problems

We have pretty much given up trying to get BatchJobsParam working on our LSF cluster. We frequently encounter seemingly random failures at various stages, including just after submission (the transition to waiting) and when attempting to collect results. These failures result in empty "" error messages being reported as the "first error" for the operation. Another common failure is that jobs will be started by the scheduler before their .R file exists.

The file system is mounted via NFS, and we suspect that file operations are happening out of order. For example, when switching to the wait state, the system seems to think that there is nothing to wait for, when in fact no jobs have started yet. And, when all jobs are finished, it checks for results, but they have not yet appeared, so it fails.

This is mostly a BatchJobs issue but is made worse by the automation provided by BatchJobParam. Is there any way to synchronize these operations? Or must we find a more reliable filesystem?

This problem is also probably responsible for the bpresume() issue.

Successful build depends on file sorting order / missing collate

While trying to include support for BatchJobs, I stumbled over this one:

mv R/MulticlassParam-class.R R/AAAMulticlassParam-class.R
R CMD build .

* checking for file ‘./DESCRIPTION’ ... OK
* preparing ‘BiocParallel’:
* checking DESCRIPTION meta-information ... OK
* installing the package to process help pages
      -----------------------------------
User lib: ~/.R/library
* installing *source* package ‘BiocParallel’ ...
** R
** inst
** preparing package for lazy loading
Error in getClass(what, where = where) : 
  “BiocParallelParam” is not a defined class
Error : unable to load R code in package ‘BiocParallel’
ERROR: lazy loading failed for package ‘BiocParallel’
* removing ‘/tmp/RtmpYZDq6D/Rinst93f193634b2/BiocParallel’
      -----------------------------------
ERROR: package installation failed

This was originally triggered after creating a file BatchJobsParam-class.R, and took quite some time to track it down. Renaming the file to something like "zzzBatchJobsParam-class.R" temporarily solved it. But this could possibly also happen in the current revision in combination with unfortunate settings for LC_ALL/LC_COLLATE.

Suggested solution: adding Collate to DESCRIPTION should solve this. Alternatively, with a switch to roxygen2 for documentation this could be solved using the "@includes" directive.

Add function to check for non-local variable use

The 80/20 solution is probably ambitious enough

Support job names?

For the BatchJobs backend in particular, it would be nice if there was a way to specify a basename for the jobs that get submitted to the cluster. As it is, we just get "bpmapply-1", "bpmapply-2", and so on, which means that when I have multiple R scripts using BiocParallel to submit jobs to the cluster, it's hard to tell the jobs apart in the qstat output. Perhaps this basename should be an optional parameter to the BatchJobsParam constructor?

Implement finalizers for BPParam classes?

I just noticed that setRefClass can define a special "finalize" method to be called when the object is garbage-collected. It seems like it would be a good idea to define such methods for BPParam classes that would automatically call bpstop on the param object. This would ensure that, for example, SnowParam objects stop their cluster processes (eventually) once there are no more references to them.

Symbols evaluated too eagerly

Compability issue with virtualArray.

bplapply(list(as.symbol("x")), as.character, 
  BPPARAM=SerialParam(catch.errors=TRUE))

throws an error if catch.errors==TRUE, which is the default.

Change default to FALSE (at least until error handling is really well tested to deal with language objects)
Test all backends to not resolve symbols too eagerly

example(DoparParam) gives R CMD check error on Windows (w/ PATCH)

PROBLEM:
On Windows, there is a bug (most likely in the doParallel package) causing 'R CMD check' to give:

checking examples ...Warning in file(con, "r") :
cannot open file 'BiocParallel-Ex.Rout': Permission denied
Error in file(con, "r") : cannot open the connection
Execution halted

PATCH:
This is because there are some left over socket connections open. The following patch fixes the problem:

diff --git a/man/DoparParam-class.Rd b/man/DoparParam-class.Rd
index 91ce98d..044b036 100644
--- a/man/DoparParam-class.Rd
+++ b/man/DoparParam-class.Rd
@@ -86,7 +86,7 @@ DoparParam()

First register a parallel backend with foreach

library(doParallel)
-registerDoParallel(cl=makeCluster(2))
+registerDoParallel(cl=(cl <- makeCluster(2)))

p <- DoparParam()
bplapply(1:10, sqrt, BPPARAM=p)
@@ -95,6 +95,9 @@ bpvec(1:10, sqrt, BPPARAM=p)
\dontrun{
register(DoparParam(), default=TRUE)
}
+
+# Workaround for doParallel bug on Windows
+if (.Platform$OS == "windows") stopCluster(cl)
}

\keyword{classes}

/Henrik

Add Suggests: RUnit

PROBLEM:
Right now 'R CMD check' throws an error that 'RUnit' is not available (despite it indeed is installed).

checking tests ...
Running 'test.R'
ERROR
Running the tests in 'tests/test.R' failed.
Last 13 lines of output:
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

BiocGenerics:::testPackage("BiocParallel")
Error in BiocGenerics:::testPackage("BiocParallel") :
RUnit package not found
In addition: Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return =
TRUE, :
there is no package called 'RUnit'
Execution halted

This happens with both R v3.0.1 patched (r62850) and R devel (r62857) [at least on Windows].

PATCH:
Adding 'RUnit' to DESCRIPTION/Suggests: solves this;

diff --git a/DESCRIPTION b/DESCRIPTION
index f115c12..f5e6b88 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -10,4 +10,4 @@ Description: This package provides modified versions and novel
biocViews: HighThroughputSequencing, Infrastructure
License: GPL-2 | GPL-3
Imports: methods, parallel, foreach, tools
-Suggests: BiocGenerics, doParallel
+Suggests: BiocGenerics, doParallel, RUnit

/Henrik

bpmapply does not match MoreArgs argument names

> f = function(x, y) x
> mapply(f, 1:3, MoreArgs=list(x=1))
[1] 1 1 1
> library(BiocParallel)
> bpmapply(f, 1:3, MoreArgs=list(x=1))
[1] 1 2 3

Using on an LSF cluster - "Some jobs disappeared"

Hi, your package is potentially really useful to me, but I'm having an issue where, although to toy examples work, scaling up to many jobs gives an error message:

Error in stop(e) :
Some jobs disappeared, i.e. were submitted but are now gone. Check your configuration and template file.

E.g., I can do

toy function

FUN <- function(x) { round(sqrt(x), 4) }

now with batch

funs <- makeClusterFunctionsLSF("/g/furlong/Harnett/Tagseq_myfolder/jobs/simple.tmpl")
param <- BatchJobsParam(4, resources=list(ncpus=1,nodes=1,queue = 'medium_priority' ,memory=1e4,walltime=3600),cluster.functions=funs)
register(param)
xx <- bplapply(1:20, FUN)

But if I do the following:
xx <- bplapply(1:200, FUN)

Then I generally get the error.
Maybe BioCparallel just needs to wait longer for the LSF scheduler to return the jobs? I don't generally get problems with the cluster dropping jobs otherwise, and the R code isn't at fault.

Thanks for you help!

Missing BPPARAM not handled correctly

f = function(x, BPPARAM) bplapply(x, identity, BPPARAM=BPPARAM)

# works
bplapply(1:2, identity)
bplapply(1:2, identity, BPPARAM=MulticoreParam(2))
f(1:2, MulticoreParam(2))

# error
f(1:2)


Error in bplapply(x, identity, BPPARAM = BPPARAM) : 
error in evaluating the argument 'BPPARAM' in selecting a method 
for function 'bplapply': Error: argument "BPPARAM" is missing, 
with no default

inst/NEWS.Rd outdated

The inst/NEWS.Rd files is outdated and shows the wrong version number. Probably worth dropping if it remains in the current state.

In vignette, update reference to Statistical Science article.

Since it was published, in 2014, the included URL (http://www.imstat.org/sts/future papers.html) is no longer useful.

Avoid Rsamtools in unit tests

Replace https://github.com/Bioconductor/BiocParallel/blob/master/inst/unitTests/test_bpvalidate.R#L15-L34 with tests involving

fun = function(...) param; bpvalidate(fun)
fun = function(..., param) param; bpvalidate(fun)

Loading an (arbitrary) package slows down bplapply()

2.5x longer when mgcv is loaded.

> library(BiocParallel)
> system.time(bplapply(1:1e2 , function(...) {}, BPPARAM = MulticoreParam(workers = 2, tasks=1e2)))
   user  system elapsed 
  0.089   0.008   5.832 
> library(mgcv)
Loading required package: nlme
This is mgcv 1.8-10. For overview type 'help("mgcv-package")'.
> system.time(bplapply(1:1e2 , function(...) {}, BPPARAM = MulticoreParam(workers = 2, tasks=1e2)))
   user  system elapsed 
  0.095   0.012  12.048

Add Suggests: RUnit

PROBLEM:
Right now 'R CMD check' throws an error that 'RUnit' is not available (despite it indeed is installed).

checking tests ...
Running 'test.R'
ERROR
Running the tests in 'tests/test.R' failed.
Last 13 lines of output:
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

BiocGenerics:::testPackage("BiocParallel")
Error in BiocGenerics:::testPackage("BiocParallel") :
RUnit package not found
In addition: Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return =
TRUE, :
there is no package called 'RUnit'
Execution halted

This happens with both R v3.0.1 patched (r62850) and R devel (r62857) [at least on Windows].

PATCH:
Adding 'RUnit' to DESCRIPTION/Suggests: solves this;

/Henrik

biocParallel install error; makePSOCKcluster

R 3.2.5

I'm having this issue when installing from biocLite:

biocLite("BiocParallel")
BioC_mirror: http://bioconductor.org
Using Bioconductor version 2.12 (BiocInstaller 1.10.4), R version 3.2.5.
Temporarily using Bioconductor version 2.12
Installing package(s) 'BiocParallel'
--2017-02-04 15:22:35-- http://bioconductor.org/packages/2.12/bioc/src/contrib/BiocParallel_0.2.0.tar.gz
Resolving bioconductor.org (bioconductor.org)... 54.192.147.23, 54.192.147.170, 54.192.147.156, ...
Connecting to bioconductor.org (bioconductor.org)|54.192.147.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 126165 (123K) [application/x-gzip]
Saving to: `/tmp/RtmpBv2SbE/downloaded_packages/BiocParallel_0.2.0.tar.gz'
 0K .......... .......... .......... .......... .......... 40% 1.35M 0s
50K .......... .......... .......... .......... .......... 81% 2.36M 0s
100K .......... .......... ... 100% 7.25M=0.06s

2017-02-04 15:22:35 (2.01 MB/s) - `/tmp/RtmpBv2SbE/downloaded_packages/BiocParallel_0.2.0.tar.gz' saved [126165/126165]

installing source package ‘BiocParallel’ ...
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
Error : .onLoad failed in loadNamespace() for 'BiocParallel', details:
call: makePSOCKcluster(spec, ...)
error: numeric 'names' must be >= 1
Error: loading failed
Execution halted
ERROR: loading failed

removing ‘/home/billylau/R/x86_64-pc-linux-gnu-library/3.2/BiocParallel’

The downloaded source packages are in
‘/tmp/RtmpBv2SbE/downloaded_packages’

Any ideas?

DESCRIPTION's biocViews 'HighThroughputSequencing' non-relevant(?)

The biocViews 'HighThroughputSequencing' in the DESCRIPTION file seems out of place.

Add bpexport functionality

bpexport to make local variables available to remote computation. From the mailing list

BatchJobs config file

Hi Michel,

I'm planning to use BatchJobsParam in an example for the BioC2015 lab Martin and I are doing. When putting it together I came across a couple of issues with the 'workers' field.

It looks like the ncpus set in Multicore and SSH isn't propagating to the 'workers' field in BatchJobsParam().

For example if I have this in my config,

cluster.functions = makeClusterFunctionsMulticore(ncpus=3)

bpworkers() is not set to 3.

bpworkers(BatchJobsParam())
integer(0)

Also, the docs say that for Multicore and SSH the number of workers defaults to max available workers. This also does not propagate to 'workers' in BatchJobsParam.

Here I had cluster.functions = makeClusterFunctionsMulticore() in my config:

BatchJobsParam()
Sourcing configuration file: '/home/vobencha/R/R-dev/R-3-2-branch/library/BatchJobs/etc/BatchJobs_global_config.R'
Sourcing configuration file: '/home/vobencha/sandbox/.BatchJobs.R'
BatchJobs configuration:
cluster functions: Multicore
mail.from:
mail.to:
mail.start: none
mail.done: none
mail.error: none
default.resources:
debug: FALSE
raise.warnings: FALSE
staged.queries: TRUE
max.concurrent.jobs: Inf
fs.timeout: NA

class: BatchJobsParam
bpjobname:BPJOB; bpworkers:; bpisup:TRUE
bpstopOnError:FALSE; bpprogressbar:TRUE
cleanup:TRUE

Evidently it doesn't matter what bpworkers() is set to, BatchJobs is getting the information it needs elsewhere. Maybe there should be a validity check for consistency? Does the value in bpworkers() serve much purpose for BatchJobsParam?

Was it intended that the user would set most parameters through the config file and not the BatchJobsParam() constructor interface?

Thanks.
Valerie

bpiterate problem

Dear BiocParallel developers,

I have some trouble getting bpiterate function to run in parallel on my PC. I'm running linux PC (ubuntu 14.04). To run the code I'm using RStudio and R version 3.2.2 . I'm trying to process bam file sorted by Qnames in chunk and apply function to each chunk.
While it seems that multicore processing of the bam file is working in parallel (based on memory consumption) loaded chunks are not processed in parallel (based on CPU usage). I did get any error while running the code, the only problem is that I'm still using only one CPU.
Can you please help me to solve this problem?
Below is the code I used trying to reproduce example code from the reference manual from BiocParallel package.

Thank you in advance,
David

bf <- BamFile(bamfile, yieldSize = 300000, obeyQname=TRUE)

bamIterator <- function(bf) {
  done <- FALSE
  if (!isOpen( bf))
    open(bf)
  function() {
    if (done)
      return(NULL)
    param <- Rsamtools::ScanBamParam(
        what=c('seq', 'qual','mapq','cigar'),  
        flag=scanBamFlag(isDuplicate=F))
    yld <- readGAlignments(bf, param=)
    if (length(yld) == 0L) {
      close(bf)
      done <<- TRUE
      NULL
    } else yld
  }
}

ITER <- bamIterator(bf)
bpparam <- MulticoreParam(workers = 4)

counter <- function(reads, roi, ...) {
  countOverlaps(query = roi, subject = reads)
} 

#hap.gr is GRanges object of multiple ranges
bpiterate(ITER, counter, roi=hap.gr, BPPARAM = bpparam)

parallel:::isChild() only exists on Unix (not Windows)

FYI,

bplapply() for c("ANY", "ANY", "MulticoreParam") calls isChild(), which is only defined on Unix (cf. R/unix/zzz.R), where isChild <- parallel:::isChild(), which in turn is only defined on Unix (cf. parallel/R/unix/mcfork.R).

Thus,

p <- MulticoreParam(workers=8);
res <- bplapply(1:8, function(i) Sys.sleep(1), param=p);
Error in bplapply(1:8, function(i) Sys.sleep(1), param = p) :
could not find function "isChild"

on Windows with BiocParallel 0.0.5. This is also caught by 'R CMD check' (http://bioconductor.org/checkResults/devel/bioc-LATEST/BiocParallel/moscato2-checksrc.html).

Allow recovery from fatal worker error

library(BiocParallel)
fun <- function(i) {
    if (i == 2) tools::pskill(Sys.getpid())
    i
}
bplapply(1:3, fun)

causes the entire bplapply() to fail, but could instead return results 1 and 3.

Orphaned SnowParam() clusters need a finalizer

Completing execution without explicitly closing a snow cluster results in an error

$ R --vanilla -e "library('BiocParallel'); p <- SnowParam(1); bpstart(p)"
> library('BiocParallel'); p <- SnowParam(1); bpstart(p)
starting worker localhost:11325
> 
> 
Error in unserialize(node$con) : error reading from connection
Calls: local ... doTryCatch -> <Anonymous> -> recvData.SOCKnode -> unserialize
Execution halted

whereas closing the cluster does not.

$ R --vanilla -e "library('BiocParallel'); p <- SnowParam(1); bpstart(p); bpstop(p)"
> library('BiocParallel'); p <- SnowParam(1); bpstart(p); bpstop(p)
starting worker localhost:11840
> 
> 
$

Need a finalizer to close each started cluster

Automatically check that user code references exported, local, or attached package variables

Systematically integrate function to check on non-local use into bplapply and friends

Idea for "bpforeach"

So, I just thought a little about how to do a "bpforeach" implementation, and I think the simplest way to do it would be to implement a %dobp% operator that uses the registered BiocParallel param to parallelize the foreach iteration. Usage would be like the %dopar% operator:

foreach(x=1:10) %dobp% { sqrt(x); }

Default backends are not registered until first call to registered()

The following code produces an error:

library(BiocParallel)
register(DoparParam())
registered()
bpparam("SerialParam")

This is because if another param is registered before the first call to registered, then the default backends are never registered. I have written some code assuming that bpparam("SerialParam") will always succeed and return the registered SerialParam, and the documentation for registered seems to support my assumption: "At load time the registry is populated with default backends." (Never mind the fact that SerialParam is a singleton class, since that's an implementation detail.)

Perhaps the defaults should instead be registered in the .onLoad function?

Some Param objects should have finalizers

If you do:

p <- SnowParam(2)
p <- bpstart(p)
rm(p)

Then the two processes started for the cluster will remain running. Once params have been converted to reference classes, it should be possible to register finalizers on them using reg.finalizer (since I think each instance of a ref class carries around an environment with it). We can use this to auto-cleanup stale processes and such for param objects that go out of scope.

Params should be reference classes

Looking at how bpstart and bpstop work, it looks like one must do:

param <- bpstart(param)

If one simply does bpstart(param), then param will still be set to an object representing a stopped cluster, and the reference to the started cluster will be lost (possibly leaking resources by leaving a bunch of new processes running but inaccessible).

I think we could make bpstart and bpstop work without assignment by storing the reference to the backend inside an environment in the object, since environments are pass-by-reference.

What do you think?

bplapply,SnowParam returns NULL rather than list()

Expecting

> lapply(list(), c)
list()

but get

> bplapply(list(), c, BPPARAM=SnowParam(1))
Bioconductor version 2.14 (BiocInstaller 1.13.3), ?biocLite for help
NULL

split/apply/combine paradigm

I'd like to get started on this one and use this tracker to collect and discuss ideas.

AFAIR @lawremi suggested back in September to use split/by (split), bp*apply (apply) and stack (combine).

I'm rather unsure what functionality is needed. Usually I'm fine with split, bplapply and l*ply/Reduce.

bpresume failures

bpresume fails frequently generating empty error messages when used with BatchJobsParam(). It works fine with SerialParam(). Please see examples below.

fun <- function(x) { if (x >= 0) x else y }

bplapply(-5:5, fun, BPPARAM = BatchJobsParam())
SubmitJobs |+++++++++++++++++++++++++++++++++++++++++++++++++| 100% (00:00:00)
Waiting [S:0 R:0 D:11 E:0] |++++++++++++++++++++++++++++++++++| 100% (00:00:00)

Error: Errors occurred; first error message:
Error in FUN(...): object 'y' not found

For more information, use bplasterror(). To resume calculation, re-call
the function and set the argument 'BPRESUME' to TRUE or wrap the
previous call in bpresume().
First traceback:
41: bplapply(-5:5, fun, BPPARAM = BatchJobsParam())
40: bplapply(-5:5, fun, BPPARAM = BatchJobsParam())
39: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, USE.NAMES = FALSE,
BPRESUME = BPRESUME, BPPARAM = BPPARAM)
38: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, USE.NAMES = FALSE,
BPRESUME = BPRESUME, BPPARAM = BPPARAM)
37: suppressMessages(do.call(submitJobs, pars))
36: withCallingHandlers(expr, message = function(c) invokeRestart("muffleMessage"))
35: do.call(submitJobs, pars)
34: (function (reg, ids, resources = list(), wait, max.retries = 10L,
job.delay = FALSE)
{
chunks.as.arrayjobs = FALSE
getDelays = function(cf, job.delay, n) {
if (is.logical(job.delay)) {
if (job.delay && n > 100L && cf$name %nin% c("Interactive",
"Multicore", "SSH")) {
return(runif(n, n * 0.1, n * 0.2))
}
return(delays = rep.int(0, n))
}
vapply(seq_along(ids), job.delay, numeric(1L), n = n)
}
checkArg(reg, cl = "Registry")
syncRegistry(reg)
if (missing(ids)) {
ids = dbFindSubmitted(reg, negate = TRUE)
if (length(ids) == 0L) {
message("All jobs submitted, nothing to do!")
return(invisible(NULL))
}
}
else {
if (is.list(ids)) {
ids = lapply(ids, checkIds, reg = reg, check.present = FALSE)
dbCheckJobIds(reg, unlist(ids))
}
else if (is.numeric(ids)) {
ids = checkIds(reg, ids)
}
else {
stop("Parameter 'ids' must be a integer vector of job ids or a list of chunked job ids (list of integer vectors)!")
}
}
conf = getBatchJobsConf()
cf = getClusterFunctions(conf)
limit.concurrent.jobs = is.finite(conf$max.concurrent.jobs)
n = length(ids)
checkArg(resources, "list")
resources = resrc(resources)
if (missing(wait))
wait = function(retries) 10 * 2^retries
else checkArg(wait, formals = "retries")
if (is.logical(job.delay)) {
checkArg(job.delay, "logical", len = 1L, na.ok = FALSE)
}
else {
checkArg(job.delay, formals = c("n", "i"))
}
if (is.finite(max.retries)) {
max.retries = convertInteger(max.retries)
checkArg(max.retries, "integer", len = 1L, na.ok = FALSE)
}
checkArg(chunks.as.arrayjobs, "logical", na.ok = FALSE)
if (chunks.as.arrayjobs && is.na(cf$getArrayEnvirName())) {
warningf("Cluster functions '%s' do not support array jobs, falling back on chunks",
cf$name)
chunks.as.arrayjobs = FALSE
}
if (!is.null(cf$listJobs)) {
ids.intersect = intersect(unlist(ids), dbFindOnSystem(reg,
unlist(ids)))
if (length(ids.intersect) > 0L) {
stopf("Some of the jobs you submitted are already present on the batch system! E.g. id=%i.",
ids.intersect[1L])
}
}
if (limit.concurrent.jobs && (cf$name %in% c("Interactive",
"Local", "Multicore", "SSH") || is.null(cf$listJobs))) {
warning("Option 'max.concurrent.jobs' is enabled, but your cluster functions implementation does not support the listing of system jobs.\n",
"Option disabled, sleeping 5 seconds for safety reasons.")
limit.concurrent.jobs = FALSE
Sys.sleep(5)
}
if (n > 5000L) {
warningf(collapse(c("You are about to submit '%i' jobs.",
"Consider chunking them to avoid heavy load on the scheduler.",
"Sleeping 5 seconds for safety reasons."), sep = "\n"),
n)
Sys.sleep(5)
}
saveConf(reg)
is.chunked = is.list(ids)
messagef("Submitting %i chunks / %i jobs.", n, if (is.chunked)
sum(vapply(ids, length, integer(1L)))
else n)
messagef("Cluster functions: %s.", cf$name)
messagef("Auto-mailer settings: start=%s, done=%s, error=%s.",
conf$mail.start, conf$mail.done, conf$mail.error)
interrupted = FALSE
submit.msgs = buffer("list", 1000L, dbSendMessages, reg = reg,
max.retries = 10000L, sleep = function(r) 5, staged = useStagedQueries())
logger = makeSimpleFileLogger(file.path(reg$file.dir, "submit.log"),
touch = FALSE, keep = 1L)
on.exit({
if (interrupted && exists("batch.result", inherits = FALSE)) {
submit.msgs$push(dbMakeMessageSubmitted(reg, id,
time = submit.time, batch.job.id = batch.result$batch.job.id,
first.job.in.chunk.id = if (is.chunked) id1 else NULL,
resources.timestamp = resources.timestamp))
}
messagef("Sending %i submit messages...\nMight take some time, do not interrupt this!",
submit.msgs$pos())
submit.msgs$clear()
if (logger$getSize()) messagef("%i temporary submit errors logged to file '%s'.\nFirst message: %s",
logger$getSize(), logger$getLogfile(), logger$getMessages(1L))
})
messagef("Writing %i R scripts...", n)
resources.timestamp = saveResources(reg, resources)
writeRscripts(reg, cf, ids, chunks.as.arrayjobs, resources.timestamp,
disable.mail = FALSE, delays = getDelays(cf, job.delay,
n), interactive.test = !is.null(conf$interactive))
dbSendMessage(reg, dbMakeMessageKilled(reg, unlist(ids)),
staged = FALSE)
bar = makeProgressBar(max = n, label = "SubmitJobs")
bar$set()
tryCatch({
for (id in ids) {
id1 = id[1L]
retries = 0L
repeat {
if (limit.concurrent.jobs && length(cf$listJobs(conf,
reg)) >= conf$max.concurrent.jobs) {
batch.result = makeSubmitJobResult(status = 10L,
batch.job.id = NA_character_, "Max concurrent jobs exhausted")
}
else {
interrupted = TRUE
submit.time = now()
batch.result = cf$submitJob(conf = conf, reg = reg,
job.name = sprintf("%s-%i", reg$id, id1),
rscript = getRScriptFilePath(reg, id1), log.file = getLogFilePath(reg,
id1), job.dir = getJobDirs(reg, id1), resources = resources,
arrayjobs = if (chunks.as.arrayjobs)
length(id)
else 1L)
}
if (batch.result$status == 0L) {
submit.msgs$push(dbMakeMessageSubmitted(reg,
id, time = submit.time, batch.job.id = batch.result$batch.job.id,
first.job.in.chunk.id = if (is.chunked)
id1
else NULL, resources.timestamp = resources.timestamp))
interrupted = FALSE
bar$inc(1L)
break
}
interrupted = FALSE
if (batch.result$statu

bplasterror()
6/11 partial results stored. First 5 error messages:
[1]: Error: Error in FUN(...): object 'y' not found

[2]: Error: Error in FUN(...): object 'y' not found

[3]: Error: Error in FUN(...): object 'y' not found

[4]: Error: Error in FUN(...): object 'y' not found

[5]: Error: Error in FUN(...): object 'y' not found

bpresume(bplapply(abs(-5:5), fun, BPPARAM = BatchJobsParam()))
Resuming previous calculation...
SubmitJobs |+++++++++++++++++++++++++++++++++++++++++++++++++| 100% (00:00:00)
Syncing registry ...
Waiting [S:0 R:0 D:0 E:0] |++++++++++++++++++++++++++++++++++| 100% (00:00:00)

Error in LastError$store(results = results, is.error = !ok, throw.error = TRUE) :
Errors occurred; first error message:

For more information, use bplasterror(). To resume calculation, re-call
the function and set the argument 'BPRESUME' to TRUE or wrap the
previous call in bpresume().
Error in LastError$store(results = replace(results, is.error, LastError$results), :
Errors occurred; first error message:
Error:

For more information, use bplasterror(). To resume calculation, re-call
the function and set the argument 'BPRESUME' to TRUE or wrap the
previous call in bpresume().

bplasterror()
6/11 partial results stored. First 5 error messages:
[1]: Error: Error:

[2]: Error: Error:

[3]: Error: Error:

[4]: Error: Error:

[5]: Error: Error:

bpresume(bplapply(abs(-5:5), fun, BPPARAM = BatchJobsParam()))
Resuming previous calculation...
SubmitJobs |+++++++++++++++++++++++++++++++++++++++++++++++++| 100% (00:00:00)
Syncing registry ...
Waiting [S:0 R:0 D:5 E:0] |++++++++++++++++++++++++++++++++++| 100% (00:00:00)

[[1]]
[1] 5

[[2]]
[1] 4

[[3]]
[1] 3

[[4]]
[1] 2

[[5]]
[1] 1

[[6]]
[1] 0

[[7]]
[1] 1

[[8]]
[1] 2

[[9]]
[1] 3

[[10]]
[1] 4

[[11]]
[1] 5

sessionInfo()
R Under development (unstable) (2013-12-03 r64376)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] BatchJobs_1.1-1135 BBmisc_1.4 BiocParallel_0.5.5

loaded via a namespace (and not attached):
[1] brew_1.0-6 codetools_0.2-8 DBI_0.2-7 digest_0.6.4
[5] fail_1.2 foreach_1.4.1 iterators_1.0.6 parallel_3.1.0
[9] plyr_1.8 RSQLite_0.11.4 sendmailR_1.1-2 tools_3.1.0

Tests involving Bioconductor objects

Currently all tests are using primitive vectors and/or lists. Since the goal is for the parallel functions to work on anything implementing c, [, and [[, we should add tests that work on Bioconductor vector-like classes, such as IRanges, SimpleList, GRangesList, and XStringSet, etc.

Issues to be addressed:

How can we use a package in a test without making BiocParallel depend on that package?
We should benchmark to see what kind of speedup is achieved with various backends. For a big complex object like GRanges, there might be a lot of overhead from transferring data between workers, as well as the splitting and joining operations. This wouldn't be part of the normal tests though.

Error setting bpstopOnError(parms) <- FALSE

With the current devel version I get an error when I try to set stop.on.error to FALSE:

> library(BiocParallel)
> parms <- bpparam()
> parms
class: MulticoreParam
  bpisup: FALSE; bpnworkers: 2; bptasks: 0; bpjobname: BPJOB
  bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
  bptimeout: 2592000; bpprogressbar: FALSE
  bpRNGseed: 
  bplogdir: NA
  bpresultdir: NA
  cluster type: FORK
> bpstopOnError(parms) <- FALSE
Error in validObject(x) : 
  invalid class “MulticoreParam” object: 1: ‘cluster’, ‘.clusterargs’, ‘.uid’, ‘RNGseed’ must be length 1
invalid class “MulticoreParam” object: 2: ‘.clusterargs’, ‘.controlled’, ‘logdir’, ‘resultdir’ must be length 1
In addition: Warning message:
In is.na(x[[i]]) :
  is.na() applied to non-(list or vector) of type 'environment'

My session info:

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin15.6.0/x86_64 (64-bit)
Running under: OS X 10.11.6 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocParallel_1.7.8

loaded via a namespace (and not attached):
[1] parallel_3.3.1

Rscript: trying to execute load actions without 'methods' package

Does the following to be fixed in Bioc release?

With BiocParallel 1.4.0 on R (<= 3.2.3), I get:

% Rscript --version
R scripting front-end version 3.2.3 Patched (2015-12-10 r69760)

% Rscript -e "loadNamespace('BiocParallel')"
<environment: namespace:BiocParallel>
Warning message:
In .doLoadActions(where, attach) :
  trying to execute load actions without 'methods' package

I don't understand why this happens, because the methods is listed under Depends:.

I do not see this with BiocParallel 1.5.0 on R devel;

% Rscript --version
R scripting front-end version 3.3.0 Under development (unstable) (2015-12-10 r69760)
% Rscript -e "loadNamespace('BiocParallel')"
<environment: namespace:BiocParallel>

bioconductor / biocparallel Goto Github PK

biocparallel's Introduction

BiocParallel

Possible TODO

DONE

github notes

TO FIX

biocparallel's People

Contributors

Stargazers

Watchers

Forkers

biocparallel's Issues

First register a parallel backend with foreach

toy function

now with batch

Recommend Projects

Recommend Topics

Recommend Org