tudo-r / batchjobs Goto Github PK

View Code? Open in Web Editor NEW

85.0 15.0 21.0 2 MB

BatchJobs: Batch computing with R

License: Other

R 98.70% Shell 1.30%

batchjobs's People

Contributors

Stargazers

Watchers

batchjobs's Issues

src.files / src.dirs : Why relative to work dir

It is annoying that we cannot use absolute paths!

Wrong chunk labels in info mails

Situation: i have got 16 chunks and for the first and the last one there is an info mail per default.
For the first one it says "Chunk 1 has started/finished" correctly.
But the label of the last is not correct.
It says
Chunk 751 has started/finished, but 751 is the ID of the first job in chunk 16.

Memory usage scales (somewhat) linearly with chunk.size

Note: this one is a bit more vague, if you require an example of this happening please let me know.

Consider the following: I use BatchJobs for a function call where more.args has got a large matrix, in my case about 20,000 x 1,000 and the return value is a lot smaller, say 1 x 1000.

Running this with chunk.size = 1 requires about 300-400 MB of memory. If I run it with chunk.size = 5, it uses up about 2 GB. chunk.size = 10 requires a bit under 4 GB. If I have 30,000 jobs that I'd like to put in chunks of 50, this becomes a problem.

However, there is no reason for the function calls to use so much memory. If you:

Load more.args once as a reference copy (to not re-load and thus put load on the file system)
Copy it once for the function call, and
Clean up everything after,

this would require more or less constant memory with increasing chunk size.

Maybe using a new.env() for the function call and completely delete the environment after (as suggested in #35 ) would also solve this issue.

add method to export object to slaves

Uses fail:put internally

showLog does not work

On SLURM I get this:

sh: 1: /usr/bin/less +45: not found

The problem seems to be the +45.

getOption("pager")
[1] "/opt/R/R-3.0.2/lib/R/bin/pager"
Sys.getenv("PAGER")
[1] "/usr/bin/less"

testJob should also run jobs in interactive so we can traceback

Remove display of config on package load

It is annoying, mea culpa, I suck, bla, bla.

Multicore mode on multiple machines in parallel

It might be nice to start on multiple machines in multicore mode to be more mem efficient in some scenarios.

Are there viable alternatives for this?

methods not loaded by Rscript

BatchJobs execution gets halted before it can queue the jobs.

options(BatchJobs.on.slave=TRUE, BatchJobs.resources.path='/import/scratch/user/dpuru/BatchJobs-scratch/bmq_7fbd23130dcd/resources/resources_1392927734.RData')
library(BatchJobs)
Loading required package: BBmisc
res = BatchJobs:::doJob(
  reg=loadRegistry('/import/scratch/user/dpuru/BatchJobs-scratch/bmq_7fbd23130dcd'),
  ids=c(1L),
  multiple.result.files=FALSE,
  disable.mail=FALSE,
  first=1L,
  last=2L,
  array.id=NA)
2014-02-20 14:22:16: Starting job on node stluhpcprd837.
Loading registry: /import/scratch/user/dpuru/BatchJobs-scratch/bmq_7fbd23130dcd/registry.RData
Loading conf: /import/scratch/user/dpuru/BatchJobs-scratch/bmq_7fbd23130dcd/conf.RData
Auto-mailer settings: start=none, done=none, error=none.
Setting work dir: /import/scratch/user/dpuru/BatchJobs-scratch
Error in sendMail(reg, job, result.str, "", disable.mail, condition = "start", :
could not find function "is"
Calls: -> doSingleJob -> sendMail
Setting work back to: /import/scratch/user/dpuru/BatchJobs-scratch
Memory usage according to gc:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 306867 16.4 467875 25 350000 18.7
Vcells 448981 3.5 905753 7 905753 7.0
Execution halted

Rscript does not load package "methods", so simply putting library(methods) at the beginning of your script, gets us past this error, and queues the jobs, but eventually all the jobs expire due the exact same issue occurring when the slave jobs begin execution on the nodes.

Sys.sleep(0.000000)
options(BatchJobs.on.slave=TRUE,
BatchJobs.resources.path='/import/scratch/user/dpuru/BatchJobs- scratch/bmq_1e8658cc07c6/resources/resources_1392937763.RData')
library(BatchJobs)
res = BatchJobs:::doJob(
reg=loadRegistry('/import/scratch/user/dpuru/BatchJobs-
scratch/bmq_1e8658cc07c6'),
ids=c(1L),
multiple.result.files=FALSE,
disable.mail=FALSE,
first=1L,
last=2L,
array.id=NA)
BatchJobs:::setOnSlave(FALSE)

Is there was a way to include library(methods) in these files too?

Functions that fail on LSF do not fail in interactive mode

The following function runs fine in interactive mode but fails on LSF, for the reason that x is not defined in a new environment.

> library(BatchJobs)
> x = 5
> f = function(y) x+y
> reg = batchMapQuick(f, c(1:3))
> reduceResultsList(reg, fun=function(job, res) res)
$`1`
[1] 6
$`2`
[1] 7
$`3`
[1] 8

I'm not arguing that this should work on LSF, but it should (in my opinion) fail also in interactive mode, especially since that is used for debugging purposes and should catch errors occurring in production.

I think a good solution would be to evaluate all function calls in a new.env(). This might also have the advantage that it can be cleaned up more easily after the function returned.

BatchJobs should allow to source scripts on the nodes

It is possible to define packages to be loaded on the nodes, but not to source() scripts. This is valid when the function I call resides in another file that makes use of helper functions there.

Consider the following example (working on interactive, not working on LSF):

caller.r

library(BatchJobs)
source('callee.r')
reg = batchMapQuick(primary.func, c(1,2), temporary=F)

callee.r

myglobal <<- "123"
primary.func = function(val) {
    print(myglobal) # fails
    secondary.func(val) # fails
}
secondary.func = function(val) {
}

The workaround I'm currently using is to source("callee.r") in primary.func of callee.r. This is not only ugly but also dangerous because of a possible infinite recursion.

I think the nicest way to handle this would be to add an option to source on the node (analogous to loading packages).

Add a help("BatchJobs") page

SRC: https://code.google.com/p/batchjobs/issues/detail?id=22

I suggest that you add a help("BatchJobs") overview page. It's also a neat way to quickly access the HTML help of the package without lots of point'n'clicks, e.g. ?BatchJobs.

This can be achieved using Rd markup \alias{BatchJobs-package}. See Section '2.1.4 Documenting packages' of WRE for more details.

Debian Med autotest problem

This Problem was reporting from Debian Med package management in their autopkgtest

########################################`

LC_ALL=C R --no-save < run-all.R

and noticed that when doing this as normal user it results in

Error: batchExpandGrid ------------------------------------------------------
Could not create dir: unittests-files/unittestee2f8ba97c/registry
1: makeTestRegistry() at test_batchExpandGrid.R:4
2: makeRegistry(id = "unittests", seed = 1, packages = packages, file.dir = rd, work.dir = "unittests-files",
...) at /usr/lib/R/site-library/BatchJobs/tests/helpers.R:39
3: makeRegistryInternal(id, file.dir, sharding, work.dir, multiple.result.files, seed,
packages, src.dirs, src.files)
4: checkDir(file.dir, create = TRUE, check.empty = TRUE, check.posix = TRUE, msg = TRUE)
5: stop("Could not create dir: ", path)
Error: batchMap -------------------------------------------------------------
Could not create dir: unittests-files/unittestee26666541a/registry
1: makeTestRegistry() at test_batchMap.R:4
2: makeRegistry(id = "unittests", seed = 1, packages = packages, file.dir = rd, work.dir = "unittests-files",
...) at /usr/lib/R/site-library/BatchJobs/tests/helpers.R:39
3: makeRegistryInternal(id, file.dir, sharding, work.dir, multiple.result.files, seed,
packages, src.dirs, src.files)
4: checkDir(file.dir, create = TRUE, check.empty = TRUE, check.posix = TRUE, msg = TRUE)

since normal users can not write to /usr/lib/R/site-library/.

I tried as root which seems to work without error but I get

test_package("BatchJobs")
batchExpandGrid : Warning in sqliteInitDriver(max.con, fetch.default.rec, force.reload, shared.cache) :
RS-DBI driver warning: (SQLite mismatch between compiled version 3.8.5 and runtime version 3.8.4.3)
Warning in sqliteInitDriver(max.con, fetch.default.rec, force.reload, shared.cache) :
RS-DBI driver warning: (SQLite mismatch between compiled version 3.8.5 and runtime version 3.8.4.3)
Warning in sqliteInitDriver(max.con, fetch.default.rec, force.reload, shared.cache) :
RS-DBI driver warning: (SQLite mismatch between compiled version 3.8.5 and runtime version 3.8.4.3)
.Warning in sqliteInitDriver(max.con, fetch.default.rec, force.reload, shared.cache) :
RS-DBI driver warning: (SQLite mismatch between compiled version 3.8.5 and runtime version 3.8.4.3)

Display memory statistics in getJobInfo() and showStatus()

To optimise the memory usage for each job it would be interesting to get an idea of the memory usage. Is it possible to provide the memory usage in getJobInfo() and a small statistic in showStatus() similar to the time values?

Html help: When jobs are run in interactive mode for this, too much crappy output is produced

Look here:
http://tudo-r.github.io/BatchJobs/man/submitJobs.html

check why findRunning does not work in SSH cluster LMU

isFileFromRooot

Should be moved to BBmisc and renamed. See FIXME in code

Graph of dependent jobs?

SRC: https://code.google.com/p/batchjobs/issues/detail?id=19

For some experiments it MIGHT be useful to be able to specify a graph of dependent jobs, similar to how targets are defined in a Makefile.

This means, that for some jobs to starts, the results of others have to be fully completed. The solution for this probably is simple topological sorting wrt to preconditions.

But I want to collect more use cases, before we look into this again.

Problem when loading BatchJobs

Hi, I just tried to install BatchJobs and BatchExperiments on a new computer and encountered the following problem:

I've installed BatchJobs using the command line
devtools::install_github("BatchJobs", username="tudo-r")

Then I tried to load the package and got the following error-message:
library(BatchJobs)
Sourcing configuration file: '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/BatchJobs/etc/BatchJobs_global_config.R'
Error : .onAttach in attachNamespace() für 'BatchJobs' fehlgeschlagen, Details:
Aufruf: sprintf(fmt, x$cluster.functions$name, x$mail.from, x$mail.to,
Fehler: konnte Funktion "listToShortString" nicht finden
Fehler: Laden von Paket oder Namensraum für ‘BatchJobs’ fehlgeschlagen

odd behavior of registry with multicore approach

I noticed the FAQ on 'multicore does not work' ... this is not my issue, it works very nicely in certain respects. However, tracking submission in the registry seems not to work. many of the job counting utilities do work. But findSubmitted does not.

showStatus(campWBsplreg7)
Status for 239 jobs at 2014-03-20 13:19:12
Submitted: 0 ( 0.00%)
Started: 21 ( 8.79%)
Running: 0 ( 0.00%)
Done: 11 ( 4.60%)
Errors: 0 ( 0.00%)
Expired: 0 ( 0.00%)
Time: min=2834.00s avg=3510.91s max=3741.00s

campWBsplreg7
Job registry: mar3
Number of jobs: 239
Files dir: /udd/stvjc/VM/CAMPWB_200K/mar3f
Work dir: /udd/stvjc/VM/CAMPWB_200K
Multiple result files: FALSE
Seed: 123
Required packages: BatchJobs
?makeRegistry
findRunning(campWBsplreg7)
integer(0)
findDone(campWBsplreg7)
[1] 1 2 3 4 5 6 7 8 9 10 11
findNotDone(campWBsplreg7)
[1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[19] 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
[37] 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
[55] 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
[73] 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
[91] 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
[109] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
[127] 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155
[145] 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
[163] 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
[181] 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209
[199] 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227
[217] 228 229 230 231 232 233 234 235 236 237 238 239
findStarted(campWBsplreg7)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
findSubmitted(campWBsplreg7)
integer(0)

batchMapQuick: arg 'inds': the semantics of inds are quite unclear when combined with chunking

Actually, what happens is well defined but probably really useless of one considers that the vector of ids gets shuffled?

Is there any use case where a user would like to set inds with chunking?

Docs: It must be documented that users can access submit.log where warnings / temp errors are stored

In doc file for submitJobs.
Maybe somewhere else too, where progressbar is used?

Impute missing results with a certain value (e.g. NA) in reduce*-functions

Didnt we discuss to implement this?

Michel, I thought we already had this, currently I don't see this anywhere in the code?

If it has not been done, this is a feature request to both of us, also for reduceResultsExperiments.

Display bug in testJob's approximate running time

Hi,

I'm using BatchExperiments version 1.0-968.
I tested a job using testJob() and I got

Approximate running time: 1.07 seconds

But it should be minutes, because the job took 1.07 * 60 = 64.2 (64.162 in system.time()) seconds.
Just a minor bug but I wanted to share it.

Exports: Is this done?

Check and see issue

#15

check.posix

Why was this actually made an option?
And is not part of the config?

Is it documented somewhere?

Explain package loading, exporting and sourcing of *.R in tutorial

This too hard to figure out from the R docs

src.dir path in registry

Is there a specific reason why these have to be relative to the work.dir?

I already have a few cases where I would like to set absolute paths.

Why dont we allow this, as long as the user makes sure this are accessible on the shared FS?

Install slurm with arrayjobs on test system

Enable users to define temp errors on cluster systems

SRC: https://code.google.com/p/batchjobs/issues/detail?id=17

Add arguments to makeClusterFunctionsTorque, etc, so users can define exit codes and partial error messages which are only temp. errors and should result in resubmits.

The currently implemented ones are probably reasonable defaults for the args.

makeClusterFunctionsTorque(): Add -s EHQRTW to 'qselect' call in order to ignore completed ('C') jobs

SRC: https://code.google.com/p/batchjobs/issues/detail?id=23

Why are exports not automatically loaded on slaves?

It is also really uncool that the whole mechnism is not explained anywhere!

Cluster function commands should be made slighltly more configurable

Using the package here at the LMU HPC (SLURM) cluster I found out two things

a) The package basically works out of the box

b) Listing jobs does not work. The problem is trivial.
Instead of what we do
squeue -h -o %i -u $USER
I need to do add
squeue --clusters=serial -h -o %i -u $USER

The thing is that this command is hardcoded into makeClusterFunctionsSLURM

Now I could surely copy makeClusterFunctionsSLURM and adapt it, but maybe it is simpler to be able to "configure" that command?
Maybe there should be an option in BatchJobs.R that one could set?

Rethink sourcing of files and dirs on nodes

We want to be able to source single files.
Maybe allow a mixture of files and directories so all parties are happy?

batchMapQuick chunks not working (on LSF)

Consider the function below. Each time I have got 4 jobs that should be put in one chunk, but LSF always submits 4 individual jobs disregarding the chunks I specify.

Note: this works fine when specifying the chunks manually (last example).

square = function(x) { x*x }
library(BatchJobs)

reg = batchMapQuick(square, c(1:4), chunk.size=10)
#Saving conf: /some/dir/bmq_341e69aa7610/conf.RData
#Submitting 4 chunks / 4 jobs.
#Cluster functions: LSF.

reg = batchMapQuick(square, c(1:4), n.chunks=1)
#Saving conf: /some/dir/bmq_3c8c7b37b8cb/conf.RData
#Submitting 4 chunks / 4 jobs.
#Cluster functions: LSF.

reg = makeRegistry(id="BatchJobsExample", file.dir=tempfile(), seed=123)
batchMap(reg, square, c(1:4))
chunked = chunk(getJobIds(reg), n.chunks=1, shuffle=TRUE)
submitJobs(reg, chunked)
#Saving conf: /tmp/RtmpsWXKsa/file6bec917d96c/conf.RData
#Submitting 1 chunks / 4 jobs.
#Cluster functions: LSF.

The problem seems to be lines 55/56 in R/batchMapQuick.R:

if (!missing(chunk.size) && !missing(n.chunks))
    ids = chunk(ids, chunk.size=chunk.size, n.chunks=n.chunks, shuffle=TRUE)

Here, the conditions are only true if both chunk.size and n.chunks are given. It should rather be something like:

if (!missing(chunk.size) && !missing(n.chunks))
    stop("Providing both chunk.size and n.chunks makes no sense")
if (!missing(chunk.size))
    ids = chunk(ids, chunk.size=chunk.size, shuffle=TRUE)
if (!missing(n.chunks))
    ids = chunk(ids, n.chunks=n.chunks, shuffle=TRUE)

SSH mode: bashrc

I noticed on the CIP cluster in Munich that forcefully printing stuff in your bashrc leads to errors in submission.

Might not be the most relevant issue, but lets see if we can robustify the ssh-scripts even more.

src.dirs does not wirk on Lido

Here is the minimal example:

library(BatchJobs)
r = makeRegistry("test", file.dir=as.character(sample(10000, 1)), src.dirs = "Blubb")
f = function(x) x^2
batchMap(r, f, 1:20)
submitJobs(r)

where Blubb is a directory in my working directory, containing a random R-script

We see 2 Bugs here:
First, the error on lido is:
Error in getRScripts(dirs) : Directories not found: Blubb
Second:
This error is not transfered into the registry, in the registry these jobs are only labeled as "submitted", they er neither started nor errors, but they should be.

dbDoQuery() occasionally gives "RS_SQLite_fetch: failed first step: attempt to write a readonly database"

PROBLEM

Occasionally one of my ~20 jobs fails with the following error:

$: more BiocParallel_tmp_6fd624fb91629/jobs/14/14.out

Command: Rscript --verbose "/path/to/BiocParallel_tmp_6fd624fb91629/jobs/14/14.R"
running
  '/opt/R/R-3.1.1/lib64/R/bin/R --slave --no-restore --file=/path/to/BiocParallel_tmp_6fd624fb91629/jobs/14/14.R'

Loading required package: BBmisc
Loading required package: methods
Loading registry: /path/to/BiocParallel_tmp_6fd624fb91629/registry.RData
Loading conf:
2014-09-09 15:57:16: Starting job on node n0.
Auto-mailer settings: start=none, done=none, error=none.
Setting work dir: /path/to
Warning in sqliteCloseConnection(conn, ...) :
  RS-DBI driver warning: (closing pending result sets before closing this connection)
[1] "Error in sqliteFetch(rs, n = -1, ...) : \n  RSQLite driver: (RS_SQLite_fetch: failed first step: attempt to write a readonly database)\n"
[1] "SELECT job_id, fun_id, pars, jobname, seed FROM bpmapply_expanded_jobs WHERE job_id IN (14)"
Error in dbDoQuery(reg, query) :
  Error in dbDoQuery. Error in sqliteFetch(rs, n = -1, ...) :
  RSQLite driver: (RS_SQLite_fetch: failed first step: attempt to write a readonly database)
 (SELECT job_id, fun_id, pars, jobname, seed FROM bpmapply_expanded_jobs WHERE job_id IN (14))
Calls: <Anonymous> ... dbGetJobs.Registry -> dbSelectWithIds -> dbDoQuery -> stopf
Setting work back to: /home/henrik
Memory usage according to gc:
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 328855 17.6     467875 25.0   407500 21.8
Vcells 488477  3.8    1031040  7.9   786431  6.0
Execution halted
Command: Rscript --verbose "/path/to/BiocParallel_tmp_6fd624fb91629/jobs/14/14.R" ... DONE

FACTS

It only happens occasionally, i.e. hard to make a reproducible example.
Other very similar jobs run successfully on the same node.
There is nothing specific with the jobs failing this way.

TROUBLESHOOTING

Looking at BatchJobs:::dbDoQuery, I see that you do check for "(lock|i/o)" related errors and have dbDoQuery() retry several times before giving up. My best guess is that a similar issue occurs here; BatchJobs runs on a shared file system (NFS), multiple jobs tries to access/update the SQLite database, which is a file on this shared file system. Some job grabs this SQLite file and locks it. Your wait-and-try-again approach handles this lock case. However, here it seems as the file can also be in a read-only state, which is not handled. Maybe it has to do with the fact that there is a latency in how file information is propagated on a shared file systems and it could be that the read-only state is active before the lock of the file. Just guessing here...

html Documentation is missing a lot of files

http://tudo-r.github.io/BatchJobs/man/
Anything with find... won't work and also others 😕

Always use R CMD BATCH in cluster functions

Reasons:

Code becomes more homogeneous
We dont have trouble with packages that Rscript does not load by default
The slightly greater overhead does not count. BJ is not supposed to be used when the job only takes 2 secs. We have chunking for that.

Support for other DBMS

This was requested at the Bioconductor Developer Meeting.
Without staged.queries -> permanent database lock.
With staged.queries -> overburdened file system server (not sure if chunked nicely)

Started to look into it, but we might want to wait until rstats-db/dbi (picked up by Hadley) is more mature to avoid workarounds for dbPreparedQuery etc. See https://stat.ethz.ch/pipermail/r-sig-db/2013q4/001322.html

Add an option to respect delayed file systems

See Bioconductor/BiocParallel#33

Add an option to enable or disable additional checks
After writing R scripts in submitJobs(), sleep until all(file.exists(...)) == TRUE
How to check if the database file has been rewritten to disk in submitJobs?
staged.queries may even fail if we cannot rely on the order of files (first created appear first) -> check what reordering would break (besides temporary inconsistencies)

Amazon EC2 / Google compute engine support?

SRC: https://code.google.com/p/batchjobs/issues/detail?id=18

Is it possible that we can user the packages to compute there?

I never looked into it properly, lets try it.

http://aws.amazon.com/ec2/
https://cloud.google.com/products/compute-engine
https://github.com/armstrtw/AWS.tools

makeClusterFunctionsLocal(parallel=TRUE)

SRC: https://code.google.com/p/batchjobs/issues/detail?id=16

Cannot run simple SGE example

I am trying to get a simple example working on my SGE cluster. Session info below. I'd greatly appreciate any ideas for figuring this out. BiocParallel is not available for R 3.1.1.

Maybe my SGE template is not correct? I expect not because this code seems to fail on a simple qstat.

I am using the following configuration:
cluster.functions = makeClusterFunctionsSGE("/home/poirierj/R_libs/BatchJobs/etc/simple.tmpl", list.jobs.cmd = c("qstat", "-u poirierj"))
mail.start = "none"
mail.done = "none"
mail.error = "none"
db.driver = "SQLite"
db.options = list()
debug = TRUE

library(BatchJobs)
Loading required package: BBmisc
Sourcing configuration file: '/home/poirierj/R_libs/BatchJobs/etc/BatchJobs_global_config.R'
BatchJobs configuration:
cluster functions: SGE
mail.from:
mail.to:
mail.start: none
mail.done: none
mail.error: none
default.resources:
debug: TRUE
raise.warnings: FALSE
staged.queries: FALSE
max.concurrent.jobs: Inf
fs.timeout: NA
library(BiocParallel)
param <- BatchJobsParam(2)
register(param)
x<-bplapply(1:10, identity)
OS cmd: qstat -u poirierj
OS result:
$exit.code
[1] 0

$output
character(0)

Error: $ operator is invalid for atomic vectors

traceback()
15: fun(getBatchJobsConf(), reg)
14: getBatchIds(reg, "Cannot find jobs on system")
13: dbFindOnSystem(reg, unlist(ids))
12: as.vector(y)
11: intersect(unlist(ids), dbFindOnSystem(reg, unlist(ids)))
10: (function (reg, ids, resources = list(), wait, max.retries = 10L,
chunks.as.arrayjobs = FALSE, job.delay = FALSE)
{
getDelays = function(cf, job.delay, n) {
if (is.logical(job.delay)) {
if (job.delay && n > 100L && cf$name %nin% c("Interactive",
"Multicore", "SSH")) {
return(runif(n, n * 0.1, n * 0.2))
}
return(delays = rep.int(0, n))
}
vnapply(seq_along(ids), job.delay, n = n)
}
checkRegistry(reg)
syncRegistry(reg)
if (missing(ids)) {
ids = dbFindSubmitted(reg, negate = TRUE)
if (length(ids) == 0L) {
info("All jobs submitted, nothing to do!")
return(invisible(integer(0L)))
}
}
else {
if (is.list(ids)) {
ids = lapply(ids, checkIds, reg = reg, check.present = FALSE)
dbCheckJobIds(reg, unlist(ids))
}
else if (is.numeric(ids)) {
ids = checkIds(reg, ids)
}
else {
stop("Parameter 'ids' must be a integer vector of job ids or a list of chunked job ids (list of integer vectors)!")
}
}
conf = getBatchJobsConf()
cf = getClusterFunctions(conf)
limit.concurrent.jobs = is.finite(conf$max.concurrent.jobs)
n = length(ids)
assertList(resources)
resources = resrc(resources)
if (missing(wait))
wait = function(retries) 10 * 2^retries
else assertFunction(wait, "retries")
if (is.logical(job.delay)) {
assertFlag(job.delay)
}
else {
checkFunction(job.delay, c("n", "i"))
}
if (is.finite(max.retries))
max.retries = asCount(max.retries)
assertFlag(chunks.as.arrayjobs)
if (chunks.as.arrayjobs && is.na(cf$getArrayEnvirName())) {
warningf("Cluster functions '%s' do not support array jobs, falling back on chunks",
cf$name)
chunks.as.arrayjobs = FALSE
}
if (!is.null(cf$listJobs)) {
ids.intersect = intersect(unlist(ids), dbFindOnSystem(reg,
unlist(ids)))
if (length(ids.intersect) > 0L) {
stopf("Some of the jobs you submitted are already present on the batch system! E.g. id=%i.",
ids.intersect[1L])
}
}
if (limit.concurrent.jobs && (cf$name %in% c("Interactive",
"Local", "Multicore", "SSH") || is.null(cf$listJobs))) {
warning("Option 'max.concurrent.jobs' is enabled, but your cluster functions implementation does not support the listing of system jobs.\n",
"Option disabled, sleeping 5 seconds for safety reasons.")
limit.concurrent.jobs = FALSE
Sys.sleep(5)
}
if (n > 5000L) {
warningf(collapse(c("You are about to submit '%i' jobs.",
"Consider chunking them to avoid heavy load on the scheduler.",
"Sleeping 5 seconds for safety reasons."), sep = "\n"),
n)
Sys.sleep(5)
}
saveConf(reg)
is.chunked = is.list(ids)
info("Submitting %i chunks / %i jobs.", n, if (is.chunked)
sum(viapply(ids, length))
else n)
info("Cluster functions: %s.", cf$name)
info("Auto-mailer settings: start=%s, done=%s, error=%s.",
conf$mail.start, conf$mail.done, conf$mail.error)
fs.timeout = conf$fs.timeout
staged = conf$staged.queries && !is.na(fs.timeout)
interrupted = FALSE
submit.msgs = buffer(type = "list", capacity = 1000L, value = dbSendMessages,
reg = reg, max.retries = 10000L, sleep = function(r) 5,
staged = staged, fs.timeout = fs.timeout)
logger = makeSimpleFileLogger(file.path(reg$file.dir, "submit.log"),
touch = FALSE, keep = 1L)
on.exit({
if (interrupted && exists("batch.result", inherits = FALSE)) {
submit.msgs$push(dbMakeMessageSubmitted(reg, id,
time = submit.time, batch.job.id = batch.result$batch.job.id,
first.job.in.chunk.id = if (is.chunked) id1 else NULL,
resources.timestamp = resources.timestamp))
}
info("Sending %i submit messages...\nMight take some time, do not interrupt this!",
submit.msgs$pos())
submit.msgs$clear()
if (logger$getSize()) messagef("%i temporary submit errors logged to file '%s'.\nFirst message: %s",
logger$getSize(), logger$getLogfile(), logger$getMessages(1L))
})
info("Writing %i R scripts...", n)
resources.timestamp = saveResources(reg, resources)
rscripts = writeRscripts(reg, cf, ids, chunks.as.arrayjobs,
resources.timestamp, disable.mail = FALSE, delays = getDelays(cf,
job.delay, n))
waitForFiles(rscripts, timeout = fs.timeout)
dbSendMessage(reg, dbMakeMessageKilled(reg, unlist(ids),
type = "first"), staged = staged, fs.timeout = fs.timeout)
bar = makeProgressBar(max = n, label = "SubmitJobs")
bar$set()
tryCatch({
for (i in seq_along(ids)) {
id = ids[[i]]
id1 = id[1L]
retries = 0L
repeat {
if (limit.concurrent.jobs && length(cf$listJobs(conf,
reg)) >= conf$max.concurrent.jobs) {
batch.result = makeSubmitJobResult(status = 10L,
batch.job.id = NA_character_, "Max concurrent jobs exhausted")
}
else {
interrupted = TRUE
submit.time = now()
batch.result = cf$submitJob(conf = conf, reg = reg,
job.name = sprintf("%s-%i", reg$id, id1),
rscript = rscripts[i], log.file = getLogFilePath(reg,
id1), job.dir = getJobDirs(reg, id1), resources = resources,
arrayjobs = if (chunks.as.arrayjobs)
length(id)
else 1L)
}
if (batch.result$status == 0L) {
submit.msgs$push(dbMakeMessageSubmitted(reg,
id, time = submit.time, batch.job.id = batch.result$batch.job.id,
first.job.in.chunk.id = if (is.chunked)
id1
else NULL, resources.timestamp = resources.timestamp))
interrupted = FALSE
bar$inc(1L)
break
}
interrupted = FALSE
if (batch.result$status > 0L && batch.result$status <=
100L) {
if (is.finite(max.retries) && retries > max.retries)
stopf("Retried already %i times to submit. Aborting.",
max.retries)
Sys.sleep(wait(retries))
logger$log(batch.result$msg)
retries = retries + 1L
}
else if (batch.result$status > 100L && batch.result$status <=
200L) {
stopf("Fatal error occured: %i. %s", batch.result$status,
batch.result$msg)
}
else {
stopf("Illegal status code %s returned from cluster functions!",
batch.result$status)
}
}
}
}, error = bar$error)
return(invisible(ids))
})(reg = list(id = "bpmapply", version = list(platform = "x86_64-unknown-linux-gnu",
arch = "x86_64", os = "linux-gnu", system = "x86_64, linux-gnu",
status = "", major = "3", minor = "0.2", year = "2013", month = "09",
day = "25", svn rev = "63987", language = "R", version.string = "R version 3.0.2 (2013-09-25)",
nickname = "Frisbee Sailing"), RNGkind = c("Mersenne-Twister",
"Inversion"), db.driver = "SQLite", db.options = list(), seed = 693613467L,
file.dir = "/home/poirierj//BiocParallel_tmp_55523b7bd6c2",
sharding = TRUE, work.dir = "/home/poirierj", src.dirs = character(0),
src.files = character(0), multiple.result.files = FALSE,
packages = list(BatchJobs = list(version = list(c(1L, 3L))))),
ids = list(c(2L, 4L, 5L, 7L, 9L), c(1L, 3L, 6L, 8L, 10L)))
9: do.call(submitJobs, pars)
8: withCallingHandlers(expr, message = function(c) invokeRestart("muffleMessage"))
7: suppressMessages(do.call(submitJobs, pars))
6: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, USE.NAMES = FALSE,
resume = resume, BPPARAM = BPPARAM)
5: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, USE.NAMES = FALSE,
resume = resume, BPPARAM = BPPARAM)
4: bplapply(X, FUN, ..., resume = resume, BPPARAM = x)
3: bplapply(X, FUN, ..., resume = resume, BPPARAM = x)
2: bplapply(1:10, identity)
1: bplapply(1:10, identity)
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] BiocParallel_0.4.1 BatchJobs_1.3 BBmisc_1.7

loaded via a namespace (and not attached):
[1] brew_1.0-6 checkmate_1.2 codetools_0.2-8 DBI_0.2-7
[5] digest_0.6.4 fail_1.2 foreach_1.4.2 iterators_1.0.7
[9] parallel_3.0.2 RSQLite_0.11.4 sendmailR_1.1-2 stringr_0.6.2
[13] tools_3.0.2

Explain array jobs mode somewhere on web page and test it better

BatchJobs does not check if the arguments provided to a function are sufficient

If I pass insufficient arguments to a function, it would be nice if BatchJobs fails already at registry creation, and does not silently drop the results.

> library(BatchJobs)
> f = function(x,y) x+y
> reg = batchMapQuick(f, c(1:5))
> reduceResultsList(reg, fun=function(job, res) res)
Reducing 0 results...
list()

Currently, I use a wrapper to check the input like the one below. Please feel free to add these checks if you find them useful.

l. = list(...)
fun = match.fun(` fun`) 
funargs = formals(fun)
required = names(funargs)[unlist(lapply(funargs, function(f) class(f)=='name'))]
provided = names(c(l., more.args))

if (length(provided) > 1) {
    if (any(nchar(provided) == 0))
        stop("All arguments that will be provided to function must be named")

    sdiff = unlist(setdiff(required, provided))
    if (length(sdiff) > 0 && sdiff != '...')
        stop(paste("Argument required but not provided:", paste(sdiff, collapse=" ")))
}

sdiff = unlist(setdiff(provided, names(funargs)))
if (length(sdiff) > 0 && ! '...' %in% names(funargs))
    stop(paste("Argument provided by not accepted by function:", paste(sdiff, collapse=" ")))
dups = duplicated(provided)
if (any(dups))
    stop(paste("Argument duplicated:", paste(provided[[dups]], collapse=" ")))

maybe we want to log the max mem consumption after a job completes in the DB

the gc value might not be perfect?
but better than nothing?

question - stuck at batchMap

I am trying to set up a BatchJobs job with 8 Scenarios. The function I use is a wrapper function that will put the parameters into their right form to call the functions that actually do the computing work. All parameters in the call are vectors of length 8, defined either as columns from matrices or vectors (created using rep). However, the function definition is not well interpreted: the message I get is--
Error in data.frame(fun_id = fun.id, pars = pars, jobname = jobname) :
arguments imply differing number of rows: 1, 8, 0
FUNCTION DEFINTION BELOW
preFixRes <- function(nRep, la01,la02,la03,la11,la12,la13,ct1,ct2,ct3,
n,rho1,rho2,rho3,rho4,rho5,
gamma1,gamma2,gamma3,gamma4,gamma5,
optRho,optGamma,seeds, cLams){
n0 <- n/2
n1 <- n0
cttfs <- c(ct1,ct2,ct3)
rho <- c(rho1,rho2,rho3,rho4,rho5)
cLamb <- rep(cLams,3)
gamma <- c(gamma1,gamma2,gamma3,gamma4,gamma5)
rez <- runSimul (nRep,Scennr,la01,la02,la03,la11,la12,la13,ctffs,
n0,n1,rho,gamma,optRho,optGamma,seeds, cLamb)
resMat <- fixResTests1( rez,nRep,length(rho))
evs <- cEvs(rez,nRep)
return(list(resMat=resMat,evs=evs))
}
Question--what is being expected?

Learn walltime / resource requirements?

SRC: https://code.google.com/p/batchjobs/issues/detail?id=20

On some HPC systems (like ours) it is necessary to set the walltime at least roughly correct to be able to compute efficiently - and not overestimate it by a factor of 10 to be on the safe side.

For some experiments, especially if you mix very different algorithms with BatchExperiments, this can now become very difficult for the user. He has to do extensive "pre-runs", note execution times and then submit jobs in groups.

One current way out is to use chunking, to simply mix long running jobs with short ones.

But I am still wondering, whether we cannot automate what the user does (explained above) by automatically learning it?

Maybe this is a weird idea and overkill, in any case this of low priority and won't be implemented soon.