Giter VIP home page Giter VIP logo

batchjobs's People

Contributors

berndbischl avatar gaborcsardi avatar henrikbengtsson avatar jakob-r avatar mllg avatar readmecritic avatar surmann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

batchjobs's Issues

Wrong chunk labels in info mails

Situation: i have got 16 chunks and for the first and the last one there is an info mail per default.
For the first one it says "Chunk 1 has started/finished" correctly.
But the label of the last is not correct.
It says
Chunk 751 has started/finished, but 751 is the ID of the first job in chunk 16.

Memory usage scales (somewhat) linearly with chunk.size

Note: this one is a bit more vague, if you require an example of this happening please let me know.

Consider the following: I use BatchJobs for a function call where more.args has got a large matrix, in my case about 20,000 x 1,000 and the return value is a lot smaller, say 1 x 1000.

Running this with chunk.size = 1 requires about 300-400 MB of memory. If I run it with chunk.size = 5, it uses up about 2 GB. chunk.size = 10 requires a bit under 4 GB. If I have 30,000 jobs that I'd like to put in chunks of 50, this becomes a problem.

However, there is no reason for the function calls to use so much memory. If you:

  • Load more.args once as a reference copy (to not re-load and thus put load on the file system)
  • Copy it once for the function call, and
  • Clean up everything after,

this would require more or less constant memory with increasing chunk size.

Maybe using a new.env() for the function call and completely delete the environment after (as suggested in #35 ) would also solve this issue.

showLog does not work

On SLURM I get this:

sh: 1: /usr/bin/less +45: not found

The problem seems to be the +45.

getOption("pager")
[1] "/opt/R/R-3.0.2/lib/R/bin/pager"
Sys.getenv("PAGER")
[1] "/usr/bin/less"

methods not loaded by Rscript

BatchJobs execution gets halted before it can queue the jobs.

options(BatchJobs.on.slave=TRUE, BatchJobs.resources.path='/import/scratch/user/dpuru/BatchJobs-scratch/bmq_7fbd23130dcd/resources/resources_1392927734.RData')
library(BatchJobs)
Loading required package: BBmisc
res = BatchJobs:::doJob(

  •   reg=loadRegistry('/import/scratch/user/dpuru/BatchJobs-scratch/bmq_7fbd23130dcd'),
    
  •   ids=c(1L),
    
  •   multiple.result.files=FALSE,
    
  •   disable.mail=FALSE,
    
  •   first=1L,
    
  •   last=2L,
    
  •   array.id=NA)
    
    2014-02-20 14:22:16: Starting job on node stluhpcprd837.
    Loading registry: /import/scratch/user/dpuru/BatchJobs-scratch/bmq_7fbd23130dcd/registry.RData
    Loading conf: /import/scratch/user/dpuru/BatchJobs-scratch/bmq_7fbd23130dcd/conf.RData
    Auto-mailer settings: start=none, done=none, error=none.
    Setting work dir: /import/scratch/user/dpuru/BatchJobs-scratch
    Error in sendMail(reg, job, result.str, "", disable.mail, condition = "start", :
    could not find function "is"
    Calls: -> doSingleJob -> sendMail
    Setting work back to: /import/scratch/user/dpuru/BatchJobs-scratch
    Memory usage according to gc:
    used (Mb) gc trigger (Mb) max used (Mb)
    Ncells 306867 16.4 467875 25 350000 18.7
    Vcells 448981 3.5 905753 7 905753 7.0
    Execution halted

Rscript does not load package "methods", so simply putting library(methods) at the beginning of your script, gets us past this error, and queues the jobs, but eventually all the jobs expire due the exact same issue occurring when the slave jobs begin execution on the nodes.

Sys.sleep(0.000000)
options(BatchJobs.on.slave=TRUE,
BatchJobs.resources.path='/import/scratch/user/dpuru/BatchJobs- scratch/bmq_1e8658cc07c6/resources/resources_1392937763.RData')
library(BatchJobs)
res = BatchJobs:::doJob(
reg=loadRegistry('/import/scratch/user/dpuru/BatchJobs-
scratch/bmq_1e8658cc07c6'),
ids=c(1L),
multiple.result.files=FALSE,
disable.mail=FALSE,
first=1L,
last=2L,
array.id=NA)
BatchJobs:::setOnSlave(FALSE)

Is there was a way to include library(methods) in these files too?

Functions that fail on LSF do not fail in interactive mode

The following function runs fine in interactive mode but fails on LSF, for the reason that x is not defined in a new environment.

> library(BatchJobs)
> x = 5
> f = function(y) x+y
> reg = batchMapQuick(f, c(1:3))
> reduceResultsList(reg, fun=function(job, res) res)
$`1`
[1] 6
$`2`
[1] 7
$`3`
[1] 8

I'm not arguing that this should work on LSF, but it should (in my opinion) fail also in interactive mode, especially since that is used for debugging purposes and should catch errors occurring in production.

I think a good solution would be to evaluate all function calls in a new.env(). This might also have the advantage that it can be cleaned up more easily after the function returned.

BatchJobs should allow to source scripts on the nodes

It is possible to define packages to be loaded on the nodes, but not to source() scripts. This is valid when the function I call resides in another file that makes use of helper functions there.

Consider the following example (working on interactive, not working on LSF):

caller.r

library(BatchJobs)
source('callee.r')
reg = batchMapQuick(primary.func, c(1,2), temporary=F)

callee.r

myglobal <<- "123"
primary.func = function(val) {
    print(myglobal) # fails
    secondary.func(val) # fails
}
secondary.func = function(val) {
}

The workaround I'm currently using is to source("callee.r") in primary.func of callee.r. This is not only ugly but also dangerous because of a possible infinite recursion.

I think the nicest way to handle this would be to add an option to source on the node (analogous to loading packages).

Debian Med autotest problem

This Problem was reporting from Debian Med package management in their autopkgtest

########################################`

LC_ALL=C R --no-save < run-all.R

and noticed that when doing this as normal user it results in

  1. Error: batchExpandGrid ------------------------------------------------------
    Could not create dir: unittests-files/unittestee2f8ba97c/registry
    1: makeTestRegistry() at test_batchExpandGrid.R:4
    2: makeRegistry(id = "unittests", seed = 1, packages = packages, file.dir = rd, work.dir = "unittests-files",
    ...) at /usr/lib/R/site-library/BatchJobs/tests/helpers.R:39
    3: makeRegistryInternal(id, file.dir, sharding, work.dir, multiple.result.files, seed,
    packages, src.dirs, src.files)
    4: checkDir(file.dir, create = TRUE, check.empty = TRUE, check.posix = TRUE, msg = TRUE)
    5: stop("Could not create dir: ", path)
  2. Error: batchMap -------------------------------------------------------------
    Could not create dir: unittests-files/unittestee26666541a/registry
    1: makeTestRegistry() at test_batchMap.R:4
    2: makeRegistry(id = "unittests", seed = 1, packages = packages, file.dir = rd, work.dir = "unittests-files",
    ...) at /usr/lib/R/site-library/BatchJobs/tests/helpers.R:39
    3: makeRegistryInternal(id, file.dir, sharding, work.dir, multiple.result.files, seed,
    packages, src.dirs, src.files)
    4: checkDir(file.dir, create = TRUE, check.empty = TRUE, check.posix = TRUE, msg = TRUE)

since normal users can not write to /usr/lib/R/site-library/.

I tried as root which seems to work without error but I get

test_package("BatchJobs")
batchExpandGrid : Warning in sqliteInitDriver(max.con, fetch.default.rec, force.reload, shared.cache) :
RS-DBI driver warning: (SQLite mismatch between compiled version 3.8.5 and runtime version 3.8.4.3)
Warning in sqliteInitDriver(max.con, fetch.default.rec, force.reload, shared.cache) :
RS-DBI driver warning: (SQLite mismatch between compiled version 3.8.5 and runtime version 3.8.4.3)
Warning in sqliteInitDriver(max.con, fetch.default.rec, force.reload, shared.cache) :
RS-DBI driver warning: (SQLite mismatch between compiled version 3.8.5 and runtime version 3.8.4.3)
.Warning in sqliteInitDriver(max.con, fetch.default.rec, force.reload, shared.cache) :
RS-DBI driver warning: (SQLite mismatch between compiled version 3.8.5 and runtime version 3.8.4.3)

Graph of dependent jobs?

SRC: https://code.google.com/p/batchjobs/issues/detail?id=19

For some experiments it MIGHT be useful to be able to specify a graph of dependent jobs, similar to how targets are defined in a Makefile.

This means, that for some jobs to starts, the results of others have to be fully completed. The solution for this probably is simple topological sorting wrt to preconditions.

But I want to collect more use cases, before we look into this again.

Problem when loading BatchJobs

Hi, I just tried to install BatchJobs and BatchExperiments on a new computer and encountered the following problem:

I've installed BatchJobs using the command line
devtools::install_github("BatchJobs", username="tudo-r")

Then I tried to load the package and got the following error-message:
library(BatchJobs)
Sourcing configuration file: '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/BatchJobs/etc/BatchJobs_global_config.R'
Error : .onAttach in attachNamespace() für 'BatchJobs' fehlgeschlagen, Details:
Aufruf: sprintf(fmt, x$cluster.functions$name, x$mail.from, x$mail.to,
Fehler: konnte Funktion "listToShortString" nicht finden
Fehler: Laden von Paket oder Namensraum für ‘BatchJobs’ fehlgeschlagen

odd behavior of registry with multicore approach

I noticed the FAQ on 'multicore does not work' ... this is not my issue, it works very nicely in certain respects. However, tracking submission in the registry seems not to work. many of the job counting utilities do work. But findSubmitted does not.

showStatus(campWBsplreg7)
Status for 239 jobs at 2014-03-20 13:19:12
Submitted: 0 ( 0.00%)
Started: 21 ( 8.79%)
Running: 0 ( 0.00%)
Done: 11 ( 4.60%)
Errors: 0 ( 0.00%)
Expired: 0 ( 0.00%)
Time: min=2834.00s avg=3510.91s max=3741.00s

campWBsplreg7
Job registry: mar3
Number of jobs: 239
Files dir: /udd/stvjc/VM/CAMPWB_200K/mar3f
Work dir: /udd/stvjc/VM/CAMPWB_200K
Multiple result files: FALSE
Seed: 123
Required packages: BatchJobs
?makeRegistry
findRunning(campWBsplreg7)
integer(0)
findDone(campWBsplreg7)
[1] 1 2 3 4 5 6 7 8 9 10 11
findNotDone(campWBsplreg7)
[1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[19] 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
[37] 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
[55] 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
[73] 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
[91] 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
[109] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
[127] 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155
[145] 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
[163] 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
[181] 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209
[199] 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227
[217] 228 229 230 231 232 233 234 235 236 237 238 239
findStarted(campWBsplreg7)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
findSubmitted(campWBsplreg7)
integer(0)

Display bug in testJob's approximate running time

Hi,

I'm using BatchExperiments version 1.0-968.
I tested a job using testJob() and I got

Approximate running time: 1.07 seconds

But it should be minutes, because the job took 1.07 * 60 = 64.2 (64.162 in system.time()) seconds.
Just a minor bug but I wanted to share it.

SZ

check.posix

Why was this actually made an option?
And is not part of the config?

Is it documented somewhere?

src.dir path in registry

Is there a specific reason why these have to be relative to the work.dir?

I already have a few cases where I would like to set absolute paths.

Why dont we allow this, as long as the user makes sure this are accessible on the shared FS?

Cluster function commands should be made slighltly more configurable

Using the package here at the LMU HPC (SLURM) cluster I found out two things

a) The package basically works out of the box

b) Listing jobs does not work. The problem is trivial.
Instead of what we do
squeue -h -o %i -u $USER
I need to do add
squeue --clusters=serial -h -o %i -u $USER

The thing is that this command is hardcoded into makeClusterFunctionsSLURM

Now I could surely copy makeClusterFunctionsSLURM and adapt it, but maybe it is simpler to be able to "configure" that command?
Maybe there should be an option in BatchJobs.R that one could set?

batchMapQuick chunks not working (on LSF)

Consider the function below. Each time I have got 4 jobs that should be put in one chunk, but LSF always submits 4 individual jobs disregarding the chunks I specify.

Note: this works fine when specifying the chunks manually (last example).

square = function(x) { x*x }
library(BatchJobs)

reg = batchMapQuick(square, c(1:4), chunk.size=10)
#Saving conf: /some/dir/bmq_341e69aa7610/conf.RData
#Submitting 4 chunks / 4 jobs.
#Cluster functions: LSF.

reg = batchMapQuick(square, c(1:4), n.chunks=1)
#Saving conf: /some/dir/bmq_3c8c7b37b8cb/conf.RData
#Submitting 4 chunks / 4 jobs.
#Cluster functions: LSF.

reg = makeRegistry(id="BatchJobsExample", file.dir=tempfile(), seed=123)
batchMap(reg, square, c(1:4))
chunked = chunk(getJobIds(reg), n.chunks=1, shuffle=TRUE)
submitJobs(reg, chunked)
#Saving conf: /tmp/RtmpsWXKsa/file6bec917d96c/conf.RData
#Submitting 1 chunks / 4 jobs.
#Cluster functions: LSF.

The problem seems to be lines 55/56 in R/batchMapQuick.R:

if (!missing(chunk.size) && !missing(n.chunks))
    ids = chunk(ids, chunk.size=chunk.size, n.chunks=n.chunks, shuffle=TRUE)

Here, the conditions are only true if both chunk.size and n.chunks are given. It should rather be something like:

if (!missing(chunk.size) && !missing(n.chunks))
    stop("Providing both chunk.size and n.chunks makes no sense")
if (!missing(chunk.size))
    ids = chunk(ids, chunk.size=chunk.size, shuffle=TRUE)
if (!missing(n.chunks))
    ids = chunk(ids, n.chunks=n.chunks, shuffle=TRUE)

SSH mode: bashrc

I noticed on the CIP cluster in Munich that forcefully printing stuff in your bashrc leads to errors in submission.

Might not be the most relevant issue, but lets see if we can robustify the ssh-scripts even more.

src.dirs does not wirk on Lido

Here is the minimal example:

library(BatchJobs)
r = makeRegistry("test", file.dir=as.character(sample(10000, 1)), src.dirs = "Blubb")
f = function(x) x^2
batchMap(r, f, 1:20)
submitJobs(r)

where Blubb is a directory in my working directory, containing a random R-script

We see 2 Bugs here:
First, the error on lido is:
Error in getRScripts(dirs) : Directories not found: Blubb
Second:
This error is not transfered into the registry, in the registry these jobs are only labeled as "submitted", they er neither started nor errors, but they should be.

dbDoQuery() occasionally gives "RS_SQLite_fetch: failed first step: attempt to write a readonly database"

PROBLEM

Occasionally one of my ~20 jobs fails with the following error:

$: more BiocParallel_tmp_6fd624fb91629/jobs/14/14.out

Command: Rscript --verbose "/path/to/BiocParallel_tmp_6fd624fb91629/jobs/14/14.R"
running
  '/opt/R/R-3.1.1/lib64/R/bin/R --slave --no-restore --file=/path/to/BiocParallel_tmp_6fd624fb91629/jobs/14/14.R'

Loading required package: BBmisc
Loading required package: methods
Loading registry: /path/to/BiocParallel_tmp_6fd624fb91629/registry.RData
Loading conf:
2014-09-09 15:57:16: Starting job on node n0.
Auto-mailer settings: start=none, done=none, error=none.
Setting work dir: /path/to
Warning in sqliteCloseConnection(conn, ...) :
  RS-DBI driver warning: (closing pending result sets before closing this connection)
[1] "Error in sqliteFetch(rs, n = -1, ...) : \n  RSQLite driver: (RS_SQLite_fetch: failed first step: attempt to write a readonly database)\n"
[1] "SELECT job_id, fun_id, pars, jobname, seed FROM bpmapply_expanded_jobs WHERE job_id IN (14)"
Error in dbDoQuery(reg, query) :
  Error in dbDoQuery. Error in sqliteFetch(rs, n = -1, ...) :
  RSQLite driver: (RS_SQLite_fetch: failed first step: attempt to write a readonly database)
 (SELECT job_id, fun_id, pars, jobname, seed FROM bpmapply_expanded_jobs WHERE job_id IN (14))
Calls: <Anonymous> ... dbGetJobs.Registry -> dbSelectWithIds -> dbDoQuery -> stopf
Setting work back to: /home/henrik
Memory usage according to gc:
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 328855 17.6     467875 25.0   407500 21.8
Vcells 488477  3.8    1031040  7.9   786431  6.0
Execution halted
Command: Rscript --verbose "/path/to/BiocParallel_tmp_6fd624fb91629/jobs/14/14.R" ... DONE

FACTS

  • It only happens occasionally, i.e. hard to make a reproducible example.
  • Other very similar jobs run successfully on the same node.
  • There is nothing specific with the jobs failing this way.

TROUBLESHOOTING

Looking at BatchJobs:::dbDoQuery, I see that you do check for "(lock|i/o)" related errors and have dbDoQuery() retry several times before giving up. My best guess is that a similar issue occurs here; BatchJobs runs on a shared file system (NFS), multiple jobs tries to access/update the SQLite database, which is a file on this shared file system. Some job grabs this SQLite file and locks it. Your wait-and-try-again approach handles this lock case. However, here it seems as the file can also be in a read-only state, which is not handled. Maybe it has to do with the fact that there is a latency in how file information is propagated on a shared file systems and it could be that the read-only state is active before the lock of the file. Just guessing here...

Always use R CMD BATCH in cluster functions

Reasons:

  1. Code becomes more homogeneous

  2. We dont have trouble with packages that Rscript does not load by default

  3. The slightly greater overhead does not count. BJ is not supposed to be used when the job only takes 2 secs. We have chunking for that.

Support for other DBMS

This was requested at the Bioconductor Developer Meeting.
Without staged.queries -> permanent database lock.
With staged.queries -> overburdened file system server (not sure if chunked nicely)

Started to look into it, but we might want to wait until rstats-db/dbi (picked up by Hadley) is more mature to avoid workarounds for dbPreparedQuery etc. See https://stat.ethz.ch/pipermail/r-sig-db/2013q4/001322.html

Add an option to respect delayed file systems

See Bioconductor/BiocParallel#33

  • Add an option to enable or disable additional checks
  • After writing R scripts in submitJobs(), sleep until all(file.exists(...)) == TRUE
  • How to check if the database file has been rewritten to disk in submitJobs?
  • staged.queries may even fail if we cannot rely on the order of files (first created appear first) -> check what reordering would break (besides temporary inconsistencies)

Cannot run simple SGE example

I am trying to get a simple example working on my SGE cluster. Session info below. I'd greatly appreciate any ideas for figuring this out. BiocParallel is not available for R 3.1.1.

Maybe my SGE template is not correct? I expect not because this code seems to fail on a simple qstat.

I am using the following configuration:
cluster.functions = makeClusterFunctionsSGE("/home/poirierj/R_libs/BatchJobs/etc/simple.tmpl", list.jobs.cmd = c("qstat", "-u poirierj"))
mail.start = "none"
mail.done = "none"
mail.error = "none"
db.driver = "SQLite"
db.options = list()
debug = TRUE

library(BatchJobs)
Loading required package: BBmisc
Sourcing configuration file: '/home/poirierj/R_libs/BatchJobs/etc/BatchJobs_global_config.R'
BatchJobs configuration:
cluster functions: SGE
mail.from:
mail.to:
mail.start: none
mail.done: none
mail.error: none
default.resources:
debug: TRUE
raise.warnings: FALSE
staged.queries: FALSE
max.concurrent.jobs: Inf
fs.timeout: NA
library(BiocParallel)
param <- BatchJobsParam(2)
register(param)
x<-bplapply(1:10, identity)
OS cmd: qstat -u poirierj
OS result:
$exit.code
[1] 0

$output
character(0)

Error: $ operator is invalid for atomic vectors

traceback()
15: fun(getBatchJobsConf(), reg)
14: getBatchIds(reg, "Cannot find jobs on system")
13: dbFindOnSystem(reg, unlist(ids))
12: as.vector(y)
11: intersect(unlist(ids), dbFindOnSystem(reg, unlist(ids)))
10: (function (reg, ids, resources = list(), wait, max.retries = 10L,
chunks.as.arrayjobs = FALSE, job.delay = FALSE)
{
getDelays = function(cf, job.delay, n) {
if (is.logical(job.delay)) {
if (job.delay && n > 100L && cf$name %nin% c("Interactive",
"Multicore", "SSH")) {
return(runif(n, n * 0.1, n * 0.2))
}
return(delays = rep.int(0, n))
}
vnapply(seq_along(ids), job.delay, n = n)
}
checkRegistry(reg)
syncRegistry(reg)
if (missing(ids)) {
ids = dbFindSubmitted(reg, negate = TRUE)
if (length(ids) == 0L) {
info("All jobs submitted, nothing to do!")
return(invisible(integer(0L)))
}
}
else {
if (is.list(ids)) {
ids = lapply(ids, checkIds, reg = reg, check.present = FALSE)
dbCheckJobIds(reg, unlist(ids))
}
else if (is.numeric(ids)) {
ids = checkIds(reg, ids)
}
else {
stop("Parameter 'ids' must be a integer vector of job ids or a list of chunked job ids (list of integer vectors)!")
}
}
conf = getBatchJobsConf()
cf = getClusterFunctions(conf)
limit.concurrent.jobs = is.finite(conf$max.concurrent.jobs)
n = length(ids)
assertList(resources)
resources = resrc(resources)
if (missing(wait))
wait = function(retries) 10 * 2^retries
else assertFunction(wait, "retries")
if (is.logical(job.delay)) {
assertFlag(job.delay)
}
else {
checkFunction(job.delay, c("n", "i"))
}
if (is.finite(max.retries))
max.retries = asCount(max.retries)
assertFlag(chunks.as.arrayjobs)
if (chunks.as.arrayjobs && is.na(cf$getArrayEnvirName())) {
warningf("Cluster functions '%s' do not support array jobs, falling back on chunks",
cf$name)
chunks.as.arrayjobs = FALSE
}
if (!is.null(cf$listJobs)) {
ids.intersect = intersect(unlist(ids), dbFindOnSystem(reg,
unlist(ids)))
if (length(ids.intersect) > 0L) {
stopf("Some of the jobs you submitted are already present on the batch system! E.g. id=%i.",
ids.intersect[1L])
}
}
if (limit.concurrent.jobs && (cf$name %in% c("Interactive",
"Local", "Multicore", "SSH") || is.null(cf$listJobs))) {
warning("Option 'max.concurrent.jobs' is enabled, but your cluster functions implementation does not support the listing of system jobs.\n",
"Option disabled, sleeping 5 seconds for safety reasons.")
limit.concurrent.jobs = FALSE
Sys.sleep(5)
}
if (n > 5000L) {
warningf(collapse(c("You are about to submit '%i' jobs.",
"Consider chunking them to avoid heavy load on the scheduler.",
"Sleeping 5 seconds for safety reasons."), sep = "\n"),
n)
Sys.sleep(5)
}
saveConf(reg)
is.chunked = is.list(ids)
info("Submitting %i chunks / %i jobs.", n, if (is.chunked)
sum(viapply(ids, length))
else n)
info("Cluster functions: %s.", cf$name)
info("Auto-mailer settings: start=%s, done=%s, error=%s.",
conf$mail.start, conf$mail.done, conf$mail.error)
fs.timeout = conf$fs.timeout
staged = conf$staged.queries && !is.na(fs.timeout)
interrupted = FALSE
submit.msgs = buffer(type = "list", capacity = 1000L, value = dbSendMessages,
reg = reg, max.retries = 10000L, sleep = function(r) 5,
staged = staged, fs.timeout = fs.timeout)
logger = makeSimpleFileLogger(file.path(reg$file.dir, "submit.log"),
touch = FALSE, keep = 1L)
on.exit({
if (interrupted && exists("batch.result", inherits = FALSE)) {
submit.msgs$push(dbMakeMessageSubmitted(reg, id,
time = submit.time, batch.job.id = batch.result$batch.job.id,
first.job.in.chunk.id = if (is.chunked) id1 else NULL,
resources.timestamp = resources.timestamp))
}
info("Sending %i submit messages...\nMight take some time, do not interrupt this!",
submit.msgs$pos())
submit.msgs$clear()
if (logger$getSize()) messagef("%i temporary submit errors logged to file '%s'.\nFirst message: %s",
logger$getSize(), logger$getLogfile(), logger$getMessages(1L))
})
info("Writing %i R scripts...", n)
resources.timestamp = saveResources(reg, resources)
rscripts = writeRscripts(reg, cf, ids, chunks.as.arrayjobs,
resources.timestamp, disable.mail = FALSE, delays = getDelays(cf,
job.delay, n))
waitForFiles(rscripts, timeout = fs.timeout)
dbSendMessage(reg, dbMakeMessageKilled(reg, unlist(ids),
type = "first"), staged = staged, fs.timeout = fs.timeout)
bar = makeProgressBar(max = n, label = "SubmitJobs")
bar$set()
tryCatch({
for (i in seq_along(ids)) {
id = ids[[i]]
id1 = id[1L]
retries = 0L
repeat {
if (limit.concurrent.jobs && length(cf$listJobs(conf,
reg)) >= conf$max.concurrent.jobs) {
batch.result = makeSubmitJobResult(status = 10L,
batch.job.id = NA_character_, "Max concurrent jobs exhausted")
}
else {
interrupted = TRUE
submit.time = now()
batch.result = cf$submitJob(conf = conf, reg = reg,
job.name = sprintf("%s-%i", reg$id, id1),
rscript = rscripts[i], log.file = getLogFilePath(reg,
id1), job.dir = getJobDirs(reg, id1), resources = resources,
arrayjobs = if (chunks.as.arrayjobs)
length(id)
else 1L)
}
if (batch.result$status == 0L) {
submit.msgs$push(dbMakeMessageSubmitted(reg,
id, time = submit.time, batch.job.id = batch.result$batch.job.id,
first.job.in.chunk.id = if (is.chunked)
id1
else NULL, resources.timestamp = resources.timestamp))
interrupted = FALSE
bar$inc(1L)
break
}
interrupted = FALSE
if (batch.result$status > 0L && batch.result$status <=
100L) {
if (is.finite(max.retries) && retries > max.retries)
stopf("Retried already %i times to submit. Aborting.",
max.retries)
Sys.sleep(wait(retries))
logger$log(batch.result$msg)
retries = retries + 1L
}
else if (batch.result$status > 100L && batch.result$status <=
200L) {
stopf("Fatal error occured: %i. %s", batch.result$status,
batch.result$msg)
}
else {
stopf("Illegal status code %s returned from cluster functions!",
batch.result$status)
}
}
}
}, error = bar$error)
return(invisible(ids))
})(reg = list(id = "bpmapply", version = list(platform = "x86_64-unknown-linux-gnu",
arch = "x86_64", os = "linux-gnu", system = "x86_64, linux-gnu",
status = "", major = "3", minor = "0.2", year = "2013", month = "09",
day = "25", svn rev = "63987", language = "R", version.string = "R version 3.0.2 (2013-09-25)",
nickname = "Frisbee Sailing"), RNGkind = c("Mersenne-Twister",
"Inversion"), db.driver = "SQLite", db.options = list(), seed = 693613467L,
file.dir = "/home/poirierj//BiocParallel_tmp_55523b7bd6c2",
sharding = TRUE, work.dir = "/home/poirierj", src.dirs = character(0),
src.files = character(0), multiple.result.files = FALSE,
packages = list(BatchJobs = list(version = list(c(1L, 3L))))),
ids = list(c(2L, 4L, 5L, 7L, 9L), c(1L, 3L, 6L, 8L, 10L)))
9: do.call(submitJobs, pars)
8: withCallingHandlers(expr, message = function(c) invokeRestart("muffleMessage"))
7: suppressMessages(do.call(submitJobs, pars))
6: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, USE.NAMES = FALSE,
resume = resume, BPPARAM = BPPARAM)
5: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, USE.NAMES = FALSE,
resume = resume, BPPARAM = BPPARAM)
4: bplapply(X, FUN, ..., resume = resume, BPPARAM = x)
3: bplapply(X, FUN, ..., resume = resume, BPPARAM = x)
2: bplapply(1:10, identity)
1: bplapply(1:10, identity)
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] BiocParallel_0.4.1 BatchJobs_1.3 BBmisc_1.7

loaded via a namespace (and not attached):
[1] brew_1.0-6 checkmate_1.2 codetools_0.2-8 DBI_0.2-7
[5] digest_0.6.4 fail_1.2 foreach_1.4.2 iterators_1.0.7
[9] parallel_3.0.2 RSQLite_0.11.4 sendmailR_1.1-2 stringr_0.6.2
[13] tools_3.0.2

BatchJobs does not check if the arguments provided to a function are sufficient

If I pass insufficient arguments to a function, it would be nice if BatchJobs fails already at registry creation, and does not silently drop the results.

> library(BatchJobs)
> f = function(x,y) x+y
> reg = batchMapQuick(f, c(1:5))
> reduceResultsList(reg, fun=function(job, res) res)
Reducing 0 results...
list()

Currently, I use a wrapper to check the input like the one below. Please feel free to add these checks if you find them useful.

l. = list(...)
fun = match.fun(` fun`) 
funargs = formals(fun)
required = names(funargs)[unlist(lapply(funargs, function(f) class(f)=='name'))]
provided = names(c(l., more.args))

if (length(provided) > 1) {
    if (any(nchar(provided) == 0))
        stop("All arguments that will be provided to function must be named")

    sdiff = unlist(setdiff(required, provided))
    if (length(sdiff) > 0 && sdiff != '...')
        stop(paste("Argument required but not provided:", paste(sdiff, collapse=" ")))
}

sdiff = unlist(setdiff(provided, names(funargs)))
if (length(sdiff) > 0 && ! '...' %in% names(funargs))
    stop(paste("Argument provided by not accepted by function:", paste(sdiff, collapse=" ")))
dups = duplicated(provided)
if (any(dups))
    stop(paste("Argument duplicated:", paste(provided[[dups]], collapse=" ")))

question - stuck at batchMap

I am trying to set up a BatchJobs job with 8 Scenarios. The function I use is a wrapper function that will put the parameters into their right form to call the functions that actually do the computing work. All parameters in the call are vectors of length 8, defined either as columns from matrices or vectors (created using rep). However, the function definition is not well interpreted: the message I get is--
Error in data.frame(fun_id = fun.id, pars = pars, jobname = jobname) :
arguments imply differing number of rows: 1, 8, 0
FUNCTION DEFINTION BELOW
preFixRes <- function(nRep, la01,la02,la03,la11,la12,la13,ct1,ct2,ct3,
n,rho1,rho2,rho3,rho4,rho5,
gamma1,gamma2,gamma3,gamma4,gamma5,
optRho,optGamma,seeds, cLams){
n0 <- n/2
n1 <- n0
cttfs <- c(ct1,ct2,ct3)
rho <- c(rho1,rho2,rho3,rho4,rho5)
cLamb <- rep(cLams,3)
gamma <- c(gamma1,gamma2,gamma3,gamma4,gamma5)
rez <- runSimul (nRep,Scennr,la01,la02,la03,la11,la12,la13,ctffs,
n0,n1,rho,gamma,optRho,optGamma,seeds, cLamb)
resMat <- fixResTests1( rez,nRep,length(rho))
evs <- cEvs(rez,nRep)
return(list(resMat=resMat,evs=evs))
}
Question--what is being expected?

Learn walltime / resource requirements?

SRC: https://code.google.com/p/batchjobs/issues/detail?id=20

On some HPC systems (like ours) it is necessary to set the walltime at least roughly correct to be able to compute efficiently - and not overestimate it by a factor of 10 to be on the safe side.

For some experiments, especially if you mix very different algorithms with BatchExperiments, this can now become very difficult for the user. He has to do extensive "pre-runs", note execution times and then submit jobs in groups.

One current way out is to use chunking, to simply mix long running jobs with short ones.

But I am still wondering, whether we cannot automate what the user does (explained above) by automatically learning it?

Maybe this is a weird idea and overkill, in any case this of low priority and won't be implemented soon.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.