tudo-r / batchjobs Goto Github PK
View Code? Open in Web Editor NEWBatchJobs: Batch computing with R
License: Other
BatchJobs: Batch computing with R
License: Other
It is annoying that we cannot use absolute paths!
Situation: i have got 16 chunks and for the first and the last one there is an info mail per default.
For the first one it says "Chunk 1 has started/finished" correctly.
But the label of the last is not correct.
It says
Chunk 751 has started/finished, but 751 is the ID of the first job in chunk 16.
Note: this one is a bit more vague, if you require an example of this happening please let me know.
Consider the following: I use BatchJobs for a function call where more.args
has got a large matrix, in my case about 20,000 x 1,000 and the return value is a lot smaller, say 1 x 1000.
Running this with chunk.size = 1
requires about 300-400 MB of memory. If I run it with chunk.size = 5
, it uses up about 2 GB. chunk.size = 10
requires a bit under 4 GB. If I have 30,000 jobs that I'd like to put in chunks of 50, this becomes a problem.
However, there is no reason for the function calls to use so much memory. If you:
more.args
once as a reference copy (to not re-load and thus put load on the file system)this would require more or less constant memory with increasing chunk size.
Maybe using a new.env()
for the function call and completely delete the environment after (as suggested in #35 ) would also solve this issue.
Uses fail:put internally
On SLURM I get this:
sh: 1: /usr/bin/less +45: not found
The problem seems to be the +45.
getOption("pager")
[1] "/opt/R/R-3.0.2/lib/R/bin/pager"
Sys.getenv("PAGER")
[1] "/usr/bin/less"
It is annoying, mea culpa, I suck, bla, bla.
It might be nice to start on multiple machines in multicore mode to be more mem efficient in some scenarios.
Are there viable alternatives for this?
BatchJobs execution gets halted before it can queue the jobs.
options(BatchJobs.on.slave=TRUE, BatchJobs.resources.path='/import/scratch/user/dpuru/BatchJobs-scratch/bmq_7fbd23130dcd/resources/resources_1392927734.RData')
library(BatchJobs)
Loading required package: BBmisc
res = BatchJobs:::doJob(
reg=loadRegistry('/import/scratch/user/dpuru/BatchJobs-scratch/bmq_7fbd23130dcd'),
ids=c(1L),
multiple.result.files=FALSE,
disable.mail=FALSE,
first=1L,
last=2L,
2014-02-20 14:22:16: Starting job on node stluhpcprd837.array.id=NA)
Loading registry: /import/scratch/user/dpuru/BatchJobs-scratch/bmq_7fbd23130dcd/registry.RData
Loading conf: /import/scratch/user/dpuru/BatchJobs-scratch/bmq_7fbd23130dcd/conf.RData
Auto-mailer settings: start=none, done=none, error=none.
Setting work dir: /import/scratch/user/dpuru/BatchJobs-scratch
Error in sendMail(reg, job, result.str, "", disable.mail, condition = "start", :
could not find function "is"
Calls: -> doSingleJob -> sendMail
Setting work back to: /import/scratch/user/dpuru/BatchJobs-scratch
Memory usage according to gc:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 306867 16.4 467875 25 350000 18.7
Vcells 448981 3.5 905753 7 905753 7.0
Execution halted
Rscript does not load package "methods", so simply putting library(methods) at the beginning of your script, gets us past this error, and queues the jobs, but eventually all the jobs expire due the exact same issue occurring when the slave jobs begin execution on the nodes.
Sys.sleep(0.000000)
options(BatchJobs.on.slave=TRUE,
BatchJobs.resources.path='/import/scratch/user/dpuru/BatchJobs- scratch/bmq_1e8658cc07c6/resources/resources_1392937763.RData')
library(BatchJobs)
res = BatchJobs:::doJob(
reg=loadRegistry('/import/scratch/user/dpuru/BatchJobs-
scratch/bmq_1e8658cc07c6'),
ids=c(1L),
multiple.result.files=FALSE,
disable.mail=FALSE,
first=1L,
last=2L,
array.id=NA)
BatchJobs:::setOnSlave(FALSE)
Is there was a way to include library(methods) in these files too?
The following function runs fine in interactive mode but fails on LSF, for the reason that x is not defined in a new environment.
> library(BatchJobs)
> x = 5
> f = function(y) x+y
> reg = batchMapQuick(f, c(1:3))
> reduceResultsList(reg, fun=function(job, res) res)
$`1`
[1] 6
$`2`
[1] 7
$`3`
[1] 8
I'm not arguing that this should work on LSF, but it should (in my opinion) fail also in interactive mode, especially since that is used for debugging purposes and should catch errors occurring in production.
I think a good solution would be to evaluate all function calls in a new.env()
. This might also have the advantage that it can be cleaned up more easily after the function returned.
It is possible to define packages to be loaded on the nodes, but not to source()
scripts. This is valid when the function I call resides in another file that makes use of helper functions there.
Consider the following example (working on interactive, not working on LSF):
caller.r
library(BatchJobs)
source('callee.r')
reg = batchMapQuick(primary.func, c(1,2), temporary=F)
callee.r
myglobal <<- "123"
primary.func = function(val) {
print(myglobal) # fails
secondary.func(val) # fails
}
secondary.func = function(val) {
}
The workaround I'm currently using is to source("callee.r")
in primary.func of callee.r. This is not only ugly but also dangerous because of a possible infinite recursion.
I think the nicest way to handle this would be to add an option to source on the node (analogous to loading packages).
SRC: https://code.google.com/p/batchjobs/issues/detail?id=22
I suggest that you add a help("BatchJobs") overview page. It's also a neat way to quickly access the HTML help of the package without lots of point'n'clicks, e.g. ?BatchJobs.
This can be achieved using Rd markup \alias{BatchJobs-package}. See Section '2.1.4 Documenting packages' of WRE for more details.
This Problem was reporting from Debian Med package management in their autopkgtest
LC_ALL=C R --no-save < run-all.R
and noticed that when doing this as normal user it results in
since normal users can not write to /usr/lib/R/site-library/.
I tried as root which seems to work without error but I get
test_package("BatchJobs")
batchExpandGrid : Warning in sqliteInitDriver(max.con, fetch.default.rec, force.reload, shared.cache) :
RS-DBI driver warning: (SQLite mismatch between compiled version 3.8.5 and runtime version 3.8.4.3)
Warning in sqliteInitDriver(max.con, fetch.default.rec, force.reload, shared.cache) :
RS-DBI driver warning: (SQLite mismatch between compiled version 3.8.5 and runtime version 3.8.4.3)
Warning in sqliteInitDriver(max.con, fetch.default.rec, force.reload, shared.cache) :
RS-DBI driver warning: (SQLite mismatch between compiled version 3.8.5 and runtime version 3.8.4.3)
.Warning in sqliteInitDriver(max.con, fetch.default.rec, force.reload, shared.cache) :
RS-DBI driver warning: (SQLite mismatch between compiled version 3.8.5 and runtime version 3.8.4.3)
To optimise the memory usage for each job it would be interesting to get an idea of the memory usage. Is it possible to provide the memory usage in getJobInfo() and a small statistic in showStatus() similar to the time values?
Should be moved to BBmisc and renamed. See FIXME in code
SRC: https://code.google.com/p/batchjobs/issues/detail?id=19
For some experiments it MIGHT be useful to be able to specify a graph of dependent jobs, similar to how targets are defined in a Makefile.
This means, that for some jobs to starts, the results of others have to be fully completed. The solution for this probably is simple topological sorting wrt to preconditions.
But I want to collect more use cases, before we look into this again.
Hi, I just tried to install BatchJobs and BatchExperiments on a new computer and encountered the following problem:
I've installed BatchJobs using the command line
devtools::install_github("BatchJobs", username="tudo-r")
Then I tried to load the package and got the following error-message:
library(BatchJobs)
Sourcing configuration file: '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/BatchJobs/etc/BatchJobs_global_config.R'
Error : .onAttach in attachNamespace() für 'BatchJobs' fehlgeschlagen, Details:
Aufruf: sprintf(fmt, x$cluster.functions$name, x$mail.from, x$mail.to,
Fehler: konnte Funktion "listToShortString" nicht finden
Fehler: Laden von Paket oder Namensraum für ‘BatchJobs’ fehlgeschlagen
I noticed the FAQ on 'multicore does not work' ... this is not my issue, it works very nicely in certain respects. However, tracking submission in the registry seems not to work. many of the job counting utilities do work. But findSubmitted does not.
showStatus(campWBsplreg7)
Status for 239 jobs at 2014-03-20 13:19:12
Submitted: 0 ( 0.00%)
Started: 21 ( 8.79%)
Running: 0 ( 0.00%)
Done: 11 ( 4.60%)
Errors: 0 ( 0.00%)
Expired: 0 ( 0.00%)
Time: min=2834.00s avg=3510.91s max=3741.00scampWBsplreg7
Job registry: mar3
Number of jobs: 239
Files dir: /udd/stvjc/VM/CAMPWB_200K/mar3f
Work dir: /udd/stvjc/VM/CAMPWB_200K
Multiple result files: FALSE
Seed: 123
Required packages: BatchJobs
?makeRegistry
findRunning(campWBsplreg7)
integer(0)
findDone(campWBsplreg7)
[1] 1 2 3 4 5 6 7 8 9 10 11
findNotDone(campWBsplreg7)
[1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
[19] 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
[37] 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
[55] 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
[73] 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
[91] 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
[109] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
[127] 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155
[145] 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
[163] 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
[181] 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209
[199] 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227
[217] 228 229 230 231 232 233 234 235 236 237 238 239
findStarted(campWBsplreg7)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
findSubmitted(campWBsplreg7)
integer(0)
Actually, what happens is well defined but probably really useless of one considers that the vector of ids gets shuffled?
Is there any use case where a user would like to set inds with chunking?
In doc file for submitJobs.
Maybe somewhere else too, where progressbar is used?
Didnt we discuss to implement this?
Michel, I thought we already had this, currently I don't see this anywhere in the code?
If it has not been done, this is a feature request to both of us, also for reduceResultsExperiments.
Hi,
I'm using BatchExperiments version 1.0-968.
I tested a job using testJob() and I got
But it should be minutes, because the job took 1.07 * 60 = 64.2 (64.162 in system.time()) seconds.
Just a minor bug but I wanted to share it.
SZ
Check and see issue
Why was this actually made an option?
And is not part of the config?
Is it documented somewhere?
This too hard to figure out from the R docs
Is there a specific reason why these have to be relative to the work.dir?
I already have a few cases where I would like to set absolute paths.
Why dont we allow this, as long as the user makes sure this are accessible on the shared FS?
SRC: https://code.google.com/p/batchjobs/issues/detail?id=17
Add arguments to makeClusterFunctionsTorque
, etc, so users can define exit codes and partial error messages which are only temp. errors and should result in resubmits.
The currently implemented ones are probably reasonable defaults for the args.
It is also really uncool that the whole mechnism is not explained anywhere!
Using the package here at the LMU HPC (SLURM) cluster I found out two things
a) The package basically works out of the box
b) Listing jobs does not work. The problem is trivial.
Instead of what we do
squeue -h -o %i -u $USER
I need to do add
squeue --clusters=serial -h -o %i -u $USER
The thing is that this command is hardcoded into makeClusterFunctionsSLURM
Now I could surely copy makeClusterFunctionsSLURM and adapt it, but maybe it is simpler to be able to "configure" that command?
Maybe there should be an option in BatchJobs.R that one could set?
We want to be able to source single files.
Maybe allow a mixture of files and directories so all parties are happy?
Consider the function below. Each time I have got 4 jobs that should be put in one chunk, but LSF always submits 4 individual jobs disregarding the chunks I specify.
Note: this works fine when specifying the chunks manually (last example).
square = function(x) { x*x }
library(BatchJobs)
reg = batchMapQuick(square, c(1:4), chunk.size=10)
#Saving conf: /some/dir/bmq_341e69aa7610/conf.RData
#Submitting 4 chunks / 4 jobs.
#Cluster functions: LSF.
reg = batchMapQuick(square, c(1:4), n.chunks=1)
#Saving conf: /some/dir/bmq_3c8c7b37b8cb/conf.RData
#Submitting 4 chunks / 4 jobs.
#Cluster functions: LSF.
reg = makeRegistry(id="BatchJobsExample", file.dir=tempfile(), seed=123)
batchMap(reg, square, c(1:4))
chunked = chunk(getJobIds(reg), n.chunks=1, shuffle=TRUE)
submitJobs(reg, chunked)
#Saving conf: /tmp/RtmpsWXKsa/file6bec917d96c/conf.RData
#Submitting 1 chunks / 4 jobs.
#Cluster functions: LSF.
The problem seems to be lines 55/56 in R/batchMapQuick.R:
if (!missing(chunk.size) && !missing(n.chunks))
ids = chunk(ids, chunk.size=chunk.size, n.chunks=n.chunks, shuffle=TRUE)
Here, the conditions are only true if both chunk.size and n.chunks are given. It should rather be something like:
if (!missing(chunk.size) && !missing(n.chunks))
stop("Providing both chunk.size and n.chunks makes no sense")
if (!missing(chunk.size))
ids = chunk(ids, chunk.size=chunk.size, shuffle=TRUE)
if (!missing(n.chunks))
ids = chunk(ids, n.chunks=n.chunks, shuffle=TRUE)
I noticed on the CIP cluster in Munich that forcefully printing stuff in your bashrc leads to errors in submission.
Might not be the most relevant issue, but lets see if we can robustify the ssh-scripts even more.
Here is the minimal example:
library(BatchJobs)
r = makeRegistry("test", file.dir=as.character(sample(10000, 1)), src.dirs = "Blubb")
f = function(x) x^2
batchMap(r, f, 1:20)
submitJobs(r)
where Blubb is a directory in my working directory, containing a random R-script
We see 2 Bugs here:
First, the error on lido is:
Error in getRScripts(dirs) : Directories not found: Blubb
Second:
This error is not transfered into the registry, in the registry these jobs are only labeled as "submitted", they er neither started nor errors, but they should be.
Occasionally one of my ~20 jobs fails with the following error:
$: more BiocParallel_tmp_6fd624fb91629/jobs/14/14.out
Command: Rscript --verbose "/path/to/BiocParallel_tmp_6fd624fb91629/jobs/14/14.R"
running
'/opt/R/R-3.1.1/lib64/R/bin/R --slave --no-restore --file=/path/to/BiocParallel_tmp_6fd624fb91629/jobs/14/14.R'
Loading required package: BBmisc
Loading required package: methods
Loading registry: /path/to/BiocParallel_tmp_6fd624fb91629/registry.RData
Loading conf:
2014-09-09 15:57:16: Starting job on node n0.
Auto-mailer settings: start=none, done=none, error=none.
Setting work dir: /path/to
Warning in sqliteCloseConnection(conn, ...) :
RS-DBI driver warning: (closing pending result sets before closing this connection)
[1] "Error in sqliteFetch(rs, n = -1, ...) : \n RSQLite driver: (RS_SQLite_fetch: failed first step: attempt to write a readonly database)\n"
[1] "SELECT job_id, fun_id, pars, jobname, seed FROM bpmapply_expanded_jobs WHERE job_id IN (14)"
Error in dbDoQuery(reg, query) :
Error in dbDoQuery. Error in sqliteFetch(rs, n = -1, ...) :
RSQLite driver: (RS_SQLite_fetch: failed first step: attempt to write a readonly database)
(SELECT job_id, fun_id, pars, jobname, seed FROM bpmapply_expanded_jobs WHERE job_id IN (14))
Calls: <Anonymous> ... dbGetJobs.Registry -> dbSelectWithIds -> dbDoQuery -> stopf
Setting work back to: /home/henrik
Memory usage according to gc:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 328855 17.6 467875 25.0 407500 21.8
Vcells 488477 3.8 1031040 7.9 786431 6.0
Execution halted
Command: Rscript --verbose "/path/to/BiocParallel_tmp_6fd624fb91629/jobs/14/14.R" ... DONE
Looking at BatchJobs:::dbDoQuery, I see that you do check for "(lock|i/o)" related errors and have dbDoQuery() retry several times before giving up. My best guess is that a similar issue occurs here; BatchJobs runs on a shared file system (NFS), multiple jobs tries to access/update the SQLite database, which is a file on this shared file system. Some job grabs this SQLite file and locks it. Your wait-and-try-again approach handles this lock case. However, here it seems as the file can also be in a read-only state, which is not handled. Maybe it has to do with the fact that there is a latency in how file information is propagated on a shared file systems and it could be that the read-only state is active before the lock of the file. Just guessing here...
http://tudo-r.github.io/BatchJobs/man/
Anything with find...
won't work and also others 😕
Reasons:
Code becomes more homogeneous
We dont have trouble with packages that Rscript does not load by default
The slightly greater overhead does not count. BJ is not supposed to be used when the job only takes 2 secs. We have chunking for that.
This was requested at the Bioconductor Developer Meeting.
Without staged.queries
-> permanent database lock.
With staged.queries
-> overburdened file system server (not sure if chunked nicely)
Started to look into it, but we might want to wait until rstats-db/dbi (picked up by Hadley) is more mature to avoid workarounds for dbPreparedQuery etc. See https://stat.ethz.ch/pipermail/r-sig-db/2013q4/001322.html
See Bioconductor/BiocParallel#33
submitJobs()
, sleep until all(file.exists(...)) == TRUE
submitJobs
?staged.queries
may even fail if we cannot rely on the order of files (first created appear first) -> check what reordering would break (besides temporary inconsistencies)SRC: https://code.google.com/p/batchjobs/issues/detail?id=18
Is it possible that we can user the packages to compute there?
I never looked into it properly, lets try it.
http://aws.amazon.com/ec2/
https://cloud.google.com/products/compute-engine
https://github.com/armstrtw/AWS.tools
I am trying to get a simple example working on my SGE cluster. Session info below. I'd greatly appreciate any ideas for figuring this out. BiocParallel is not available for R 3.1.1.
Maybe my SGE template is not correct? I expect not because this code seems to fail on a simple qstat.
I am using the following configuration:
cluster.functions = makeClusterFunctionsSGE("/home/poirierj/R_libs/BatchJobs/etc/simple.tmpl", list.jobs.cmd = c("qstat", "-u poirierj"))
mail.start = "none"
mail.done = "none"
mail.error = "none"
db.driver = "SQLite"
db.options = list()
debug = TRUE
library(BatchJobs)
Loading required package: BBmisc
Sourcing configuration file: '/home/poirierj/R_libs/BatchJobs/etc/BatchJobs_global_config.R'
BatchJobs configuration:
cluster functions: SGE
mail.from:
mail.to:
mail.start: none
mail.done: none
mail.error: none
default.resources:
debug: TRUE
raise.warnings: FALSE
staged.queries: FALSE
max.concurrent.jobs: Inf
fs.timeout: NA
library(BiocParallel)
param <- BatchJobsParam(2)
register(param)
x<-bplapply(1:10, identity)
OS cmd: qstat -u poirierj
OS result:
$exit.code
[1] 0
$output
character(0)
Error: $ operator is invalid for atomic vectors
traceback()
15: fun(getBatchJobsConf(), reg)
14: getBatchIds(reg, "Cannot find jobs on system")
13: dbFindOnSystem(reg, unlist(ids))
12: as.vector(y)
11: intersect(unlist(ids), dbFindOnSystem(reg, unlist(ids)))
10: (function (reg, ids, resources = list(), wait, max.retries = 10L,
chunks.as.arrayjobs = FALSE, job.delay = FALSE)
{
getDelays = function(cf, job.delay, n) {
if (is.logical(job.delay)) {
if (job.delay && n > 100L && cf$name %nin% c("Interactive",
"Multicore", "SSH")) {
return(runif(n, n * 0.1, n * 0.2))
}
return(delays = rep.int(0, n))
}
vnapply(seq_along(ids), job.delay, n = n)
}
checkRegistry(reg)
syncRegistry(reg)
if (missing(ids)) {
ids = dbFindSubmitted(reg, negate = TRUE)
if (length(ids) == 0L) {
info("All jobs submitted, nothing to do!")
return(invisible(integer(0L)))
}
}
else {
if (is.list(ids)) {
ids = lapply(ids, checkIds, reg = reg, check.present = FALSE)
dbCheckJobIds(reg, unlist(ids))
}
else if (is.numeric(ids)) {
ids = checkIds(reg, ids)
}
else {
stop("Parameter 'ids' must be a integer vector of job ids or a list of chunked job ids (list of integer vectors)!")
}
}
conf = getBatchJobsConf()
cf = getClusterFunctions(conf)
limit.concurrent.jobs = is.finite(conf$max.concurrent.jobs)
n = length(ids)
assertList(resources)
resources = resrc(resources)
if (missing(wait))
wait = function(retries) 10 * 2^retries
else assertFunction(wait, "retries")
if (is.logical(job.delay)) {
assertFlag(job.delay)
}
else {
checkFunction(job.delay, c("n", "i"))
}
if (is.finite(max.retries))
max.retries = asCount(max.retries)
assertFlag(chunks.as.arrayjobs)
if (chunks.as.arrayjobs && is.na(cf$getArrayEnvirName())) {
warningf("Cluster functions '%s' do not support array jobs, falling back on chunks",
cf$name)
chunks.as.arrayjobs = FALSE
}
if (!is.null(cf$listJobs)) {
ids.intersect = intersect(unlist(ids), dbFindOnSystem(reg,
unlist(ids)))
if (length(ids.intersect) > 0L) {
stopf("Some of the jobs you submitted are already present on the batch system! E.g. id=%i.",
ids.intersect[1L])
}
}
if (limit.concurrent.jobs && (cf$name %in% c("Interactive",
"Local", "Multicore", "SSH") || is.null(cf$listJobs))) {
warning("Option 'max.concurrent.jobs' is enabled, but your cluster functions implementation does not support the listing of system jobs.\n",
"Option disabled, sleeping 5 seconds for safety reasons.")
limit.concurrent.jobs = FALSE
Sys.sleep(5)
}
if (n > 5000L) {
warningf(collapse(c("You are about to submit '%i' jobs.",
"Consider chunking them to avoid heavy load on the scheduler.",
"Sleeping 5 seconds for safety reasons."), sep = "\n"),
n)
Sys.sleep(5)
}
saveConf(reg)
is.chunked = is.list(ids)
info("Submitting %i chunks / %i jobs.", n, if (is.chunked)
sum(viapply(ids, length))
else n)
info("Cluster functions: %s.", cf$name)
info("Auto-mailer settings: start=%s, done=%s, error=%s.",
conf$mail.start, conf$mail.done, conf$mail.error)
fs.timeout = conf$fs.timeout
staged = conf$staged.queries && !is.na(fs.timeout)
interrupted = FALSE
submit.msgs = buffer(type = "list", capacity = 1000L, value = dbSendMessages,
reg = reg, max.retries = 10000L, sleep = function(r) 5,
staged = staged, fs.timeout = fs.timeout)
logger = makeSimpleFileLogger(file.path(reg$file.dir, "submit.log"),
touch = FALSE, keep = 1L)
on.exit({
if (interrupted && exists("batch.result", inherits = FALSE)) {
submit.msgs$push(dbMakeMessageSubmitted(reg, id,
time = submit.time, batch.job.id = batch.result$batch.job.id,
first.job.in.chunk.id = if (is.chunked) id1 else NULL,
resources.timestamp = resources.timestamp))
}
info("Sending %i submit messages...\nMight take some time, do not interrupt this!",
submit.msgs$pos())
submit.msgs$clear()
if (logger$getSize()) messagef("%i temporary submit errors logged to file '%s'.\nFirst message: %s",
logger$getSize(), logger$getLogfile(), logger$getMessages(1L))
})
info("Writing %i R scripts...", n)
resources.timestamp = saveResources(reg, resources)
rscripts = writeRscripts(reg, cf, ids, chunks.as.arrayjobs,
resources.timestamp, disable.mail = FALSE, delays = getDelays(cf,
job.delay, n))
waitForFiles(rscripts, timeout = fs.timeout)
dbSendMessage(reg, dbMakeMessageKilled(reg, unlist(ids),
type = "first"), staged = staged, fs.timeout = fs.timeout)
bar = makeProgressBar(max = n, label = "SubmitJobs")
bar$set()
tryCatch({
for (i in seq_along(ids)) {
id = ids[[i]]
id1 = id[1L]
retries = 0L
repeat {
if (limit.concurrent.jobs && length(cf$listJobs(conf,
reg)) >= conf$max.concurrent.jobs) {
batch.result = makeSubmitJobResult(status = 10L,
batch.job.id = NA_character_, "Max concurrent jobs exhausted")
}
else {
interrupted = TRUE
submit.time = now()
batch.result = cf$submitJob(conf = conf, reg = reg,
job.name = sprintf("%s-%i", reg$id, id1),
rscript = rscripts[i], log.file = getLogFilePath(reg,
id1), job.dir = getJobDirs(reg, id1), resources = resources,
arrayjobs = if (chunks.as.arrayjobs)
length(id)
else 1L)
}
if (batch.result$status == 0L) {
submit.msgs$push(dbMakeMessageSubmitted(reg,
id, time = submit.time, batch.job.id = batch.result$batch.job.id,
first.job.in.chunk.id = if (is.chunked)
id1
else NULL, resources.timestamp = resources.timestamp))
interrupted = FALSE
bar$inc(1L)
break
}
interrupted = FALSE
if (batch.result$status > 0L && batch.result$status <=
100L) {
if (is.finite(max.retries) && retries > max.retries)
stopf("Retried already %i times to submit. Aborting.",
max.retries)
Sys.sleep(wait(retries))
logger$log(batch.result$msg)
retries = retries + 1L
}
else if (batch.result$status > 100L && batch.result$status <=
200L) {
stopf("Fatal error occured: %i. %s", batch.result$status,
batch.result$msg)
}
else {
stopf("Illegal status code %s returned from cluster functions!",
batch.result$status)
}
}
}
}, error = bar$error)
return(invisible(ids))
})(reg = list(id = "bpmapply", version = list(platform = "x86_64-unknown-linux-gnu",
arch = "x86_64", os = "linux-gnu", system = "x86_64, linux-gnu",
status = "", major = "3", minor = "0.2", year = "2013", month = "09",
day = "25",svn rev
= "63987", language = "R", version.string = "R version 3.0.2 (2013-09-25)",
nickname = "Frisbee Sailing"), RNGkind = c("Mersenne-Twister",
"Inversion"), db.driver = "SQLite", db.options = list(), seed = 693613467L,
file.dir = "/home/poirierj//BiocParallel_tmp_55523b7bd6c2",
sharding = TRUE, work.dir = "/home/poirierj", src.dirs = character(0),
src.files = character(0), multiple.result.files = FALSE,
packages = list(BatchJobs = list(version = list(c(1L, 3L))))),
ids = list(c(2L, 4L, 5L, 7L, 9L), c(1L, 3L, 6L, 8L, 10L)))
9: do.call(submitJobs, pars)
8: withCallingHandlers(expr, message = function(c) invokeRestart("muffleMessage"))
7: suppressMessages(do.call(submitJobs, pars))
6: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, USE.NAMES = FALSE,
resume = resume, BPPARAM = BPPARAM)
5: bpmapply(FUN, X, MoreArgs = list(...), SIMPLIFY = FALSE, USE.NAMES = FALSE,
resume = resume, BPPARAM = BPPARAM)
4: bplapply(X, FUN, ..., resume = resume, BPPARAM = x)
3: bplapply(X, FUN, ..., resume = resume, BPPARAM = x)
2: bplapply(1:10, identity)
1: bplapply(1:10, identity)
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocParallel_0.4.1 BatchJobs_1.3 BBmisc_1.7
loaded via a namespace (and not attached):
[1] brew_1.0-6 checkmate_1.2 codetools_0.2-8 DBI_0.2-7
[5] digest_0.6.4 fail_1.2 foreach_1.4.2 iterators_1.0.7
[9] parallel_3.0.2 RSQLite_0.11.4 sendmailR_1.1-2 stringr_0.6.2
[13] tools_3.0.2
If I pass insufficient arguments to a function, it would be nice if BatchJobs fails already at registry creation, and does not silently drop the results.
> library(BatchJobs)
> f = function(x,y) x+y
> reg = batchMapQuick(f, c(1:5))
> reduceResultsList(reg, fun=function(job, res) res)
Reducing 0 results...
list()
Currently, I use a wrapper to check the input like the one below. Please feel free to add these checks if you find them useful.
l. = list(...)
fun = match.fun(` fun`)
funargs = formals(fun)
required = names(funargs)[unlist(lapply(funargs, function(f) class(f)=='name'))]
provided = names(c(l., more.args))
if (length(provided) > 1) {
if (any(nchar(provided) == 0))
stop("All arguments that will be provided to function must be named")
sdiff = unlist(setdiff(required, provided))
if (length(sdiff) > 0 && sdiff != '...')
stop(paste("Argument required but not provided:", paste(sdiff, collapse=" ")))
}
sdiff = unlist(setdiff(provided, names(funargs)))
if (length(sdiff) > 0 && ! '...' %in% names(funargs))
stop(paste("Argument provided by not accepted by function:", paste(sdiff, collapse=" ")))
dups = duplicated(provided)
if (any(dups))
stop(paste("Argument duplicated:", paste(provided[[dups]], collapse=" ")))
the gc value might not be perfect?
but better than nothing?
I am trying to set up a BatchJobs job with 8 Scenarios. The function I use is a wrapper function that will put the parameters into their right form to call the functions that actually do the computing work. All parameters in the call are vectors of length 8, defined either as columns from matrices or vectors (created using rep). However, the function definition is not well interpreted: the message I get is--
Error in data.frame(fun_id = fun.id, pars = pars, jobname = jobname) :
arguments imply differing number of rows: 1, 8, 0
FUNCTION DEFINTION BELOW
preFixRes <- function(nRep, la01,la02,la03,la11,la12,la13,ct1,ct2,ct3,
n,rho1,rho2,rho3,rho4,rho5,
gamma1,gamma2,gamma3,gamma4,gamma5,
optRho,optGamma,seeds, cLams){
n0 <- n/2
n1 <- n0
cttfs <- c(ct1,ct2,ct3)
rho <- c(rho1,rho2,rho3,rho4,rho5)
cLamb <- rep(cLams,3)
gamma <- c(gamma1,gamma2,gamma3,gamma4,gamma5)
rez <- runSimul (nRep,Scennr,la01,la02,la03,la11,la12,la13,ctffs,
n0,n1,rho,gamma,optRho,optGamma,seeds, cLamb)
resMat <- fixResTests1( rez,nRep,length(rho))
evs <- cEvs(rez,nRep)
return(list(resMat=resMat,evs=evs))
}
Question--what is being expected?
SRC: https://code.google.com/p/batchjobs/issues/detail?id=20
On some HPC systems (like ours) it is necessary to set the walltime at least roughly correct to be able to compute efficiently - and not overestimate it by a factor of 10 to be on the safe side.
For some experiments, especially if you mix very different algorithms with BatchExperiments, this can now become very difficult for the user. He has to do extensive "pre-runs", note execution times and then submit jobs in groups.
One current way out is to use chunking, to simply mix long running jobs with short ones.
But I am still wondering, whether we cannot automate what the user does (explained above) by automatically learning it?
Maybe this is a weird idea and overkill, in any case this of low priority and won't be implemented soon.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.