crbs / chmutil Goto Github PK
View Code? Open in Web Editor NEWUtilities to run Cascaded Hierarchical Model (CHM) jobs on cluster of computers
License: Other
Utilities to run Cascaded Hierarchical Model (CHM) jobs on cluster of computers
License: Other
when using large images > 10k x 10k pillow outputs warning message about a possible DOS attack. This page says there is a way to silence this error:
It would be nice to be able to pass a mask image file or a stack of them to createchmjob.py with say a --mask flag whose value would be a directory or a single image file. The image file would tell what parts of the image should be segmented.
Modify createchmjob.py to also run pychm jobs.
It appears Gordon uses a custom job filter for commands passed to qsub and this filter does NOT allow the -t #-# flag to be set as a command line argument. The -t #-# denotes the job is an array job and the #-# denote the starting and ending job ids. To work around this the -t needs to be put into the runjobs.gordon script file by checkchmjob.py --submit and updated by that command if rerun.
Example line:
#PBS -t 1-3432
This means that the user just needs to run qsub runjobs.gordon to submit jobs.
Add new argument --rawthreshold that lets caller specify pixel intensity as a threshold cut off. This is needed to be able to overlay training data (which has intensity of 1).
Modify SGEScheduler in image.py to add virtual_free=XXG line to requirements. This is needed cause h_vmem alone does not seem to be putting jobs on machines with enough memory.
Output information about input dataset:
Number input images: 1,234 (145 Gigabytes)
Dimensions of images: 24,000 x 32,000
Add a --detailed option which provides this additional information in output:
CHM tasks: 4% complete (960 of 23,456 completed)
CHM task runtime: 3.2 hours per task (6gb ram per job)
CHM task CPU consumption so far: 3,069 CPU hours (~0.35 CPU years)
CHM task estimated remaining compute: 75,059 CPU hours (~8.69 CPU years)
Merge tasks: 0% complete (0 of 1,234 completed)
Merge task runtime: NA
Merge task CPU consumption so far: NA
Merge task estimated remaining compute: NA
This tool should be named createprobmapoverlayjob.py and it should be called like so:
createprobmapoverlayjob.py (options of probability map overlay creation)
The above script should take a completed CHM job and use information in that job to generate a new set of tasks to run on a cluster. These tasks (1 per probability map) should take the probability map, filter based on options set by caller and then overlay that probability map onto the original raw image with color and opacity defined by the user.
On Gordon cluster there are a few nodes that do not have singularity module. This causes the job to fail very quickly which is fine. The problem is this node becomes available again and proceeds to eat other jobs. To remedy this it would be nice if chmutil could catch this failure and add this node to the exclude list of the job so subsequent jobs do NOT get assigned to this node:
Command run in runjobs.gordon:
module load singularity/2.1.2
Error message (not sure if module load returns an exit code or not)
ModuleCmd_Load.c(204):ERROR:105: Unable to locate a modulefile for
'singularity/2.1.2'
List of nodes that had problems on a run over 1-14-2017:
gcn-13-77
gcn-4-28
gcn-7-75
gcn-8-22
gcn-8-74
Perhaps this class should generate the following script named runjobs.local:
#!/bin/sh
if [ $# -ne 2 ] ; then
echo "$0 "
echo ""
echo "Runs sequence of CHM tasks"
echo ""
echo "Ex: $0 1 50"
exit 1
fi
start=$1
end=$2
for Y in seq $start $end
; do
outfile="/fakechmjob/gordon2/chmrun/stdout/${Y}.out"
echo "HOST: $HOSTNAME" > $outfile
echo "DATE: date
" >> $outfile
echo "TASKID: $Y" >> $outfile
/usr/bin/time -p /usr/bin/chmrunner.py $Y /fakechmjob/gordon2 --scratchdir /fakechmjob/gordon2/chmrun/tmp --log DEBUG >> $outfile 2>&1
exitcode=$?
echo "chmrunner.py exited with code: $exitcode" >> $outfile
done
Should look into automatically setting the project for comet jobs.
To make it easier to create tiles for probability map viewer add --gentiles to createchmimage.py which will tile image in format needed by probability map viewer
createchmjob.py should create the first couple tasks with only 1 tile in the argument list. The runchmjob.py should then tell the caller to run these tasks first to verify correct operation. This will also give estimates of runtime for running entire dataset.
checkjobstatus.py needs to be replaced with checkchmjob.py
Add option --addprobmap that allows caller to overlay additional probability maps each with own color and settings.
This readme file should contain the following information:
-- Descriptions of all files and directories pertaining to the job.
-- The arguments passed to createchmjob.py to create this directory
-- Commands to submit jobs and check status
-- Links to get help
SGE scheduler will send a USR2 a few seconds before killing the job. chmrunner.py should (depending if flag is set or unset on command line) catch this signal and output any stderr/stdout output and remove any temp files.
For Comet and rocce cluster the runjobs.CLUSTER and runmerge.CLUSTER files are generated once by createchmjob.py which sets the account value from --account flag.
For Gordon cluster this value is lost since runjobs.gordon and runmerge.gordon is re-written when checkchmjob.py --submit is invoked. To remedy this the account value needs to be stored in a configuration file so it can be loaded into CHMConfig.
A job on rocce failed cause it ran out of memory on mergetiles.py:
2017-07-23 17:52:22,722 ERROR (11321) chmutil.mergetiles Caught exception
Traceback (most recent call last):
File "/home/rdrigo/miniconda2/bin/mergetiles.py", line 95, in main
theargs.suffix)
File "/home/rdrigo/miniconda2/bin/mergetiles.py", line 53, in _merge_image_tiles
merged = sim.merge_images(im_list)
File "/home/rdrigo/miniconda2/lib/python2.7/site-packages/chmutil/image.py", line 42, in merge_images
merged = self._merge_two_images(merged, entry)
File "/home/rdrigo/miniconda2/lib/python2.7/site-packages/chmutil/image.py", line 58, in _merge_two_images
b=image2)
File "/home/rdrigo/miniconda2/lib/python2.7/site-packages/PIL/ImageMath.py", line 265, in eval
out = builtins.eval(expression, args)
File "<string>", line 1, in <module>
File "/home/rdrigo/miniconda2/lib/python2.7/site-packages/PIL/ImageMath.py", line 232, in imagemath_max
return self.apply("max", self, other)
File "/home/rdrigo/miniconda2/lib/python2.7/site-packages/PIL/ImageMath.py", line 88, in apply
out = Image.new(mode or im1.mode, im1.size, None)
File "/home/rdrigo/miniconda2/lib/python2.7/site-packages/PIL/Image.py", line 2154, in new
return Image()._new(core.new(mode, size))
MemoryError
2017-07-23 17:52:23,456 INFO (11318) chmutil.core Process 11320 exited with code: 2
The job was merging 71 tiles of size 31237x29138 pixels. To remedy this, need to estimate memory needed by merge and add the following to the -l line in runmerge.rocce configuration file:
h_vmem=XXG,virtual_free=XXG
Where XX is number of gigabytes of ram needed. This is only needed on rocce since the queue places multiple user jobs on a node. Comet should be okay since a job gets an entire node with 128 gigabytes of ram.
add --account option that lets user specify account which is needed for Gordon and Comet clusters. This value should be put into CHMConfig and obtainable via a get method.
In version 0.8.0 createtrainingmrcstack.py failing cause it is trying to call get_image_path_list from core module, but that function was moved to image module
Need to change singularity module loaded to
singularity/2.3.2
update qstat call example in readme.txt to include -t flag needed for array jobs:
qstat -t -u '$USER'
Add new option --dontdeletescratch to createtrainingmrcstack.py to skip deletion of scratchdir.
This script should generate a chm train job in similar design to createchmjob.py
Usage:
createchmtrainjob.py ./images ./labels ./run --stage 2 --level 2 --account foo --walltime 24:00:00
under ./run should be something similar to createchmjob.py
$ createchmjob.py chmimages model mychm --disablechmhisteq --cluster rocce --chmbin /data/churas/chm_s22.img
2017-07-24 16:46:15,539 ERROR chmutil.image Skipping file unable to open /data/scratch/churastest/chmimages/.DS_Store
Traceback (most recent call last):
File "/home/churastest/miniconda2/lib/python2.7/site-packages/chmutil/image.py", line 382, in get_input_image_stats
im = Image.open(fp)
File "/home/churastest/miniconda2/lib/python2.7/site-packages/PIL/Image.py", line 2519, in open
% (filename if filename else fp))
IOError: cannot identify image file '/data/scratch/churastest/chmimages/.DS_Store'
2017-07-24 16:46:15,560 ERROR chmutil.image Caught exception attempting to close image
Traceback (most recent call last):
File "/home/churastest/miniconda2/lib/python2.7/site-packages/chmutil/image.py", line 391, in get_input_image_stats
im.close()
AttributeError: 'NoneType' object has no attribute 'close'
Run this to submit job
/home/churastest/miniconda2/bin/checkchmjob.py "/data/scratch/churastest/mychm" --submit
[churastest@login-0-0 churastest]$ ls -la chmimages/.DS_Store
-rw-r--r-- 1 churastest churastest 6148 Jul 24 16:40 chmimages/.DS_Store
[churastest@login-0-0 churastest]$ file chmimages/.DS_Store
Setting this --cluster flag should update --jobspernode to 1 for rocce, 11 or 12 for gordon as well as correct value for comet.
When checking a job with checkchmjob.py --detailed memory usage was output:
CHM tasks: 100% complete (2 of 2 completed)
CHM runtime: 0.5 hours per task (12,846.76GB ram)
Looking standard out files, here is the output for memory usage in kilobytes:
Maximum resident set size (kbytes): 12846688
Maximum resident set size (kbytes): 12846832
The above output should be 12.85GB ram. Looks like checkchmjob.py is outputting megabytes of ram instead of gigabytes.
Since IMOD needs tif files to create an MRC stack, it would be nice to have createchmjob.py generate probability map images that are tiffs instead of png files. Easiest implementation is to offer a new flag --gentifs that tells createchmjob.py to generate the merged probability map images as tif files.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.