Giter VIP home page Giter VIP logo

chmutil's People

Contributors

coleslaw481 avatar pyup-bot avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

shuail

chmutil's Issues

Modify checkchmjob.py to update contents of runjobs.gordon with task array range

It appears Gordon uses a custom job filter for commands passed to qsub and this filter does NOT allow the -t #-# flag to be set as a command line argument. The -t #-# denotes the job is an array job and the #-# denote the starting and ending job ids. To work around this the -t needs to be put into the runjobs.gordon script file by checkchmjob.py --submit and updated by that command if rerun.

Example line:
#PBS -t 1-3432

This means that the user just needs to run qsub runjobs.gordon to submit jobs.

Add --detailed option to checkchmjob.py

Add a --detailed option which provides this additional information in output:

          CHM tasks: 4% complete (960 of 23,456 completed)
          CHM task runtime: 3.2 hours per task (6gb ram per job)
          CHM task CPU consumption so far: 3,069 CPU hours (~0.35 CPU years)
          CHM task estimated remaining compute: 75,059 CPU hours (~8.69 CPU years)
          

          Merge tasks: 0% complete (0 of 1,234 completed)
          Merge task runtime: NA
          Merge task CPU consumption so far: NA
          Merge task estimated remaining compute: NA

create tool to generate jobs to create probability map overlay images

This tool should be named createprobmapoverlayjob.py and it should be called like so:

createprobmapoverlayjob.py (options of probability map overlay creation)

The above script should take a completed CHM job and use information in that job to generate a new set of tasks to run on a cluster. These tasks (1 per probability map) should take the probability map, filter based on options set by caller and then overlay that probability map onto the original raw image with color and opacity defined by the user.

If a job fails look into altering job submission to exclude that node

On Gordon cluster there are a few nodes that do not have singularity module. This causes the job to fail very quickly which is fine. The problem is this node becomes available again and proceeds to eat other jobs. To remedy this it would be nice if chmutil could catch this failure and add this node to the exclude list of the job so subsequent jobs do NOT get assigned to this node:

Command run in runjobs.gordon:
module load singularity/2.1.2

Error message (not sure if module load returns an exit code or not)
ModuleCmd_Load.c(204):ERROR:105: Unable to locate a modulefile for
'singularity/2.1.2'

List of nodes that had problems on a run over 1-14-2017:

gcn-13-77
gcn-4-28
gcn-7-75
gcn-8-22
gcn-8-74

Look into adding LocalCluster to run jobs on local computer

Perhaps this class should generate the following script named runjobs.local:

#!/bin/sh

if [ $# -ne 2 ] ; then
echo "$0 "
echo ""
echo "Runs sequence of CHM tasks"
echo ""
echo "Ex: $0 1 50"
exit 1
fi

start=$1
end=$2

for Y in seq $start $end ; do
outfile="/fakechmjob/gordon2/chmrun/stdout/${Y}.out"
echo "HOST: $HOSTNAME" > $outfile
echo "DATE: date" >> $outfile
echo "TASKID: $Y" >> $outfile
/usr/bin/time -p /usr/bin/chmrunner.py $Y /fakechmjob/gordon2 --scratchdir /fakechmjob/gordon2/chmrun/tmp --log DEBUG >> $outfile 2>&1

exitcode=$?
echo "chmrunner.py exited with code: $exitcode" >> $outfile
done

Add --gentiles to createchmimage.py

To make it easier to create tiles for probability map viewer add --gentiles to createchmimage.py which will tile image in format needed by probability map viewer

createchmjob.py should create readme.txt file for job

This readme file should contain the following information:

-- Descriptions of all files and directories pertaining to the job.

-- The arguments passed to createchmjob.py to create this directory

-- Commands to submit jobs and check status

-- Links to get help

Modify chmrunner.py to catch USR2 signals

SGE scheduler will send a USR2 a few seconds before killing the job. chmrunner.py should (depending if flag is set or unset on command line) catch this signal and output any stderr/stdout output and remove any temp files.

Account value not being saved in configuration files

For Comet and rocce cluster the runjobs.CLUSTER and runmerge.CLUSTER files are generated once by createchmjob.py which sets the account value from --account flag.

For Gordon cluster this value is lost since runjobs.gordon and runmerge.gordon is re-written when checkchmjob.py --submit is invoked. To remedy this the account value needs to be stored in a configuration file so it can be loaded into CHMConfig.

mergetiles job failed on rocce because it ran out of memory

A job on rocce failed cause it ran out of memory on mergetiles.py:

2017-07-23 17:52:22,722 ERROR (11321) chmutil.mergetiles Caught exception
Traceback (most recent call last):
  File "/home/rdrigo/miniconda2/bin/mergetiles.py", line 95, in main
    theargs.suffix)
  File "/home/rdrigo/miniconda2/bin/mergetiles.py", line 53, in _merge_image_tiles
    merged = sim.merge_images(im_list)
  File "/home/rdrigo/miniconda2/lib/python2.7/site-packages/chmutil/image.py", line 42, in merge_images
    merged = self._merge_two_images(merged, entry)
  File "/home/rdrigo/miniconda2/lib/python2.7/site-packages/chmutil/image.py", line 58, in _merge_two_images
    b=image2)
  File "/home/rdrigo/miniconda2/lib/python2.7/site-packages/PIL/ImageMath.py", line 265, in eval
    out = builtins.eval(expression, args)
  File "<string>", line 1, in <module>
  File "/home/rdrigo/miniconda2/lib/python2.7/site-packages/PIL/ImageMath.py", line 232, in imagemath_max
    return self.apply("max", self, other)
  File "/home/rdrigo/miniconda2/lib/python2.7/site-packages/PIL/ImageMath.py", line 88, in apply
    out = Image.new(mode or im1.mode, im1.size, None)
  File "/home/rdrigo/miniconda2/lib/python2.7/site-packages/PIL/Image.py", line 2154, in new
    return Image()._new(core.new(mode, size))
MemoryError
2017-07-23 17:52:23,456 INFO (11318) chmutil.core Process 11320 exited with code: 2

The job was merging 71 tiles of size 31237x29138 pixels. To remedy this, need to estimate memory needed by merge and add the following to the -l line in runmerge.rocce configuration file:

h_vmem=XXG,virtual_free=XXG

Where XX is number of gigabytes of ram needed. This is only needed on rocce since the queue places multiple user jobs on a node. Comet should be okay since a job gets an entire node with 128 gigabytes of ram.

Add --account option to creatchmjob.py

add --account option that lets user specify account which is needed for Gordon and Comet clusters. This value should be put into CHMConfig and obtainable via a get method.

Create new tool createchmtrainjob.py

This script should generate a chm train job in similar design to createchmjob.py

Usage:

createchmtrainjob.py ./images ./labels ./run --stage 2 --level 2 --account foo --walltime 24:00:00

under ./run should be something similar to createchmjob.py

.DS_Store in input images directory causes IOError in createchmjob.py

$ createchmjob.py chmimages model mychm --disablechmhisteq --cluster rocce --chmbin /data/churas/chm_s22.img
2017-07-24 16:46:15,539 ERROR chmutil.image Skipping file unable to open /data/scratch/churastest/chmimages/.DS_Store
Traceback (most recent call last):
File "/home/churastest/miniconda2/lib/python2.7/site-packages/chmutil/image.py", line 382, in get_input_image_stats
im = Image.open(fp)
File "/home/churastest/miniconda2/lib/python2.7/site-packages/PIL/Image.py", line 2519, in open
% (filename if filename else fp))
IOError: cannot identify image file '/data/scratch/churastest/chmimages/.DS_Store'
2017-07-24 16:46:15,560 ERROR chmutil.image Caught exception attempting to close image
Traceback (most recent call last):
File "/home/churastest/miniconda2/lib/python2.7/site-packages/chmutil/image.py", line 391, in get_input_image_stats
im.close()
AttributeError: 'NoneType' object has no attribute 'close'
Run this to submit job
/home/churastest/miniconda2/bin/checkchmjob.py "/data/scratch/churastest/mychm" --submit
[churastest@login-0-0 churastest]$ ls -la chmimages/.DS_Store
-rw-r--r-- 1 churastest churastest 6148 Jul 24 16:40 chmimages/.DS_Store
[churastest@login-0-0 churastest]$ file chmimages/.DS_Store

checkchmjob.py --detailed outputting incorrect value for ram

When checking a job with checkchmjob.py --detailed memory usage was output:

CHM tasks: 100% complete (2 of 2 completed)
CHM runtime: 0.5 hours per task (12,846.76GB ram)

Looking standard out files, here is the output for memory usage in kilobytes:

Maximum resident set size (kbytes): 12846688
Maximum resident set size (kbytes): 12846832

The above output should be 12.85GB ram. Looks like checkchmjob.py is outputting megabytes of ram instead of gigabytes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.