Giter VIP home page Giter VIP logo

ratatosk's People

Contributors

percyfal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

ratatosk's Issues

Syntax error in utils

Error

from ratatosk.utils import update, config_to_dict
SyntaxError: invalid syntax (utils.py, line 175)

Sort out dependencies between IndelRealigner and

RealignerTargetCreator. Right now IndelRealigner depends on
RealignerTargetCreator, and a source bam file name is calculated
dynamically. Shouldn't the dependency really be on the bam file used
as input for RealignerTargetCreator?

Try: modify repr(Task) for better visualization in graphs via a

command line option. Currently the graphs include all options,
making it difficult to read. UPDATE: breaking the representation of
Task (equivalent to Task.task_id) breaks the dependency resolution.
This needs a fix in the code that generates the graphs.

Implement dry run

The previous issue is related to the wish for a dry run: basically
want to generate a picture of the workflow

Class validation of parent task

Implement class validation of parent_task. Currently, any code can be used, but it would be nice if the class be validated against the parent class, for instance by using interfaces

Try: modify repr(Task) for better visualization in graphs via a

command line option. Currently the graphs include all options,
making it difficult to read. UPDATE: breaking the representation of
Task (equivalent to Task.task_id) breaks the dependency resolution.
This needs a fix in the code that generates the graphs.

Task event timing

I have added start_time and end_time to BaseJobTask, but currently the times don't get submitted to the graph/table interface. This would allow monitoring execution times and identifying pipeline bottlenecks.

Implement dry run

The previous issue is related to the wish for a dry run: basically
want to generate a picture of the workflow

Check for program versions

Check for program versions and command inconsistencies: for instance, BaseRecalibrator was introduced in GATK 2.0. EDIT: I'm thinking it's best not to elaborate too much on this part, but leave the responsibility of combining correct program versions to the end user. Otherwise, this can become overly complex.

Control number of threads / worker

How control the number of workers/threads in use? An example best explains the issue: alignment with bwa aln can be done with multiple threads. bwa sampe is single-threaded, and uses ~5.4GB RAM for the human genome. Our current compute cluster has 8-core 24GB RAM nodes. One solution would be to run 8 samples per node, running bwa aln -t 8 sequentially, wrap them with a WrappedTask before proceeding with bwa sampe, which then should only use 4 workers simultaneously. For small samples this is ok. For large samples, one might imagine partitioning the pipeline into an alignment step, in which one sample is run per node, and then grouping the remaining tasks and samples in reasonably sized groups. This latter approach would probably benefit from SLURM/drmaa integration (see following item).

File suffixes and string substitutions in file names are

hard-coded. Turning substitutions into options would maybe solve the
issue of input parameter generation for tasks that are run several
times on files with different suffixes.
DONE (sort of - the current solution works, but would need cleaning
up)

Configuration improvements

Configuration issues:

Read configuration file on startup, and not for every task as is currently the case
Variable expansion would be nice (e.g. $GATK_HOME)
Global section for globals, such as num_threads, dbsnp, etc?

Better SLURM/drmaa integration

(Long-term goal?): Integrate with SLURM/drmaa along the lines of luigi.hadoop and luigi.hadoop_jar. Currently using the local scheduler on nodes works well enough

Fix unit tests

Tests are not real unittests since they depend on oneanother

Central planner

Have tasks talk to a central planner so task lists can be easily monitored via a web page. Start out with gandalf.

Resolve option update order

If an option is set in config_file and in custom_config (e.g. target_generator_handler) the latter does not override the former. The order of preference should be:

  1. command line
  2. custom config
  3. config file
  4. default value

a.move(b) doesn't seem to work

UPSTREAM in ratatosk.job.DefaultShellJobRunner._fix_paths, a.move(b) doesn't work (I modelled this after luigi hadoop_jar)

Implement dry run

The previous issue is related to the wish for a dry run: basically
want to generate a picture of the workflow

Make opts() and args() more consistent

Some options in opts() are not really options - see
e.g. ratatosk.lib.tools.gatk.VariantEval, which requires a
reference. Move to args() for consistency?

Dynamic calculation of target names

Currently need to know target name. Add function that prints in which order labels are added to facilitate target name construction. Basically we need to gather all labels between two vertices in the dependency graph. See ratatosk.job.name_prefix function.

Implement restarting functions

Implement options --restart and --restart-from that restart from scratch or from a given task. Would require calculation of target names between any two vertices in the dependency graph. The idea would be to add a condition in the complete function that returns False until the provided task name is reached.

Sort out dependencies between IndelRealigner and

RealignerTargetCreator. Right now IndelRealigner depends on
RealignerTargetCreator, and a source bam file name is calculated
dynamically. Shouldn't the dependency really be on the bam file used
as input for RealignerTargetCreator?

Implement dry run

The previous issue is related to the wish for a dry run: basically
want to generate a picture of the workflow

Clean up intermediate output

Add task for cleaning up intermediate output (related to issue on pipes). Tmp files could be removed if is_tmp=True?

Sort out dependencies between IndelRealigner and

RealignerTargetCreator. Right now IndelRealigner depends on
RealignerTargetCreator, and a source bam file name is calculated
dynamically. Shouldn't the dependency really be on the bam file used
as input for RealignerTargetCreator?

Implement pipes

Use pipes whereever possible. See luigi.format.InputPipeProcessWrapper etc and luigi.file. Incidentally, how would using pipes over several nodes work? I guess tasks connected by pipes should be gathered and run by a single worker

Hadoop integration

Integrate with hadoop. This may be extremely easy: set the job runner for the JobTasks via the config file; by default, they use DefaultShellJobRunner, but could also use a (customized and subclassed?) version of hadoop_jar.HadoopJarJobRunner

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.