percyfal / ratatosk Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 6.0 4.19 MB

License: Apache License 2.0

Python 91.85% Graphviz (DOT) 3.64% CSS 1.99% JavaScript 2.53%

ratatosk's People

Contributors

Stargazers

Watchers

Forkers

scilifelab celloud samuell vhuarui zzygyx9119

ratatosk's Issues

Make exe() use self.executable and remove exe() from all subclasses.

Syntax error in utils

Error

from ratatosk.utils import update, config_to_dict
SyntaxError: invalid syntax (utils.py, line 175)

Sort out dependencies between IndelRealigner and

RealignerTargetCreator. Right now IndelRealigner depends on
RealignerTargetCreator, and a source bam file name is calculated
dynamically. Shouldn't the dependency really be on the bam file used
as input for RealignerTargetCreator?

Try: modify repr(Task) for better visualization in graphs via a

command line option. Currently the graphs include all options,
making it difficult to read. UPDATE: breaking the representation of
Task (equivalent to Task.task_id) breaks the dependency resolution.
This needs a fix in the code that generates the graphs.

Implement dry run

The previous issue is related to the wish for a dry run: basically
want to generate a picture of the workflow

Class validation of parent task

Implement class validation of parent_task. Currently, any code can be used, but it would be nice if the class be validated against the parent class, for instance by using interfaces

Add general task config option `--target` to mimic Make behaviour

more closely. These options correspond to file names
CANCELLED?: In addition, have a fallback --source for a task.

Use new luigi visualizer

Try: modify repr(Task) for better visualization in graphs via a

Task event timing

I have added start_time and end_time to BaseJobTask, but currently the times don't get submitted to the graph/table interface. This would allow monitoring execution times and identifying pipeline bottlenecks.

There is still a dependency on cement that should be removed

YAML configuration parser (currently in ratatosk.yamlconfigparser.py
shell commands are wrapped with shell.exec_cmd from the cement package

Implement dry run

The previous issue is related to the wish for a dry run: basically
want to generate a picture of the workflow

Check for program versions

Check for program versions and command inconsistencies: for instance, BaseRecalibrator was introduced in GATK 2.0. EDIT: I'm thinking it's best not to elaborate too much on this part, but leave the responsibility of combining correct program versions to the end user. Otherwise, this can become overly complex.

Have had weird problems with cutadapt. Using the '-o' flag should generate gzipped output. However, the job runner uses tmp files, which by default don't use 'gz'-suffix. A workaround for now is to add the suffix in CutadaptJobRunner. This works on Linux,

Reimplement server.py, importing and subclassing only the necessary

Modules server.py, the subdirectory static, and the daemon are more or less copies from luigi.

Control number of threads / worker

How control the number of workers/threads in use? An example best explains the issue: alignment with bwa aln can be done with multiple threads. bwa sampe is single-threaded, and uses ~5.4GB RAM for the human genome. Our current compute cluster has 8-core 24GB RAM nodes. One solution would be to run 8 samples per node, running bwa aln -t 8 sequentially, wrap them with a WrappedTask before proceeding with bwa sampe, which then should only use 4 workers simultaneously. For small samples this is ok. For large samples, one might imagine partitioning the pipeline into an alignment step, in which one sample is run per node, and then grouping the remaining tasks and samples in reasonably sized groups. This latter approach would probably benefit from SLURM/drmaa integration (see following item).

global config not updated correctly

backend.global_config should be updated once all parameters have been updated correctly. Currently not working/implemented.

File suffixes and string substitutions in file names are

hard-coded. Turning substitutions into options would maybe solve the
issue of input parameter generation for tasks that are run several
times on files with different suffixes.
DONE (sort of - the current solution works, but would need cleaning
up)

Configuration improvements

Configuration issues:

Read configuration file on startup, and not for every task as is currently the case
Variable expansion would be nice (e.g. $GATK_HOME)
Global section for globals, such as num_threads, dbsnp, etc?

Use cloudbiolinux references

Instead of bwaref etc for setting alignment references, utilise cloudbiolinux/tool-data

Merge sam files target list not unique

For some target generators, bam file target list is not unique

Better SLURM/drmaa integration

(Long-term goal?): Integrate with SLURM/drmaa along the lines of luigi.hadoop and luigi.hadoop_jar. Currently using the local scheduler on nodes works well enough

Fix unit tests

Tests are not real unittests since they depend on oneanother

Central planner

Have tasks talk to a central planner so task lists can be easily monitored via a web page. Start out with gandalf.

Modify graph representation so only task name is shown

Resolve option update order

If an option is set in config_file and in custom_config (e.g. target_generator_handler) the latter does not override the former. The order of preference should be:

command line
custom config
config file
default value

File suffixes and string substitutions in file names are hard-coded. Turning substitutions into options would maybe solve the issue of input parameter generation for tasks that are run s

DONE (sort of - the current solution works, but would need cleaning up)

Add support for merging vcf files in HaloPlex using CombineVariants

a.move(b) doesn't seem to work

UPSTREAM in ratatosk.job.DefaultShellJobRunner._fix_paths, a.move(b) doesn't work (I modelled this after luigi hadoop_jar)

Currently they live in ratatosk.scilife, but the principle is clear. Add ratatosk.pipeline in which pipeline wrappers are put. In the main script then import different pre-defined pipelines so one could change them via the command line following luigi rul

Function opts() is inconsistent at present. Best is probably to init empty list and do a join on return.

Speaking of file suffixes, currently assume all fastq files are gzipped. EDIT: actually, this is a good thing. If a program can't take gzipped input, use pipes.

Command-line options should override settings in config file - not

sure if that currently is the case. EDIT: no, this is a bug

Implement dry run

The previous issue is related to the wish for a dry run: basically
want to generate a picture of the workflow

Make opts() and args() more consistent

Some options in opts() are not really options - see
e.g. ratatosk.lib.tools.gatk.VariantEval, which requires a
reference. Move to args() for consistency?

opt_to_dict in wrapper script squashes list options

Some options can be repeated (--sample s1 --sample s2). Transforming the argument list to a dictionary therefore squashes the list into one element.

Pickling states doesn't currently seem to work?

Dynamic calculation of target names

Currently need to know target name. Add function that prints in which order labels are added to facilitate target name construction. Basically we need to gather all labels between two vertices in the dependency graph. See ratatosk.job.name_prefix function.

Target grouping in table output

Add 'target_grouping' or something similar to BaseJobTask to use for grouping in table output.

Implement restarting functions

Implement options --restart and --restart-from that restart from scratch or from a given task. Would require calculation of target names between any two vertices in the dependency graph. The idea would be to add a condition in the complete function that returns False until the provided task name is reached.

percyfal / ratatosk Goto Github PK

ratatosk's People

Contributors

Stargazers

Watchers

Forkers

ratatosk's Issues

Recommend Projects

Recommend Topics

Recommend Org