percyfal / ratatosk Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Error
from ratatosk.utils import update, config_to_dict
SyntaxError: invalid syntax (utils.py, line 175)
RealignerTargetCreator. Right now IndelRealigner depends on
RealignerTargetCreator, and a source bam file name is calculated
dynamically. Shouldn't the dependency really be on the bam file used
as input for RealignerTargetCreator?
command line option. Currently the graphs include all options,
making it difficult to read. UPDATE: breaking the representation of
Task (equivalent to Task.task_id) breaks the dependency resolution.
This needs a fix in the code that generates the graphs.
The previous issue is related to the wish for a dry run: basically
want to generate a picture of the workflow
Implement class validation of parent_task. Currently, any code can be used, but it would be nice if the class be validated against the parent class, for instance by using interfaces
more closely. These options correspond to file names
CANCELLED?: In addition, have a fallback --source
for a task.
command line option. Currently the graphs include all options,
making it difficult to read. UPDATE: breaking the representation of
Task (equivalent to Task.task_id) breaks the dependency resolution.
This needs a fix in the code that generates the graphs.
I have added start_time and end_time to BaseJobTask, but currently the times don't get submitted to the graph/table interface. This would allow monitoring execution times and identifying pipeline bottlenecks.
YAML configuration parser (currently in ratatosk.yamlconfigparser.py
shell commands are wrapped with shell.exec_cmd from the cement package
The previous issue is related to the wish for a dry run: basically
want to generate a picture of the workflow
Check for program versions and command inconsistencies: for instance, BaseRecalibrator was introduced in GATK 2.0. EDIT: I'm thinking it's best not to elaborate too much on this part, but leave the responsibility of combining correct program versions to the end user. Otherwise, this can become overly complex.
Modules server.py, the subdirectory static, and the daemon are more or less copies from luigi.
How control the number of workers/threads in use? An example best explains the issue: alignment with bwa aln can be done with multiple threads. bwa sampe is single-threaded, and uses ~5.4GB RAM for the human genome. Our current compute cluster has 8-core 24GB RAM nodes. One solution would be to run 8 samples per node, running bwa aln -t 8 sequentially, wrap them with a WrappedTask before proceeding with bwa sampe, which then should only use 4 workers simultaneously. For small samples this is ok. For large samples, one might imagine partitioning the pipeline into an alignment step, in which one sample is run per node, and then grouping the remaining tasks and samples in reasonably sized groups. This latter approach would probably benefit from SLURM/drmaa integration (see following item).
backend.global_config should be updated once all parameters have been updated correctly. Currently not working/implemented.
hard-coded. Turning substitutions into options would maybe solve the
issue of input parameter generation for tasks that are run several
times on files with different suffixes.
DONE (sort of - the current solution works, but would need cleaning
up)
Configuration issues:
Read configuration file on startup, and not for every task as is currently the case
Variable expansion would be nice (e.g. $GATK_HOME)
Global section for globals, such as num_threads, dbsnp, etc?
Instead of bwaref etc for setting alignment references, utilise cloudbiolinux/tool-data
For some target generators, bam file target list is not unique
(Long-term goal?): Integrate with SLURM/drmaa along the lines of luigi.hadoop and luigi.hadoop_jar. Currently using the local scheduler on nodes works well enough
Tests are not real unittests since they depend on oneanother
Have tasks talk to a central planner so task lists can be easily monitored via a web page. Start out with gandalf.
If an option is set in config_file and in custom_config (e.g. target_generator_handler) the latter does not override the former. The order of preference should be:
DONE (sort of - the current solution works, but would need cleaning up)
UPSTREAM in ratatosk.job.DefaultShellJobRunner._fix_paths, a.move(b) doesn't work (I modelled this after luigi hadoop_jar)
sure if that currently is the case. EDIT: no, this is a bug
The previous issue is related to the wish for a dry run: basically
want to generate a picture of the workflow
Some options in opts() are not really options - see
e.g. ratatosk.lib.tools.gatk.VariantEval, which requires a
reference. Move to args() for consistency?
Some options can be repeated (--sample s1 --sample s2). Transforming the argument list to a dictionary therefore squashes the list into one element.
Currently need to know target name. Add function that prints in which order labels are added to facilitate target name construction. Basically we need to gather all labels between two vertices in the dependency graph. See ratatosk.job.name_prefix function.
Add 'target_grouping' or something similar to BaseJobTask to use for grouping in table output.
Implement options --restart and --restart-from that restart from scratch or from a given task. Would require calculation of target names between any two vertices in the dependency graph. The idea would be to add a condition in the complete function that returns False until the provided task name is reached.
RealignerTargetCreator. Right now IndelRealigner depends on
RealignerTargetCreator, and a source bam file name is calculated
dynamically. Shouldn't the dependency really be on the bam file used
as input for RealignerTargetCreator?
The previous issue is related to the wish for a dry run: basically
want to generate a picture of the workflow
Add task for cleaning up intermediate output (related to issue on pipes). Tmp files could be removed if is_tmp=True?
RealignerTargetCreator. Right now IndelRealigner depends on
RealignerTargetCreator, and a source bam file name is calculated
dynamically. Shouldn't the dependency really be on the bam file used
as input for RealignerTargetCreator?
Use pipes whereever possible. See luigi.format.InputPipeProcessWrapper etc and luigi.file. Incidentally, how would using pipes over several nodes work? I guess tasks connected by pipes should be gathered and run by a single worker
Integrate with hadoop. This may be extremely easy: set the job runner for the JobTasks via the config file; by default, they use DefaultShellJobRunner, but could also use a (customized and subclassed?) version of hadoop_jar.HadoopJarJobRunner
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.