bjpop / rubra Goto Github PK
View Code? Open in Web Editor NEWInfrastructure code to support DNA pipeline
License: MIT License
Infrastructure code to support DNA pipeline
License: MIT License
Rubra: a bioinformatics pipeline. --------------------------------- https://github.com/bjpop/rubra License: -------- Rubra is licensed under the MIT license. See LICENSE.txt. Description: ------------ Rubra is a pipeline system for bioinformatics workflows. It is built on top of the Ruffus (http://www.ruffus.org.uk/) Python library, and adds support for running pipeline stages on a distributed compute cluster. Authors: -------- Bernie Pope, Clare Sloggett, Gayle Philip, Matthew Wakefield Installation: ------------- To install, clone this repository and run `setup.py`: git clone https://github.com/bjpop/rubra cd rubra python setup.py install If you are on a system where you do not have administrative privileges, we suggest using virtualenv ( http://www.virtualenv.org/ ). On HPC systems you may find virtualenv is already installed. Usage: ------ usage: rubra [-h] PIPELINE_FILE --config CONFIG_FILE [CONFIG_FILE ...] [--verbose {0,1,2}] [--style {print,run,touchfiles,flowchart}] [--force TASKNAME] [--end TASKNAME] [--rebuild {fromstart,fromend}] A bioinformatics pipeline system. optional arguments: -h, --help show this help message and exit PIPELINE_FILE Your Ruffus pipeline stages (a Python module) --config CONFIG_FILE [CONFIG_FILE ...] One or more configuration files (Python modules) --verbose {0,1,2} Output verbosity level: 0 = quiet; 1 = normal; 2 = chatty (default is 1) --style {print,run,touchfiles,flowchart} Pipeline behaviour: print; run; touchfiles; flowchart (default is print) --force TASKNAME tasks which are forced to be out of date regardless of timestamps --end TASKNAME end points (tasks) for the pipeline --rebuild {fromstart,fromend} rebuild outputs by working back from end tasks or forwards from start tasks (default is fromstart) Example: -------- Below is a little example pipeline which you can find in the Rubra source tree. It counts the number of lines in two files (test/data1.txt and test/data2.txt), and then sums the results together. rubra example_pipeline.py --config example_config.py --style run There are 2 lines in the first file and 1 line in the second file. So the result is 3, which is written to the output file test/total.txt. The --pipeline argument is a Python script which contains the actual code for each pipeline stage (using Ruffus notation). The --config argument is a Python script which contains configuration options for the whole pipeline, plus options for each stage (including the shell command to run in the stage). The --style argument says what to do with the pipeline: "run" means "perform the out-of-date steps in the pipeline". The default style is "print" which just displays what the pipeline would do if it were run. You can get a diagram of the pipeline using the "flowchart" style. You can touch all files in order using the "touchfiles" style, which is mostly useful for forcing Ruffus to acknowledge that a set of steps is up to date. Configuration: -------------- Configuration options are written into one or more Python scripts, which are passed to Rubra via the --config command line argument. Some options are required, and some are, well, optional. Options for the whole pipeline: ------------------------------- pipeline = { "logDir": "log", "logFile": "pipeline.log", "procs": 2, "end": ["total"], } Options for each stage of the pipeline: --------------------------------------- stageDefaults = { "distributed": False, "walltime": "00:10:00", "memInGB": 1, "queue": "batch", "modules": ["python-gcc"] } stages = { "countLines": { "command": "wc -l %file > %out", }, "total": { "command": "./test/total.py %files > %out", }, }
sys.path includes the rubra script's directory, but does not include the loaded pipeline script's directory. This means that we can't import other python files that are in the same directory as the pipeline, which we would naively expect to be able to do.
Hello,
I tried installing directly from pip and got the following errors:
pip install git+git://github.com/bjpop/rubra
Collecting git+git://github.com/bjpop/rubra
Cloning git://github.com/bjpop/rubra to /private/var/folders/d7/8vn6rd1d6f37gtgy_13h3b95mx3fmv/T/pip-fvxy0kg6-build
Collecting ruffus==2.2 (from Rubra==0.1.5)
Downloading ruffus-2.2.zip (5.9MB)
100% |████████████████████████████████| 5.9MB 115kB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/d7/8vn6rd1d6f37gtgy_13h3b95mx3fmv/T/pip-build-wo2ofvka/ruffus/setup.py", line 2, in
import ez_setup
File "/private/var/folders/d7/8vn6rd1d6f37gtgy_13h3b95mx3fmv/T/pip-build-wo2ofvka/ruffus/ez_setup.py", line 98
except pkg_resources.VersionConflict, e:
^
SyntaxError: invalid syntax
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/d7/8vn6rd1d6f37gtgy_13h3b95mx3fmv/T/pip-build-wo2ofvka/ruffus/
We should make it possible to install Rubra in the standard Python way.
Currently we have a Torque/PBS specific way of submitting jobs to clusters.
We could make it more portable by using DRMAA instead.
At the moment the srun command line is passed to Popen as a string, but it ought to be a list of strings.
The recommendation is to use shlex.split to parse the line into strings.
It's not obvious to the user how to make use of the options in the config files, ie from rubra.utils import pipeline_options
. We should include an example of this and perhaps import pipeline_options into the main rubra script so that it's from rubra import pipeline_options
.
Or, should the user import user-defined options files themselves? If yes, do they still need to access the pipeline config options for any reason? We should give an example of what we think is the right way to do both these things.
The current slurmified rubra generates a shell script for each job, but it does not save it with a suitably unique name.
We can use the usual Ruffus functionality where we put a step out of order and use the task name as a string, and this is ok:
@follows('first_task')
def second_task():
....
def first_task():
....
However if we put two tasks before the same dependency task, we get an error:
@follows('first_task')
def second_task():
....
@follows('first_task')
def other_second_task():
....
def first_task():
....
The second @follows('first_task') will throw an error like
ruffus.graph.error_duplicate_node_name: [pipeline.first_task] has already been added
This does not seem to happen when using straight ruffus scripts, without rubra.
Add support for SLURM job scheduler.
We need to start version numbering Rubra, and tagging versions in the git repository.
At the moment the config file and pipeline file need to be in the same directory due to the magic we do with import paths.
It would sometimes be convenient to have them in different directories.
Hello, I am running RedDog and I am getting the error that is detailed below. I reported this problem in RedDog GitHub and they suggested that it could be a problem with Rubra and server permissions. I am running the pipeline on a torque/qsub system. Could you help me? Thanks in advance!
Starting pipeline...
155 jobs to be executed in total
Traceback (most recent call last):
File "/usr/local/bin/rubra", line 11, in
load_entry_point('Rubra==0.1.5', 'console_scripts', 'rubra')()
File "build/bdist.linux-x86_64/egg/rubra/rubra.py", line 66, in main
File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 2680, in pipeline_run
ruffus.ruffus_exceptions.RethrownJobError:
Exceptions running jobs for
'def RedDog.makeDir(...):'
Original exception:
Exception #1
exceptions.Exception(qsub command failed with exit status: 172):
for RedDog.makeDir.Job = [False -> dir.makeDir.Success]
Traceback (most recent call last):
File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
return_value = job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 447, in job_wrapper_io_files
ret_val = user_defined_work_func(*param)
File "RedDog.py", line 955, in makeDir
runStageCheck('makeDir', flagFile, outPrefix, full_sequence_list_string)
File "build/bdist.linux-x86_64/egg/rubra/utils.py", line 128, in runStageCheck
status = runStage(stage, *args)
File "build/bdist.linux-x86_64/egg/rubra/utils.py", line 144, in runStage
exitStatus = distributedCommand(stage, commandStr, pipeline_options)
File "build/bdist.linux-x86_64/egg/rubra/utils.py", line 122, in distributedCommand
return script.runJobAndWait(stage, logDir, verbosity)
File "build/bdist.linux-x86_64/egg/rubra/cluster_job.py", line 65, in runJobAndWait
jobID = self.launch()
File "build/bdist.linux-x86_64/egg/rubra/cluster_job.py", line 138, in launch
str(returnCode)))
Exception: qsub command failed with exit status: 172
We should check that the required configuration options are present and somewhat meaningful.
We should also make sure they have appropriate and documented defaults.
Currently we only generate SVGs, but it would be nice to be able to output PNG or JPG.
Particularly useful for running these on our CloudBioLinux instances.
rubra example_pipeline.py --config example_config.py --style run
[achalk@merri examples]$ ../rubra/rubra.py example_pipeline.py --config example_config.py --style run
Traceback (most recent call last):
File "../rubra/rubra.py", line 86, in
main()
File "../rubra/rubra.py", line 35, in main
import(drop_py_suffix(args.pipeline))
File "example_pipeline.py", line 15, in
from rubra.utils import (runStageCheck)
ImportError: No module named utils
[achalk@merri rubra]$ rubra/rubra.py example_pipeline.py --config example_config.py --style run
Traceback (most recent call last):
File "rubra/rubra.py", line 86, in
main()
File "rubra/rubra.py", line 66, in main
gnu_make_maximal_rebuild_mode=rebuildMode)
File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 2680, in pipeline_run
ruffus.ruffus_exceptions.RethrownJobError:
Exceptions running jobs for
'def example_pipeline.countLines(...):'
Original exceptions:
Exception #1
exceptions.AttributeError('NoneType' object has no attribute 'stages'):
for example_pipeline.countLines.Job = [test/data2.txt -> [test/data2.count, test/data2.count.Success]]
Traceback (most recent call last):
File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
return_value = job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 447, in job_wrapper_io_files
ret_val = user_defined_work_func(*param)
File "example_pipeline.py", line 25, in countLines
runStageCheck('countLines', flagFile, file, output)
File "rubra/utils.py", line 132, in runStageCheck
status = runStage(stage, *args)
File "rubra/utils.py", line 143, in runStage
command = getCommand(stage, pipeline_options)
File "rubra/utils.py", line 175, in getCommand
commandStr = getStageOptions(options, name, 'command')
File "rubra/utils.py", line 104, in getStageOptions
return options.stages[stage][optionName]
AttributeError: 'NoneType' object has no attribute 'stages'
Exception #2
exceptions.AttributeError('NoneType' object has no attribute 'stages'):
for example_pipeline.countLines.Job = [test/data1.txt -> [test/data1.count, test/data1.count.Success]]
Traceback (most recent call last):
File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
return_value = job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 447, in job_wrapper_io_files
ret_val = user_defined_work_func(*param)
File "example_pipeline.py", line 25, in countLines
runStageCheck('countLines', flagFile, file, output)
File "rubra/utils.py", line 132, in runStageCheck
status = runStage(stage, *args)
File "rubra/utils.py", line 143, in runStage
command = getCommand(stage, pipeline_options)
File "rubra/utils.py", line 175, in getCommand
commandStr = getStageOptions(options, name, 'command')
File "rubra/utils.py", line 104, in getStageOptions
return options.stages[stage][optionName]
AttributeError: 'NoneType' object has no attribute 'stages'
Currently you must specify the pipeline as a flag on the command line with --pipeline.
This is unnatural, and the pipeline is required anyway, so we should allow the user to specify it without the flag.
When a stage crashes and ruffus throws an exception, we end up with the situation where rubra dies but jobs might be launched on the cluster.
These jobs probably keep running, but they will not produce success files.
It would be nice if rubra was more robust in the face of such errors.
Referencing files by their full path results in a file not found error when file exists:
rubra/rubra.py /vlsci/VR0244/shared/git/rubra/examples/example_pipeline.py --config /vlsci/VR0244/shared/git/rubra/examples/example_config.py --style run
Could not find configuration file: /vlsci/VR0244/shared/git/rubra/examples/example_config.py
ls -al /vlsci/VR0244/shared/git/rubra/examples/example_pipeline.py /vlsci/VR0244/shared/git/rubra/examples/example_config.py
-rw-r----- 1 achalk VR0244 452 Feb 19 10:34 /vlsci/VR0244/shared/git/rubra/examples/example_config.py
-rw-r----- 1 achalk VR0244 847 Feb 19 10:34 /vlsci/VR0244/shared/git/rubra/examples/example_pipeline.py
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.