bjpop / rubra Goto Github PK

Infrastructure code to support DNA pipeline

License: MIT License

Python 100.00%

rubra's Introduction

Rubra: a bioinformatics pipeline.
---------------------------------

https://github.com/bjpop/rubra

License:
--------

Rubra is licensed under the MIT license. See LICENSE.txt.

Description:
------------

Rubra is a pipeline system for bioinformatics workflows. It is built on top
of the Ruffus (http://www.ruffus.org.uk/) Python library, and adds support
for running pipeline stages on a distributed compute cluster.

Authors:
--------

Bernie Pope, Clare Sloggett, Gayle Philip, Matthew Wakefield

Installation:
-------------

To install, clone this repository and run `setup.py`:

    git clone https://github.com/bjpop/rubra
    cd rubra
    python setup.py install

If you are on a system where you do not have administrative privileges, we
suggest using virtualenv ( http://www.virtualenv.org/ ). On HPC systems you 
may find virtualenv is already installed.

Usage:
------

usage: rubra [-h] PIPELINE_FILE --config CONFIG_FILE
                [CONFIG_FILE ...] [--verbose {0,1,2}]
                [--style {print,run,touchfiles,flowchart}] [--force TASKNAME]
                [--end TASKNAME] [--rebuild {fromstart,fromend}]

A bioinformatics pipeline system.

optional arguments:
  -h, --help            show this help message and exit
  PIPELINE_FILE         Your Ruffus pipeline stages (a Python module)
  --config CONFIG_FILE [CONFIG_FILE ...]
                        One or more configuration files (Python modules)
  --verbose {0,1,2}     Output verbosity level: 0 = quiet; 1 = normal; 2 =
                        chatty (default is 1)
  --style {print,run,touchfiles,flowchart}
                        Pipeline behaviour: print; run; touchfiles; flowchart (default is
                        print)
  --force TASKNAME      tasks which are forced to be out of date regardless of
                        timestamps
  --end TASKNAME        end points (tasks) for the pipeline
  --rebuild {fromstart,fromend}
                        rebuild outputs by working back from end tasks or
                        forwards from start tasks (default is fromstart)

Example:
--------

Below is a little example pipeline which you can find in the Rubra source
tree. It counts the number of lines in two files (test/data1.txt and
test/data2.txt), and then sums the results together.

   rubra example_pipeline.py --config example_config.py --style run

There are 2 lines in the first file and 1 line in the second file. So the
result is 3, which is written to the output file test/total.txt.

The --pipeline argument is a Python script which contains the actual
code for each pipeline stage (using Ruffus notation). The --config
argument is a Python script which contains configuration options for the
whole pipeline, plus options for each stage (including the shell command
to run in the stage). The --style argument says what to do with the pipeline:
"run" means "perform the out-of-date steps in the pipeline". The default
style is "print" which just displays what the pipeline would do if it were
run. You can get a diagram of the pipeline using the "flowchart" style. You 
can touch all files in order using the "touchfiles" style, which is mostly 
useful for forcing Ruffus to acknowledge that a set of steps is up to date.

Configuration:
--------------

Configuration options are written into one or more Python scripts, which
are passed to Rubra via the --config command line argument.

Some options are required, and some are, well, optional.

Options for the whole pipeline:
-------------------------------

    pipeline = {
        "logDir": "log",
        "logFile": "pipeline.log",
        "procs": 2,
        "end": ["total"],
    }


Options for each stage of the pipeline:
---------------------------------------

    stageDefaults = {
        "distributed": False,
        "walltime": "00:10:00",
        "memInGB": 1,
        "queue": "batch",
        "modules": ["python-gcc"]
    }

    stages = {
        "countLines": {
            "command": "wc -l %file > %out",
        },
        "total": {
            "command": "./test/total.py %files > %out",
        },
    }

rubra's People

Contributors

Stargazers

Watchers

Forkers

claresloggett kdaily genomematt jdiez griffinp nuwang tbbss magosil86 supernifty boratonaj d-j-e scwatts shahirb healthhackau2014 afcarl ngts-aus raomohan89

rubra's Issues

Modules in the pipeline script's directory can't be imported when it is called via rubra

sys.path includes the rubra script's directory, but does not include the loaded pipeline script's directory. This means that we can't import other python files that are in the same directory as the pipeline, which we would naively expect to be able to do.

Trouble install directly from pip

Hello,
I tried installing directly from pip and got the following errors:

pip install git+git://github.com/bjpop/rubra
Collecting git+git://github.com/bjpop/rubra
Cloning git://github.com/bjpop/rubra to /private/var/folders/d7/8vn6rd1d6f37gtgy_13h3b95mx3fmv/T/pip-fvxy0kg6-build
Collecting ruffus==2.2 (from Rubra==0.1.5)
Downloading ruffus-2.2.zip (5.9MB)
100% |████████████████████████████████| 5.9MB 115kB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/d7/8vn6rd1d6f37gtgy_13h3b95mx3fmv/T/pip-build-wo2ofvka/ruffus/setup.py", line 2, in
import ez_setup
File "/private/var/folders/d7/8vn6rd1d6f37gtgy_13h3b95mx3fmv/T/pip-build-wo2ofvka/ruffus/ez_setup.py", line 98
except pkg_resources.VersionConflict, e:
^
SyntaxError: invalid syntax

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/d7/8vn6rd1d6f37gtgy_13h3b95mx3fmv/T/pip-build-wo2ofvka/ruffus/

Make Rubra an installable package in the normal Python way.

We should make it possible to install Rubra in the standard Python way.

Investigate the use of DRMAA to submit jobs to clusters.

Currently we have a Torque/PBS specific way of submitting jobs to clusters.

We could make it more portable by using DRMAA instead.

Use shlex.split to parse srun command line

At the moment the srun command line is passed to Popen as a string, but it ought to be a list of strings.

The recommendation is to use shlex.split to parse the line into strings.

Add an example pipeline that makes use of the config files

It's not obvious to the user how to make use of the options in the config files, ie from rubra.utils import pipeline_options. We should include an example of this and perhaps import pipeline_options into the main rubra script so that it's from rubra import pipeline_options.

Or, should the user import user-defined options files themselves? If yes, do they still need to access the pipeline config options for any reason? We should give an example of what we think is the right way to do both these things.

save a separate shell script for each submitted job

The current slurmified rubra generates a shell script for each job, but it does not save it with a suitably unique name.

Putting too many steps out of order results in an error that is not usually seen in Ruffus

We can use the usual Ruffus functionality where we put a step out of order and use the task name as a string, and this is ok:

@follows('first_task')
def second_task():
    ....

def first_task():
    ....

However if we put two tasks before the same dependency task, we get an error:

@follows('first_task')
def second_task():
    ....

@follows('first_task')
def other_second_task():
    ....

def first_task():
    ....

The second @follows('first_task') will throw an error like
ruffus.graph.error_duplicate_node_name: [pipeline.first_task] has already been added

This does not seem to happen when using straight ruffus scripts, without rubra.

Support SLURM job scheduler

Add support for SLURM job scheduler.

Give Rubra a version number

We need to start version numbering Rubra, and tagging versions in the git repository.

Allow config file to be in different directory to pipeline file

At the moment the config file and pipeline file need to be in the same directory due to the magic we do with import paths.

It would sometimes be convenient to have them in different directories.

Rubra RedDog error

Hello, I am running RedDog and I am getting the error that is detailed below. I reported this problem in RedDog GitHub and they suggested that it could be a problem with Rubra and server permissions. I am running the pipeline on a torque/qsub system. Could you help me? Thanks in advance!

Starting pipeline...
155 jobs to be executed in total
Traceback (most recent call last):
File "/usr/local/bin/rubra", line 11, in
load_entry_point('Rubra==0.1.5', 'console_scripts', 'rubra')()
File "build/bdist.linux-x86_64/egg/rubra/rubra.py", line 66, in main
File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 2680, in pipeline_run
ruffus.ruffus_exceptions.RethrownJobError:

Exceptions running jobs for

'def RedDog.makeDir(...):'

Original exception:

Exception #1
exceptions.Exception(qsub command failed with exit status: 172):
for RedDog.makeDir.Job = [False -> dir.makeDir.Success]

Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
    return_value =  job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
  File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 447, in job_wrapper_io_files
    ret_val = user_defined_work_func(*param)
  File "RedDog.py", line 955, in makeDir
    runStageCheck('makeDir', flagFile, outPrefix, full_sequence_list_string)
  File "build/bdist.linux-x86_64/egg/rubra/utils.py", line 128, in runStageCheck
    status = runStage(stage, *args)
  File "build/bdist.linux-x86_64/egg/rubra/utils.py", line 144, in runStage
    exitStatus = distributedCommand(stage, commandStr, pipeline_options)
  File "build/bdist.linux-x86_64/egg/rubra/utils.py", line 122, in distributedCommand
    return script.runJobAndWait(stage, logDir, verbosity)
  File "build/bdist.linux-x86_64/egg/rubra/cluster_job.py", line 65, in runJobAndWait
    jobID = self.launch()
  File "build/bdist.linux-x86_64/egg/rubra/cluster_job.py", line 138, in launch
    str(returnCode)))
Exception: qsub command failed with exit status: 172

Perform checking of configuration options

We should check that the required configuration options are present and somewhat meaningful.

We should also make sure they have appropriate and documented defaults.

Generate other image format types for flowchart

Currently we only generate SVGs, but it would be nice to be able to output PNG or JPG.

Allow module loading for non-distributed jobs

Particularly useful for running these on our CloudBioLinux instances.

example doesn't run out of the box

Running in examples directory gives missing utils module error

rubra example_pipeline.py --config example_config.py --style run

[achalk@merri examples]$ ../rubra/rubra.py example_pipeline.py --config example_config.py --style run
Traceback (most recent call last):
File "../rubra/rubra.py", line 86, in
main()
File "../rubra/rubra.py", line 35, in main
import(drop_py_suffix(args.pipeline))
File "example_pipeline.py", line 15, in
from rubra.utils import (runStageCheck)
ImportError: No module named utils

copying the example_pipeline.p, example_config.py and examples/test to rubra directory fixes this

But there is an error in the pipeline.

[achalk@merri rubra]$ rubra/rubra.py example_pipeline.py --config example_config.py --style run
Traceback (most recent call last):
File "rubra/rubra.py", line 86, in
main()
File "rubra/rubra.py", line 66, in main
gnu_make_maximal_rebuild_mode=rebuildMode)
File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 2680, in pipeline_run
ruffus.ruffus_exceptions.RethrownJobError:

Exceptions running jobs for

'def example_pipeline.countLines(...):'

Original exceptions:

Exception #1
exceptions.AttributeError('NoneType' object has no attribute 'stages'):
for example_pipeline.countLines.Job = [test/data2.txt -> [test/data2.count, test/data2.count.Success]]

Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
    return_value =  job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
  File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 447, in job_wrapper_io_files
    ret_val = user_defined_work_func(*param)
  File "example_pipeline.py", line 25, in countLines
    runStageCheck('countLines', flagFile, file, output)
  File "rubra/utils.py", line 132, in runStageCheck
    status = runStage(stage, *args)
  File "rubra/utils.py", line 143, in runStage
    command = getCommand(stage, pipeline_options)
  File "rubra/utils.py", line 175, in getCommand
    commandStr = getStageOptions(options, name, 'command')
  File "rubra/utils.py", line 104, in getStageOptions
    return options.stages[stage][optionName]
AttributeError: 'NoneType' object has no attribute 'stages'


Exception #2
exceptions.AttributeError('NoneType' object has no attribute 'stages'):
for example_pipeline.countLines.Job = [test/data1.txt -> [test/data1.count, test/data1.count.Success]]

Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
    return_value =  job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
  File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 447, in job_wrapper_io_files
    ret_val = user_defined_work_func(*param)
  File "example_pipeline.py", line 25, in countLines
    runStageCheck('countLines', flagFile, file, output)
  File "rubra/utils.py", line 132, in runStageCheck
    status = runStage(stage, *args)
  File "rubra/utils.py", line 143, in runStage
    command = getCommand(stage, pipeline_options)
  File "rubra/utils.py", line 175, in getCommand
    commandStr = getStageOptions(options, name, 'command')
  File "rubra/utils.py", line 104, in getStageOptions
    return options.stages[stage][optionName]
AttributeError: 'NoneType' object has no attribute 'stages'

Make the pipeline a required argument and remove the --pipeline flag

Currently you must specify the pipeline as a flag on the command line with --pipeline.

This is unnatural, and the pipeline is required anyway, so we should allow the user to specify it without the flag.

Make rubra more robust in the face of job errors

When a stage crashes and ruffus throws an exception, we end up with the situation where rubra dies but jobs might be launched on the cluster.

These jobs probably keep running, but they will not produce success files.

It would be nice if rubra was more robust in the face of such errors.

The default of rebuild appears to actually be fromstart (which I would expect) ? But docs say it is fromend.

Make the installed rubra package executable as a script from the user's path

handling of full paths in command line

Referencing files by their full path results in a file not found error when file exists:

rubra/rubra.py /vlsci/VR0244/shared/git/rubra/examples/example_pipeline.py --config /vlsci/VR0244/shared/git/rubra/examples/example_config.py --style run
Could not find configuration file: /vlsci/VR0244/shared/git/rubra/examples/example_config.py

ls -al /vlsci/VR0244/shared/git/rubra/examples/example_pipeline.py /vlsci/VR0244/shared/git/rubra/examples/example_config.py
-rw-r----- 1 achalk VR0244 452 Feb 19 10:34 /vlsci/VR0244/shared/git/rubra/examples/example_config.py
-rw-r----- 1 achalk VR0244 847 Feb 19 10:34 /vlsci/VR0244/shared/git/rubra/examples/example_pipeline.py