Giter VIP home page Giter VIP logo

rubra's Introduction

Rubra: a bioinformatics pipeline.
---------------------------------

https://github.com/bjpop/rubra

License:
--------

Rubra is licensed under the MIT license. See LICENSE.txt.

Description:
------------

Rubra is a pipeline system for bioinformatics workflows. It is built on top
of the Ruffus (http://www.ruffus.org.uk/) Python library, and adds support
for running pipeline stages on a distributed compute cluster.

Authors:
--------

Bernie Pope, Clare Sloggett, Gayle Philip, Matthew Wakefield

Installation:
-------------

To install, clone this repository and run `setup.py`:

    git clone https://github.com/bjpop/rubra
    cd rubra
    python setup.py install

If you are on a system where you do not have administrative privileges, we
suggest using virtualenv ( http://www.virtualenv.org/ ). On HPC systems you 
may find virtualenv is already installed.

Usage:
------

usage: rubra [-h] PIPELINE_FILE --config CONFIG_FILE
                [CONFIG_FILE ...] [--verbose {0,1,2}]
                [--style {print,run,touchfiles,flowchart}] [--force TASKNAME]
                [--end TASKNAME] [--rebuild {fromstart,fromend}]

A bioinformatics pipeline system.

optional arguments:
  -h, --help            show this help message and exit
  PIPELINE_FILE         Your Ruffus pipeline stages (a Python module)
  --config CONFIG_FILE [CONFIG_FILE ...]
                        One or more configuration files (Python modules)
  --verbose {0,1,2}     Output verbosity level: 0 = quiet; 1 = normal; 2 =
                        chatty (default is 1)
  --style {print,run,touchfiles,flowchart}
                        Pipeline behaviour: print; run; touchfiles; flowchart (default is
                        print)
  --force TASKNAME      tasks which are forced to be out of date regardless of
                        timestamps
  --end TASKNAME        end points (tasks) for the pipeline
  --rebuild {fromstart,fromend}
                        rebuild outputs by working back from end tasks or
                        forwards from start tasks (default is fromstart)

Example:
--------

Below is a little example pipeline which you can find in the Rubra source
tree. It counts the number of lines in two files (test/data1.txt and
test/data2.txt), and then sums the results together.

   rubra example_pipeline.py --config example_config.py --style run

There are 2 lines in the first file and 1 line in the second file. So the
result is 3, which is written to the output file test/total.txt.

The --pipeline argument is a Python script which contains the actual
code for each pipeline stage (using Ruffus notation). The --config
argument is a Python script which contains configuration options for the
whole pipeline, plus options for each stage (including the shell command
to run in the stage). The --style argument says what to do with the pipeline:
"run" means "perform the out-of-date steps in the pipeline". The default
style is "print" which just displays what the pipeline would do if it were
run. You can get a diagram of the pipeline using the "flowchart" style. You 
can touch all files in order using the "touchfiles" style, which is mostly 
useful for forcing Ruffus to acknowledge that a set of steps is up to date.

Configuration:
--------------

Configuration options are written into one or more Python scripts, which
are passed to Rubra via the --config command line argument.

Some options are required, and some are, well, optional.

Options for the whole pipeline:
-------------------------------

    pipeline = {
        "logDir": "log",
        "logFile": "pipeline.log",
        "procs": 2,
        "end": ["total"],
    }


Options for each stage of the pipeline:
---------------------------------------

    stageDefaults = {
        "distributed": False,
        "walltime": "00:10:00",
        "memInGB": 1,
        "queue": "batch",
        "modules": ["python-gcc"]
    }

    stages = {
        "countLines": {
            "command": "wc -l %file > %out",
        },
        "total": {
            "command": "./test/total.py %files > %out",
        },
    }

rubra's People

Contributors

bjpop avatar claresloggett avatar genomematt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rubra's Issues

Trouble install directly from pip

Hello,
I tried installing directly from pip and got the following errors:

pip install git+git://github.com/bjpop/rubra
Collecting git+git://github.com/bjpop/rubra
Cloning git://github.com/bjpop/rubra to /private/var/folders/d7/8vn6rd1d6f37gtgy_13h3b95mx3fmv/T/pip-fvxy0kg6-build
Collecting ruffus==2.2 (from Rubra==0.1.5)
Downloading ruffus-2.2.zip (5.9MB)
100% |████████████████████████████████| 5.9MB 115kB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/d7/8vn6rd1d6f37gtgy_13h3b95mx3fmv/T/pip-build-wo2ofvka/ruffus/setup.py", line 2, in
import ez_setup
File "/private/var/folders/d7/8vn6rd1d6f37gtgy_13h3b95mx3fmv/T/pip-build-wo2ofvka/ruffus/ez_setup.py", line 98
except pkg_resources.VersionConflict, e:
^
SyntaxError: invalid syntax

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/d7/8vn6rd1d6f37gtgy_13h3b95mx3fmv/T/pip-build-wo2ofvka/ruffus/

Use shlex.split to parse srun command line

At the moment the srun command line is passed to Popen as a string, but it ought to be a list of strings.

The recommendation is to use shlex.split to parse the line into strings.

Add an example pipeline that makes use of the config files

It's not obvious to the user how to make use of the options in the config files, ie from rubra.utils import pipeline_options. We should include an example of this and perhaps import pipeline_options into the main rubra script so that it's from rubra import pipeline_options.

Or, should the user import user-defined options files themselves? If yes, do they still need to access the pipeline config options for any reason? We should give an example of what we think is the right way to do both these things.

Putting too many steps out of order results in an error that is not usually seen in Ruffus

We can use the usual Ruffus functionality where we put a step out of order and use the task name as a string, and this is ok:

@follows('first_task')
def second_task():
    ....

def first_task():
    ....

However if we put two tasks before the same dependency task, we get an error:

@follows('first_task')
def second_task():
    ....

@follows('first_task')
def other_second_task():
    ....

def first_task():
    ....

The second @follows('first_task') will throw an error like
ruffus.graph.error_duplicate_node_name: [pipeline.first_task] has already been added

This does not seem to happen when using straight ruffus scripts, without rubra.

Rubra RedDog error

Hello, I am running RedDog and I am getting the error that is detailed below. I reported this problem in RedDog GitHub and they suggested that it could be a problem with Rubra and server permissions. I am running the pipeline on a torque/qsub system. Could you help me? Thanks in advance!

Starting pipeline...
155 jobs to be executed in total
Traceback (most recent call last):
File "/usr/local/bin/rubra", line 11, in
load_entry_point('Rubra==0.1.5', 'console_scripts', 'rubra')()
File "build/bdist.linux-x86_64/egg/rubra/rubra.py", line 66, in main
File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 2680, in pipeline_run
ruffus.ruffus_exceptions.RethrownJobError:

Exceptions running jobs for

'def RedDog.makeDir(...):'

Original exception:

Exception #1
exceptions.Exception(qsub command failed with exit status: 172):
for RedDog.makeDir.Job = [False -> dir.makeDir.Success]

Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
    return_value =  job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
  File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 447, in job_wrapper_io_files
    ret_val = user_defined_work_func(*param)
  File "RedDog.py", line 955, in makeDir
    runStageCheck('makeDir', flagFile, outPrefix, full_sequence_list_string)
  File "build/bdist.linux-x86_64/egg/rubra/utils.py", line 128, in runStageCheck
    status = runStage(stage, *args)
  File "build/bdist.linux-x86_64/egg/rubra/utils.py", line 144, in runStage
    exitStatus = distributedCommand(stage, commandStr, pipeline_options)
  File "build/bdist.linux-x86_64/egg/rubra/utils.py", line 122, in distributedCommand
    return script.runJobAndWait(stage, logDir, verbosity)
  File "build/bdist.linux-x86_64/egg/rubra/cluster_job.py", line 65, in runJobAndWait
    jobID = self.launch()
  File "build/bdist.linux-x86_64/egg/rubra/cluster_job.py", line 138, in launch
    str(returnCode)))
Exception: qsub command failed with exit status: 172

Perform checking of configuration options

We should check that the required configuration options are present and somewhat meaningful.

We should also make sure they have appropriate and documented defaults.

example doesn't run out of the box

example doesn't run out of the box

Running in examples directory gives missing utils module error

rubra example_pipeline.py --config example_config.py --style run

[achalk@merri examples]$ ../rubra/rubra.py example_pipeline.py --config example_config.py --style run
Traceback (most recent call last):
File "../rubra/rubra.py", line 86, in
main()
File "../rubra/rubra.py", line 35, in main
import(drop_py_suffix(args.pipeline))
File "example_pipeline.py", line 15, in
from rubra.utils import (runStageCheck)
ImportError: No module named utils

copying the example_pipeline.p, example_config.py and examples/test to rubra directory fixes this

But there is an error in the pipeline.

[achalk@merri rubra]$ rubra/rubra.py example_pipeline.py --config example_config.py --style run
Traceback (most recent call last):
File "rubra/rubra.py", line 86, in
main()
File "rubra/rubra.py", line 66, in main
gnu_make_maximal_rebuild_mode=rebuildMode)
File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 2680, in pipeline_run
ruffus.ruffus_exceptions.RethrownJobError:

Exceptions running jobs for

'def example_pipeline.countLines(...):'

Original exceptions:

Exception #1
exceptions.AttributeError('NoneType' object has no attribute 'stages'):
for example_pipeline.countLines.Job = [test/data2.txt -> [test/data2.count, test/data2.count.Success]]

Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
    return_value =  job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
  File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 447, in job_wrapper_io_files
    ret_val = user_defined_work_func(*param)
  File "example_pipeline.py", line 25, in countLines
    runStageCheck('countLines', flagFile, file, output)
  File "rubra/utils.py", line 132, in runStageCheck
    status = runStage(stage, *args)
  File "rubra/utils.py", line 143, in runStage
    command = getCommand(stage, pipeline_options)
  File "rubra/utils.py", line 175, in getCommand
    commandStr = getStageOptions(options, name, 'command')
  File "rubra/utils.py", line 104, in getStageOptions
    return options.stages[stage][optionName]
AttributeError: 'NoneType' object has no attribute 'stages'


Exception #2
exceptions.AttributeError('NoneType' object has no attribute 'stages'):
for example_pipeline.countLines.Job = [test/data1.txt -> [test/data1.count, test/data1.count.Success]]

Traceback (most recent call last):
  File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 517, in run_pooled_job_without_exceptions
    return_value =  job_wrapper(param, user_defined_work_func, register_cleanup, touch_files_only)
  File "build/bdist.linux-x86_64/egg/ruffus/task.py", line 447, in job_wrapper_io_files
    ret_val = user_defined_work_func(*param)
  File "example_pipeline.py", line 25, in countLines
    runStageCheck('countLines', flagFile, file, output)
  File "rubra/utils.py", line 132, in runStageCheck
    status = runStage(stage, *args)
  File "rubra/utils.py", line 143, in runStage
    command = getCommand(stage, pipeline_options)
  File "rubra/utils.py", line 175, in getCommand
    commandStr = getStageOptions(options, name, 'command')
  File "rubra/utils.py", line 104, in getStageOptions
    return options.stages[stage][optionName]
AttributeError: 'NoneType' object has no attribute 'stages'

Make rubra more robust in the face of job errors

When a stage crashes and ruffus throws an exception, we end up with the situation where rubra dies but jobs might be launched on the cluster.

These jobs probably keep running, but they will not produce success files.

It would be nice if rubra was more robust in the face of such errors.

handling of full paths in command line

Referencing files by their full path results in a file not found error when file exists:

rubra/rubra.py /vlsci/VR0244/shared/git/rubra/examples/example_pipeline.py --config /vlsci/VR0244/shared/git/rubra/examples/example_config.py --style run
Could not find configuration file: /vlsci/VR0244/shared/git/rubra/examples/example_config.py

ls -al /vlsci/VR0244/shared/git/rubra/examples/example_pipeline.py /vlsci/VR0244/shared/git/rubra/examples/example_config.py
-rw-r----- 1 achalk VR0244 452 Feb 19 10:34 /vlsci/VR0244/shared/git/rubra/examples/example_config.py
-rw-r----- 1 achalk VR0244 847 Feb 19 10:34 /vlsci/VR0244/shared/git/rubra/examples/example_pipeline.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.