epam / fonda Goto Github PK

View Code? Open in Web Editor NEW

8.0 10.0 3.0 1.77 MB

Fonda is a framework which offers scalable and automatic analysis of multiple NGS sequencing data types

License: Apache License 2.0

Java 54.25% Shell 36.22% R 1.06% Python 8.47%

ngs bioinformatics pipeline

fonda's Introduction

Fonda

Fonda is a framework that offers a scalable and automatic analysis of multiple NGS sequencing data types.

Fonda Prebuilt binaries
Required environment setup
Build Fonda
Fonda installation
Available workflows in Fonda
Before running Fonda…
Run Fonda
Contributors
Publications

Fonda Prebuilt binaries

All the binaries, built by the CI process (described in CONTRIBUTING.md) are available via the Download page and the GitHub Release page

Required environment setup

Unix
Java 8

Build Fonda

To launch all unit and integration tests run the command:

./gradlew test

To launch all unit and integration tests, to perform the source code analysis (via PMD), to check the code adherement to a coding standard (via checkstyle) and to count the code coverage (via JaCoCo) run the command:

./gradlew check

To build Fonda run the command:

./gradlew clean build zip

clean - deletes the Fonda build directory for a fresh compile
build - creates Fonda .jar file and src folder in build/libs
zip - packs Fonda .jar and src folder into a zip file located in build/distributions

Note: before building a specific Fonda version, please check the Fonda version in the build.gradle file is the correct one.

Fonda installation

Fonda package contains two components:

Fonda .jar file
src folder

If the src_scripts option in global config is not set, please make sure src folder and .jar file are put in the same parental directory for proper usages. This is necessary because Fonda needs to call some external scripts from src folder (python and R subfolders) in some pipeline usages.
For different pipeline utilities, the user needs to make sure the corresponding software prerequisites are properly installed before executing a specific Fonda pipeline. The user can check the required software and databases in the global_config files.

Available workflows in Fonda

Workflow	Description
DnaCaptureVar_Fastq	DNA Captured sequencing data for genomic variant detection using fastq data
DnaCaptureVar_Bam	DNA Captured sequencing data for genomic variant detection using bam data
DnaAmpliconVar_Fastq	DNA Amplicon sequencing data for genomic variant detection using fastq data
DnaAmpliconVar_Bam	DNA Amplicon sequencing data for genomic variant detection using bam data
DnaWgsVar_Fastq	DNA whole genome sequencing data for genomic variant detection using fastq data
DnaWgsVar_Bam	DNA whole genome sequencing data for genomic variant detection using bam data
RnaCaptureVar_Fastq	RNA Captured sequencing data for genomic variant detection using fastq data
HlaTyping_Fastq	DNA sequencing data for genomic HLA type prediction using fastq data
Bam2Fastq	Convert bam file to fastq files
RnaExpression_Fastq	RNA sequencing data for gene expression analysis using fastq data
RnaExpression_Bam	RNA sequencing data for gene expression analysis using bam data
scRnaExpression_Fastq	single cell RNA sequencing data for gene expression analysis using fastq data
scRnaExpression_CellRanger_Fastq	10X single cell RNA/TCR/BCR sequencing data for gene expression and immune profiling analysis using fastq data
scRnaExpression_Bam	single cell RNA sequencing data for gene expression analysis using bam data
RnaFusion_Fastq	RNA sequencing data for gene fusion detection using fastq data
TcrRepertoire_Fastq	DNA or RNA sequencing data for TCR or BCR repertoire detection using fastq data

Before running Fonda…

Show help message

java -jar fonda-<VERSION>.jar -help

Possible options:

Option	Description
Required
`-global_config` <arg>	Configuration file for the particular workflow
`-study_config` <arg>	Configuration file for the specific study
Non-required
`-detail`	Show the details of the Fonda framework
`-local`	Default: no. Running the job on local machine
`-test`	Default: no. Test the commands without actually running the job
`-sync`	Default: no. Running Fonda in asynchronous mode, waiting for all tasks to complete
`-master`	Default: no. Running the main master script to manage all Fonda created scripts
`-help`	Show help utility message

Elaboration of required config arguments

-global_config file - sets a configuration file for a particular pipeline version (such as RnaExpression_Fastq 1095.1). In the config file, there are 4 sections:

[all_tools] - contains paths to used tools
[Databases] - contains input data/paths to input datasets
[Pipeline_Info] - contains workflow and toolset settings
[Queue_Parameters] - contains sge settings

If the user likes to change a parameter, a new version should be generated and recorded. However, different studies can share an identical pipeline.

Available parameter options for the global_config files you can see here.
Examples of the global_config files you can see here.

Please keep in mind that in each global_config file the only tools and databases are included that are required for executing this specific pipeline version.
For example, global_config_RnaExpression_Fastq_v1.1.txt may list out the databases, tools and parameters for a particular RnaExpression_Fastq pipeline version 1. Later on, global_config_RnaExpression_Fastq_v1.2.txt may be prepared for another RnaExpression_Fastq pipeline version 2. In the second config the required databases, tools and parameters might be quite different from the first one.
Therefore, all potential databases, tools and parameter options for each available workflow shall be listed out to make sure users can take the full advantage of using Fonda in different projects.

To control the line-endings behavior the line_ending option was introduced in the [Pipeline_Info] section. The option can be specified as LF (Unix-style end-of-line marker) or CRLF (Windows-style end-of-line marker) value. If the option is not specified, the LF line separator was set as the default one.

-study_config file - sets a configuration file for a particular study - for cases when a specific study is selected to perform the NGS data analysis. In this config file, there is 1 section - [Series_Info].
Required parameters for each workflow:

Parameter	Description
job_name	Sets the job ID
dir_out	Sets the output directory for the analysis
fastq_list / bam_list	Sets the path to the input manifest file
LibraryType	Sets the sequencing library type - DNAWholeExomeSeq_Paired, DNAWholeExomeSeq_Single, DNATargetSeq_Paired, DNATargetSeq_Single, DNAAmpliconSeq_Paired, RNASeq_Paired, RNASeq_Single, etc.
DataGenerationSource	Sets the data generation source - Internal, IGR, Broad, etc.
Date	Sets the sequencing run date
Project	Sets the project ID
Run	Sets the run ID

The format of input manifest files see here.
Examples of the study_config files you can see here.

Elaboration of additional arguments

-help - to show the help message
-detail - to show the workflow details available in the current Fonda framework
-local - to run the job on the local machine without being submitted to the cluster
-test - to have a pilot run in the command line interface without actually submitting jobs to the cluster

Run Fonda: actual example for RnaExpression_Fastq workflow

Test mode

java -jar /path_to_data/fonda/<VERSION>/fonda-<VERSION>.jar -global_config /path_to_data/fonda/global_config/global_config_RnaExpression_Fastq_v1.1.txt -study_config /path_to_data/config_RnaExpression_Fastq_test.txt -test

For the test mode, no job will be submitted to the cluster for actual run. In this case, you will be able to check whether the contents in each shell scripts are properly organized. This is important for debugging purposes.

Submit jobs to cluster

java -jar /path_to_data/fonda/<VERSION>/fonda-<VERSION>.jar -global_config /path_to_data/fonda/global_config/global_config_RnaExpression_Fastq_v1.1.txt -study_config /path_to_data/config_RnaExpression_Fastq_test.txt

Local machine mode

java -jar /path_to_data/fonda/<VERSION>/fonda-<VERSION>.jar -global_config /path_to_data/fonda/global_config/global_config_RnaExpression_Fastq_v1.1.txt -study_config /path_to_data/config_RnaExpression_Fastq_test.txt -local

For the local machine mode, the individual jobs will be run on the local machine, without being submitted to the cluster.
In this case, scripts will be the same as in the cluster mode. The only difference is the jobs are not submitted to the cluster. This is important for debugging purpose.

Contributors

Shu Yan ¹
Tenghui Chen ¹
Joon Sang Lee ¹
Chandra Sekhar Pedamallu ¹
Mark Magid ¹
Quan Wan ¹
Ei-Wen Yang ¹
Donald Jackson ¹
Jack Pollard ¹
Aleksandr Sidoruk ²
Mariia Zueva ²
Mikhail Alperovich ²
Yulia Kamyshova ²

¹ Sanofi, 270 Albany Street, Cambridge, MA, USA

² EPAM Systems, Inc.

Publications

Links to publications that contain Fonda references

A Comprehensive Sample Tracking and Data Processing Workflow for Next Generation Sequencing

fonda's People

Contributors

Stargazers

Watchers

Forkers

mysterionrise madmongoose

fonda's Issues

Implement FastqListAnalysis post process tool for Bam2Fastq workflow

Background

Create the FastqListAnalysis tool class that is used to Bam2Fastq workflow that includes working logic of workflow in Fonda. Tool class should implement PostProcessTool interface.

Approach

process to thymeleaf template
implement generate() method

Create Mixcr tool

Background

Create Mixcr tool class that includes working logic of this tool in Fonda. Tool class should implement Tool interface.

Approach

find all tool usage in Fonda workflows
combine tool behavior in created class
process to thymeleaf template
implement generate() method

Implement Bam2Fastq workflow

Background

Implement Bam2FastqWorkflow class that implements BamWorkflow interface

Approach

Implement run() and postProcess() methods
Add the required workflow tools to appropriate stages
Create Bam2FastqIntegrationTest that covers all working logic in Bam2FastqWorkflow
Provide a javadoc

xenome single ended addition

Current usage of xenome does not consider single ended reads. Proposed changes at #134

Keep in mind the package requires slightly modified naming convention for single ended reads.

Instead of

ambiguous_1.fastq  
ambiguous_2.fastq  
both_1.fastq 
both_2.fastq 
graft_1.fastq 
graft_2.fastq 
host_1.fastq 
host_2.fastq 
neither_1.fastq 
neither_2.fastq

It will be come now this

ambiguous.fastq 
both.fastq 
human.fastq 
mouse.fastq 
neither.fastq

Extend and improve integration tests

Background

The existing integration tests don't completely cover all the workflow logic and have code style shortcomings.

Approach

The existing integration tests can be extended and improved with Thymeleaf Template Engine.

rewrite the existing test cases with thymeleaf template
add a new test cases to extend covered logic
add a folder tree validation
use JUnit 5 instead of JUnit 4

Extend a SCImmuneProfileCellRangerFastqTest integration test

This issue is related to #33 issue.

Extend and improve SCImmuneProfileCellRangerFastqTest with Thymeleaf Template Engine.

Test and document the DnaWgsVar_Fastq workflow

Implement DnaCaptureVar_Bam workflow

Background

Implement DnaCaptureVarBamWorkflow class that implements BamWorkflow interface

Approach

Implement run() and postProcess() methods-
Add the required workflow tools to appropriate stages
Create DnaCaptureVarBamIntegrationTest that covers all working logic in DnaCaptureVarBamWorkflow
Provide a javadoc

"Cloud Pipeline" execution backend

Background

At the moment FONDA is used to process the data within a Cloud environment, managed by the Cloud Pipeline
In this case, FONDA is packed into a docker image and is executed using the sge executor.

But the Cloud Pipeline users do have a number of cases, when it's better to run FONDA itself locally, but offload the data processing to the Cloud.

Extend a RnaExpressionFastq integration test

This issue is related to #33 issue.

Extend and improve RnaExpressionFastqIntegrationTest with Thymeleaf Template Engine.

Extend a DnaAnalysis integration test

This issue is related to #33 issue.

Extend and improve DnaAnalysisIntegrationTest with Thymeleaf Template Engine.

Extend a RnaExpressionBam integration test

This issue is related to #33 issue.

Extend and improve RnaExpressionBamIntegrationTest with Thymeleaf Template Engine.

Extend a DnaCaptureVarFastq integration test

This issue is related to #33 issue.

Extend and improve DnaCaptureVarFastqIntegrationTest with Thymeleaf Template Engine.

Implement DnaWgsVar_Bam workflow

Background

Implement DnaWgsVarBamWorkflow class that implements BamWorkflow interface

Approach

Implement run() and postProcess() methods
Add the required workflow tools to appropriate stages
Create DnaWgsVarBamIntegrationTest that covers all working logic in DnaWgsVarBamWorkflow
Provide a javadoc

Test and document the RnaExpression_Bam workflow

Extend a Bam2Fastq integration test

This issue is related to #33 issue.

Extend and improve Bam2FastqIntegrationTest with Thymeleaf Template Engine.

Extend a DnaCaptureVarBam integration test

This issue is related to #33 issue.

Extend and improve DnaCaptureVarBamIntegrationTest with Thymeleaf Template Engine.

Test and document the TcrRepertoire_Fastq workflow

Implement DnaCaptureVar_Fastq workflow

Implement DnaCaptureVarFastqWorkflow class that implements FastqWorkflow

implement run() and postProcess() methods
add tools to stages if it is needed

Don't forget to add integration tests

Implement DnaWgsVar_Fastq workflow

Background

Implement DnaWgsVarFastqWorkflow class that implements FastqWorkflow interface

Approach

Implement run() and postProcess() methods
Add the required workflow tools to appropriate stages
Create DnaWgsVarFastqIntegrationTest that covers all working logic in DnaWgsVarFastqWorkflow
Provide a javadoc

Extend a RnaFusionFastq integration test

This issue is related to #33 issue.

Extend and improve RnaFusionFastqIntegrationTest with Thymeleaf Template Engine.

Extend a DnaWgsVarBam integration test

This issue is related to #33 issue.

Extend and improve DnaWgsVarBamIntegrationTest with Thymeleaf Template Engine.

Test and document the HlaTyping_Fastq workflow

Implement SamToFastq tool

Background

Create SamToFastq tool class that includes working logic of this tool in Fonda. Tool class should implement Tool interface.

Approach

find all tool usage in Fonda workflows
combine tool behavior in created class
process to thymeleaf template
implement generate() method

Test and document the DnaAmpliconVar_Fastq workflow

Setup Travis CI

Build/Test any PR and develop
Publish artifacts from the successful develop builds to the S3

Implement RnaCaptureVar_Fastq workflow

Background

Implement RnaCaptureVarFastqWorkflow class that implements FastqWorkflow interface

Approach

Implement run() and postProcess() methods-
Add the required workflow tools to appropriate stages
Create RnaCaptureVarFastqIntegrationTest that covers all working logic in RnaCaptureVarFastqWorkflow
Provide a javadoc

Test and document the DnaCaptureVar_Fastq workflow

Test and document the DnaWgsVar_Bam workflow

Add overall information on the issue/pr management approach

@kamyshova please add the following information on the overall contribution to the repo and how the issues are tracked:

Entrypoint:
- Issue: if there is only an idea for some feature and fix and it shall be discussed first
- PR: if there is already an implementation for some feature or a bug-fix
What information shall be provided to the Issue/PR description (we need to add Issue/PR templates further), which branch shall be used as a merge destination, how to link PRs and issues
If the issue is created first - how it gets implemented (discussion -> decision)
If a PR is created first (or a PR is created as a result of work on an issue) - how to request to approval, how it gets discussed, approved, merged
Any other valuable information of your choice

Implement TcrRepertoire_Fastq workflow

Background

Implement TcrRepertoireFastqWorkflow class that implements FastqWorkflow interface

Approach

Implement run() and postProcess() methods
Add the required workflow tools to appropriate stages
Create TcrRepertoireFastqIntegrationTest that covers all working logic in TcrRepertoireFastqWorkflow
Provide a javadoc

Extend a DnaAmpliconVarFastq integration test

This issue is related to #33 issue.

Extend and improve DnaAmpliconVarFastqIntegrationTest with Thymeleaf Template Engine.

Update GatkHaplotypeCaller tool

Background

Update the existing GatkHaplotypeCaller tool class in according to GatkHaplotypeCaller for RNA.

Approach

compare gatkHaplotypeCaller_rna_unpaired and gatkHaplotypeCaller_unpaired methods
combine tools behavior in the existing class by introducing a new template gatk_haplotype_rna_tool_template.txt
update the existing GatkHaplotypeCallerTest

Test and document the RnaFusion_Fastq workflow

Implement DnaAmpliconVar_Bam workflow

Background

Implement DnaAmpliconVarBamWorkflow class that implements BamWorkflow interface

Approach

Implement run() and postProcess() methods
Add the required workflow tools to appropriate stages
Create DnaAmpliconVarBamIntegrationTest that covers all working logic in DnaAmpliconVarBamWorkflow
Provide a javadoc

Test and document the Bam2Fastq workflow

Test and document the scRnaExpression_Bam workflow

Test and document the scRnaExpression_Fastq workflow

Extend a DnaAmpliconVarBam integration test

This issue is related to #33 issue.

Extend and improve DnaAmpliconVarBamIntegrationTest with Thymeleaf Template Engine.

Test and document the RnaCaptureVar_Fastq workflow

Test and document the DnaCaptureVar_Bam workflow

Implement HlaTyping_Fastq workflow

Background

Implement HlaTypingFastqWorkflow class that implements FastqWorkflow interface

Approach

Implement run() and postProcess() methods-
Add the required workflow tools to appropriate stages
Create HlaTypingFastqIntegrationTest that covers all working logic in HlaTypingFastqWorkflow
Provide a javadoc

Implement scRnaExpression_Fastq workflow

Background

Implement SCRnaExpressionFastqWorkflow class that implements FastqWorkflow interface

Approach

Implement run() and postProcess() methods-
Add the required workflow tools to appropriate stages
Create SCRnaExpressionFastqIntegrationTest that covers all working logic in SCRnaExpressionFastqWorkflow
Provide a javadoc

Test and document the DnaAmpliconVar_Bam workflow

Create OptiType tool

Background

Create OptiType tool class that includes working logic of this tool in Fonda. Tool class should implement Tool interface.

Approach

find all tool usage in Fonda workflows
combine tool behavior in created class
process to thymeleaf template
implement generate() method

Extend a SCRnaExpressionCellRangerFastq integration test

This issue is related to #33 issue.

Extend and improve SCRnaExpressionCellRangerFastqIntegrationTest with Thymeleaf Template Engine.

Implement scRnaExpression_Bam workflow

Background

Implement SCRnaExpressionBamWorkflow class that implements BamWorkflow interface

Approach

Implement run() and postProcess() methods-
Add the required workflow tools to appropriate stages
Create SCRnaExpressionBamIntegrationTest that covers all working logic in SCRnaExpressionBamWorkflow
Provide a javadoc

Create GatkSplitReads tool

Background

Create GatkSplitReads tool class that includes working logic of this tool in Fonda. Tool class should implement Tool interface.

Approach

find all tool usage in Fonda workflows
combine tool behavior in created class
process to thymeleaf template
implement generate() method

Create GatkHaplotypeCallerRnaFilter tool

Background

Create GatkHaplotypeCallerRnaFilter tool class that includes working logic of this tool in Fonda. Tool class should implement Tool interface.

Approach

find all tool usage in Fonda workflows
combine tool behavior in created class
process to thymeleaf template
implement generate() method

Implement SortBamByReadName tool

Background

Create SortBamByReadName tool class that includes working logic of this tool in Fonda. Tool class should implement Tool interface.

Approach

find all tool usage in Fonda workflows
combine tool behavior in created class
process to thymeleaf template
implement generate() method

epam / fonda Goto Github PK

fonda's Introduction

Fonda

Fonda Prebuilt binaries

Required environment setup

Build Fonda

Fonda installation

Available workflows in Fonda

Before running Fonda…

Show help message

Elaboration of required config arguments

Elaboration of additional arguments

Run Fonda: actual example for RnaExpression_Fastq workflow

Test mode

Submit jobs to cluster

Local machine mode

Contributors

Publications

fonda's People

Contributors

Stargazers

Watchers

Forkers

fonda's Issues

Background

Recommend Projects

Recommend Topics

Recommend Org