Giter VIP home page Giter VIP logo

fonda's Introduction

Build status codecov Codacy Badge

Fonda

Fonda is a framework that offers a scalable and automatic analysis of multiple NGS sequencing data types.

Fonda Prebuilt binaries

All the binaries, built by the CI process (described in CONTRIBUTING.md) are available via the Download page and the GitHub Release page

Required environment setup

  • Unix
  • Java 8

Build Fonda

To launch all unit and integration tests run the command:

./gradlew test

To launch all unit and integration tests, to perform the source code analysis (via PMD), to check the code adherement to a coding standard (via checkstyle) and to count the code coverage (via JaCoCo) run the command:

./gradlew check

To build Fonda run the command:

./gradlew clean build zip
  • clean - deletes the Fonda build directory for a fresh compile
  • build - creates Fonda .jar file and src folder in build/libs
  • zip - packs Fonda .jar and src folder into a zip file located in build/distributions

Note: before building a specific Fonda version, please check the Fonda version in the build.gradle file is the correct one.

Fonda installation

Fonda package contains two components:

  1. Fonda .jar file
  2. src folder

If the src_scripts option in global config is not set, please make sure src folder and .jar file are put in the same parental directory for proper usages. This is necessary because Fonda needs to call some external scripts from src folder (python and R subfolders) in some pipeline usages.
For different pipeline utilities, the user needs to make sure the corresponding software prerequisites are properly installed before executing a specific Fonda pipeline. The user can check the required software and databases in the global_config files.

Available workflows in Fonda

Workflow Description
DnaCaptureVar_Fastq DNA Captured sequencing data for genomic variant detection using fastq data
DnaCaptureVar_Bam DNA Captured sequencing data for genomic variant detection using bam data
DnaAmpliconVar_Fastq DNA Amplicon sequencing data for genomic variant detection using fastq data
DnaAmpliconVar_Bam DNA Amplicon sequencing data for genomic variant detection using bam data
DnaWgsVar_Fastq DNA whole genome sequencing data for genomic variant detection using fastq data
DnaWgsVar_Bam DNA whole genome sequencing data for genomic variant detection using bam data
RnaCaptureVar_Fastq RNA Captured sequencing data for genomic variant detection using fastq data
HlaTyping_Fastq DNA sequencing data for genomic HLA type prediction using fastq data
Bam2Fastq Convert bam file to fastq files
RnaExpression_Fastq RNA sequencing data for gene expression analysis using fastq data
RnaExpression_Bam RNA sequencing data for gene expression analysis using bam data
scRnaExpression_Fastq single cell RNA sequencing data for gene expression analysis using fastq data
scRnaExpression_CellRanger_Fastq 10X single cell RNA/TCR/BCR sequencing data for gene expression and immune profiling analysis using fastq data
scRnaExpression_Bam single cell RNA sequencing data for gene expression analysis using bam data
RnaFusion_Fastq RNA sequencing data for gene fusion detection using fastq data
TcrRepertoire_Fastq DNA or RNA sequencing data for TCR or BCR repertoire detection using fastq data

Before running Fonda…

Show help message

java -jar fonda-<VERSION>.jar -help

Possible options:

Option Description
Required
-global_config <arg> Configuration file for the particular workflow
-study_config <arg> Configuration file for the specific study
Non-required
-detail Show the details of the Fonda framework
-local Default: no. Running the job on local machine
-test Default: no. Test the commands without actually running the job
-sync Default: no. Running Fonda in asynchronous mode, waiting for all tasks to complete
-master Default: no. Running the main master script to manage all Fonda created scripts
-help Show help utility message

Elaboration of required config arguments

-global_config file - sets a configuration file for a particular pipeline version (such as RnaExpression_Fastq 1095.1). In the config file, there are 4 sections:

  • [all_tools] - contains paths to used tools
  • [Databases] - contains input data/paths to input datasets
  • [Pipeline_Info] - contains workflow and toolset settings
  • [Queue_Parameters] - contains sge settings

If the user likes to change a parameter, a new version should be generated and recorded. However, different studies can share an identical pipeline.

Available parameter options for the global_config files you can see here.
Examples of the global_config files you can see here.

Please keep in mind that in each global_config file the only tools and databases are included that are required for executing this specific pipeline version.
For example, global_config_RnaExpression_Fastq_v1.1.txt may list out the databases, tools and parameters for a particular RnaExpression_Fastq pipeline version 1. Later on, global_config_RnaExpression_Fastq_v1.2.txt may be prepared for another RnaExpression_Fastq pipeline version 2. In the second config the required databases, tools and parameters might be quite different from the first one.
Therefore, all potential databases, tools and parameter options for each available workflow shall be listed out to make sure users can take the full advantage of using Fonda in different projects.

To control the line-endings behavior the line_ending option was introduced in the [Pipeline_Info] section. The option can be specified as LF (Unix-style end-of-line marker) or CRLF (Windows-style end-of-line marker) value. If the option is not specified, the LF line separator was set as the default one.

-study_config file - sets a configuration file for a particular study - for cases when a specific study is selected to perform the NGS data analysis. In this config file, there is 1 section - [Series_Info].
Required parameters for each workflow:

Parameter Description
job_name Sets the job ID
dir_out Sets the output directory for the analysis
fastq_list / bam_list Sets the path to the input manifest file
LibraryType Sets the sequencing library type - DNAWholeExomeSeq_Paired, DNAWholeExomeSeq_Single, DNATargetSeq_Paired, DNATargetSeq_Single, DNAAmpliconSeq_Paired, RNASeq_Paired, RNASeq_Single, etc.
DataGenerationSource Sets the data generation source - Internal, IGR, Broad, etc.
Date Sets the sequencing run date
Project Sets the project ID
Run Sets the run ID

The format of input manifest files see here.
Examples of the study_config files you can see here.

Elaboration of additional arguments

-help - to show the help message
-detail - to show the workflow details available in the current Fonda framework
-local - to run the job on the local machine without being submitted to the cluster
-test - to have a pilot run in the command line interface without actually submitting jobs to the cluster

Run Fonda: actual example for RnaExpression_Fastq workflow

Test mode

java -jar /path_to_data/fonda/<VERSION>/fonda-<VERSION>.jar -global_config /path_to_data/fonda/global_config/global_config_RnaExpression_Fastq_v1.1.txt -study_config /path_to_data/config_RnaExpression_Fastq_test.txt -test

For the test mode, no job will be submitted to the cluster for actual run. In this case, you will be able to check whether the contents in each shell scripts are properly organized. This is important for debugging purposes.

Submit jobs to cluster

java -jar /path_to_data/fonda/<VERSION>/fonda-<VERSION>.jar -global_config /path_to_data/fonda/global_config/global_config_RnaExpression_Fastq_v1.1.txt -study_config /path_to_data/config_RnaExpression_Fastq_test.txt

Local machine mode

java -jar /path_to_data/fonda/<VERSION>/fonda-<VERSION>.jar -global_config /path_to_data/fonda/global_config/global_config_RnaExpression_Fastq_v1.1.txt -study_config /path_to_data/config_RnaExpression_Fastq_test.txt -local

For the local machine mode, the individual jobs will be run on the local machine, without being submitted to the cluster.
In this case, scripts will be the same as in the cluster mode. The only difference is the jobs are not submitted to the cluster. This is important for debugging purpose.

Contributors

  • Shu Yan 1
  • Tenghui Chen 1
  • Joon Sang Lee 1
  • Chandra Sekhar Pedamallu 1
  • Mark Magid 1
  • Quan Wan 1
  • Ei-Wen Yang 1
  • Donald Jackson 1
  • Jack Pollard 1
  • Aleksandr Sidoruk 2
  • Mariia Zueva 2
  • Mikhail Alperovich 2
  • Yulia Kamyshova 2

1 Sanofi, 270 Albany Street, Cambridge, MA, USA

2 EPAM Systems, Inc.

Publications

Links to publications that contain Fonda references

fonda's People

Contributors

dtegai avatar ekazachkova avatar ilya-ugr avatar kamyshova avatar karimmagomedov avatar madmongoose avatar patrickstephens1 avatar sidoruka avatar syansanofi avatar yurysukhorukov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fonda's Issues

Create Mixcr tool

Background

Create Mixcr tool class that includes working logic of this tool in Fonda. Tool class should implement Tool interface.

Approach

  • find all tool usage in Fonda workflows
  • combine tool behavior in created class
  • process to thymeleaf template
  • implement generate() method

Implement Bam2Fastq workflow

Background

Implement Bam2FastqWorkflow class that implements BamWorkflow interface

Approach

  1. Implement run() and postProcess() methods
  2. Add the required workflow tools to appropriate stages
  3. Create Bam2FastqIntegrationTest that covers all working logic in Bam2FastqWorkflow
  4. Provide a javadoc

xenome single ended addition

Current usage of xenome does not consider single ended reads. Proposed changes at #134

Keep in mind the package requires slightly modified naming convention for single ended reads.

Instead of

ambiguous_1.fastq  
ambiguous_2.fastq  
both_1.fastq 
both_2.fastq 
graft_1.fastq 
graft_2.fastq 
host_1.fastq 
host_2.fastq 
neither_1.fastq 
neither_2.fastq

It will be come now this

ambiguous.fastq 
both.fastq 
human.fastq 
mouse.fastq 
neither.fastq

Extend and improve integration tests

Background

The existing integration tests don't completely cover all the workflow logic and have code style shortcomings.

Approach

The existing integration tests can be extended and improved with Thymeleaf Template Engine.

  • rewrite the existing test cases with thymeleaf template
  • add a new test cases to extend covered logic
  • add a folder tree validation
  • use JUnit 5 instead of JUnit 4

Implement DnaCaptureVar_Bam workflow

Background

Implement DnaCaptureVarBamWorkflow class that implements BamWorkflow interface

Approach

  • Implement run() and postProcess() methods-
  • Add the required workflow tools to appropriate stages
  • Create DnaCaptureVarBamIntegrationTest that covers all working logic in DnaCaptureVarBamWorkflow
  • Provide a javadoc

"Cloud Pipeline" execution backend

Background

At the moment FONDA is used to process the data within a Cloud environment, managed by the Cloud Pipeline
In this case, FONDA is packed into a docker image and is executed using the sge executor.

But the Cloud Pipeline users do have a number of cases, when it's better to run FONDA itself locally, but offload the data processing to the Cloud.

Implement DnaWgsVar_Bam workflow

Background

Implement DnaWgsVarBamWorkflow class that implements BamWorkflow interface

Approach

  • Implement run() and postProcess() methods
  • Add the required workflow tools to appropriate stages
  • Create DnaWgsVarBamIntegrationTest that covers all working logic in DnaWgsVarBamWorkflow
  • Provide a javadoc

Implement DnaCaptureVar_Fastq workflow

Implement DnaCaptureVarFastqWorkflow class that implements FastqWorkflow

  • implement run() and postProcess() methods
  • add tools to stages if it is needed

Don't forget to add integration tests

Implement DnaWgsVar_Fastq workflow

Background

Implement DnaWgsVarFastqWorkflow class that implements FastqWorkflow interface

Approach

  • Implement run() and postProcess() methods
  • Add the required workflow tools to appropriate stages
  • Create DnaWgsVarFastqIntegrationTest that covers all working logic in DnaWgsVarFastqWorkflow
  • Provide a javadoc

Implement SamToFastq tool

Background

Create SamToFastq tool class that includes working logic of this tool in Fonda. Tool class should implement Tool interface.

Approach

  • find all tool usage in Fonda workflows
  • combine tool behavior in created class
  • process to thymeleaf template
  • implement generate() method

Setup Travis CI

  • Build/Test any PR and develop
  • Publish artifacts from the successful develop builds to the S3

Implement RnaCaptureVar_Fastq workflow

Background

Implement RnaCaptureVarFastqWorkflow class that implements FastqWorkflow interface

Approach

  • Implement run() and postProcess() methods-
  • Add the required workflow tools to appropriate stages
  • Create RnaCaptureVarFastqIntegrationTest that covers all working logic in RnaCaptureVarFastqWorkflow
  • Provide a javadoc

Add overall information on the issue/pr management approach

@kamyshova please add the following information on the overall contribution to the repo and how the issues are tracked:

  • Entrypoint:
    • Issue: if there is only an idea for some feature and fix and it shall be discussed first
    • PR: if there is already an implementation for some feature or a bug-fix
  • What information shall be provided to the Issue/PR description (we need to add Issue/PR templates further), which branch shall be used as a merge destination, how to link PRs and issues
  • If the issue is created first - how it gets implemented (discussion -> decision)
  • If a PR is created first (or a PR is created as a result of work on an issue) - how to request to approval, how it gets discussed, approved, merged
  • Any other valuable information of your choice

Implement TcrRepertoire_Fastq workflow

Background

Implement TcrRepertoireFastqWorkflow class that implements FastqWorkflow interface

Approach

  • Implement run() and postProcess() methods
  • Add the required workflow tools to appropriate stages
  • Create TcrRepertoireFastqIntegrationTest that covers all working logic in TcrRepertoireFastqWorkflow
  • Provide a javadoc

Update GatkHaplotypeCaller tool

Background

Update the existing GatkHaplotypeCaller tool class in according to GatkHaplotypeCaller for RNA.

Approach

  • compare gatkHaplotypeCaller_rna_unpaired and gatkHaplotypeCaller_unpaired methods
  • combine tools behavior in the existing class by introducing a new template gatk_haplotype_rna_tool_template.txt
  • update the existing GatkHaplotypeCallerTest

Implement DnaAmpliconVar_Bam workflow

Background

Implement DnaAmpliconVarBamWorkflow class that implements BamWorkflow interface

Approach

  • Implement run() and postProcess() methods
  • Add the required workflow tools to appropriate stages
  • Create DnaAmpliconVarBamIntegrationTest that covers all working logic in DnaAmpliconVarBamWorkflow
  • Provide a javadoc

Implement HlaTyping_Fastq workflow

Background

Implement HlaTypingFastqWorkflow class that implements FastqWorkflow interface

Approach

  • Implement run() and postProcess() methods-
  • Add the required workflow tools to appropriate stages
  • Create HlaTypingFastqIntegrationTest that covers all working logic in HlaTypingFastqWorkflow
  • Provide a javadoc

Implement scRnaExpression_Fastq workflow

Background

Implement SCRnaExpressionFastqWorkflow class that implements FastqWorkflow interface

Approach

  • Implement run() and postProcess() methods-
  • Add the required workflow tools to appropriate stages
  • Create SCRnaExpressionFastqIntegrationTest that covers all working logic in SCRnaExpressionFastqWorkflow
  • Provide a javadoc

Create OptiType tool

Background

Create OptiType tool class that includes working logic of this tool in Fonda. Tool class should implement Tool interface.

Approach

  • find all tool usage in Fonda workflows
  • combine tool behavior in created class
  • process to thymeleaf template
  • implement generate() method

Implement scRnaExpression_Bam workflow

Background

Implement SCRnaExpressionBamWorkflow class that implements BamWorkflow interface

Approach

  • Implement run() and postProcess() methods-
  • Add the required workflow tools to appropriate stages
  • Create SCRnaExpressionBamIntegrationTest that covers all working logic in SCRnaExpressionBamWorkflow
  • Provide a javadoc

Create GatkSplitReads tool

Background

Create GatkSplitReads tool class that includes working logic of this tool in Fonda. Tool class should implement Tool interface.

Approach

  • find all tool usage in Fonda workflows
  • combine tool behavior in created class
  • process to thymeleaf template
  • implement generate() method

Create GatkHaplotypeCallerRnaFilter tool

Background

Create GatkHaplotypeCallerRnaFilter tool class that includes working logic of this tool in Fonda. Tool class should implement Tool interface.

Approach

  • find all tool usage in Fonda workflows
  • combine tool behavior in created class
  • process to thymeleaf template
  • implement generate() method

Implement SortBamByReadName tool

Background

Create SortBamByReadName tool class that includes working logic of this tool in Fonda. Tool class should implement Tool interface.

Approach

  • find all tool usage in Fonda workflows
  • combine tool behavior in created class
  • process to thymeleaf template
  • implement generate() method

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.