Giter VIP home page Giter VIP logo

raredisease's Introduction

nf-core/raredisease

GitHub Actions CI Status GitHub Actions Linting StatusAWS CICite with Zenodo nf-test GitHub Actions Linting StatusAWS CICite with Zenodo

Nextflow run with conda run with docker run with singularity Launch on Seqera Platform

Get help on SlackFollow on TwitterFollow on MastodonWatch on YouTube

TOC

Introduction

nf-core/raredisease is a best-practice bioinformatic pipeline for calling and scoring variants from WGS/WES data from rare disease patients. This pipeline is heavily inspired by MIP.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

On release, automated continuous integration tests run the pipeline on a full-sized dataset on the AWS cloud infrastructure. This ensures that the pipeline runs on AWS, has sensible resource allocation defaults set to run on real-world datasets, and permits the persistent storage of results to benchmark between pipeline releases and other analysis sources. The results obtained from the full-sized test can be viewed on the nf-core website.

Pipeline summary

nf-core/raredisease workflow

1. Metrics:

2. Alignment:

3. Variant calling - SNV:

4. Variant calling - SV:

5. Annotation - SNV:

6. Annotation - SV:

7. Mitochondrial analysis:

8. Variant calling - repeat expansions:

9. Variant calling - mobile elements:

10. Rank variants - SV and SNV:

11. Variant evaluation:

Note that it is possible to include/exclude certain tools or steps.

Usage

Note

If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

sample,lane,fastq_1,fastq_2,sex,phenotype,paternal_id,maternal_id,case_id
hugelymodelbat,1,reads_1.fastq.gz,reads_2.fastq.gz,1,2,,,justhusky

Each row represents a fastq file (single-end) or a pair of fastq files (paired end).

Second, ensure that you have defined the path to reference files and parameters required for the type of analysis that you want to perform. More information about this can be found here.

Now, you can run the pipeline using:

nextflow run nf-core/raredisease \
   -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
   --input samplesheet.csv \
   --outdir <OUTDIR>

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

For more details and further functionality, please refer to the usage documentation and the parameter documentation.

Pipeline output

For more details about the output files and reports, please refer to the output documentation.

Credits

nf-core/raredisease was written in a collaboration between the Clinical Genomics nodes in Sweden, with major contributions from Ramprasad Neethiraj, Anders Jemt, Lucia Pena Perez, and Mei Wu at Clinical Genomics Stockholm.

Additional contributors were Sima Rahimi, Gwenna Breton and Emma Västerviga (Clinical Genomics Gothenburg); Halfdan Rydbeck and Lauri Mesilaakso (Clinical Genomics Linköping); Subazini Thankaswamy Kosalai (Clinical Genomics Örebro); Annick Renevey and Peter Pruisscher (Clinical Genomics Stockholm); Ryan Kennedy (Clinical Genomics Lund); Anders Sune Pedersen (Danish National Genome Center) and Lucas Taniguti.

We thank the nf-core community for their extensive assistance in the development of this pipeline.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to get in touch on the Slack #raredisease channel (you can join with this invite).

Citations

If you use nf-core/raredisease for your analysis, please cite it using the following doi: 10.5281/zenodo.7995798

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

You can read more about MIP's use in healthcare in,

Stranneheim H, Lagerstedt-Robinson K, Magnusson M, et al. Integration of whole genome sequencing into a healthcare setting: high diagnostic rates across multiple clinical entities in 3219 rare disease patients. Genome Med. 2021;13(1):40. doi:10.1186/s13073-021-00855-5

raredisease's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

raredisease's Issues

Bcftools norm

Is your feature request related to a problem? Please describe

Normalize and split multi allelic variants using bcftools norm prior to annotation

Describe the solution you'd like

Incorporate the bcftools norm module from nf-core modules.

Describe alternatives you've considered

We could use vt decompose and normalize

Additional context

SamtoFastQ

SamtoFastQ

Convert SAM or BAM file to FastQ

add tiddit/sv

Description of feature

In MIP, we combine callsets using svdb from manta, tiddit/sv, and cnvnator. We should add tiddit/sv.

Add TIDDIT/cov module

Our in-house pipeline (MIP) uses this tool and we want to add this to the nextflow pipeline. It's not part of nf-core modules yet: nf-core/modules#792.

Once the module is added to nf-core/modules, then it'll be added to subworkflow qc_bam.nf

VEP

Is your feature request related to a problem? Please describe

Add VEP from nf-core modules

Describe the solution you'd like

Describe alternatives you've considered

Additional context

refactor alignmnt modules

Description of feature

Currently, the code snippet is in the raredisease.nf script but when there are more mappers/aligners in the picture we should hide the logic away in a bigger subworkflow with switches for which tool to use. This way we can declutter the raredisease.nf script.

if (params.aligner == 'bwamem2') {
        ALIGN_BWAMEM2 (
            INPUT_CHECK.out.reads,
            PREPARE_GENOME.out.bwamem2_index
        )
...

turns into...

if (aligner == 'bwamem2') {
        ALIGN_BWAMEM2 (
            reads,
            bwamem2_index
        )
...

stowed in the bigger subworkflow 👍 - where aligner is a resource defined in the take: definition block.

RevertSam-GATK

RevertSam

Produce unmapped BAM (uBAM) from aligned BAM

New modules required: Sentieon

Here is a list of Sentieon tools that are relevant for the pipeline and for which issues have been opened in https://github.com/nf-core/modules.

Another tool that might be relevant but for which there is no open issue at the moment:

  • WgsMetricsAlgo

Add picardtools collecthsmetrics to BamQC subworkflow

Our in-house pipeline uses this tool and we want to add this to the nextflow pipeline. It's not part of nf-core modules yet: nf-core/modules#793.

Once the module is added to nf-core/modules, then it'll be added to subworkflow qc_bam.nf

EDIT: The module is part of nf-core/modules now. Please go ahead and add it to subworkflow qc_bam.nf

Create subworkflow to prepare indices

Is your feature request related to a problem? Please describe

The pipeline currently re-builds the index for bwamem2 on every run. In order to save resources, there should be a check if there are existing indices to be used instead.

Describe the solution you'd like

This subworkflow should 1) check for existing reference index files 2) allow for the re-use of indices in different downstream processes.

Describe alternatives you've considered

Additional context

Add vcfanno to the annotation workflow

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Java memory issue on SLURM

Check Documentation

I have checked the following places for your error:

Description of the bug

Steps to reproduce

Steps to reproduce the behaviour:

  1. nextflow run nf-core/raredisease -profile test,singularity,hasta,dev_prio -r dev (-c customconf.conf )
  2. See error:

Without customconf.conf

[dd/f36687] NOTE: Process `NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)` terminated with an error exit status (134) -- Execution is retried (1)
WARN: Input tuple does not match input set cardinality declared by process `NFCORE_RAREDISEASE:RAREDISEASE:DEEPVARIANT_CALLER:GLNEXUS` -- offending value: [id:caseydonkey]
Error executing process > 'NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)'

Caused by:
  Process `NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)` terminated with an error exit status (134)

Command executed:

  picard \
      -Xmx6g \
      MarkDuplicates \
      --CREATE_INDEX \
      -I 1234N.bam \
      -O 1234N_sorted.bam \
      -M 1234N_sorted.MarkDuplicates.metrics.txt

  cat <<-END_VERSIONS > versions.yml
  MARKDUPLICATES:
      markduplicates: $(echo $(picard MarkDuplicates --version 2>&1) | grep -o 'Version:.*' | cut -f2- -d:)
  END_VERSIONS

Command exit status:
  134

Command output:
  #
  # A fatal error has been detected by the Java Runtime Environment:
  #
  #  Internal Error (g1PageBasedVirtualSpace.cpp:43), pid=211157, tid=211219
  #  guarantee(rs.is_reserved()) failed: Given reserved space must have been reserved already.
  #
  # JRE version:  (11.0.9.1) (build )
  # Java VM: OpenJDK 64-Bit Server VM (11.0.9.1-internal+0-adhoc..src, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
  # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
  #
  # An error report file with more information is saved as:
  # hs_err_pid211157.log
  #
  #

Command error:
  /usr/local/bin/picard: line 5: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
  /usr/local/bin/picard: line 66: 211157 Aborted                 /usr/local/bin/java -Xmx6g -jar /usr/local/share/picard-2.25.7-0/picard.jar MarkDuplicates "--CREATE_INDEX" "-I" "1234N.bam" "-O" "1234N_sorted.bam" "-M" "1234N_sorted.MarkDuplicates.metrics.txt"

With customconf.conf:

process {
    withName: PICARD_MARKDUPLICATES {
        memory = 5.GB
    }
}
Error executing process > 'NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)'

Caused by:
  Process `NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)` terminated with an error exit status (1)

Command executed:

  picard \
      -Xmx5g \
      MarkDuplicates \
      --CREATE_INDEX \
      -I 1234N.bam \
      -O 1234N_sorted.bam \
      -M 1234N_sorted.MarkDuplicates.metrics.txt

  cat <<-END_VERSIONS > versions.yml
  MARKDUPLICATES:
      markduplicates: $(echo $(picard MarkDuplicates --version 2>&1) | grep -o 'Version:.*' | cut -f2- -d:)
  END_VERSIONS

Command exit status:
  1

Command output:
  Error occurred during initialization of VM
  Could not reserve enough space for 5242880KB object heap

Command error:
  /usr/local/bin/picard: line 5: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory

Expected behaviour

Successful completion of the analysis

Log files

Have you provided the following extra information/files:

  • The command used to run the pipeline
  • The .nextflow.log file

System

  • Hardware: HPC, hasta
  • Executor: slurm
  • OS: CentOS
  • Version: 7

Nextflow Installation

  • Version: 21.04.3.5560

Container engine

  • Engine: singularity
  • version: 3.1.1-1.el7

Quick fix that solves the problem until more elegant solution

modules/nf-core/modules/picard/markduplicates/main.nf:
avail_mem = task.memory.giga-2

Next related issue

Similar error for bamqc.

Additional context

For the first error, markduplicates:

nextflow-customconf.log
nextflow-no-customconf.log

Update the way module versions are emitted

Is your feature request related to a problem? Please describe

nf-core/modules updated the way versions are emitted, so from <software>.version.txt -> versions.yml. This allows to emit multiple versions in cases a module or subworkflow uses multiple tools. Updated documentation here.

This pipeline is not updated accordingly yet.

Describe the solution you'd like

Update the subworkflows and main workflow 😄 accordingly

Describe alternatives you've considered

Additional context

Add Vcfanno to nf-core modules

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Add feature/MarkDuplicates

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Add this finishing touch to the mapping subworkflow so that the preprocessing of bam files is complete before branching into other tools e.g. variant callers

Describe alternatives you've considered

Additional context

Draft overview of future pipeline

nf-raredisease

This overview is based on the WGS/WES rare disease pipeline (MIP) that is currently in use at Clinical Genomics Stockholm. This outlines the basic functionality and modules that we would like to have from a pipeline specialised for calling, annotating and scoring variants relevant for rare disease patients.

Overview

Fastq files are prepared for variantcalling by alignment with bwa-mem/bwa-mem2 followed by markduplicates. From this point the workflow is split into a SNV/indel part and a SV part.

SNV/Indels

SNV/indels are primarly called with Deepvariant Glnexus but with the possibility of turning on the GATK Haplotypecaller workflow. These two callsets can be combined into one for maximum sensitivity. Vcfanno annotates the callset with population allele frequencies (Gnomad) and predicted pathogenicity (CADD). Common variation is removed from the callset and CADD scores are caclulated for indels. VEP is used for transcript annotation including annotation with CLINVAR, SpliceAI and pLI scores. The SNV/indels are split into a clinical callset and a research callset based on a bed file with genes of interest. Finally the variants are ranked for predicted pathogenicity based on their annotation as well as their modes of inheritance.

SV

We use Cnvnator, Manta, (Delly) and Tiddit to call structural variants. Using SVDB we combine the variants into one callset and using a local frequency database we remove common variants and sequencing/calling artefacts. The callset is annotated with vcfanno and VEP followed by a split into a clincal callset and a research callset. The SVs are then ranked in the same manner as the SNVs.

But wait there's more

Aside from SNVs and SVs the pipelines identifies and visualizes runs of homo/auto-zygosity as well as upd:s. Also included are identification and annotation of pathogenic STRs with ExpansionHunter and Stranger. SMNCopyNumberCaller is used to diagnose patients with spinal muscular atrophy.

The tools mentioned here are not set in stone and we are certainly open to adding and changing tools as we continue development. Below is a list of tools used in the workflow.

Bcftools
BedTools
BWA
CADD
Chanjo
Chromograph
Cnvnator
Cyrius
Delly
Deepvariant
Expansionhunter
FastQC
GATK
GENMOD
Gffcompare
Glnexus
Manta
MultiQC
Peddy
PicardTools
PLINK
Rhocall
Sambamba
Samtools
SMNCopyNumberCaller
Stranger
Svdb
Telomerecat
Tiddit
Upd
Vcf2cytosure
Vcfanno
VEP

Create an interactive chart to use during pipeline development

To complement the project board, we would like to have an interactive chart that can reflect the progresses of the development work and that can be easily modified e.g. when we want to include more tools.
nf-core recommends LucidChart or Google Drawings for such task. For the moment we are going for Google Drawings.

Add svdb/merge to pipeline

Description of feature

Add this to call_structural_variants.nf to combine VCFs from manta, cnvpytor, tiddit

Gens preprocessing

Description of feature

Add preprocessing for Gens to the pipeline.

  • GATK CollectReadCounts added to nfcore/modules
  • GATK DenoiseReadCounts added to nfcore/modules
  • Gens perl-scripts added as a local module
  • Local subworkflow added
  • Subworkflow added to main workflow

Parse input vcf to check for normalization

Is your feature request related to a problem? Please describe

We need to know that the input vcf:s used in for example the annotation process have been decomposed.

Describe the solution you'd like

Write a small script that parses the header and checks for bcftools norm command.

Describe alternatives you've considered

Additional context

Include a default variant catalog file

Maybe a default file for variant_catalog (in case the user doesn't provide one) should still be added? What do you think? In case this should be included in this merge, I can try to look for how this could be done in prepare_genome.nf.

It's not a bad idea. However I think we can go ahead and merge this one and add that option in a small PR later. We could bundle it with the pipeline or have it as an url https://raw.githubusercontent.com/Illumina/ExpansionHunter/master/variant_catalog/hg19/variant_catalog.json
There has also been a discussion about adding a download workflow which would automatically download all the references.

Originally posted by @jemten in #51 (comment)

Adding read groups to meta

It would be good to add read_group to meta so bwa_mem2 can use it and other future programs too (e.g. peddy needs it).

I have tested a little bit the addition of line:
meta.read_group = "'@rg\tID:"+row.sample + "" + row.fastq_1.split('/')[-1].split('R1*.fastq')[0] + "" + row.lane + "\tPL:ILLUMINA\tSM:"+row.sample.split('')[0]+"'"
in subworkflows/local/input_check.nf
But it creates issues when GLnexus needs to combine the different channels again (see nextflow log)
This problem does however not arise with
meta.read_group = "'@rg\tID:myid\tPL:ILLUMINA\tSM:"+row.sample.split('_')[0]+"'"

The problem arises with both a unique sample or multiple samples in the samplesheet.

Mitochondria workflow

We have agreed to use the mitochondria workflow currently implemented at GATK best practices.

The following steps are included. Modules already exist for some of them; all modules need to be included in a subworkflow. We plan to have the mitochondrial subworkflow run by default, but to have the possibility to turn it off and also to turn off the calling of variants for the autosomes.

  • Samtools subsampling [nf-core/raredisease] #49
  • RevertSam [nf-core/raredisease]#106
  • SamtoFastq [nf-core/raredisease]#107
  • BWA
  • GATK MergeBamAlignment
  • Picard MarkDuplicates
  • Haplocheck [nf-core/raredisease]#111
  • call variants with GATK Mutect2
  • Picard LiftoverVCF
  • GATK Mergevcfs [nf-core/raredisease]#113
  • GATK4 [FilterMutectCalls] [nf-core/raredisease]#115
  • GATK Filterblacklist
  • annotation with HmtNote
  • annotation with vep. Because it requires a database, special care has to be taken to run it offline. Cf nf-core/sarek too.
  • bcftools query and
  • bcftools view to prepare input for haplogrep2
  • call mitochondrial haplogroup with haplogrep2 (To be included in workflow)
  • detect mitochondrial deletions with eKLIPse (not in bioconda) (To be included in workflow) (to be checked )

This list can be modified as new issues are created and new modules are added.

Test dataset including mtDNA: https://github.com/nf-core/test-datasets/tree/raredisease

update current module versions

Is your feature request related to a problem? Please describe

Samtools and MultiQC are outdated.

Describe the solution you'd like

Update them versions 😃

Describe alternatives you've considered

Additional context

Add bcftools/annotate

Description of feature

Hello 👋 , we use this to add additional annotations after vcfanno and include a header relating to software and case info

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.