Giter VIP home page Giter VIP logo

vip's Introduction

Variant Interpretation Pipeline

VIP is a flexible human variant interpretation pipeline for rare disease using state-of-the-art pathogenicity prediction (CAPICE) and template-based interactive reporting to facilitate decision support.

Example Report

Documentation

VIP documentation is available at this link https://molgenis.github.io/vip/.

Tip

Visit https://vip.molgeniscloud.org/ to analyse your own variants

Tip

Preprint now available at medRxiv

Quick Reference

Requirements

Installation

git clone https://github.com/molgenis/vip
bash vip/install.sh

Usage

usage: vip -w <arg> -i <arg> -o <arg>
  -w, --workflow <arg>  workflow to execute. allowed values: cram, fastq, gvcf, vcf
  -i, --input    <arg>  path to sample sheet .tsv
  -o, --output   <arg>  output folder
  -c, --config   <arg>  path to additional nextflow .cfg (optional)
  -p, --profile  <arg>  nextflow configuration profile (optional)
  -r, --resume          resume execution using cached results (default: false)
  -s, --stub            quickly prototype workflow logic using process script stubs
  -h, --help            print this message and exit

Developers

To create the documentation pages:

pip install mkdocs mkdocs-mermaid2-plugin
mkdocs serve

License

VIP is an aggregate work of many works, each covered by their own licence(s). For the purposes of determining what you can do with specific works in VIP, this policy should be read together with the licence(s) of the relevant tools. For the avoidance of doubt, where any other licence grants rights, this policy does not modify or reduce those rights under those licences.

vip's People

Contributors

bartcharbon avatar dennishendriksen avatar marikaris avatar mswertz avatar sietsmarj avatar svandenhoek avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

vip's Issues

Running with VEP --no_escape leads to invalid VCF and problems with e.g. GenMod

We use VEP with --VEP becasue this leads to better readable output for the report:
NP_000383.1:p.Phe=
instead of
NP_000383.1:p.Phe39%3D

However the = in an INFO field is illegal, therefor we now produce illegal VCF which can lead to trouble when using it in other tooling. We ran into this already when using GenMod with the --vep option.

When fixing this, we need to unescape the relevant string in vip-report.

Check if all modules are available before running pipeline.

Is your feature request related to a problem? Please describe.
If a required module is missing on the system running the pipeline, an error will only be generated when arriving at that step. While someone could install the module and from there run the individual steps starting from where the error occurred to still generate the results, it would be much cleaner to check any missing requirements beforehand and generate an error instead.

Describe the solution you'd like
A check that validates all required modules are present before running any step from the pipeline.

Describe alternatives you've considered

Additional context

Clean up test data

Test data has no INFO data but the metadata headers from the prototype are still there

Command-line option -fa --fasta for reference sequence

Introduce an optional command-line option for a fasta reference sequence file
Replace hardcoded VEP reference with the provided fasta reference sequence file
Adjust the VEP arguments in case no fasta reference sequence file is supplied (e.g. remove hgvs)

--start does not work very conviniently

This option either empty's previous output (when running with -f)
or complains that stuff already exists (without -f)

This is inconvintient because the goal of this option is to partially rerun the pipeline.

Bash wrapper script to improve usage in Easybuild env

suggestion from @pneerincx

waarom is de "binary" pipeline.sh en niet gewoon vip zonder.sh extensie? Dan heb je geen:
Show usage: sh ${EBROOTVIP}/pipeline.sh
nodig. We hebben meerdere andere tools die gestart worden met een bash wrapper script met de naam van de tool en zonder extensie; maakt het makkelijker voor gebruikers. Als je zorgt dat zo'n vip wrapper executable is en aan ${PATH} wordt toegevoegd, dan kun je het als commando aanroepen zonder sh en/of ${EBROOTVIP} ervoor. Executable bits zetten en locatie aan ${PATH} toevoegen doet EasyBuild automatisch voor je als je wrapper/binary in een bin subdirectory staat. Als je het niet in een bin subdir wil zetten, kun je wat extra regels toevoegen aan je easyconfig: postinstallcmds om de permissies te zetten en modextrapaths = {'PATH': '.'}om de root van je installdir aan ${PATH} toe te voegen.

Inheritance module genmod crash for missing VEP annotations

Likely cause: a variant without CSQ annotations (removed with the default decision tree) in combination with genmod running in --vep mode:

step 4/5 inheritance matching ...
[2021-02-09 11:48:16,052] WARNING : genmod.commands.annotate_models: 'CSQ'
Aborted!

Use bgzip -l and -@

-l

define compression level when we use bgzip:

bgzip -c -l 0 test.vcf > test_fast_compression.vcf.gz
bgzip -c -l 1 ...
...
bgzip -c -l 9 test.vcf > test_best_compression.vcf.gz

use max compression level for outputs, use faster compression level for intermediate resources.

decide whether we want to pipe bcftools output to bgzip so we can specify the compression level.

-@

define number of threads when we use bgzip:

bgzip -c -@ 8 test.vcf > test_parallel.vcf.gz

VCFs without samples fail

When the input VCF has no samples, the pipeline adds a "FORMAT" header somewhere along the way, leading to this error:

Unable to parse header with error: Your input file has a malformed header: The FORMAT field was provided but there is no genotype/sample data, for input source

Script fails when trying to run from a different directory.

Describe the bug

The pipeline.sh script fails when running it from a different directory.

To Reproduce

v1.0.0

Directory structure used when encountering issue:

vip
|- vip-1.0.0
  |- pipeline.sh
  |- pipeline_0_preprocess.sh
  \- etc.
\- out

Steps to reproduce the behavior:

  1. Make sure you are in the vip folder.
  2. sbatch --constraint=tmp01 vip-1.0.0/pipeline.sh -i vip-1.0.0/test/data/test.vcf -o out/test1.vcf

This results in the following error:

$ cat vip.err
/var/spool/slurmd/job307828/slurm_script: line 195: ./pipeline_0_preprocess.sh: No such file or directory

master (5845048)

Using the current master (commit: 5845048) results in a slightly different vip.err (as it does not stop after the first error), but has the same problem:

$ cat vip.err
/var/spool/slurmd/job308894/slurm_script: line 12: utils/header.sh: No such file or directory
sh: ./pipeline_preprocess.sh: No such file or directory
sh: ./pipeline_annotate.sh: No such file or directory
sh: ./pipeline_filter.sh: No such file or directory
mv: cannot stat ‘/groups/umcg-gcc/tmp01/umcg-svandenhoek/vip/vip-master-2020-09-08/out3/out_pipeline_out/step2_filter//out.vcf’: No such file or directory
sh: ./pipeline_report.sh: No such file or directory
cp: cannot stat ‘/groups/umcg-gcc/tmp01/umcg-svandenhoek/vip/vip-master-2020-09-08/out3/out_pipeline_out/step4_report//out.html’: No such file or directory

Expected behavior

Script runs independent of the directory from which it was executed.

Additional context

A solution such as seen here might fix this. However, using cd to go to the script directory might break user-input paths if these are relative paths. Pre-pending the ${BASE_PATH} to the used scripts might therefore be a better solution.

Add functions to bash scripts

Is your feature request related to a problem? Please describe.
Currently when reading the bash scripts it's just a long list of everything what's done. This reduces the readability of the scripts. While the echo do indicate where each new step starts in pipeline.sh, stuff like command line parsing do not have their own "section" within the code.

Describe the solution you'd like
Dividing the separate steps done in a single script (such as every step in pipeline.sh and stuff like command-line parsing) into their own method would increase the readability of the scripts.

Describe alternatives you've considered
Adding additional comments. However, separate methods will look cleaner.

Inheritance module genmod crash if proband is not affected

Run the pipeline with a .ped file with an unaffected trio:

HG      HG002   HG003   HG004   1       2
HG      HG003   0       0       1       1
HG      HG004   0       0       2       1

Result:

step 4/5 inheritance matching ...
[2021-03-24 07:10:14,107] WARNING : genmod.commands.annotate_models: No affected individuals found for family HG. Skipping family.
[2021-03-24 07:10:14,107] WARNING : genmod.commands.annotate_models: Please provide at least one family with affected individuals
Aborted!

Use module specific config files for module versions/default settings

Is your feature request related to a problem? Please describe.
Related to:

  • #14 Fix module versions to increase reproducability
  • #19 Replace hardcoded paths

Using a separate .config file for each module should make it easier to see the current configuration and adjust it (in comparison to looking it up in the file). An added benefit is also that by viewing the .config file for a bash script, one knows exactly all modules each script depends on to run (as this is not documented right now and needs to be viewed in the source code).

Describe the solution you'd like
For each module, an identically named .config file. For example, a pipeline_annotate.config for pipeline_annotate.sh.

Example config file:

# Set module versions to be used
CAPICEVERSION=v1.2-foss-2018b

# Set paths to needed files/dirs
VEP_DATA=/apps/data/Ensembl/VEP/100

# Configure default values for input parameters (overridable by command line arguments)
CPU_CORES=4
ASSEMBLY=GRCh37

# Set default arguments of used tools
VEP_ARGS=" --stats_text \
--offline --cache --dir_cache ${VEP_DATA} \
--species homo_sapiens --assembly ${ASSEMBLY} \
--flag_pick_allele \
--coding_only \
--no_intergenic \
--af_gnomad --pubmed --gene_phenotype \
--shift_3prime 1 \
--no_escape \
--numbers \
--dont_skip \
--allow_non_variant \
--fork ${CPU_CORES}"

Example bash script:

source pipeline_annotate.config
module load CAPICE/${CAPICEVERSION}

Describe alternatives you've considered
Use a single global config file:

  • Config file could become cumbersome in the long-term if many modules exist.
  • If different bash scripts need to load a different version of a specific module, could cause issues unless module is always named explicitly as well (in which case the added benefit of a single config file might be reduced).

Setting module-versions/default values in the bash script and needed files/dirs in a global config file:

  • All bash scripts would require loading the single config file and therefore loading stuff they don't need.
  • Easier to configure basic configuration to run the pipeline (as only a single file needs to be adjusted).
  • Does require adjusting bash script adjustments when f.e. wanting to set a different CPU_CORES as default. When updating to a new version, which custom-changes need to be adjusted again is less clear.

Remove low quality reads removed by default

  • Change pre-processing module
    • Remove VCF records with failed filters (keep passed/missing)
    • Remove VCF records with low read depth for all samples
    • Modify VCF records: set samples with low read depth to missing
    • Add option '--filter_low_qual' to enable filtering
    • Add option '--filter_read_depth' to set read depth threshold
  • Change pipeline
    • Call preprocessing module with option '--filter_low_qual'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.