molgenis / vip Goto Github PK

View Code? Open in Web Editor NEW

23.0 6.0 5.0 428.15 MB

Variant Interpretation Pipeline

Home Page: https://molgenis.github.io/vip/

License: GNU Lesser General Public License v3.0

Shell 36.27% Perl 30.29% Nextflow 33.19% Python 0.24%

genetics pipeline variant-interpretation

vip's Introduction

Variant Interpretation Pipeline

VIP is a flexible human variant interpretation pipeline for rare disease using state-of-the-art pathogenicity prediction (CAPICE) and template-based interactive reporting to facilitate decision support.

Documentation

VIP documentation is available at this link https://molgenis.github.io/vip/.

Tip

Visit https://vip.molgeniscloud.org/ to analyse your own variants

Tip

Preprint now available at medRxiv

Quick Reference

Requirements

GNU-based Linux (e.g. Ubuntu, Windows Subsystem for Linux) with x86_64 architecture
Bash ≥ 3.2
Java ≥ 11
Apptainer (setuid installation)
8GB RAM (an estimate, see also the documentation)
220GB disk space

Installation

git clone https://github.com/molgenis/vip
bash vip/install.sh

Usage

usage: vip -w <arg> -i <arg> -o <arg>
  -w, --workflow <arg>  workflow to execute. allowed values: cram, fastq, gvcf, vcf
  -i, --input    <arg>  path to sample sheet .tsv
  -o, --output   <arg>  output folder
  -c, --config   <arg>  path to additional nextflow .cfg (optional)
  -p, --profile  <arg>  nextflow configuration profile (optional)
  -r, --resume          resume execution using cached results (default: false)
  -s, --stub            quickly prototype workflow logic using process script stubs
  -h, --help            print this message and exit

Developers

To create the documentation pages:

pip install mkdocs mkdocs-mermaid2-plugin
mkdocs serve

License

VIP is an aggregate work of many works, each covered by their own licence(s). For the purposes of determining what you can do with specific works in VIP, this policy should be read together with the licence(s) of the relevant tools. For the avoidance of doubt, where any other licence grants rights, this policy does not modify or reduce those rights under those licences.

vip's People

Contributors

Stargazers

Watchers

Forkers

tdmedina rienkheins umcugenetics odielwoolmore eaooms

vip's Issues

Annotate symbol source by default

Enable VEP --symbol by default so that we van check whether the symbol from HGNC or a different source.

Improve error handling

Improve error handling based on NGS_DNA pipeline error handling:
https://github.com/molgenis/NGS_DNA/blob/3.3.0/templates/slurm/header.ftl

Running with VEP --no_escape leads to invalid VCF and problems with e.g. GenMod

We use VEP with --VEP becasue this leads to better readable output for the report:
NP_000383.1:p.Phe=
instead of
NP_000383.1:p.Phe39%3D

However the = in an INFO field is illegal, therefor we now produce illegal VCF which can lead to trouble when using it in other tooling. We ran into this already when using GenMod with the --vep option.

When fixing this, we need to unescape the relevant string in vip-report.

Use CADD/v1.4-foss-2018b-minimal when available

Check if all modules are available before running pipeline.

Is your feature request related to a problem? Please describe.
If a required module is missing on the system running the pipeline, an error will only be generated when arriving at that step. While someone could install the module and from there run the individual steps starting from where the error occurred to still generate the results, it would be much cleaner to check any missing requirements beforehand and generate an error instead.

Describe the solution you'd like
A check that validates all required modules are present before running any step from the pipeline.

Describe alternatives you've considered

Additional context

Decide how to deal with existing annotations

Clean up test data

Test data has no INFO data but the metadata headers from the prototype are still there

Annotate variants with MVL

Error pre-processing VCF in case of missing DP format header

Command-line option -fa --fasta for reference sequence

Introduce an optional command-line option for a fasta reference sequence file
Replace hardcoded VEP reference with the provided fasta reference sequence file
Adjust the VEP arguments in case no fasta reference sequence file is supplied (e.g. remove hgvs)

HPO plugin incorrect if genes_to_phenotype.txt has exactly three cols

Add default batch script submission options

--start does not work very conviniently

This option either empty's previous output (when running with -f)
or complains that stuff already exists (without -f)

This is inconvintient because the goal of this option is to partially rerun the pipeline.

Fix module versions to increase reproducability

Bash wrapper script to improve usage in Easybuild env

suggestion from @pneerincx

waarom is de "binary" pipeline.sh en niet gewoon vip zonder.sh extensie? Dan heb je geen:
Show usage: sh ${EBROOTVIP}/pipeline.sh
nodig. We hebben meerdere andere tools die gestart worden met een bash wrapper script met de naam van de tool en zonder extensie; maakt het makkelijker voor gebruikers. Als je zorgt dat zo'n vip wrapper executable is en aan ${PATH} wordt toegevoegd, dan kun je het als commando aanroepen zonder sh en/of ${EBROOTVIP} ervoor. Executable bits zetten en locatie aan ${PATH} toevoegen doet EasyBuild automatisch voor je als je wrapper/binary in een bin subdirectory staat. Als je het niet in een bin subdir wil zetten, kun je wat extra regels toevoegen aan je easyconfig: postinstallcmds om de permissies te zetten en modextrapaths = {'PATH': '.'}om de root van je installdir aan ${PATH} toe te voegen.

Add default VEP plugin dir

Add pipeline args to enable decision tree labels/filters

Inheritance module genmod crash for missing VEP annotations

Likely cause: a variant without CSQ annotations (removed with the default decision tree) in combination with genmod running in --vep mode:

step 4/5 inheritance matching ...
[2021-02-09 11:48:16,052] WARNING : genmod.commands.annotate_models: 'CSQ'
Aborted!

Skip precomputed CADD scores for CAPICE

Remove --gene_phenotype from default annotation arguments

Use bgzip -l and -@

-l

define compression level when we use bgzip:

bgzip -c -l 0 test.vcf > test_fast_compression.vcf.gz
bgzip -c -l 1 ...
...
bgzip -c -l 9 test.vcf > test_best_compression.vcf.gz

use max compression level for outputs, use faster compression level for intermediate resources.

decide whether we want to pipe bcftools output to bgzip so we can specify the compression level.

-@

define number of threads when we use bgzip:

bgzip -c -@ 8 test.vcf > test_parallel.vcf.gz

VCFs without samples fail

When the input VCF has no samples, the pipeline adds a "FORMAT" header somewhere along the way, leading to this error:

Unable to parse header with error: Your input file has a malformed header: The FORMAT field was provided but there is no genotype/sample data, for input source

Add test scripts for semi-automated testing

VKGL annotations GRCh38 not working

Script fails when trying to run from a different directory.

Describe the bug

The pipeline.sh script fails when running it from a different directory.

To Reproduce

v1.0.0

Directory structure used when encountering issue:

vip
|- vip-1.0.0
  |- pipeline.sh
  |- pipeline_0_preprocess.sh
  \- etc.
\- out

Steps to reproduce the behavior:

Make sure you are in the vip folder.
sbatch --constraint=tmp01 vip-1.0.0/pipeline.sh -i vip-1.0.0/test/data/test.vcf -o out/test1.vcf

This results in the following error:

$ cat vip.err
/var/spool/slurmd/job307828/slurm_script: line 195: ./pipeline_0_preprocess.sh: No such file or directory

master (`5845048`)

Using the current master (commit: 5845048) results in a slightly different vip.err (as it does not stop after the first error), but has the same problem:

$ cat vip.err
/var/spool/slurmd/job308894/slurm_script: line 12: utils/header.sh: No such file or directory
sh: ./pipeline_preprocess.sh: No such file or directory
sh: ./pipeline_annotate.sh: No such file or directory
sh: ./pipeline_filter.sh: No such file or directory
mv: cannot stat ‘/groups/umcg-gcc/tmp01/umcg-svandenhoek/vip/vip-master-2020-09-08/out3/out_pipeline_out/step2_filter//out.vcf’: No such file or directory
sh: ./pipeline_report.sh: No such file or directory
cp: cannot stat ‘/groups/umcg-gcc/tmp01/umcg-svandenhoek/vip/vip-master-2020-09-08/out3/out_pipeline_out/step4_report//out.html’: No such file or directory

Expected behavior

Script runs independent of the directory from which it was executed.

Additional context

A solution such as seen here might fix this. However, using cd to go to the script directory might break user-input paths if these are relative paths. Pre-pending the ${BASE_PATH} to the used scripts might therefore be a better solution.

Decompose and normalize records before annotation

Inheritance non-penetrance list should be configurable

Currently the list with non-penetrance genes is hardcoded:
https://github.com/molgenis/vip/blob/v2.2.0/pipeline_inheritance.sh#L59

Make this list configurable
Store /apps/data/UMCG/non_penetrance/UMCG_non_penetrantie_genes_entrez_20210125.tsv in this repository instead of in /apps/data
Rename the resource such that it doesn't seem UMCG specific
Use the stored resource as a default if no other list was configures

Annotate input vcf with VKGL classification

Keep variants that fail validation and keep non-variants

VEP:

--dont_skip
--allow_non_variant

Add functions to bash scripts

Is your feature request related to a problem? Please describe.
Currently when reading the bash scripts it's just a long list of everything what's done. This reduces the readability of the scripts. While the echo do indicate where each new step starts in pipeline.sh, stuff like command line parsing do not have their own "section" within the code.

Describe the solution you'd like
Dividing the separate steps done in a single script (such as every step in pipeline.sh and stuff like command-line parsing) into their own method would increase the readability of the scripts.

Describe alternatives you've considered
Adding additional comments. However, separate methods will look cleaner.

CADD should use TMPDIR instead of /tmp

Run CADD with:

-t  temporary dir (please use a tmp-dir if /tmp not enough [default: /tmp])

Update headers? http://docs.gcc.rug.nl/gearshift/analysis/#local-scratch-space-on-a-cluster-node

Replace hardcoded paths

/apps/data/Ensembl/VEP/100
paths in conf.template

Inheritance module genmod crash if proband is not affected

Run the pipeline with a .ped file with an unaffected trio:

HG      HG002   HG003   HG004   1       2
HG      HG003   0       0       1       1
HG      HG004   0       0       2       1

Result:

step 4/5 inheritance matching ...
[2021-03-24 07:10:14,107] WARNING : genmod.commands.annotate_models: No affected individuals found for family HG. Skipping family.
[2021-03-24 07:10:14,107] WARNING : genmod.commands.annotate_models: Please provide at least one family with affected individuals
Aborted!

Pre-processing: filter records with hom_ref/missing variants for all probands/samples

Use module specific config files for module versions/default settings

Is your feature request related to a problem? Please describe.
Related to:

#14 Fix module versions to increase reproducability
#19 Replace hardcoded paths

Using a separate .config file for each module should make it easier to see the current configuration and adjust it (in comparison to looking it up in the file). An added benefit is also that by viewing the .config file for a bash script, one knows exactly all modules each script depends on to run (as this is not documented right now and needs to be viewed in the source code).

Describe the solution you'd like
For each module, an identically named .config file. For example, a pipeline_annotate.config for pipeline_annotate.sh.

Example config file:

# Set module versions to be used
CAPICEVERSION=v1.2-foss-2018b

# Set paths to needed files/dirs
VEP_DATA=/apps/data/Ensembl/VEP/100

# Configure default values for input parameters (overridable by command line arguments)
CPU_CORES=4
ASSEMBLY=GRCh37

# Set default arguments of used tools
VEP_ARGS=" --stats_text \
--offline --cache --dir_cache ${VEP_DATA} \
--species homo_sapiens --assembly ${ASSEMBLY} \
--flag_pick_allele \
--coding_only \
--no_intergenic \
--af_gnomad --pubmed --gene_phenotype \
--shift_3prime 1 \
--no_escape \
--numbers \
--dont_skip \
--allow_non_variant \
--fork ${CPU_CORES}"

Example bash script:

source pipeline_annotate.config
module load CAPICE/${CAPICEVERSION}

Describe alternatives you've considered
Use a single global config file:

Config file could become cumbersome in the long-term if many modules exist.
If different bash scripts need to load a different version of a specific module, could cause issues unless module is always named explicitly as well (in which case the added benefit of a single config file might be reduced).

Setting module-versions/default values in the bash script and needed files/dirs in a global config file:

All bash scripts would require loading the single config file and therefore loading stuff they don't need.
Easier to configure basic configuration to run the pipeline (as only a single file needs to be adjusted).
Does require adjusting bash script adjustments when f.e. wanting to set a different CPU_CORES as default. When updating to a new version, which custom-changes need to be adjusted again is less clear.

gearshift:/groups/umcg-gcc/tmp01/projects/modular_ngs_pipeline/data/trio/v1.8.0/run1/run.sh

Change pre-processing module
- Remove VCF records with failed filters (keep passed/missing)
- Remove VCF records with low read depth for all samples
- Modify VCF records: set samples with low read depth to missing
- Add option '--filter_low_qual' to enable filtering
- Add option '--filter_read_depth' to set read depth threshold
Change pipeline
- Call preprocessing module with option '--filter_low_qual'

molgenis / vip Goto Github PK

vip's Introduction

Variant Interpretation Pipeline

Documentation

Quick Reference

Requirements

Installation

Usage

Developers

License

vip's People

Contributors

Stargazers

Watchers

Forkers

vip's Issues

-l

-@

Describe the bug

To Reproduce

v1.0.0

master (5845048)

Expected behavior

Additional context

Recommend Projects

Recommend Topics

Recommend Org

master (`5845048`)