Giter VIP home page Giter VIP logo

cievad's Introduction

run with conda Nextflow GitHub Actions Workflow Status GitHub Release GitHub commit activity

CIEVaD

Continuous Integration and Evaluation for Variant Detection. This repository provides a tool suite for simple, streamlined and rapid creation and evaluation of genomic variant callsets. It is primarily designed for continuous integration of variant detection software and a plain containment check between sets of variants. The tools suite utilizes the conda package management system and nextflow workflow language.

Contents:

  1. System requirements
  2. Installation
  3. Usage
  4. Help
  5. Citation

System requirements:

This tool suite was developed for Linux and is the only officially supported operating system here. Having any derivative of the conda package management system installed is the only strict system requirement. A recent version (≥20.04.0) of nextflow is required to execute the workflows, but can easily be installed via conda. For an installation instruction of nextflow via conda see Installation.

🖥️ See list of tested setups:
Requirement Tested with
64 bits Linux operating system Ubuntu 20.04.5 LTS
Conda vers. 23.5.0, 24.1.2
Nextflow vers. 20.04.0, 23.10.1

Installation:

  1. Download the repository:
git clone https://github.com/rki-mf1/cievad.git
  1. [Optional] Install nextflow if not yet on your system. For good practise you should use a new conda environment:
conda deactivate
conda create -n cievad -c bioconda nextflow
conda activate cievad

Usage:

This tool suite provides multiple functional features to generate synthetic sequencing data, generate sets of ground truth variants (truthsets) and evaluate sets of predicted variants (callsets). There are two main workflows, hap.nf and eval.nf. Both workflows are executed via the nextflow command line interface (CLI).

⚠️ Run commands from the root directory: Without further ado, please run the commands from a terminal at the top folder (root directory) of this repository. Otherwise relative paths within the workflows might be invalid.

Generating haplotype data

The minimal command to generate haplotype data is

nextflow run hap.nf -profile local,conda

This generates the following data within the <project_root>/results/ directory:

  • a haplotype (FASTA), which is a copy of the provided reference sequence but deviates by a set of synthetic genomic variants
  • the variant set (VCF) of synthetic genomic variants in the haplotype
  • a set of reads (FASTQ) representing a sequencing experiment from the haplotype

Evaluating variant calls

The minimal command to evaluate the accordance between a truthset (generated data) and a callset is

nextflow run eval.nf -profile local,conda --callsets_dir <path/to/callsets>

where --callsets_dir is the parameter to specify a folder containing the callset VCF files. Currently, a callset within this folder has to follow the naming convention callset_<X>.vcf[.gz] where <X> is the integer of the corresponding truthset. Alternatively, one can provide a sample sheet of comma separated values (CSV file) with the columns "index", "truthset" and callset", where "index" is an integer from 1 to n (number of samples) and "callset"/"truthset" are paths to the pairwise matching VCF files. Callsets can optionally be gzip compressed. The command for the sample sheet input is

nextflow run eval.nf -profile local,conda --sample_sheet <path/to/sample_sheet>

This generates the following data within the <project_root>/results/ directory:

  • a report (CSV, JSON) about accordance between the synthetic variant set and a given corresponding callset
  • a report (CSV) with statistis across all tested individuals

Tuning the workflow parameters

CIEVaD enables access and finetuning to a vast majority of parameters of the internal software tools. The parameters to adjust the workflows are listed on their respective help pages. To inspect the help pages type --help after the script name, e.g. nextflow run hap.nf --help for the hap.nf workflow. Parameters can be adjusted via the CLI or directly within the nextflow.config file. Mind that parameters provided by the CLI will overwrite parameters set in config. More information about tuning crucial parameters, e.g. read quality and genome coverage, can be found in the Wiki.

Help:

Visit the project wiki for more detail information on parameters, help and FAQs.
Please file issues, bug reports and questions to the issues section.

Citation:

We have a manuscript available for CIEVaD. If you use CIEVaD please cite

@article{krannich2024cievad,
  title={CIEVaD: A Lightweight Workflow Collection for the Rapid and On-Demand Deployment of End-to-End Testing for Genomic Variant Detection},
  author={Krannich, Thomas and Ternovoj, Dmitrii and Paraskevopoulou, Sofia and Fuchs, Stephan},
  journal={Viruses},
  volume={16},
  number={9},
  pages={1444},
  year={2024},
  doi={10.3390/v16091444}
}

cievad's People

Contributors

dimitriternovoj avatar krannich479 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

dimitriternovoj

cievad's Issues

ONT avg read is too low

  • Describe the bug
    With the current model and settings the average read quality is about 8. This is (1) out-of-date with the current real avg read quality ONT sequencers produce and (2) it is so low that some pipeline just filter out those reads entirely.

  • To Reproduce

  1. Run cievad hap --read_type ont
  2. Run poreCov
  3. Check Nanoplot quality report in results/1.Read_quality/
  • Expected behavior
    Something like avg 15-20 sould be default for simulation.

  • Screenshots
    simu-hap1-read-qual

  • Possible solution
    Add new model or check settings of nanosim to improve avg read quality.

Test callset_dir vs sammple_sheet with GH Actions / CI

Situation:
With the new feature #42 the sample_sheet parameter got introduced as an alternative to the callset_dir. These input formats should be mutually exclusive.

Improvement:
Test different input combinations of callset_dir and sample_sheet with GH Actions to ensure intended behavior. This is a good opportunity to also test invalid inputs (empty sample_sheet, missing files, etc).

Improve thread/cpu resource management

  • Is your feature request related to a problem? Please describe.
    At the moment the executor.cpus and task.cpus are hardcoded. Reserving a bigger maschine (eg on a cluster) doesn't improve the runtime.

  • Describe the solution you'd like
    The a max_cores parameter in the config, that is passed to the executors. Also, change relevant modules to scale accordingly at the task.cpus.

Help pages

Problem:
Neither hap.nf nor eval.nf have a help page. The user has to dive into the nextflow.config file in order to find parameters that one can tune.

Solution:
Provide help pages in a UNIX CLI style, like nextflow run hap.nf --help

Update documentation

  • Update README with explanation that read depth and qual can be tuned. Hint to FAQ
  • Example case for documentation how to run with non-default parameter
  • Update wiki
  • Update minor release

Add CI

The project could be tested with an end-to-end test for good practice. Particularly the CLI.
Better wait until a decision is made on Issue #17 though.

Retry processes at least once

  • Is your feature request related to a problem? Please describe.
    Sometimes processes do not success due to hardware failures. Then, a fraction of the synthetic samples is not generated.

  • Describe the solution you'd like
    There are errorStrategy and maxRetries directives for Nextflow processes s.t. this type of failure can be caught. This should be added to relevant (all?) processes.

All-in-one conda env?

I think the tool suite would benefit from having one joint conda environment over multiple individual environments, one for each module or snakemake job. It would reduce the runtime for first-ever module activation which is particularly useful for CI purposes. The question is if all requirements and dependencies can be satisfied in a joint environment. Attempt to realize this should be trivial to test once a CI workflow is in place, i.e. issue #18 resolved.

Nextflow run via organization+tool name

  • Is your feature request related to a problem? Please describe.
    No problem, just a feature for convenience.

  • Describe the solution you'd like
    Nextflow provides a git checkout-free solution for nextflow pipelines via nextflow run <organization>/<repository>. CIEVaD provides multiple pipelines, so maybe something is possible in the style of nextflow run rki-mf1/cievad-hap, and same with -eval, respectively.

Re-design pipeline activation

The current architecture of the program was initially designed for a much larger project with a lot more code re-use. Therefore, the folder structure and the split between creating configs and running a workflow is somewhat over-engineered. For long term maintenance I highly advise to reduce and restructure the code base.

[eval.nf] Add sample sheet support

Currently, the callsets have to reside or be linked inside the samples' data folders, like data/sample_hap001/callset.vcf.gz with particularly strict naming convention of the filename. That requirement should be less constrained, i.e. a minimal improvement should be

  • flexible filename
  • uncompressed input possible

[hap.nf] Add Mason read quality parameters to interface

  • Is your feature request related to a problem? Please describe.
    Illumina read error distribution cannot be trivially modified yet.

  • Describe the solution you'd like
    Include the corresponding mason simulator parameters into the CLI and nextflow.config .

Missing tools in snakemake7 env

In the conda environment "snakemake7" in dev branch the tools mason and bcftools are missing. Once I installed them manually in the conda env the script was able to run successfully.

[Documentation] Nextflow minimum version requirement is out-of-date

Issue:
Currently, the documentation says the user only requires a Nextflow v20.04+. This statement was based on the Nextflow documentation and the Nextflow features used. However, in 2021 somethings changed on the execution level of Nextflow e.g.

  • the -dsl2 parameter usage for DSL2 language features
  • singularity env variables (not yet relevant but will be in case we support containerized execution)

Solution:
I think something like v22.04+ is appropriate. This can be precisely verified with a version matrix in GH Actions.

[CI] Test invalid input with Github Actions

Test invalid inputs to the modules, e.g.

  • missing ref (HAP)
  • invalid read type (HAP)
  • #64
  • missing callsets (EVAL)
  • missing truthset (EVAL)
  • malformatted sample sheet (EVAL)
  • wrong relative path to callset dir (EVAL)
  • deterministic results

[eval.nf] No input run doesn't trigger exit 1

Describe the bug

If neither --callsets_dir not --sample_sheet is specified the eval.nf it is supposed to run into the else-case and abort with exit 1. However, I accidentally tested this and nothing happened. The wf ran successfully without anything being done.

To Reproduce

Steps to reproduce the behavior:

$nextflow run eval.nf -profile local,conda --callset_dir data
N E X T F L O W  ~  version 23.10.1
Launching `eval.nf` [prickly_kilby] DSL2 - revision: e4728a89a7
$

(mind that --callset_dir was the mistake here. It must be --callsets_dir)

Expected behavior

As none of the valid input formats are specified, this should end up in an exit 1 with an error message.

Setup (please complete the following information):

  • cievad v0.4.0
  • nextflow 23.10.1

[eval.nf] Summary of stats across samples

  • Is your feature request related to a problem? Please describe.
    Currently, the output of eval.nf is precision/recall statistics per sample (<sample>.sompy.stats.csv). I'd be helpful to have a summary statistic across all simulated samples.

  • Describe the solution you'd like
    Ideally, there would be a file containing an average and/or median for each feature in the individual samples' statistics.

[eval.nf] som.py stats miss first index

PROBLEM:

The final stats report by som.py misses a first column in the header. It's misleading when converting and looking at the CSV table, e.g. with

sed 's/,/\t/g' sample01.sompy.stats.csv | column -t | less -S 

because all the lines below the header are shifted with respect to the header.

FIX:

The first column of the header needs content, e.g. "idx" in the example below:

idx,type,total.truth,total.query,tp,fp,fn,...
0,indels,140,113,113,0,27,0,0,...
1,SNVs,320,305,305,0,15,0,0,...
5,records,460,422,418,4,42,0,0,...

IMPLEMENTATION:

Add a process that modifies the header after the som.py process.

[hap.nf] sort + index bam by default

  • Is your feature request related to a problem? Please describe.
    The BAM file from the simulation wf is vital for visually inspecting the simulation, e.g. with IGV. Currently, the BAM file returned from the NGS data generation is unsorted and comes without the BAI index.

  • Describe the solution you'd like
    After the mason_simulator process, sort and index the BAM file by default.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.