Giter VIP home page Giter VIP logo

vircov's Introduction

vircov

build codecov

Minimal virus genome coverage assessment for metagenomic diagnostics

Overview

v0.6.0

Purpose

Viral metagenomic diagnostics from low-abundance clinical samples can be challenging in the absence of sufficient genome coverage. Vircov extracts distinct non-overlapping regions from a reference alignment and generates some helpful statistics. It can be used to flag potential hits without inspection of coverage plots in automated pipelines and reports.

Implementation

Vircov is written in Rust and works with alignments in the standard PAF or SAM/BAM/CRAM (next release) formats. It is extremely fast and can process alignments against thousands of viral reference genomes in seconds. Basic input filters can be selected to remove spurious alignments and text-style coverage plots can be printed to the terminal for visual confirmation.

Vircov is written for implementation in metagenomics pipelines for human patients enroled in the META-GP network (Australia). As such, it attempts to be production-grade code with high test coverage, continuous integration, and versioned releases with precompiled binaries for Linux and MacOS.

Version 0.6.0 will be distributed on BioConda and Cargo and will have precompiled binaries available.

Installation

git clone https://github.com/esteinig/vircov 
cd vircov && cargo build --release

Tests

git clone https://github.com/esteinig/vircov 
cd vircov && cargo test && cargo tarpaulin 

Concept

Definitive viral diagnosis from metagenomic clinical samples can be extremely challenging due to low sequencing depth, large amounts of host reads and low infectious titres, especially in blood or CSF. One way to distinguish a positive viral diagnosis is to look at alignment coverage against one or multiple reference sequences. When only few reads map to the reference, and genome coverage is low, positive infections often display multiple distinct alignment regions, as opposed to reads mapping to a single or few regions on the reference. This has been implemented for example in SURPI+ used by Miller et al. (2019) for DNA and RNA viruses in CSF samples. De Vries et al. (2021) summarize this concept succinctly in this figure (adapted):

devries

Positive calls in these cases can be made from coverage plots showing the distinct alignment regions and a threshold on the number of regions is chosen by the authors (> 3). However, coverage plots require visual assessment and may not be suitable for flagging potential hits in automated pipelines or summary reports.

Vircov attempts to make visual inspection and automated flagging easier by counting the distinct (non-overlapping) coverage regions in an alignment and reports some helpful statistics to make an educated call based on coverage information. We have specifically implemented grouping options by for example taxid in the reference fasta headers and account for segmented viruses and those split into genes in some databases like Virosaurus.

In addition the most recent version implements single reference selection based on first grouping the reference sequences that have significant alignments by taxid or species name, and then selecting the reference equence with the highest number of unique mapped reads (for example). This allows for using Vircov with permissive (no filter) settings to select single reference genoems for re-mapping and and provides sensitive coverage information.

Usage examples

# Output coverage statistics from a SAM|BAM|PAF alignment
vircov --alignment test.paf > stats.tsv

# Output coverage statistics with headers of reference sequences
# used in the alignment, including non aligned references
vircov --alignment test.paf --fasta ref.fa --zero -v > stats.zero.tsv

# Output coverage statistics with threshold filters activated
vircov --alignment test.bam \
   --fasta ref.fa \
   --min-len 50 \                  # minimum query alignment length
   --min-cov 0 \                   # minimum query alignment coverage
   --min-mapq 50 \                 # minimum mapping quality
   --reads 3 \                     # minimum reads aligned against a reference
   --coverage 0.05 \               # minimum reference coverage fraction
   --regions 4 \                   # minimum distinct alignment regions
   --regions-coverage 0.3 \        # distinct regions filter applies < coverage of 30%
   --read-ids ids.txt \            # read identifiers of mapped reads of surviving alignments
   -v > stats.tsv

# Group alignments by field in reference description and select best coverage reference
# for each group - header field example: "name=virus; taxid=12345; segment=N/A". Use the
# "segment=" field to select a single best segment if aligned against and output selected
# segments in one sequence file (e.g. for Influenza genomes). Each selected reference
# or selected segments are output as sequence files into "outref".

vircov --alignment test.paf  \
   --fasta ref.fa \
   --min-len 50 \                      # minimum query alignment length
   --min-cov 0 \                       # minimum query alignment coverage
   --min-mapq 50 \                     # minimum mapping quality
   --reads 3 \                         # minimum reads aligned against a reference
   --coverage 0.05 \                   # minimum reference coverage fraction
   --regions 4 \                       # minimum distinct alignment regions
   --regions-coverage 0.3 \            # distinct regions filter applies < coverage of 30%
   --group-by "taxid=" \               # field to group alignments by
   --group-sep ";" \                   # field separator in reference description
   --group-select-by "cov" \           # select best coverage reference sequences
   --group-select-split outref \       # output selected references into this folder
   --group-select-order \              # order by highest coverage, include as index in filename 
   --segment-field "segment=" \        # group and select segmented genomes with field
   --segment-field-nan "N/A" \         # no segment value in segment field
   -v > stats.selected.tsv             # output selected references alignment stats

Performance

Alignments conducted with minimap2 -c -x sr (PAF) and minimap2 --sam-hits-only -ax sr (SAM/BAM, only aligned sequences). Peak memory is mainly determined by aligned interval records that are stored for the overlap computations. It may vary depending on how many alignments remain after filtering, the number of aligned reads, output formats and size of the reference database.

  • Sample 1: ~ 70 million Illumina PE reads against ~ 70k reference genomes, ~ 1.7 million alignments
  • Sample 2: ~ 80 million Illumina PE reads against ~ 70k reference genomes, ~ 63 million alignments

.paf

  • Sample 1 (212 MB): 0.56 seconds, peak memory: 12 MB
  • Sample 2 (8.9 GB): 55.4 seconds, peak memory: 2.9 GB

Etymology

Not a very creative abbreviation of "virus coverage" but the little spectacles in the logo are a reference to Rudolf Virchow who described such trivial concepts as cells, cancer and pathology. His surname is pronounced like vircov if you mumble the terminal v.

Contributors

  • Prof. Deborah Williamson and Prof. Lachlan Coin (principal investigators for META-GP)
  • Dr. Leon Caly (samples and sequencing for testing on clinical data)

vircov's People

Contributors

esteinig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

vircov's Issues

Enhanced output

  • interval coordinates with number of alignments
  • number of unique reads in the alignments
  • implement verbosity command line option

Implemented in last column (otherwise empty) format: start:stop:count

Intervals now contain the read identifier as val - these are then reduced to unique read identities. For paired end data (mapping to + for forward sequence and - for reverse sequence) this means that the final read count is for unique read pairs (and single mapped reads of a pair) if mate pairs have the same identifier.

  • check when read identifiers contain /1 or /2 paired end designations

In case of different read identifier names in forward and reverse ({id}/1 or {id}/2, for example) the number of reads output represents the total count of forward and reverse reads.

Coverage conditional regions filter

High coverage genomes may only have a single assembled region, implement a conditional regional filter when e.g. coverage below a threshold

Unit tests core components

Implement appropriate unit tests for paf.rs and covplot.rs

Excluded for now:

  • CovPlot::to_console() - not sure how to implement tests for crossterm output to stdout
  • PafAlignment::to_console() - not sure how to test output tostdout
  • PafAlignment::coverage_plots() - same issue with crossterm unit testing

Segmented genomes

Add option to parse header fields for segment qualifiers with custom formats to include and select best segments from alignments. At the moment hard-coded to include all reference sequences that have the segment=N/A field in the header (as in Virosaurus) and are ignored during grouped single selection, outputting all segments into the selected genome sequence files.

Also add option to select the best segment if multiple segments (same identifier) are grouped.

0-value alignment output

Sometimes we want to know if no reads aligned against some reference sequences (e.g. when checking limited number of sequences as in synthetic reference controls). Add option to output 0-values in the alignment summary.

Coverage terminal plot

An approximate coverage distribution plot to print in the terminal.

Possibly something like:

5'------------------------------------[---]----------3'

or with colors?

Strobealign BAM --min-cov

For some arcane reason, when using BAM files output by Strobealign v0.7.1 the --min-cov filter removes all alignments (noted in the MetaGP remap loop process). Need to investigate the SAM/BAM query coverage calculation and compare with PAF in Strobealign.

Reporting threshold

Implement a coverage region threshold to filter outputs for reporting (as in the de Vries paper). Particularly useful for large databases - tested this on Virosaurus (70k genomes) and larger alignments / deeper sequencing runs.

Performance optimisation [PAF]

Need to implement an iterator based PAF parser and link it to interval extraction directly in the PafRecord structs - large PAF files could probably be parsed faster this way.

Output column names

Looks very interesting. Did I miss the output column names somewhere ?

gi_1070105496_ref_NC_031108.1__Propionibacterium_phage_PFR2,_complete_genome           2   2    2    132   0   0.0000
gi_1070125494_ref_NC_031129.1__Salmonella_phage_SJ46,_complete_genome                  2   2    2    128   0   0.0000
gi_110645916_ref_NC_001401.2__Adeno-associated_virus_-_2,_complete_genome              1   2    2    67    0   0.0000

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.