Giter VIP home page Giter VIP logo

fgsv's Introduction

fgsv

Bioconda Build Status Language License DOI

Tools to gather evidence for structural variation via breakpoint detection.

Documentation

More detailed documentation can be found in the docs folder.

Introduction to the fgsv Toolkit

The fgsv toolkit contains tools for effective structural variant investigation. These tools are not meant to be used as a structural variant calling toolchain in-and-of-itself; instead, it is better to think of fgsv as a breakpoint detection and structural variant exploration toolkit.

Note

When describing structural variation, we use the term breakpoint to mean a junction between two loci and the term breakend to refer to one of the loci on one side of a breakpoint.

Important

All point intervals (1-length) reported by this toolkit are 1-based inclusive from the perspective of the reference sequence unless otherwise documented.

SvPileup

Collates pileups of reads over breakpoint events.

fgsv SvPileup \
    --input sample.bam \
    --output sample.svpileup

The tool fgsv SvPileup takes a queryname-grouped BAM file as input and scans each query group (template) of alignments for structural variant evidence. For a simple example: a paired-end read may have one alignment per read (one alignment for read 1 and another alignment for read 2) mapped to different reference sequences supporting a putative translocation.

Primary and supplementary alignments for a template are used to construct a “chain” of aligned sub-segments in a way that honors the sub-segments mapping locations and strandedness as compared to the reference sequence. The aligned sub-segments in a chain relate to each other through typical alignment mechanisms like insertions and deletions but also contain information about the relative orientation of the sub-segment to the reference sequence and importantly, jumps between reference sequences which could indicate translocations. See the SAM Format Specification v1 for more information on how reads relate to alignments.

For each chain of aligned sub-segments per template, outlier jumps are collected where the minimum inter-segment jump distance within a read must be 100bp (by default) or greater, and the minimum inter-read jump distance (e.g. between reads in a paired-end read) must be 1000bp (by default) or greater. At locations where these jumps occur, breakpoints are marked and given a unique ID based on the loci of the breakends and the directionality of the left and right strands leading into each breakend. In the case where there is both evidence for a split-read jump and inter-read jump, the split-read alignment evidence is favored since it gives a precise breakpoint. This process creates a collection of candidate breakpoint locations.

The tool outputs a table of candidate breakpoints and a BAM file with each alignment tagged with the ID of the breakpoint it supports (if any).

AggregateSvPileup

Aggregates and merges pileups that are likely to support the same breakpoint.

fgsv AggregateSvPileup \
    --bam sample.bam \
    --input sample.svpileup.txt \
    --output sample.svpileup.aggregate.txt

The tool fgsv AggregateSvPileup is used to aggregate nearby breakpoints into one event if they appear to support one true breakpoint. This polishing step preserves true positive breakpoint events and is intended to reduce the number of false positive breakpoint events.

Aggregating breakpoints is often necessary because of variability in typical short-read alignments caused by somatic mutation, sequencing error, alignment artifact, or breakend sequence similarity/homology to the reference sequence. Variability in short-read alignments means that it is not always possible to locate the exact nucleotide coordinate where either breakends in a breakpoint occur. Instead, either breakend of a true breakpoint may map to a plausible region (instead of a point coordinate) and when this happens, the cluster of breakends could be aggregated to build up support for one true breakpoint.

Clustered breakpoints are only merged if their left breakends map to the same strand of the same reference sequence, their right breakends map to the same strand of the same reference sequence, and their left and right genomic breakend positions are both within a given length threshold of 10bp (by default).

One shortcoming of the existing behavior, which should be corrected at some point, is that intra-read breakpoint evidence is considered similarly to inter-pair breakpoint evidence even though intra-read breakpoint evidence often has nucleotide-level alignment resolution and inter-pair breakpoint evidence does not.

The tool outputs a table of aggregated breakpoints and a modified copy of the input BAM file where each alignment is tagged with the ID of the aggregate breakpoint it supports (if any).

AggregateSvPileupToBedPE

Converts the output of AggregateSvPileup to the BEDPE format.

fgsv AggregateSvPileupToBedPE \
    --input sample.svpileup.aggregate.txt \
    --output sample.svpileup.aggregate.bedpe

The tool fgsv AggregateSvPileupToBedPE is used to convert the output of AggregateSvPileup to BEDPE so that it can be viewed in IGV and other BEDPE-supporting genome browsers. For example:

BEDPE in IGV

fgsv's People

Contributors

clintval avatar jdidion avatar msto avatar nh13 avatar pamelarussell avatar tfenne avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

tomebio

fgsv's Issues

SvPileup should handle circular contigs better

When running SvPileup on a BAM file aligned carefully to a circular genome (plasmid, mito) it generates:

  1. A really well supported breakpoint using split-reads than span the origin
  2. A huge number of "abnormal pair" supported breakpoints

For intra-contig evidence, if the contig is labeled as circular (TP:circular in the relevant SQ header), sv-pileup should treat the transition around the origin as contiguous and re-calculate the "inner distance" between segments for split reads and between reads for pairs. Doing this should remove the vast majority of origin-related breakpoints being emitted.

Document running the tools and any bwa mem options

Requirements:

  • run bwa mem:
    • without the -M option (-M marks shorter split hits as secondary).
    • use the -Y option (use soft clipping for supplementary alignments).

Run:

  • fgsv SvPileup -i <input.bam> -o <out-pre>.raw
  • fgsv AggregateSvPileup -i <out-pre>.raw.txt -o <out_pre>.final.txt

`AggregateSvPileup` should account for inaccurate split-read breakpoint positions

Currently AggregateSvPileup merges breakpoints that have left and right breakpoints within a distance threshold of each other, regardless of the type of read evidence of the breakpoints: split-read (breakpoint occurs inside sequenced read) or read-pair (breakpoint occurs in the unsequenced insert between mates).

However, these two types of evidence have different precision of the breakpoint position and should use different distance thresholds. While split-read evidence is likely to point to a very precise position, the position for a read-pair event can be off by as much as the inner distance (insert size minus read lengths). Something similar to the following procedure should be used instead:

  1. "Seed" clusters by clustering only breakpoints that have split-read evidence
  2. "Seed" additional clusters with breakpoints that have read-pair evidence
  3. Use read-pair events to aggregate clusters when the distance is within the inner distance (computed empirically by sampling)

SvPileup should allow some minimal filtering

I'm envisioning we could add parameters like:

--min-split-reads (default: 0)
--min-read-pairs (default: 0)
--min-total-support (default: 1)

The only problem with doing this is that we would no longer be able to write the evidence BAM on the fly - we'd have to write a temporary BAM with all reads supporting all break-points, and then re-process that to remove reads/templates for breakpoints with insufficient support. That's not a big overhead - these BAMs are usually pretty small. And it would also give us a logical place to add re-sorting the BAM file into coordinate order for use in IGV.

fgsv should enable easier review in IGV

fgsv generates an output BAM from the pileup phase that has tags like:

be:Z:0;left;from;read_pair
be:Z:0;left;into;read_pair
be:Z:0;left;from;split_read
be:Z:0;left;into;split_read
be:Z:0;right;from;split_read
be:Z:0;right;into;split_read

where records from the same template (or same read in the case of split reads) get different values. This makes it really hard to group/color/etc. in IGV and see all read alignments that support a given breakpoint.

I would suggest that some of this information should be extracted into individual tags. At least the breakpoint ID, and possibly also the left/right into a second tag. That way users could group by breakpoint ID and get a view of all the reads supporting the break point together.

Are some "Possible Deletion"s actually intra-contig rearrangements?

Looking at what falls through as a possible deletion, I think fgsv categorizes the following as a "possible deletion" when it's more likely a "intra-contig rearrangment":

  1. The left and right breakends have the same contig and strand
    2a. When they are both on the + strand, he position of the left breakend is greater than the right breakend
    2b. When they are both on the - strand, he position of the left breakend is lower than the right breakend

A deletion means the same contig and strand and jumping "forward" (5'->3').

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.