Giter VIP home page Giter VIP logo

svtools's Introduction

svtools - Comprehensive utilities to explore structural variations in genomes

License Build Status Coverage Status

PyPI DOI

Summary

svtools is a suite of utilities designed to help bioinformaticians construct and explore cohort-level structural variation calls. It is designed to efficiently merge and genotype calls from speedseq sv across thousands to tens of thousands of genomes.

Table of Contents

  1. Requirements
  2. Installation
  3. Obtaining help
  4. Usage
  5. Citing svtools
  6. Troubleshooting

Requirements

Installation

We recommend you install using pip. For more detailed instructions, see our Installation guide.

Installing via pip

pip install svtools

Obtaining help

Please see the documentation on, or linked to, this page. For additional help or to report a bug, please open an issue in the svtools repository: https://github.com/hall-lab/svtools/issues

Usage

svtools consists of subcommands for processing VCF or BEDPE files of structural variants and one accessory script (create_coordinates).

usage: svtools [-h] [--version] [--support] subcommand ...

Comprehensive utilities to explore structural variation in genomes

optional arguments:
  -h, --help     show this help message and exit
  --version      show program's version number and exit
  --support      information on obtaining support

  subcommand     description
    lsort        sort N LUMPY VCF files into a single file
    lmerge       merge LUMPY calls inside a single file from svtools lsort
    vcfpaste     paste VCFs from multiple samples
    copynumber   add copynumber information using cnvnator-multi
    genotype     compute genotype of structural variants based on breakpoint depth
    afreq        add allele frequency information to a VCF file
    bedpetobed12 convert a BEDPE file to BED12 format for viewing in IGV or the
                 UCSC browser
    bedpetovcf   convert a BEDPE file to VCF
    vcftobedpe   convert a VCF file to a BEDPE file
    vcfsort      sort a VCF file
    bedpesort    sort a BEDPE file
    prune        cluster and prune a BEDPE file by position based on allele
                 frequency
    varlookup    look for variants common between two BEDPE files
    classify     reclassify DEL and DUP based on read depth information

Citing svtools

Until svtools is published, please cite using its DOI. Note that this link corresponds to the latest version. If you used an earlier version then your DOI may be different and you can find it on Zenodo.

Troubleshooting

As issues arise and common problems are identified, we will list them here.

Note: For additional information and usage refer to the Tutorial.md file.

svtools's People

Contributors

abelhj avatar abhijitbadve avatar apregier avatar brentp avatar cc2qe avatar dantaki avatar davidlmorton avatar ernfrid avatar jeldred avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

svtools's Issues

vcfpaste header

vcfpaste is not printing the string "FORMAT" from the header line beginning "#CHROM"

##reference= lines

These are injected into our headers if no reference is present in the original file.

svtools vcfpaste, passing in VCF names

Previously I could use a wildcard "*vcf", but it no longer likes that:

$ svtools vcfpaste -f *ss

usage: svtools [-h] subcommand ...
svtools: error: unrecognized arguments:
svtools: error: unrecognized arguments: 1.vcf 2.vcf 3.vcf

So then I try to list them explicitly, same error:

$ svtools vcfpaste -f 1.vcf 2.vcf 3.vcf

usage: svtools [-h] subcommand ...
svtools: error: unrecognized arguments:
svtools: error: unrecognized arguments: 1.vcf 2.vcf 3.vcf

I noticed this bit in the usage page Line-delimited list of VCF files to paste

How do you do line-delimitation?

Converting to BEDPE is lossy

We lose the reference allele information and we also lose information in and around symbolic alleles. For example, <DUP:TANDEM> gets converted to simply <DUP>. If this is the correct behavior then symbolic allele lines need to be added/scrubbed from the headers appropriately.

Does not build, new locations look funny

$ git pull
Already up-to-date.
$ python setup.py build
Traceback (most recent call last):
File "setup.py", line 19, in
long_description=open('README.txt').read()
IOError: [Errno 2] No such file or directory: 'README.txt'
$ touch README.txt
$ python setup.py build
running build
running build_scripts
error: file '**************************/svtools/bin/vcftobedpe' does not exist
$ find . -name vcftobedpe
./tests/test_data/vcftobedpe
$ grep vcftobedpe setup.py
scripts=["bin/vcftobedpe","bin/varlookup","bin/svtools","bin/vcfsort",
$ find . -name varlookup
./tests/test_data/varlookup
s$ find . -name svtools
./svtools
$ find . -name vcfsort
./svtools/bin/vcfsort
./tests/test_data/vcfsort

Off-by-one error in bedpetovcf

bedpetovcf has an off-by-one error. Calling vcftobedpe followed by bedpetovcf reduces the POS by 1. It seems like this issue is limited to BNDs.
This arose because of commit c163881, which fixed an off-by-one error in vcftobedpe.

##assembly lines are lost

Lines in the VCF header like ##assembly are getting scrubbed. Possibly additional lines would be scrubbed as well. Is this the proper behavior or does it need to be fixed? Seems improper to me.

Update documentation on lsort

It drops SECONDARY BND lines and converts some BNDs to INV. This is a little counterintuitive and we should update.

afreq is still too slow

We're doing a lot of unnecessary parsing in the current version and run time, even with pure python, should be much better.

Add a license

Need to review dependencies and make sure we have a valid license and license file.

lsort input file

Should we make lsort take as input an input file, with a list of vcfs to merge (like vcfpaste does now)? Currently, at least in my workflow, I'm using a bash loop to write a sort command that I then submit to queue--feels like kind of a kludge.

harmonize single character arguments for vcfpaste.py (other tools as well?)

single character arguments to some tools have not been obvious to me (leading me to dig into the code to determine the intent of these arguments)
examples include
-m and -f to vcfpaste.py
I have found this to be most often confusing when the argument has an argument that is a file path or such.....I have specifically wondered if am I providing an input or output path for instance....
this suggests we may want to take a pass at the help text and names for these arguments
instead of proposing a specific harmonization strategy I recognize that changing this (or deciding not to change this might take some thought)
I create this issue to consider the issue.

Store the command line in output VCF headers

Embedding CLI and program version in the VCF header.

GATK does this similar to ##GATKCommandLine.HaplotypeCaller=<ID

bcftools adds two lines. One with the version of it (and htslib) and another with the cli. Both are prefixed similar to GATK (##bcftools_viewVersion and ##bcftools_viewCommand).

Deprecation Warning encountered when running lsort.py (possibly lmerge.py?)

when running lsort.py a Deprecation Warning is encountered

python /gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/lsort.py
/gscuser/jeldred/.local/lib/python2.6/site-packages/svtools-0.1.2-py2.6.egg/svtools/l_bp.py:2: DeprecationWarning: the sets module is deprecated
from sets import Set

according to Dave

the fix will be the same for all submodules
which is
move to using core set class instead of Set

vcfToBedpe strandness

Hi, Thanks for making this useful tool. I am using vcfToBedpe to convert vcf from lumpy.
The resulting bedpe files in column 9 and 10 have strandness info "+" "-"
what does exactly those strandness tell us?
for translocations (BND), I found there are 4 different combinations of the strandness.
for deletions, it is always+ -
for inversions, it is always + +
for duplications, it is always - +

Thanks very much!
Ming

Demo is horribly out of date

The install process as well as workflow and command lines have evolved significantly. A newer demo and/or tutorial needs to be developed.

Eliminate prune step

Fix lmerge to do what we actually want it to, and we may be able to eliminate prune step, which is a bit clunky.

Consider version management strategy

A few things to address:

  • How are version numbers managed?
    ** Do we want a version.txt file or similar?
    ** Do we want to try to manage via annotated tags in the git repository?

See also #50 for a discussion on release strategies.

VCF output of speedseq(lumpy) SVtype BND are translocations?

Hi there,

Lumpy output vcf files with 4 different SV types: BND, DEL, INV and DUP, Something I have asked here

For BNDs:
I understand that if both break ends are in different chromosomes, this is more likely to be a translocation, but how about break ends on the same chromosome? I want to make sure that BNDs are always translocations or sometimes it is just SV types can not be determined and represented in the break end format.

I read here and saw DEL can be represented in BND format, but the example is showing break points on two different chromosomes, so it is a bit confusing.

Thanks for answering me.

Best,
Ming

Come up with packaging/release strategy

Let's discuss.

The goal is to make it as simple as possible for users to install (without root). The following two user stories seem realistic:

  1. I want to use all the tools in the svtools suite in a pipeline.
  2. I want to use just some of the tools (i.e. vcf-to-bedpe and bedpe-to-vcf).

Proposal:

  • Split repository up into multiple repositories, one for each tool (or logically related set of tools).
    • Benefits:
      • Allows for releasing tools separately (satisfying user story 2).
      • Isolates tools from one another so cross-dependencies cannot creep in.
      • Allows for more fine-grained testing and issue tracking.
      • Each tool (or tool set) could be implemented in language of choice.
    • Drawbacks:
      • Some developers feel more comfortable with a single repository
      • Automation for building conda and/or pypi packages would need to be created.
      • Some amount of boilerplate will exist in every repository.

Bedpe parser error checking

The Bedpe parser in bedpetovcf will die with an index error instead of something intelligent if a line doesn't have an SVTYPE or AF file.

Better error checking/reporting around missing paths needed

@jeldred observed that when an invalid path is provided for cnvnator-multi in svtools copynumber then errors result like

/gscmnt/gc2801/analytics/jeldred/2016_03_31_test_callset/cn/logs/MISC $ cat NA12887.cn.9358519.log 
Traceback (most recent call last):
 File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/copynumber.py", line 103, in <module>
   sys.exit(args.entry_point(args))
 File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/copynumber.py", line 97, in run_from_args
   sv_readdepth(stream, args.sample, args.root, args.window, args.output_vcf, args.cnvnator, args.coordinates)
 File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/copynumber.py", line 24, in sv_readdepth
   cn = run_cnvnator(cnvnator_path, root, window, coord_list)
 File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/copynumber.py", line 11, in run_cnvnator
   p2 = Popen(cmd, stdin=p1.stdout, stdout=PIPE)
 File "/gscuser/jeldred/.pyenv/versions/2.7.9/lib/python2.7/subprocess.py", line 710, in __init__
   errread, errwrite)
 File "/gscuser/jeldred/.pyenv/versions/2.7.9/lib/python2.7/subprocess.py", line 1335, in _execute_child
   raise child_exception
OSError: [Errno 2] No such file or directory
cat: write error: Broken pipe

This is not an informative error that a normal person can act on. Error checking needs to be better for this subcommand (and arguably ALL subcommands should be reviewed).

Release a version of hall-lab-svtools in GitHub

Hello hall-lab!

I find your vcftobedpe script super useful for my purposes. I'm about to package svtools for bioconda:

https://github.com/bioconda/bioconda-recipes/tree/master/recipes

But in order to properly package it, I would need you to release a version of your package, preferably here on GitHub (ie: hall-lab-svtools-0.1.1.tar.gz that the package manager could download from instead of just the master branch):

https://help.github.com/articles/creating-releases/

Thanks in advance! Let me know if you need assistance releasing it.

BND Variants on GL contigs move during file conversion

For example,

$ cat roundtrip_diff.out | cut -f1-4 | less
4a5
> ##INFO=<ID=POS,Number=1,Type=Integer,Description="Position of the variant described in this record">
81118,81119c81119,81120
< GL000193.1    16      13312_2 N
< GL000193.1    25      13311_2 N

---
> GL000193.1    17      13312_2 N
> GL000193.1    26      13311_2 N

Left file here is before conversion to BEDPE. The right file is after conversion to BEDPE and then back to VCF.

vcfpaste fails with clean exit code when given malformed -f(vcf-list)

running vcfpaste with a vcf-list that includes the master vcf includes in the vcf-list caused the error below

(jim-2.7.9) /gscmnt/gc2801/analytics/jeldred/2016_03_31_test_callset $ cat paste.cn.9241244.log
Traceback (most recent call last):
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py", line 126, in
sys.exit(args.entry_point(args))
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py", line 120, in run_from_args
paster.execute()
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py", line 17, in execute
self.write_variants(output_handle)
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py", line 87, in write_variants
format = line_v[8]
IndexError: list index out of range


Sender: LSF System [email protected]
Subject: Job 9241244: <paste.cn> in cluster Done

Job <paste.cn> was submitted from host <linus43.gsc.wustl.edu> by user in cluster .
Job was executed on host(s) <blade8-3-16.gsc.wustl.edu>, in queue , as user in cluster .
</gscuser/jeldred> was used as the home directory.
</gscmnt/gc2801/analytics/jeldred/2016_03_31_test_callset> was used as the working directory.
Started at Fri Apr 1 14:42:10 2016
Results reported on Fri Apr 1 14:42:12 2016

Your job looked like:


LSBATCH: User input

python /gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py -m merged.no_EBV.vcf -f cn.list -q | bgzip -c > merged.sv.gt.cn.redo.1.vcf.gz

Successfully completed.

Resource usage summary:

CPU time :               0.13 sec.
Total Requested Memory : 8000.00 MB
Delta Memory :           -
(Delta: the difference between Total Requested Memory and Max Memory.)

The output (if any) is above this job summary.

A convention for BEDPE breakpoint positions

Opening a formal issue for discussions of the proper spec for representing exact breakpoints in BEDPE.

Potential Reporting Conventions

  • Affected Bases (AFF)
    • pros
      • simple
    • cons
      • doesn't work for balanced translocations or BND variants
  • Left of the breakpoint (LOB)
    • pros
      • symmetrical representation of ++/-- inversions and balanced rearrangements
    • cons
      • telomeric variants may have negative position
  • Right of the breakpoint (ROB)
    • same as LOB except avoids negative positions (but may overhang chrom on right)
  • Exact breakpoint (BPT)
    • pros
      • accurately represents breakpoints as space between two bases
      • symmetrical representation of ++/-- inversions and balanced rearrangements
    • cons
      • zero-length events may not be compatible with BEDTools intersections and clustering
  • Last-aligned Base (LAB)
    • pros
      • simple description
    • cons
      • asymmetrical representation of ++/-- inversions and balanced rearrangements
      • size of the variant is END - START + 1

Classify needs to work for both rare and common variants

For common variants (>10 samples with non-reference allele), a linear regression across all samples for each variant works well to reclassify the variant. For rare variants, we may need to use an outside training set to train a classifier. Even for large cohorts, both approaches may be needed.

Memory improvement in lsort

Current sorting approach won't scale - need to limit the amount stored in memory, perhaps by chromosome and variant type.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.