hall-lab / svtools Goto Github PK
View Code? Open in Web Editor NEWTools for processing and analyzing structural variants.
License: MIT License
Tools for processing and analyzing structural variants.
License: MIT License
bedpetovcf has an off-by-one error. Calling vcftobedpe followed by bedpetovcf reduces the POS by 1. It seems like this issue is limited to BNDs.
This arose because of commit c163881, which fixed an off-by-one error in vcftobedpe.
The Bedpe parser in bedpetovcf will die with an index error instead of something intelligent if a line doesn't have an SVTYPE or AF file.
Hello hall-lab!
I find your vcftobedpe
script super useful for my purposes. I'm about to package svtools for bioconda:
https://github.com/bioconda/bioconda-recipes/tree/master/recipes
But in order to properly package it, I would need you to release a version of your package, preferably here on GitHub (ie: hall-lab-svtools-0.1.1.tar.gz that the package manager could download from instead of just the master branch):
https://help.github.com/articles/creating-releases/
Thanks in advance! Let me know if you need assistance releasing it.
I believe name reconstruction is now incompatible with manta VCFs.
It drops SECONDARY BND lines and converts some BNDs to INV. This is a little counterintuitive and we should update.
Embedding CLI and program version in the VCF header.
GATK does this similar to ##GATKCommandLine.HaplotypeCaller=<ID
bcftools adds two lines. One with the version of it (and htslib) and another with the cli. Both are prefixed similar to GATK (##bcftools_viewVersion
and ##bcftools_viewCommand
).
Googling it every time seems unwise.
Need to review dependencies and make sure we have a valid license and license file.
@jeldred observed that when an invalid path is provided for cnvnator-multi in svtools copynumber then errors result like
/gscmnt/gc2801/analytics/jeldred/2016_03_31_test_callset/cn/logs/MISC $ cat NA12887.cn.9358519.log
Traceback (most recent call last):
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/copynumber.py", line 103, in <module>
sys.exit(args.entry_point(args))
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/copynumber.py", line 97, in run_from_args
sv_readdepth(stream, args.sample, args.root, args.window, args.output_vcf, args.cnvnator, args.coordinates)
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/copynumber.py", line 24, in sv_readdepth
cn = run_cnvnator(cnvnator_path, root, window, coord_list)
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/copynumber.py", line 11, in run_cnvnator
p2 = Popen(cmd, stdin=p1.stdout, stdout=PIPE)
File "/gscuser/jeldred/.pyenv/versions/2.7.9/lib/python2.7/subprocess.py", line 710, in __init__
errread, errwrite)
File "/gscuser/jeldred/.pyenv/versions/2.7.9/lib/python2.7/subprocess.py", line 1335, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
cat: write error: Broken pipe
This is not an informative error that a normal person can act on. Error checking needs to be better for this subcommand (and arguably ALL subcommands should be reviewed).
Currently this is only used by prune
Previously I could use a wildcard "*vcf", but it no longer likes that:
$ svtools vcfpaste -f *ss
usage: svtools [-h] subcommand ...
svtools: error: unrecognized arguments:
svtools: error: unrecognized arguments: 1.vcf 2.vcf 3.vcf
So then I try to list them explicitly, same error:
$ svtools vcfpaste -f 1.vcf 2.vcf 3.vcf
usage: svtools [-h] subcommand ...
svtools: error: unrecognized arguments:
svtools: error: unrecognized arguments: 1.vcf 2.vcf 3.vcf
I noticed this bit in the usage page Line-delimited list of VCF files to paste
How do you do line-delimitation?
Jim found that we crash if lines exist where AF=.
in the INFO field.
Should we make lsort take as input an input file, with a list of vcfs to merge (like vcfpaste does now)? Currently, at least in my workflow, I'm using a bash loop to write a sort command that I then submit to queue--feels like kind of a kludge.
This line is problematic because if the description has a "<" or ">" it will break.
Example:
##FILTER=<ID=MSQ_20,Description="Variant without read-depth support with MSQ < 20">
Is there a reason this 'GT' needs to added when len(var_list)<9? It makes things weird with 8 col vcfs.
https://github.com/hall-lab/svtools/blob/master/svtools/vcf/variant.py#L27-L34
running vcfpaste with a vcf-list that includes the master vcf includes in the vcf-list caused the error below
(jim-2.7.9) /gscmnt/gc2801/analytics/jeldred/2016_03_31_test_callset $ cat paste.cn.9241244.log
Traceback (most recent call last):
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py", line 126, in
sys.exit(args.entry_point(args))
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py", line 120, in run_from_args
paster.execute()
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py", line 17, in execute
self.write_variants(output_handle)
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py", line 87, in write_variants
format = line_v[8]
IndexError: list index out of range
Sender: LSF System [email protected]
Subject: Job 9241244: <paste.cn> in cluster Done
Job <paste.cn> was submitted from host <linus43.gsc.wustl.edu> by user in cluster .
Job was executed on host(s) <blade8-3-16.gsc.wustl.edu>, in queue , as user in cluster .
</gscuser/jeldred> was used as the home directory.
</gscmnt/gc2801/analytics/jeldred/2016_03_31_test_callset> was used as the working directory.
Started at Fri Apr 1 14:42:10 2016
Results reported on Fri Apr 1 14:42:12 2016
Your job looked like:
Successfully completed.
Resource usage summary:
CPU time : 0.13 sec.
Total Requested Memory : 8000.00 MB
Delta Memory : -
(Delta: the difference between Total Requested Memory and Max Memory.)
The output (if any) is above this job summary.
Currently, all VCFs need to be unzipped and this shouldn't be necessary.
A few things to address:
See also #50 for a discussion on release strategies.
Let's discuss.
The goal is to make it as simple as possible for users to install (without root). The following two user stories seem realistic:
Proposal:
This is a bottleneck step and currently runs in ~45 minutes per genome
Hi there,
Lumpy output vcf files with 4 different SV types: BND
, DEL
, INV
and DUP
, Something I have asked here
For BNDs:
I understand that if both break ends are in different chromosomes, this is more likely to be a translocation, but how about break ends on the same chromosome? I want to make sure that BNDs are always translocations or sometimes it is just SV types can not be determined and represented in the break end format.
I read here and saw DEL can be represented in BND format, but the example is showing break points on two different chromosomes, so it is a bit confusing.
Thanks for answering me.
Best,
Ming
Believe this is the same or similar to #2
Opening a formal issue for discussions of the proper spec for representing exact breakpoints in BEDPE.
END - START + 1
make vcfToBedpe lenient when fields are absent from header
See changes in first commit of #6.
We lose the reference allele information and we also lose information in and around symbolic alleles. For example, <DUP:TANDEM>
gets converted to simply <DUP>
. If this is the correct behavior then symbolic allele lines need to be added/scrubbed from the headers appropriately.
Currently these are being converted to a QUAL of 0. This was the behavior in some of the codebase (vcfpaste for example) before, but not in all.
Should try to address issues raised in http://www.software.ac.uk/online-sustainability-evaluation as much as possible. Should provide current description, links to install and demo information, developer info, citation, mechanisms for user support.
when running lsort.py a Deprecation Warning is encountered
python /gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/lsort.py
/gscuser/jeldred/.local/lib/python2.6/site-packages/svtools-0.1.2-py2.6.egg/svtools/l_bp.py:2: DeprecationWarning: the sets module is deprecated
from sets import Set
according to Dave
the fix will be the same for all submodules
which is
move to using core set class instead of Set
$ git pull
Already up-to-date.
$ python setup.py build
Traceback (most recent call last):
File "setup.py", line 19, in
long_description=open('README.txt').read()
IOError: [Errno 2] No such file or directory: 'README.txt'
$ touch README.txt
$ python setup.py build
running build
running build_scripts
error: file '**************************/svtools/bin/vcftobedpe' does not exist
$ find . -name vcftobedpe
./tests/test_data/vcftobedpe
$ grep vcftobedpe setup.py
scripts=["bin/vcftobedpe","bin/varlookup","bin/svtools","bin/vcfsort",
$ find . -name varlookup
./tests/test_data/varlookup
s$ find . -name svtools
./svtools
$ find . -name vcfsort
./svtools/bin/vcfsort
./tests/test_data/vcfsort
The install process as well as workflow and command lines have evolved significantly. A newer demo and/or tutorial needs to be developed.
single character arguments to some tools have not been obvious to me (leading me to dig into the code to determine the intent of these arguments)
examples include
-m and -f to vcfpaste.py
I have found this to be most often confusing when the argument has an argument that is a file path or such.....I have specifically wondered if am I providing an input or output path for instance....
this suggests we may want to take a pass at the help text and names for these arguments
instead of proposing a specific harmonization strategy I recognize that changing this (or deciding not to change this might take some thought)
I create this issue to consider the issue.
We're doing a lot of unnecessary parsing in the current version and run time, even with pure python, should be much better.
For common variants (>10 samples with non-reference allele), a linear regression across all samples for each variant works well to reclassify the variant. For rare variants, we may need to use an outside training set to train a classifier. Even for large cohorts, both approaches may be needed.
This option wasn't functional, but we do want it to add in the confidence intervals as they aren't output by certain tools.
Lines in the VCF header like ##assembly are getting scrubbed. Possibly additional lines would be scrubbed as well. Is this the proper behavior or does it need to be fixed? Seems improper to me.
These are injected into our headers if no reference is present in the original file.
I'm wrapping your tool for the common workflow language:
http://common-workflow-language.github.io/draft-3/
Unfortunately I cannot redistribute it as part of a bigger pipeline unless it has a license... could you choose one that fits (y)our needs?
Thanks!
Hi, Thanks for making this useful tool. I am using vcfToBedpe to convert vcf from lumpy.
The resulting bedpe files in column 9 and 10 have strandness info "+" "-"
what does exactly those strandness tell us?
for translocations (BND), I found there are 4 different combinations of the strandness.
for deletions, it is always+ -
for inversions, it is always + +
for duplications, it is always - +
Thanks very much!
Ming
Current sorting approach won't scale - need to limit the amount stored in memory, perhaps by chromosome and variant type.
For example,
$ cat roundtrip_diff.out | cut -f1-4 | less
4a5
> ##INFO=<ID=POS,Number=1,Type=Integer,Description="Position of the variant described in this record">
81118,81119c81119,81120
< GL000193.1 16 13312_2 N
< GL000193.1 25 13311_2 N
---
> GL000193.1 17 13312_2 N
> GL000193.1 26 13311_2 N
Left file here is before conversion to BEDPE. The right file is after conversion to BEDPE and then back to VCF.
vcfpaste is not printing the string "FORMAT" from the header line beginning "#CHROM"
Fix lmerge to do what we actually want it to, and we may be able to eliminate prune step, which is a bit clunky.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.