hall-lab / svtools Goto Github PK

View Code? Open in Web Editor NEW

143.0 18.0 55.0 157.74 MB

Tools for processing and analyzing structural variants.

License: MIT License

Python 98.53% Shell 0.99% Perl 0.48%

bioinformatics structural-variation

svtools's People

Stargazers

Watchers

svtools's Issues

Off-by-one error in bedpetovcf

bedpetovcf has an off-by-one error. Calling vcftobedpe followed by bedpetovcf reduces the POS by 1. It seems like this issue is limited to BNDs.
This arose because of commit c163881, which fixed an off-by-one error in vcftobedpe.

Bedpe parser error checking

The Bedpe parser in bedpetovcf will die with an index error instead of something intelligent if a line doesn't have an SVTYPE or AF file.

Release a version of hall-lab-svtools in GitHub

Hello hall-lab!

I find your vcftobedpe script super useful for my purposes. I'm about to package svtools for bioconda:

https://github.com/bioconda/bioconda-recipes/tree/master/recipes

But in order to properly package it, I would need you to release a version of your package, preferably here on GitHub (ie: hall-lab-svtools-0.1.1.tar.gz that the package manager could download from instead of just the master branch):

https://help.github.com/articles/creating-releases/

Thanks in advance! Let me know if you need assistance releasing it.

bedpetovcf should be compatible with single-ended BNDs

I believe name reconstruction is now incompatible with manta VCFs.

Update documentation on lsort

It drops SECONDARY BND lines and converts some BNDs to INV. This is a little counterintuitive and we should update.

Store the command line in output VCF headers

Embedding CLI and program version in the VCF header.

GATK does this similar to ##GATKCommandLine.HaplotypeCaller=<ID

bcftools adds two lines. One with the version of it (and htslib) and another with the cli. Both are prefixed similar to GATK (##bcftools_viewVersion and ##bcftools_viewCommand).

Add pip deployment information to developer docs

Googling it every time seems unwise.

Add a license

Need to review dependencies and make sure we have a valid license and license file.

Better error checking/reporting around missing paths needed

@jeldred observed that when an invalid path is provided for cnvnator-multi in svtools copynumber then errors result like

/gscmnt/gc2801/analytics/jeldred/2016_03_31_test_callset/cn/logs/MISC $ cat NA12887.cn.9358519.log 
Traceback (most recent call last):
 File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/copynumber.py", line 103, in <module>
   sys.exit(args.entry_point(args))
 File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/copynumber.py", line 97, in run_from_args
   sv_readdepth(stream, args.sample, args.root, args.window, args.output_vcf, args.cnvnator, args.coordinates)
 File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/copynumber.py", line 24, in sv_readdepth
   cn = run_cnvnator(cnvnator_path, root, window, coord_list)
 File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/copynumber.py", line 11, in run_cnvnator
   p2 = Popen(cmd, stdin=p1.stdout, stdout=PIPE)
 File "/gscuser/jeldred/.pyenv/versions/2.7.9/lib/python2.7/subprocess.py", line 710, in __init__
   errread, errwrite)
 File "/gscuser/jeldred/.pyenv/versions/2.7.9/lib/python2.7/subprocess.py", line 1335, in _execute_child
   raise child_exception
OSError: [Errno 2] No such file or directory
cat: write error: Broken pipe

This is not an informative error that a normal person can act on. Error checking needs to be better for this subcommand (and arguably ALL subcommands should be reviewed).

Utilize new str method on Bedpe class where possible

Currently this is only used by prune

svtools vcfpaste, passing in VCF names

Previously I could use a wildcard "*vcf", but it no longer likes that:

$ svtools vcfpaste -f *ss

usage: svtools [-h] subcommand ...
svtools: error: unrecognized arguments:
svtools: error: unrecognized arguments: 1.vcf 2.vcf 3.vcf

So then I try to list them explicitly, same error:

$ svtools vcfpaste -f 1.vcf 2.vcf 3.vcf

usage: svtools [-h] subcommand ...
svtools: error: unrecognized arguments:
svtools: error: unrecognized arguments: 1.vcf 2.vcf 3.vcf

I noticed this bit in the usage page Line-delimited list of VCF files to paste

How do you do line-delimitation?

Need to add back in support information to main executable

Harden prune to handle null AF entries

Jim found that we crash if lines exist where AF=. in the INFO field.

does active format list need to be updated here?

https://github.com/hall-lab/svtools/blob/master/svtools/vcf/genotype.py#L14

lsort input file

Should we make lsort take as input an input file, with a list of vcfs to merge (like vcfpaste does now)? Currently, at least in my workflow, I'm using a bash loop to write a sort command that I then submit to queue--feels like kind of a kludge.

VCF header chokes on descriptions with "<" or ">" characters

This line is problematic because if the description has a "<" or ">" it will break.

Example:

##FILTER=<ID=MSQ_20,Description="Variant without read-depth support with MSQ < 20">

extra 'GT' added to vcf

Is there a reason this 'GT' needs to added when len(var_list)<9? It makes things weird with 8 col vcfs.

https://github.com/hall-lab/svtools/blob/master/svtools/vcf/variant.py#L27-L34

vcfpaste fails with clean exit code when given malformed -f(vcf-list)

running vcfpaste with a vcf-list that includes the master vcf includes in the vcf-list caused the error below

(jim-2.7.9) /gscmnt/gc2801/analytics/jeldred/2016_03_31_test_callset $ cat paste.cn.9241244.log
Traceback (most recent call last):
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py", line 126, in
sys.exit(args.entry_point(args))
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py", line 120, in run_from_args
paster.execute()
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py", line 17, in execute
self.write_variants(output_handle)
File "/gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py", line 87, in write_variants
format = line_v[8]
IndexError: list index out of range

Sender: LSF System [email protected]
Subject: Job 9241244: <paste.cn> in cluster Done

Job <paste.cn> was submitted from host <linus43.gsc.wustl.edu> by user in cluster .
Job was executed on host(s) <blade8-3-16.gsc.wustl.edu>, in queue , as user in cluster .
</gscuser/jeldred> was used as the home directory.
</gscmnt/gc2801/analytics/jeldred/2016_03_31_test_callset> was used as the working directory.
Started at Fri Apr 1 14:42:10 2016
Results reported on Fri Apr 1 14:42:12 2016

Your job looked like:

LSBATCH: User input

python /gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py -m merged.no_EBV.vcf -f cn.list -q | bgzip -c > merged.sv.gt.cn.redo.1.vcf.gz

Successfully completed.

Resource usage summary:

CPU time :               0.13 sec.
Total Requested Memory : 8000.00 MB
Delta Memory :           -
(Delta: the difference between Total Requested Memory and Max Memory.)

The output (if any) is above this job summary.

vcfpaste will fail on VCFs with null quality values

Lsort should operate on gzipped VCFs

Currently, all VCFs need to be unzipped and this shouldn't be necessary.

Consider version management strategy

A few things to address:

How are version numbers managed?
** Do we want a version.txt file or similar?
** Do we want to try to manage via annotated tags in the git repository?

See also #50 for a discussion on release strategies.

Come up with packaging/release strategy

Let's discuss.

The goal is to make it as simple as possible for users to install (without root). The following two user stories seem realistic:

I want to use all the tools in the svtools suite in a pipeline.
I want to use just some of the tools (i.e. vcf-to-bedpe and bedpe-to-vcf).

Proposal:

Split repository up into multiple repositories, one for each tool (or logically related set of tools).
- Benefits:
  - Allows for releasing tools separately (satisfying user story 2).
  - Isolates tools from one another so cross-dependencies cannot creep in.
  - Allows for more fine-grained testing and issue tracking.
  - Each tool (or tool set) could be implemented in language of choice.
- Drawbacks:
  - Some developers feel more comfortable with a single repository
  - Automation for building conda and/or pypi packages would need to be created.
  - Some amount of boilerplate will exist in every repository.

Improve speed for genotype

This is a bottleneck step and currently runs in ~45 minutes per genome

VCF output of speedseq(lumpy) SVtype BND are translocations?

Hi there,

Lumpy output vcf files with 4 different SV types: BND, DEL, INV and DUP, Something I have asked here

For BNDs:
I understand that if both break ends are in different chromosomes, this is more likely to be a translocation, but how about break ends on the same chromosome? I want to make sure that BNDs are always translocations or sometimes it is just SV types can not be determined and represented in the break end format.

I read here and saw DEL can be represented in BND format, but the example is showing break points on two different chromosomes, so it is a bit confusing.

Thanks for answering me.

Best,
Ming

bedpetovcf crashes if STRAND field not in INFO

Believe this is the same or similar to #2

A convention for BEDPE breakpoint positions

Opening a formal issue for discussions of the proper spec for representing exact breakpoints in BEDPE.

Potential Reporting Conventions

Affected Bases (AFF)
- pros
  - simple
- cons
  - doesn't work for balanced translocations or BND variants
Left of the breakpoint (LOB)
- pros
  - symmetrical representation of ++/-- inversions and balanced rearrangements
- cons
  - telomeric variants may have negative position
Right of the breakpoint (ROB)
- same as LOB except avoids negative positions (but may overhang chrom on right)
Exact breakpoint (BPT)
- pros
  - accurately represents breakpoints as space between two bases
  - symmetrical representation of ++/-- inversions and balanced rearrangements
- cons
  - zero-length events may not be compatible with BEDTools intersections and clustering
Last-aligned Base (LAB)
- pros
  - simple description
- cons
  - asymmetrical representation of ++/-- inversions and balanced rearrangements
  - size of the variant is END - START + 1

more lenient vcfToBedpe

make vcfToBedpe lenient when fields are absent from header

Bring in EVENT check or equivalent to vcftobedpe

See changes in first commit of #6.

Converting to BEDPE is lossy

We lose the reference allele information and we also lose information in and around symbolic alleles. For example, <DUP:TANDEM> gets converted to simply <DUP>. If this is the correct behavior then symbolic allele lines need to be added/scrubbed from the headers appropriately.

Determine proper behavior for null QUAL fields

Currently these are being converted to a QUAL of 0. This was the behavior in some of the codebase (vcfpaste for example) before, but not in all.

README.md needs to be updated

Should try to address issues raised in http://www.software.ac.uk/online-sustainability-evaluation as much as possible. Should provide current description, links to install and demo information, developer info, citation, mechanisms for user support.

Deprecation Warning encountered when running lsort.py (possibly lmerge.py?)

when running lsort.py a Deprecation Warning is encountered

python /gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/lsort.py
/gscuser/jeldred/.local/lib/python2.6/site-packages/svtools-0.1.2-py2.6.egg/svtools/l_bp.py:2: DeprecationWarning: the sets module is deprecated
from sets import Set

according to Dave

the fix will be the same for all submodules
which is
move to using core set class instead of Set

clean up VCF header of bedToVCF

Does not build, new locations look funny

$ git pull
Already up-to-date.
$ python setup.py build
Traceback (most recent call last):
File "setup.py", line 19, in
long_description=open('README.txt').read()
IOError: [Errno 2] No such file or directory: 'README.txt'
$ touch README.txt
$ python setup.py build
running build
running build_scripts
error: file '**************************/svtools/bin/vcftobedpe' does not exist
$ find . -name vcftobedpe
./tests/test_data/vcftobedpe
$ grep vcftobedpe setup.py
scripts=["bin/vcftobedpe","bin/varlookup","bin/svtools","bin/vcfsort",
$ find . -name varlookup
./tests/test_data/varlookup
s$ find . -name svtools
./svtools
$ find . -name vcfsort
./svtools/bin/vcfsort
./tests/test_data/vcfsort

Demo is horribly out of date

The install process as well as workflow and command lines have evolved significantly. A newer demo and/or tutorial needs to be developed.

harmonize single character arguments for vcfpaste.py (other tools as well?)

single character arguments to some tools have not been obvious to me (leading me to dig into the code to determine the intent of these arguments)
examples include
-m and -f to vcfpaste.py
I have found this to be most often confusing when the argument has an argument that is a file path or such.....I have specifically wondered if am I providing an input or output path for instance....
this suggests we may want to take a pass at the help text and names for these arguments
instead of proposing a specific harmonization strategy I recognize that changing this (or deciding not to change this might take some thought)
I create this issue to consider the issue.

afreq is still too slow

We're doing a lot of unnecessary parsing in the current version and run time, even with pure python, should be much better.

svtyper is out of date

Classify needs to work for both rare and common variants

For common variants (>10 samples with non-reference allele), a linear regression across all samples for each variant works well to reclassify the variant. For rare variants, we may need to use an outside training set to train a classifier. Even for large cohorts, both approaches may be needed.

Add back in the --confidence option to vcftobedpe

This option wasn't functional, but we do want it to add in the confidence intervals as they aren't output by certain tools.

##assembly lines are lost

Lines in the VCF header like ##assembly are getting scrubbed. Possibly additional lines would be scrubbed as well. Is this the proper behavior or does it need to be fixed? Seems improper to me.

##reference= lines

These are injected into our headers if no reference is present in the original file.

license?

I'm wrapping your tool for the common workflow language:

http://common-workflow-language.github.io/draft-3/

Unfortunately I cannot redistribute it as part of a bigger pipeline unless it has a license... could you choose one that fits (y)our needs?

http://choosealicense.com/

Thanks!

@mr-c

vcfToBedpe strandness

Hi, Thanks for making this useful tool. I am using vcfToBedpe to convert vcf from lumpy.
The resulting bedpe files in column 9 and 10 have strandness info "+" "-"
what does exactly those strandness tell us?
for translocations (BND), I found there are 4 different combinations of the strandness.
for deletions, it is always+ -
for inversions, it is always + +
for duplications, it is always - +

Thanks very much!
Ming

create_coordinates needs to be part of install

Memory improvement in lsort

Current sorting approach won't scale - need to limit the amount stored in memory, perhaps by chromosome and variant type.

BND Variants on GL contigs move during file conversion

For example,

$ cat roundtrip_diff.out | cut -f1-4 | less
4a5
> ##INFO=<ID=POS,Number=1,Type=Integer,Description="Position of the variant described in this record">
81118,81119c81119,81120
< GL000193.1    16      13312_2 N
< GL000193.1    25      13311_2 N

---
> GL000193.1    17      13312_2 N
> GL000193.1    26      13311_2 N

Left file here is before conversion to BEDPE. The right file is after conversion to BEDPE and then back to VCF.

hall-lab / svtools Goto Github PK

svtools's People

Stargazers

Watchers

Forkers

svtools's Issues

LSBATCH: User input

python /gscmnt/gc2802/halllab/sv_aggregate/repo/svtools/svtools/vcfpaste.py -m merged.no_EBV.vcf -f cn.list -q | bgzip -c > merged.sv.gt.cn.redo.1.vcf.gz

Potential Reporting Conventions

Recommend Projects

Recommend Topics

Recommend Org