nhansen / svanalyzer Goto Github PK

Tools for the analysis of structural variation in genomes

Home Page: http://svanalyzer.readthedocs.io/

License: Other

Perl 99.86% Shell 0.14%

svanalyzer's Introduction

I'm a staff scientist at the National Institutes of Health working to learn more about variation in the human genome. Pronouns: She/her

svanalyzer's People

Contributors

Stargazers

Watchers

Forkers

inambioinfo edawson nate-d-olson pythseq wangpanqiao cjain7 zzygyx9119 jiaozexin chen318liang kubranarci pxchen110 wook2014 genostack lichengxx

svanalyzer's Issues

NA vs NaN

Not sure if this is an issue with your code or with pyvcf. I'm reading your output files in with pyvcf, and it's throwing an error because you have INFO entries set to be "NA" in fields that are numerically typed. Evidently, pyvcf does not accept that as a placeholder for non-values in either int or float fields (it states that it cannot convert "NA" to a float), but it does accept "NaN" when I sed replace all the NA to NaN. The VCF specification does describe how to implement NaN values in BCF format, but it is, to my surprise, not actually mentioned in regards to VCF. I've looked around a few other packages for handling VCF files and see a wild mixture of NaN and NAN but I only ever see NA specifically in fields that are string type.

How are the SVs merged?

This tool looks very useful. I was wondering how the SVs were merged once they are clustered. Or how the representative/unique SV was picked. I tried to read the code but I'm not very familiar with Perl.

It seems like the representative SV for a cluster is randomly selected from the largest subcluster of exactly matched variants. And if there are multiple such subclusters of the same maximal size, one is randomly selected?

That would make sense as it favors breakpoints that are supported by multiple calls. I just wanted some confirmation.

If that's the case, it could be worth adding this to the documentation. I couldn't find it in the recent GIAB preprint either and a few people might be interested.

Don't hardcode path to `SVcomp.pl` in `SVfindmatches.pl`

https://github.com/nhansen/SVanalyzer/blob/master/scripts/SVfindmatches.pl#L81

<DEL> in SVmerge

Hi nhansen,

I recently tried SVmerge command to merge my SV vcfs. I found that in deletion, SVmerge can't merge variants with DEL tags correctly (even if they all have END INFO ), for example:

variants called from Delly :
14 65360386 DEL00021614 A <DEL> 1800 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.8.7;END=65363541;PE=18;MAPQ=60;CT=3to5;CIPOS=-12,12;CIEND=-12,12;SRMAPQ=60;INSLEN=0;HOMLEN=11;SR=12;SRQ=0.989247;CONSENSUS=CACTGTTATAAGAGATTTTATTCCTCTTAAAATAAATTCTTATCTAAAACTGTAGGATTGGCCAGAAGTGTTGGCATGAGAAAGAAGGGTTCATTAAAGAATTTTCATAGTGGGAATCTGGATAAAAATCTCCTGTGAAAATAGCAGAGGCCTCCAAAAGAAGGCGTGAAGAGGAAGATGTATGTG;CE=1.90955 GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV 0/1:-51.7762,0,-12.3817:99:PASS:202:105:207:1:16:20:7:20

variants called from manta:
14 65360386 MantaDEL:0:25633:25636:0:0:0 A <DEL> 825 PASS END=65363540;SVTYPE=DEL;SVLEN=-3154;CIPOS=0,10;CIEND=0,10;HOMLEN=10;HOMSEQ=AAGAATTTTC GT:FT:GQ:PL:PR:SR 0/1:PASS:99:875,0,330:19,20:18,14

variants called from dragen:
14 65360386 MantaDEL:3966:7:8:0:0:0 A <DEL> 953 PASS END=65363540;SVTYPE=DEL;SVLEN=-3154;CIPOS=0,10;CIEND=0,10;HOMLEN=10;HOMSEQ=AAGAATTTTC GT:FT:GQ:PL:PR:SR 0/1:PASS:99:999,0,359:22,18:17,14

These three variants should be merged by SVmerge, however, they are not merged and output as:

14 65360386 DEL00021614 A <DEL> 1800 PASS ClusterIDs=DEL00021614;NumClusterSVs=1;ExactMatchIDs=DEL00021614;NumExactMatchSVs=1;ClusterMaxShiftDist=NA;ClusterMaxSizeDiff=NA;ClusterMaxEditDist=NA . .

14 65360386 dMantaDEL_3966_7_8_0_0_0 A <DEL> 953 PASS ClusterIDs=dMantaDEL_3966_7_8_0_0_0;NumClusterSVs=1;ExactMatchIDs=dMantaDEL_3966_7_8_0_0_0;NumExactMatchSVs=1;ClusterMaxShiftDist=NA;ClusterMaxSizeDiff=NA;ClusterMaxEditDist=NA . .

14 65360386 mMantaDEL_0_25633_25636_0_0_0 A <DEL> 825 PASS ClusterIDs=mMantaDEL_0_25633_25636_0_0_0;NumClusterSVs=1;ExactMatchIDs=mMantaDEL_0_25633_25636_0_0_0;NumExactMatchSVs=1;ClusterMaxShiftDist=NA;ClusterMaxSizeDiff=NA;ClusterMaxEditDist=NA . .

I would like to ask: How can I merge this kind of variants(without real sequence) correctly by SVmerge? Is there any recommended parameter for this situation ?

Thank you,
Pinxuan Chen

Short variants + SVs

Hi there,

I'm looking for a good method to combine calls produced by short variant callers (<50bp) with calls produced by SV callers (>50bp). Most structural variant merging tools are designed to merge SVs from different SV callers, and SNPs/short-mid indels are not supported.

Does SVanalyzer process SNPs and indels alongside SVs when merging variants from the same sample called by different variant callers? Or are short variants not supported by SVanalyzer?

-Joe

A question about SVwiden

I want to use SVwiden to expand my VCF file, and I have ~9000 SVs, but the result file widened.vcf only got ~1000 SVs. The log file reported 2020/11/17 17:09:12 2020/11/17 17:09:12 Ref derived from widened ref has different allele from ref !
Are these Allele ignored, so I only got ~1000 SVs ?

Why Some REPTYPE=CONTRAC Records Have Longer ALT Length than REF Length

Hi @nhansen,

I am going through a SVWIDEN output of HG002 VCF. I have noticed that there are some variants that were annotated REPTYPE=CONTRAC has longer ALT than REF. I expect CONTRAC records should have shorter ALT length than REF length because CONTRAC is a type of DEL. I have attached an instance below.

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	syndip
chr1	1108885	.	G	GTCCACCACAGCCACCATGTCTCGGCAGCACCGTCCACCACAGCCACCATGTCTCGGCAGCACCGTTCACCACAGCCACCATGTCTCAGCACCA	30	.	REPTYPE=CONTRAC;BREAKSIMLENGTH=7699;REFWIDENED=chr1:1108048-1115663	GT:AD	0|1:1,1

Do you have any explanations for this?

./Build test failure

Thanks nhansen..
For creating such a useful tool.

I am getting this error during installation in the building step.
when i enter this command..
./Build test

i am receiving this message.. please check if this is an error. if it can be skipped or fixed.

t/1_load.t .. ok
t/2_call.t .. 1/16

Failed test 'blib/script/SVwiden vartype'

at t/2_call.t line 47.

''

doesn't match '(?^:REPTYPE=DUP)'

Failed test 'blib/script/SVwiden widensmall'

at t/2_call.t line 49.

''

doesn't match '(?^:REFWIDENED=1:842056-842090)'

t/2_call.t .. 14/16 # Looks like you failed 2 tests of 16.
t/2_call.t .. Dubious, test returned 2 (wstat 512, 0x200)
Failed 2/16 subtests

Test Summary Report

t/2_call.t (Wstat: 512 Tests: 16 Failed: 2)
Failed tests: 5-6
Non-zero exit status: 2
Files=2, Tests=38, 6 wallclock secs ( 0.04 usr 0.01 sys + 4.84 cusr 0.70 csys = 5.59 CPU)
Result: FAIL
Failed 1/2 test programs. 2/38 subtests failed.

Thanks in advance.

delta2sam.pl

Hi Nancy,

My name is Yutang and I am a PhD student at ETH, Zurich. I saw your comment here mummer4/mummer#130, and I have tried your delta2sam.pl script. It worked like a charm but when I try to convert sam to bam, an error came up saying [W::sam_read1_sam] Parse error at line 4. I guess there might be something wrong in the sam file, what do you think?

Attached is part of my sam file and could you please have a look?

Thank you very much for your help and I am looking forward to your reply.

Best wishes,
Yutang
test.txt

Explanation of SVbenchmark's --normshift option

Hi @nhansen,

I can't tell from the documentation exactly what the --normshift option does. It seems that --maxdist ensures two variants are near one another, --normsizediff ensures they're approximately the same length, and --normdist ensures that they contain approximately the same edits to the reference sequence. Is --normshift similar to --maxdist, and ensures that the location of one variant is not too far from the other, but this time relative to the SV's total length?

Thanks in advance for the clarification,
Tim

how to ilter SV with 1000bp？

Hi there,
I have read this paper (A robust benchmark for detection of germline large deletions and insertions), but some questions confused me as follows. When I use SVmerge to merge different VCF files, how should I filter out another SV within 1000bp? Maybe Bcftools could do this ? or you haven't provided a script, but you've just provided an idea?
Thank you in advance for helping me with my confusion.
best wishes
Flooooooooooooower

Conda installation command in README.md not working

I tried to install SVanalyzer using the conda install svanalyzer command as specified in the README.md file. However, I received the following error message:

PackagesNotFoundError: The following packages are not available from current channels:
- svanalyzer

After some research, I found that the correct command to install SVanalyzer using conda is conda install -c bioconda svanalyzer.

I suggest updating the README.md file to reflect this change in the installation command, as it can save others from experiencing the same issue.

Thank you for maintaining this svanalyzer.

issue in vcf format for SVrefine output

Hello,
I use this tool to call SV from assembly-hap and the commands are listed as below:
`

mummer-4.0.0rc1/nucmer -p NA24385_mummer.h1 -t 8 --maxmatch -l 100 -c 500 GRCh38_chromosomes.fa 20200628_HHU_assembly-results_CCS_v12/v12_NA24385_hpg_pbsq2-ccs_1000-pereg.h1-un.racon-p2.fasta
gzip NA24385_mummer.h1.delta
SVrefine --delta NA24385_mummer.h1.delta.gz –ref_fasta GRCh38_chromosomes.fa –query_fasta 20200628_HHU_assembly-results_CCS_v12/v12_NA24385_hpg_pbsq2-ccs_1000-pereg.h1-un.racon-p2.fasta –outvcf SVanalyzerhp1/ –refname hg38 –samplename NA24385.h1 –maxsize 100000
`

All was well done, but there are some formatting issues in the formed vcf file which I can't figure out : why are all SVLEN=4 or 0，and GT =1 or 0/1 , what does this meaning? Under what circumstances would this be the case? Could you give me some advice about this, so I can make it for subsequent evaluation?

Thanks!

Additional SV types

While I'm thoroughly enjoying the robustness of your tool's clustering (thank you!), I thought I might make a feature request. Sniffles (https://github.com/fritzsedlazeck/Sniffles) calls SVs from long reads, and outputs a number of complex SV types that are not currently handled by svanalyzer such as BND, DEL/INV, DUP, INV, INV/INVDUP, INVDUP. While some of them may be harder to handle, would it be simple to at least include INV svtype handling, since the event is defined by a (START,END) point similar to DELs?

A simple question

Hi there,

For testing, I made two vcf files, one file has record like this

chr1 108190704 DEL00000334 T <DEL> . PASS IMPRECISE;SVTYPE=DEL;SVLEN=-3926;SVMETHOD=EMBL.DELLYv0.8.1;CHR2=chr1;END=108194630 ...

and another file has only one record like this:

chr1 108190708 MantaDEL:180899:0:1:0:0:0 A <DEL> 999 PASS END=108194629;SVTYPE=DEL;SVLEN=-3921;

Supposedly, they should be merged, since breakpoints are very close to each other (left side 4bp, right side 1bp). But the output file still has two records. Is this expected?

I ran the command as this:

SVmerge --fof ${filename} --ref $HG38

Thanks!

George

SVmerge batches

Hi,

Thanks for the useful tool.

I am trying to merge ~500 VCFs of SVs identified by SVrefine using SVmerge, but calculating the distance file is taking some time (over a week so far). Is it possible to split the process into smaller batches (e.g. run SVmerge on 50 samples at a time) and then run SVmerge again on the 10 previously merged VCFs to save time?

Thanks.

Can't locate GTB/File.pm

In Ubuntu22.04,when using svanalyzer,it reported:
Can't locate GTB/File.pm in @INC (you may need to install the GTB::File module)
I tried cpan GTB::File and cpanm GTB::File.But ! Couldn't find module or a distribution GTB::File
how can i get it??
thank u

SV in telomere

According to the VCF specification v4.2 it is possible to record an SV in telomere, and the POS value for such an event is specified by either setting it equal to 0 or to N + 1 where N is the length of the reference sequence for that contig. Currently, your code skips entries where the SV is in the telomere.

SVmerge option --skip_clusters

Using --skip_clusters causes SVmerge to fail with "unrecognized option" message.

Problems with the output file of the delta2sam.pl script

Hi Nancy,
I am Li cheng and a M.A. students from HZAU, Wuhan.
I used your delta2sam.pl script, but the fifth column of my sam file, i.e. the MAPQ value, is all '0'.
After I converted sam to bam file, the result file for extracting unmatched upper sequences through the resulting bam file was empty.
But I used minimap2 to compare the two genomes and the resulting sam files were fine.
I would like to ask if this situation is a problem with my delta file from Nucmer or a problem with the way the delta2sam.pl script is used? And I have used the delta2sam.pl script without any errors.
Parts of my delta file, sam file and minimap2 sam file are in the attachment.
I look forward to and appreciate your reply.

Best Wishes,
Li cheng
Part_of_delta_file.txt
Part_of_delta2sam,pl_outfile.txt
Part_of_minimap2_out_sam.txt