j35p312 / svdb Goto Github PK

structural variant database software

License: MIT License

Python 98.57% Dockerfile 1.43%

svdb's Introduction

SVDB

SVDB is a toolkit for constructing and querying structural variant databases. The databases are constructed using the output vcf files from structural variant callers such as TIDDIT, Manta, Fermikit or Delly. SVDB may also be used to merge SV vcf files from multiple callers or individuals.

Supported public databases

SVDB query supports public databases such as thousand genomes SV map and Gnomad SV, as well as most multisample SV vcf files

The thousand genomes SV database:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/

The swegen SVDB:

https://swefreq.nbis.se/

The GNOMAD SV database:

https://storage.googleapis.com/gnomad-public/papers/2019-sv/gnomad_v2_sv.sites.vcf.gz

external databases are run like this:

svdb --query \
     --query_vcf /home/jesper/vcf/6_pair_limit/P2109_120.clean.dedup.recal_FindSV.vcf \
     --out_occ GNOMAD_AC \
     --out_frq GNOMAD_AF \
     --in_occ AN \
     --out_frq AF \
     --db /home/jesper/Downloads/gnomad_sv/gnomad_v2_sv.sites.vcf

here the AF and AN are the allele frequency tags of the database, the AF is a float, and AN is an integer. These tags will be added to the annotated output vcf, and named GNOMAD_AC, GNOMAD_AF.

Install:

Dependencies: SVDB has been tested on python 2.7.11 and python 3.6, and requires numpy. SVDB is installed using the following command

git clone https://github.com/J35P312/SVDB.git
cd SVDB
pip install -e .

SVDB is available on singularity:

singularity pull shub://J35P312/SVDB

Modules:

SVDB consists of modules that are used to build, query, export, and analyse structural variant databases. These are the modules:

Build

This module is used to construct structural variant databases from vcf files. The database may then be queried to compute the frequency of structural variants, or exported into a vcf file. These are the commands used to construct a structural variation database:

print a help message
    svdb  --build --help  
Construct a database, from a set of vcf files:
    svdb --build --vcf sample1.vcf sample2.vcf sample3.vcf
construct a database from vcf files stored in a folder
    svdb --build --folder SV_analysis_folder/
    
optional arguments:
    -h, --help                      show this help message and exit

    --files [FILES [FILES ...]]      create a db using the specified vcf files(cannot be
                                     used with --folder)
                    
    --folder FOLDER                  create a db using all the vcf files in the folders
    
    --prefix PREFIX                  the prefix of the output file, default = SVDB

Export

This module is used to export the variants of the SVDB sqlite database. The variants of the sqlite svdb database is clustered using one out of three algorithms, overlap or DBSCAN.

print a help message
    svdb  --export --help  
Export the variants of the database database.db:
    svdb --export --db database.db

optional arguments:
    --no_merge            skip the merging of variants, print all variants in the db to a vcf file

     --bnd_distance BND_DISTANCE  the maximum distance between two similar precise breakpoints(default = 2500)

     --overlap OVERLAP     the overlap required to merge two events(0 means anything that touches will be merged, 1 means that two events must be identical to be merged), default = 0.8

     --DBSCAN              use dbscan to cluster the variants, overides the overlap based clustering algorithm

     --epsilon EPSILON     used together with --DBSCAN; sets the epsilon parameter(default = 500bp)

     --min_pts MIN_PTS     the min_pts parameter(default = 2

     --prefix PREFIX       the prefix of the output file, default = same as input

      --memory              load the database into memory: increases the memory requirements, but lowers the time consumption

Query

The query module is used to query one or more structural variant databases. Typically a database is constructed using the build module. However, since this module utilize the genotype field of the structural variant database vcf to compute the frequency of structural variants, a wide range of files could be used as database. The query module requires a query vcf, as well as a database file(either multisample vcf or SVDB sqlite database):

print a help message
   svdb --query --help
Query a structural variant database, using a vcf file as query:

    svdb --query --query_vcf patient1.vcf --db control_db.vcf

Query multiple databases, using a vcf file as query:

    svdb --query --query_vcf patient1.vcf --db control_db1.vcf,control_db2.vcf --prefix test --in_occ default,Obs --in_frq FRQ,default --out_frq db1_AF,db2_Frq --out_occ db1_AC,db2_Obs

optional arguments:

-h, --help            show this help message and exit

--db DB				path to a db vcf, or a comma separated list of vcfs
--sqdb SQDB			path to a SVDB sqlite db, or a comma separated list of dbs
--bedpedb BEDPEDB		path to a SV database of the following format chrA-posA-chrB-posB-type-count-frequency, or a comma separated list of files
--in_occ IN_OCC			The allele count tag, if used, this tag must be present in the INFO column of the input DB(usually set to AN or OCC). This parameter is required if multiple databases are queried. 
--in_frq IN_FRQ			The frequency count tag, if used, this tag must be present in the INFO column of the input DB(usually set to AF or FRQ). This parameter is required if multiple databases are queried.
--out_occ OUT_OCC		the allele count tag, as annotated by SVDB variant(default=OCC). This parameter is required if multiple databases are queried.
--out_frq OUT_FRQ		the tag used to describe the frequency of the variant(default=FRQ). This parameter is required if multiple databases are queried.
--prefix PREFIX			the prefix of the output file, default = print to stdout. Required if multiple databases are queried. 
--bnd_distance BND_DISTANCE	the maximum distance between two similar breakpoints(default = 10000)
--overlap OVERLAP		the overlap required to merge two events(0 means anything that touches will be merged, 1 means that two events must be identical to be merged), default = 0.6
--memory 			load the database into memory: increases the memory requirements, but lowers the time consumption(may only be used with sqdb)
--no_var			count overlapping variants of different type as hits in the db

Merge

The merge module merges variants within one or more vcf files. This could be used to either merge the output of multiple callers, or to merge variants that are called multiple times due to noise or some other error:

print a help message:
   python SVDB.py --merge --help
merge vcf files:
    svdb --merge --vcf patient1_lumpy.vcf patient1_cnvnator.vcf patient1_TIDDIT.vcf > patient1_merged_callers.vcf

Similar variants will be merged, and presented according to the order of the input vcf files. I.e If lumpy and cnvnator calls the same variant in the top example,
the variant will be printed as the lumpy call. In most cases, the order should be set according to the accuracy or detail of the info field of the different callers.
The order could also be set using the --priority flag:
    svdb --merge --vcf patient1_lumpy.vcf:one patient1_cnvnator.vcf:2 patient1_TIDDIT.vcf:tiddit --priority tiddit,2,one > patient1_merged_callers.vcf

In this example, tiddit will have the highest order, cnvnator second etc.


optional arguments:
    -h, --help                      show this help message and exit
    
    --bnd_distance BND_DISTANCE     the maximum distance between two similar precise breakpoints
                                    (default = 10000)
                    
    --overlap OVERLAP               the overlap required to merge two events(0 means
                                    anything that touches will be merged, 1 means that two
                                    events must be identical to be merged), default = 0.6

    --priority                      prioritise the input vcf files
                                                                  
    --no_intra                      no merging of variants within the same vcf
    
    --no_var                        variants of different type will be merged
    
    --pass_only                     merge only variants labeled PASS

--same_order			assume that the samples are ordered the same way (skip reordering and merging of the sample columns).

svdb's People

Contributors

Stargazers

Watchers

Forkers

scilifelab inambioinfo dnil jemten danielandreasen sjurug arontommi kristinebilgrav novapyth khurrammaqbool ramprasadn insung3511 akazhiel knnnk m-hakmi

svdb's Issues

Support for bgzip VCF files

Would it be possible to add support for bgzip VCF files?

Syntax error with singularity image

Using singularity shell to take a look at the image wrapped and found these error message during initiation.

May I know the binary file location within the image to command with singualrity exec ?

Thanks!

Refactoring

To code could use some refactoring. This is quite important, since it hopefully will encourage others to add new features.
Refactoring here includes:

Cleaning the code (removing unused code anf following PEP8)
Removing duplications of code, by creating functions/classes (preferrably in new files)

Keep sample (genotype) information

When merging VCFs, would it be possible to keep the sample/genotype information? If nothing else, just the genotype (if they agree) or missing genotype (./.) if they don't.

thanks,

Merging BNDs in different chromosomes

I am trying to merge some VCFs with SVDB v2.6.4, but I am having some trouble with BNDs.

If I have a VCF with two BNDs:

chr3    195691426       MantaBND:34332:0:1:0:0:0:0      G       G]chr5:1636168] 626     PASS    BND_DEPTH=90;CIPOS=0,12;HOMLEN=12;HOMSEQ=ATTCAGTCCTGG;MATEID=MantaBND:34332:0:1:0:0:0:1;MATE_BND_DEPTH=63;SVTYPE=BND;POS=195691426    GT:FT:GQ:PL:PR:SR       0/1:PASS:626:676,0,999:81,1:59,31
chr5    1636156 MantaBND:34332:0:1:0:0:0:1      T       T]chr3:195691438]       626     PASS    BND_DEPTH=63;CIPOS=0,12;HOMLEN=12;HOMSEQ=CCAGGACTGAAT;MATEID=MantaBND:34332:0:1:0:0:0:0;MATE_BND_DEPTH=90;SVTYPE=BND;POS=1636156      GT:FT:GQ:PL:PR:SR       0/1:PASS:626:676,0,999:81,1:59,31

and merge it with itself:

svdb --merge --vcf test_manta.vcf.gz:truth test_manta.vcf.gz:query --priority truth,query --pass_only --no_intra --overlap 0.1 --bnd_distance 1000 --ins_distance 50

I get:

chr3    195691426       MantaBND:34332:0:1:0:0:0:0:truth|MantaBND:34332:0:1:0:0:0:0:query|MantaBND:34332:0:1:0:0:0:1:query      G       G]chr5:1636168] 626     PASS    BND_DEPTH=90;CIPOS=0,12;HOMLEN=12;HOMSEQ=ATTCAGTCCTGG;MATEID=MantaBND:34332:0:1:0:0:0:1;MATE_BND_DEPTH=63;SVTYPE=BND;POS=195691426;VARID=MantaBND:34332:0:1:0:0:0:0:query|MantaBND:34332:0:1:0:0:0:1:query;set=Intersection;FOUNDBY=2;truth_CHROM=MantaBND_34332_0_1_0_0_0_0|chr3;query_CHROM=MantaBND_34332_0_1_0_0_0_0|chr3,MantaBND_34332_0_1_0_0_0_1|chr5;truth_POS=MantaBND_34332_0_1_0_0_0_0|195691426;query_POS=MantaBND_34332_0_1_0_0_0_0|195691426,MantaBND_34332_0_1_0_0_0_1|1636156;truth_QUAL=MantaBND_34332_0_1_0_0_0_0|626;query_QUAL=MantaBND_34332_0_1_0_0_0_0|626,MantaBND_34332_0_1_0_0_0_1|626;truth_FILTERS=MantaBND_34332_0_1_0_0_0_0|PASS;query_FILTERS=MantaBND_34332_0_1_0_0_0_0|PASS,MantaBND_34332_0_1_0_0_0_1|PASS;truth_SAMPLE=MantaBND_34332_0_1_0_0_0_0|NA24385-NA24385_fda_truthv2-Normal_Blood_noinfo-WGS_v1_noinfo_noinfo-210628_GIABprecisionFDA_GIABFDA99-precisionFDA_noinfo_WGSvalidation-S1_001|GT:0/1|FT:PASS|GQ:626|PL:676:0:999|PR:81:1|SR:59:31;query_SAMPLE=MantaBND_34332_0_1_0_0_0_0|NA24385-NA24385_fda_truthv2-Normal_Blood_noinfo-WGS_v1_noinfo_noinfo-210628_GIABprecisionFDA_GIABFDA99-precisionFDA_noinfo_WGSvalidation-S1_001|GT:0/1|FT:PASS|GQ:626|PL:676:0:999|PR:81:1|SR:59:31,MantaBND_34332_0_1_0_0_0_1|NA24385-NA24385_fda_truthv2-Normal_Blood_noinfo-WGS_v1_noinfo_noinfo-210628_GIABprecisionFDA_GIABFDA99-precisionFDA_noinfo_WGSvalidation-S1_001|GT:0/1|FT:PASS|GQ:626|PL:676:0:999|PR:81:1|SR:59:31;truth_INFO=MantaBND_34332_0_1_0_0_0_0|BND_DEPTH:90|CIPOS:0:12|HOMLEN:12|HOMSEQ:ATTCAGTCCTGG|MATE_BND_DEPTH:63|SVTYPE:BND|POS:195691426;query_INFO=MantaBND_34332_0_1_0_0_0_0|BND_DEPTH:90|CIPOS:0:12|HOMLEN:12|HOMSEQ:ATTCAGTCCTGG|MATE_BND_DEPTH:63|SVTYPE:BND|POS:195691426,MantaBND_34332_0_1_0_0_0_1|BND_DEPTH:63|CIPOS:0:12|HOMLEN:12|HOMSEQ:CCAGGACTGAAT|MATE_BND_DEPTH:90|SVTYPE:BND|POS:1636156;svdb_origin=truth|query  GT:FT:GQ:PL:PR:SR       0/1:PASS:626:676,0,999:81,1:59,31
chr5    1636156 MantaBND:34332:0:1:0:0:0:1      T       T]chr3:195691438]       626     PASS    BND_DEPTH=63;CIPOS=0,12;HOMLEN=12;HOMSEQ=CCAGGACTGAAT;MATEID=MantaBND:34332:0:1:0:0:0:0;MATE_BND_DEPTH=90;SVTYPE=BND;POS=1636156;set=truth;FOUNDBY=1;truth_CHROM=MantaBND_34332_0_1_0_0_0_1|chr5;truth_POS=MantaBND_34332_0_1_0_0_0_1|1636156;truth_QUAL=MantaBND_34332_0_1_0_0_0_1|626;truth_FILTERS=MantaBND_34332_0_1_0_0_0_1|PASS;truth_SAMPLE=MantaBND_34332_0_1_0_0_0_1|NA24385-NA24385_fda_truthv2-Normal_Blood_noinfo-WGS_v1_noinfo_noinfo-210628_GIABprecisionFDA_GIABFDA99-precisionFDA_noinfo_WGSvalidation-S1_001|GT:0/1|FT:PASS|GQ:626|PL:676:0:999|PR:81:1|SR:59:31;truth_INFO=MantaBND_34332_0_1_0_0_0_1|BND_DEPTH:63|CIPOS:0:12|HOMLEN:12|HOMSEQ:CCAGGACTGAAT|MATE_BND_DEPTH:90|SVTYPE:BND|POS:1636156;svdb_origin=truth  GT:FT:GQ:PL:PR:SR     0/1:PASS:626:676,0,999:81,1:59,31

It seems it is merging three BNDs, one of them on another chr(!):

MantaBND:34332:0:1:0:0:0:0:truth (chr3)
MantaBND:34332:0:1:0:0:0:0:query (chr3)
MantaBND:34332:0:1:0:0:0:1:query (chr5)

And leaving MantaBND:34332:0:1:0:0:0:1:truth unmerged.
Why is it not merging MantaBND:34332:0:1:0:0:0:0:truth with MantaBND:34332:0:1:0:0:0:0:query, and MantaBND:34332:0:1:0:0:0:1:truth with MantaBND:34332:0:1:0:0:0:1:query?

Use priotag instead of filename in INFO field

How to combine delly,lumpy,manta and tiddit result?

Dear developer:
Now we have used four methods to detect the structural variation of individuals, but we found that it can not well combined. We used the following code, but soon got an error report. How do you solve this problem?
svdb --merge --vcf ERS177302.delly.vcf ERS177302.lumpy.vcf ERS177302.manta.vcf ERS177302.tiddit.vcf > ERS177302.combine.vcf

Traceback (most recent call last):
File "/home/ld/anaconda3/bin/svdb", line 33, in
sys.exit(load_entry_point('svdb', 'console_scripts', 'svdb')())
File "/home/ld/Software/SVDB-2.6.4/svdb/main.py", line 187, in main
merge_vcf_module.main(args)
File "/home/ld/Software/SVDB-2.6.4/svdb/merge_vcf_module.py", line 171, in main
chrA, posA, chrB, posB, event_type, INFO, FORMAT = readVCF.readVCFLine(line)
File "svdb/readVCF.py", line 12, in svdb.readVCF.readVCFLine
IndexError: list index out of range

What does the frequency mean?

Hi Jesper,

I have a theoretical question about the frequency which comes out of SVDB.
Imagine I have 10 VCF files covering chr1 and chr2, with a similar variant in 3 of them (30% frequency) in chr1 at pos X, and another 90 VCF files only covering chr2.
All 100 VCF files are used to create a SVDB. What is the frequency from that SVDB in chr1 at pos X? Is it 30%, because only 3 out of possible 10 could catch this variant, or is it 3%, since 3 out of 100 have this variant?

Add command to vcf header

Add new variants to existing database

Since the frequencies are calculated on the fly (export or query mode), I think it would be a nice feature to add new variants to an existing database, rather than creating a new from scratch.

UnicodeDecodeError

I tried to query AF and AN from an in-house built SVDB.

The run returned a UnicodeError listed below:
Traceback (most recent call last): File "/venv/bin/svdb", line 33, in <module> sys.exit(load_entry_point('svdb', 'console_scripts', 'svdb')()) File "/home/worker/app/svdb/__main__.py", line 99, in main make_query_calls(args, queries, "db") File "/home/worker/app/svdb/__main__.py", line 46, in make_query_calls query_module.main(args) File "svdb/query_module.py", line 73, in svdb.query_module.main File "svdb/query_module.py", line 74, in svdb.query_module.main File "/usr/local/lib/python3.8/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 31: invalid start byte

I tried different query vcfs and all returned this error. May I know is this invalid start byte existed in the DB file or in the query VCF file. If it is the DB file, what could be wrong building it?

Much appreciated if you could take a look at this issue.

Add information for all callers in `INFO` for overlapping events

svdb --merge retains information from only one of the caller after overlap is found. It is nice to look at how the event is resolved by other callers and in some cases there is possibility to gain additional information. It may be convenient to add the information similar to vep annotation in INFO of VCF file. e.g.
manta=CHROM|POS|REF|ALT|INFO|FORMAT|TUMOR;ascat=CHROM|POS|REF|ALT|INFO|FORMAT|TUMOR .....
manta=CHROM|POS|REF|ALT|INFO|FORMAT|TUMOR|NORMAL;delly=CHROM|POS|REF|ALT|INFO|FORMAT|TUMOR|NORMAL .....

Interpretation of genotype counts

When importing the 1000 genomes data into a SQLite database and querying the DB, it appears that any match found is registered with a frequency of "1.0". I find this weird as the genotyping information is available and the frequency of the variant in the G1K cohort could be given.

Am I missing something?

Error when exporting database

I am trying to export a DB into a VCF but I get an error:

$ svdb --export --db manta.db --overlap 0 --bnd_distance 1000 --prefix manta
Traceback (most recent call last):
  File "/services/tools/anaconda2/4.4.0/bin/svdb", line 10, in <module>
    sys.exit(main())
  File "/services/tools/anaconda2/4.4.0/lib/python2.7/site-packages/svdb/__main__.py", line 77, in main
    export_module.main(args)
  File "svdb/export_module.py", line 326, in svdb.export_module.main (svdb/export_module.c:10964)
  File "svdb/export_module.py", line 304, in svdb.export_module.export (svdb/export_module.c:10226)
  File "svdb/export_module.py", line 270, in svdb.export_module.svdb_cluster_main (svdb/export_module.c:9187)
  File "svdb/export_module.py", line 185, in svdb.export_module.overlap_cluster (svdb/export_module.c:6815)
  File "svdb/export_module.py", line 140, in svdb.export_module.expand_chain (svdb/export_module.c:5021)
UnboundLocalError: local variable 'match' referenced before assignment

Any idea of what it might be?

List index out of range

Hi there!

I'm using SVDB merge to combine some small cohorts of manta VCFs together, however I'm getting the following error:

svdb --merge --vcf b16/diploidSV.vcf.gz b15/diploidSV.vcf.gz b14/diploidSV.vcf.gz --bnd_distance 500 --overlap 0.7 > svdb_test_merge.vcf
Traceback (most recent call last):
  File "/data/conda/miniconda3/envs/svdb/bin/svdb", line 11, in <module>
    load_entry_point('svdb', 'console_scripts', 'svdb')()
  File "/data/scout/SVDB/svdb/__main__.py", line 195, in main
    merge_vcf_module.main(args)
  File "/data/scout/SVDB/svdb/merge_vcf_module.py", line 185, in main
    to_be_printed = merge_vcf_module_cython.merge(variants, samples, sample_order, sample_print_order, priority_order, args)
  File "/data/scout/SVDB/svdb/merge_vcf_module_cython.py", line 302, in merge
    samples_tag[match_id].append(collect_sample(vcf_line_B ,samples,sample_order,match_id))
  File "/data/scout/SVDB/svdb/merge_vcf_module_cython.py", line 53, in collect_sample
    sample_entries = vcf_line[9 + sample_position].split(":")
IndexError: list index out of range

I'm using SVDB version 2.8.2 and have tried multiple versions of python.
Any help would be greatly appreciated!
Gavin

Query: Only write variants with frequency lower than a given value

When quering an already existing DB with a new VCF, it would be nice to add an option to only include new variants and variants with a frequency below a given value, e.g. 5%. The default behaviour should of course be to include everything (new variants + variants of any frequency).
I have forked, implemented (not pushed yey) and tested this. It is quite simple to implement, but is it something you are interested in adopting if I make a pull request on it?

Bug: missing headers

Running svdb v2.6.1 as:

svdb --merge --vcf cnvnat delly lumpy manta --pass_only --no_intra --no_var --overlap 0.1 --bnd_distance 1000

and passing the output through bcftools, I am getting some missing headers:

[W::bcf_hrec_check] Invalid tag name: "cnvnat_INFO"
[W::bcf_hrec_check] Invalid tag name: "cnvnat_SAMPLE"
[W::bcf_hrec_check] Invalid tag name: "cnvnat_CHROM"
[W::bcf_hrec_check] Invalid tag name: "cnvnat_POS"
[W::bcf_hrec_check] Invalid tag name: "cnvnat_QUAL"
[W::bcf_hrec_check] Invalid tag name: "cnvnat_FILTERS"
[W::bcf_hrec_check] Invalid tag name: "delly_INFO"
[W::bcf_hrec_check] Invalid tag name: "delly_SAMPLE"
[W::bcf_hrec_check] Invalid tag name: "delly_CHROM"
[W::bcf_hrec_check] Invalid tag name: "delly_POS"
[W::bcf_hrec_check] Invalid tag name: "delly_QUAL"
[W::bcf_hrec_check] Invalid tag name: "delly_FILTERS"
[W::bcf_hrec_check] Invalid tag name: "lumpy_INFO"
[W::bcf_hrec_check] Invalid tag name: "lumpy_SAMPLE"
[W::bcf_hrec_check] Invalid tag name: "lumpy_CHROM"
[W::bcf_hrec_check] Invalid tag name: "lumpy_POS"
[W::bcf_hrec_check] Invalid tag name: "lumpy_QUAL"
[W::bcf_hrec_check] Invalid tag name: "lumpy_FILTERS"
[W::bcf_hrec_check] Invalid tag name: "manta_INFO"
[W::bcf_hrec_check] Invalid tag name: "manta_SAMPLE"
[W::bcf_hrec_check] Invalid tag name: "manta_CHROM"
[W::bcf_hrec_check] Invalid tag name: "manta_POS"
[W::bcf_hrec_check] Invalid tag name: "manta_QUAL"
[W::bcf_hrec_check] Invalid tag name: "manta_FILTERS"

If I also include the option --notag, svdb_origin is not included in the header, even though it is in the VCF file.

version

Is there a --version flag? I cannot find it in the CLI

Add sv caller tag to vcf input parameter and prio flag

-vcf infile1.vcf:manta -vcf infile2.vcf:delly

and a prio flag e.g.:
-prioritize_caller manta,delly

main() takes exactly 2 positional arguments (1 given)

Hi,

I am trying to run the svdb module 2.7.1 with:

❯ svdb --query --query_vcf work/stage/e4/676eb20eb46b22b46b233c2a62a6d5/sv_query.vcf.gz  --db work/stage/e6/e425bbcf983679e5aaf6dc774c7400/gnomAD.r2.1.1-sv.vcf.gz --prefix test --in_occ AC --in_frq AF --out_occ gnomad_svAC --out_frq gnomad_svAF

and getting below error:

Traceback (most recent call last):
  File "/Users/monarchy/miniconda3/envs/svdb/bin/svdb", line 11, in <module>
    sys.exit(main())
  File "/Users/monarchy/miniconda3/envs/svdb/lib/python3.10/site-packages/svdb/__main__.py", line 93, in main
    make_query_calls(args, queries, "db")
  File "/Users/monarchy/miniconda3/envs/svdb/lib/python3.10/site-packages/svdb/__main__.py", line 40, in make_query_calls
    query_module.main(args)
  File "svdb/query_module.py", line 11, in svdb.query_module.main
TypeError: main() takes exactly 2 positional arguments (1 given)

I suppose, I am missing a parameter somewhere. Just for the life of me can't find which one. I have copied the test command now multiple times and filled with my files etc. If anyone has sharper eyes and could help me spot the error, I would be very grateful

Error when quering custom database

I am trying to build a custom database of SVs, and then query it to get the frequency of the SVs on a given sample. So far I have built the database:

svdb --build --files *.vcf.gz --prefix SVDB

But when I try to query it:

svdb --query --query_vcf sample1.vcf.gz --sqdb SVDB.db --in_occ OCC --in_frq AF --out_occ SV_AN --out_frq SV_AF

, I get:

Traceback (most recent call last):
  File "/home/projects/cu_10047/people/filipev/python2/bin/svdb", line 10, in <module>
    sys.exit(main())
  File "/home/projects/cu_10047/people/filipev/python2/lib/python2.7/site-packages/svdb/__main__.py", line 37, in main
    query_module.main(args)
  File "/home/projects/cu_10047/people/filipev/python2/lib/python2.7/site-packages/svdb/query_module.py", line 200, in main
    hits = SQDB(query,args,c)
  File "/home/projects/cu_10047/people/filipev/python2/lib/python2.7/site-packages/svdb/query_module.py", line 288, in SQDB
    ci = args.ci
AttributeError: 'Namespace' object has no attribute 'ci'

Any idea of what this might be?

svdb merge changes VCF header

Hi there,

I merged four SV callers using svdb merge (delly, smoove, manta, tiddit). Then I tried running the merged VCF through vcfanno to annotate with genomic features and ran into an error. I checked on the vcfanno issues page and people had similar errors when attempting to annotate with malformed VCFs.

After looking at the header I noticed this line was different between the single caller and merged VCFs:

##INFO=<ID=END,Number=1,Type=String,Description="End of an intra-chromosomal variant">

My quick fix was to run this command on the VCF:

sed 's/ID=END,Number=1,Type=String/ID=END,Number=1,Type=Integer/g'

This resolved the issue with vcfanno.

I thought I'd let you know!

The install document should be updated.

Hi,
Thanks for your useful tool. I think the install part pip install -e in the README should be more clear. Should I download the source code first?

Best,
xiucz

User defined VCF header

When querying a database, would it be possible to add an option so the user can define the VCF header? Right now it always says:

##INFO=<ID=XXX_AN,Number=1,Type=Integer,Description="The number of occurances of the event in the database">
##INFO=<ID=XXX_AF,Number=1,Type=Float,Description="The frequency of the event in the database">

But it would be nice to be able to add custom labels to include more information, like which database (and which version) was used.

Query is failing with UnicodeDecodeError

I have multiple Manta SV files and build a database already. But now with query it fails with the following error:

Traceback (most recent call last):
  File "xxxxx/bin/svdb", line 11, in <module>
    load_entry_point('svdb', 'console_scripts', 'svdb')()
  File "xxxx/SVDB/svdb/__main__.py", line 37, in main
    query_module.main(args)
  File "xxxxxx/SVDB/svdb/query_module.py", line 96, in main
    for line in f:
  File "xxxxxxxx/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 99: invalid continuation byte

Commands that I've used are:
svdb --build --files *.vcf --prefix my_svdb

Then followed by:
svdb --query --db issue_49_svdb.db --query_vcf data_1.vcf

I'm trying to run it using minimal options to test and make it work but I couldn't figure out from documentation what order or how it should be done. I tried --export as well, but I'm not sure how to follow it properly.

Option to disable "Intersection" in set field

When running SVDB --merge and the variant exists in all VCFs the INFO field "set" is set to "Intersection".

This may be nice under some circumstances, but it would be great if this could be disabled with an option and just print all the names instead (like it does when fewer than all files intersect).

For instance if you are merging files from 3 different variant callers, there is no way of knowing which callers were used when parsing the VCF, since set only says "Intersection".

Hope that makes sense.

Using SVDB for merging VCF files of different samples?

Hi there. Thanks for developing such a nice program.

I have run different SV callers (Manta, DELLY, xTea, CNVnator, and LUMPY) on my samples.
I used SVDB to merge the all above 5 outputs for each sample, and got one output.

Now, I was wondering if it is possible to use SVDB to merge the output of each sample (after previous merge) to produce one output for all samples (in other words, the union of SV sites in all samples).
If so, another question I have is: are there any programs you recommend for genotyping each sample the final list of SV sites after merging?

Thank you in advance,
Best,

Include all SV IDs on the merged ID

When merging SVs, would it be possible to add an option to set the SV's ID to the concatenation (or unique concatenation) of all the IDs collapsed (rather than the first ID)?

For example, if I am merging the SVs:

1	176478235	476	[...]
1	176478217	MantaDEL:13605:0:1:0:0:0	[...]

I'd get:

1	176478217	MantaDEL:13605:0:1:0:0:0;476	[...]

thanks

Enhance the FORMAT column of merge subcommand.

Hi,
I think people are looking for SV merge tools often, and after testing many SV merge tools, such as SURVIVOR, svimmer, SVanalyzer, svtools, truvari, bcftools, and many other tools, I find SVDB is the best tool to merge SV vcf files from different SV callers! It use the set and priority strategy to combine SV events, which is similar to another tool CombineVariants (suitable for small variants).

When I want to merge three vcf from manta, lumpy and svaba, I find the result vcf only contains two field

#svdb merged file
chr13	32913545	MantaDEL:1:279:280:0:0:0:manta|5:lumpy	T	<DEL>	.	MaxDepth	END=32918007;SVTYPE=DEL;SVLEN=-4462;CIPOS=0,2;CIEND=0,2;HOMLEN=2;HOMSEQ=CA;STRANDS=+-:436;CIPOS95=0,0;CIEND95=0,0;SU=436;PE=0;SR=436;VARID=5:lumpy;set=filterInmanta-lumpy;FOUNDBY=2;manta_CHROM=MantaDEL_1_279_280_0_0_0|chr13;lumpy_CHROM=5|chr13;manta_POS=MantaDEL_1_279_280_0_0_0|32913545;lumpy_POS=5|32913547;manta_QUAL=MantaDEL_1_279_280_0_0_0|.;lumpy_QUAL=5|0.00;manta_FILTERS=MantaDEL_1_279_280_0_0_0|MaxDepth;lumpy_FILTERS=5|.;manta_SAMPLE=MantaDEL_1_279_280_0_0_0|QYQ_zuzhi|PR:0:0|SR:0:0;lumpy_SAMPLE=5|QYQ_zuzhi|GT:./.|SU:436|PE:0|SR:436|GQ:.|SQ:.|GL:.|DP:0|RO:0|AO:0|QR:0|QA:0|RS:0|AS:0|ASC:0|RP:0|AP:0|AB:.;manta_INFO=MantaDEL_1_279_280_0_0_0|END:32918007|SVTYPE:DEL|SVLEN:-4462|CIPOS:0:2|CIEND:0:2|HOMLEN:2|HOMSEQ:CA;lumpy_INFO=5|SVTYPE:DEL|SVLEN:-4463|END:32918010|CIPOS:0:0|CIEND:0:0|CIPOS95:0:0|CIEND95:0:0|SU:436|PE:0|SR:436;svdb_origin=manta|lumpy	PR:SR	.,.:.,.	0,0:0,0

#lumpy raw file
chr13	32913547	5	N	<DEL>	0.00	.	SVTYPE=DEL;SVLEN=-4463;END=32918010;STRANDS=+-:436;CIPOS=0,0;CIEND=0,0;CIPOS95=0,0;CIEND95=0,0;SU=436;PE=0;SR=436	GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:ASC:RP:AP:AB	./.:436:0:436:.:.:.:0:0:0:0:0:0:0:0:0:0:.

#manta raw file
chr13	32913545	MantaDEL:1:279:280:0:0:0	T	<DEL>	.	MaxDepth	END=32918007;SVTYPE=DEL;SVLEN=-4462;CIPOS=0,2;CIEND=0,2;HOMLEN=2;HOMSEQ=CA	PR:SR	0,0:0,0

The SVDB merged the FORMAT field, it is trimmed for some reason.

PR: SR	.,.:.,.	0,0:0,0

, I think

the key of the FORMAT field should be filled by the tags;
the values of the FORMAT field should be the same length as the numbers of the callers(here, 3 callers, 3 columns).
If one SV event is called by 3 callers, then the FORMAT should contain all the information from the 3 callers? as the INFO field does.
An example,

GT:SU:PE:SR:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:ASC:RP:AP:AB:PR:SR ./.:436:0:436:.:.:.:0:0:0:0:0:0:0:0:0:0:.:.:. .:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:0,0:0,0 .:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.,.:.,.

Here is the command line：

#version SVDB-2.6.4
~/bin/svdb --merge --vcf tumorSV.vcf:manta lumpy.gt.vcf:lumpy svaba.unfiltered.sv.vcf:svaba  --priority svaba,manta,lumpy > svdb.3.vcf

Best,
xiucz

No_intra flag

If we run the merge module without specifying the --no_intra flag. The variants that are merged within the same VCF, have to follow the same rules as the ones across VCFs? Meaning the variants have to be the same type and be withing a window?

Thanks in advance.

No match in db, do not add to vcf file

For instance, if a variant is not in the patogenic database - do not print that info to the vcf.

##INFO=<ID=clingen_cgh_pathogenic,Number=1,Type=Integer,Description="The number of occurances of the event in the database">

yields: clingen_cgh_pathogenic=0 in the INFO field. clingen_cgh_pathogenic=1 is informative 0 is not

Add option to only include PASS SVs when building DB

Hi,

would it be possible to have an option to only include SVs (e.g.) with FILTER=PASS when building the DB? Right now svdb --build uses all SVs, but sometimes it would be nice to have the option to only include the ones that PASS.
thanks,

More generic information extraction in svdb query

For instance:

svdb query -name decipher_ -info OCCURENCES,FREQUENCY

would yield decipher_OCCURENCES=X & decipher_FREQUENCY=Y in the INFO field.

These are to specific:
--hit_tag HIT_TAG the tag used to describe the number of hits within the
info field of the output vcf(default=OCC)
--frequency_tag FREQUENCY_TAG
the tag used to describe the frequency of the
variant(defualt=FRQ)

and does not match for instance
CLASS=Benign from the benign db

It would also enable to extract any number of INFO fields from the same database and make them unique in the output vcf.

And you can also merge the clingen_cgh_benign and clingen_cgh_pathogenic db into one and record the status in CLASS

Double entries from merge command

Hi there,

I just tested this tool for combining the outputs of Delly, Manta, and TIDDIT. It seems to work great!

I'm not necessarily sure this is an issue and you probably already know -

Here is a command I ran to merge three VCFs for one sample :

svdb --merge --pass_only --no_var --vcf ${dellysv}:delly ${mantasv}:manta ${tidditsv}:tiddit --priority delly,manta,tiddit

Note that I added --pass_only and --no_var after I noticed that when two softwares identify different variants at the same position two lines are output. I'm not sure that I noticed any difference when these flags were there vs when they weren't. From what I can tell, duplicate position lines break downstream processing with bcftools... errors on first occurrence of duplicate record. I haven't tested what would happen if I ran bcftools norm -m +any on the svdb --merge output prior to additional processing, because that is typically used for multi-allelic sites from what I understand.

I got around the issue with a "take first line if the next line has the same CHROM POS" awk command, but wanted to let you know. Maybe, the --priority flag can have that built in if the user requests it?

Another minor thing I noticed is that my chromosome order got rearranged after the --merge command - From I,II,III,IV,V,X,MtDNA to I,II,II,IV,MtDNA,V,X, I am assuming this happens because the output sorts the chromosomes by alpha-numeric?

Apart from that, this tool seems great! I'll let you know if I notice anything else.

Best,
S

Only keep one INFO field

Hello,

I'd like to ask if there is a way of only keeping one INFO field when merging VCFs, meaning if there is a variant that is found by 2 different callers, only keep the INFO field of the 1st caller?

Also, has there been any changes on speed runtime? For some reason I'm trying to merge VCFs with the new version, without any changes in my VCFs and is going from some mins to a few hours.

Best regards,

Jonatan

svdb query error using decipher db

It does not seem like svdb query produces a frequency tag. Here is the command:

$svdb --query --bnd_distance 25000 --overlap 0.8 --hit_tag decipher --frequency_tag decipherAF --db /mnt/hds/proj/cust003/develop/mip_references/GRCh37_svdb_query_decipher_-v1.0.0-.vcf --query_vcf /scratch/$SLURM_JOB_ID/118_SV_vt_svdbq.vcf.0 > /scratch/$SLURM_JOB_ID/118_SV_vt_svdbq.vcf.1

Here is a line from the db:

1       23946   del_4   N       <DEL>   .       PASS    SVTYPE=DEL;END=88271;OCCURENCES=27;FREQUENCY=0.031952663        GT      0/1

Only tags that are generated are:
decipher=0;decipherAF=0
decipher=1;decipherAF=1

What do "filterIn" "FilteredInAll" and "Intersection" mean in set flag?

Bioconda package

This is more a FYI than an issue ;-)

I'm working on a Bioconda package, the main thing missing appears to be a license.

bioconda/bioconda-recipes#7557

Python 3 compatibility

Hi.

Would you accept a PR that makes the code work both with Python 2 and 3?

I'd also add a basic integration with Travis if you agree.

Clarify license

Hi, could you clarify the license (e.g., MIT/BSD) by adding a LICENSE file?

set=Intersection is not fully informative; add FOUND_IN tag?

This is a little bit of a classic! Since it will not be obvious between runs what callers were combined for set=Intersection merges (found by all callers). The FOUND_IN tag used by some other tools is rather nice, and very similar to svdb_origin. I think we'll add the latter as a synonym in Scout, but it would not hurt renaming it FOUND_IN here - or adding it if you find they differ on edge cases.