Giter VIP home page Giter VIP logo

vcfbub's Introduction

vcfbub

install with bioconda

popping bubbles in vg deconstruct VCFs

overview

The VCF output produced by a command like vg deconstruct -e -a -H '#' ... includes information about the nesting of variants. With -a, --all-snarls, we obtain not just the top level bubbles, but all nested ones. This exposed snarl tree information can be used to filter the VCF to obtain a set of non-overlapping sites (n.b. "snarl" is a generic model of graph bubbles including tips and loops).

vcfbub lets us do two common operations on these VCFs:

  1. We can filter sites by maximum level in the snarl tree. For instance, --max-level 0 would keep only sites with LV=0. In practice, vg's snarl finder ensures that these are sites rooted on the main linear axis of the pangenome graph. Those at higher levels occur within larger variants.
  2. We can filter sites by maximum allele size, either for the reference allele or any allele. In this case, --max-ref-length 10000 would keep only sites where the reference allele is less than 10kb long. Setting --max-ref-length or --max-allele-length additionally ensures that the output contains the bubbles nested inside of any popped bubble, even if they are at greater than --max-level.

vcfbub accomplishes a simple task: we keep sites that are the children of those which we "pop" due to their size. These occur around complex large SVs, such as multi-Mbp inversions and segmental duplications. We often need to remove these, as they provide little information for many downstream applications, such as haplotype panels or other imputation references.

usage

This removes all non-top-level variant sites (-l 0) unless they are inside of variants with reference length > 10kb (-r 10000):

vcfbub -l 0 -r 10000 var.vcf >filt.vcf

vcfbub's People

Contributors

andreaguarracino avatar ekg avatar glennhickey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

baozg

vcfbub's Issues

vcfbub installation/usage

Hello to all,

I am struggling with vcfbub installation. I've looked for documentation but could not find something that helped me through it (I should point out that I am a biologist and may as well be missing something quite straightforward).

I first thought that vcfbub may be integrated in another pangenome tool (I looked in pggb, vg, vcflib), but did not find it.

I then tried to copy the git repository, but this led me to the conclusion that I require rust with cargo to run the main.rs file. For some reason, the rust installation is not working (neither locally nor on the hpc cluster I am using).

How should I proceed to be able to use vcfbub ? Should I keep trying with rust installation or is there another way ?

Thank you for your time on this issue.

Best,
Luca Soldini

The vcfhub remove about 300Mb data in a 2.5Gb genome

Dear developer:
We used Minigraph-Cactus to build a pan-genome and used Pangenie for individual typing. We found a large fragment of variation in some chromosomes in the genome group, which was lost after quality control. What caused this? Will it affect the subsequent analysis? Because we were trying to do a genome-wide association analysis of SV, we were puzzled by the lack of information in some chromosome fragments all the types: SNP, Indel, and SV.
Best day!

Code:
vcfbub -l 0 -r 100000 --input chr2.vcf.gz > chr2.ready.vcf

Example:
2 10510824 >18339279>18339282 GT AG 60.0 . GT 0 0 0 0 0 0 1 0 0 0 >
2 10510828 >18339282>18339285 CA TG 60.0 . GT 0 0 0 0 0 0 1 0 0 0 >
2 10510833 >18339285>18339288 CA TT 60.0 . GT 0 0 0 0 0 0 1 0 0 0 >
2 10510838 >18339288>18339291 CT TC 60.0 . GT 0 0 0 0 0 0 1 0 0 0 >
2 10510842 >18339291>18339294 GC TT 60.0 . GT 0 0 0 0 0 0 1 0 0 0 >
2 10510848 >18339294>18339297 C G 60.0 . GT 0 0 0 0 0 0 1 0 0 0 >
2 10510856 >18339297>18339300 GG CT 60.0 . GT 0 0 0 0 0 0 1 0 0 0 >
2 10510865 >18339300>18339303 A G 60.0 . GT 0 0 0 0 0 0 1 0 0 0 >
2 10510867 >18339303>18339305 TAAC T 60.0 . GT 0 0 0 0 0 0 1 0 0 0 0 >
2 10510886 >18339305>18339308 A G 60.0 . GT 0 0 0 0 0 0 1 0 0 0 >
2 10510889 >18339308>18339311 TACC ATTG 60.0 . GT 0 0 0 0 0 0 1 0 0 0 >
2 79046966 >21226444>21226446 A AACGAATCCGACTAGGAACCATGAGGTTGCAGGTTCGGTCCCTGCCCTTGCTCAGTGGGTTAACGATCCGGCGTTGCCGTGAGCTGTGGTGTAGATCACAGATGCAGCTTAGATCCTGAGTTGCTGTGGCTGTGGCATATGGTGGCAGCTGCTATCTGATTCGACCCCTAGACTGGGAACCTCCATATACCACGAGTGCAGTCCTA>
2 79047027 >21226446>21226449 A G 60.0 . GT 0 0 0 0 0 0 0 0 0 0 >
2 79047042 >21226449>21226452 CC CT,CCT 60.0 . GT 0 0 0 0 >
2 79047076 >21226452>21226455 A G 60.0 . GT 1 0 1 1 0 1 0 0 1 1 >
2 79047081 >21226455>21226457 CC C 60.0 . GT 0 0 0 0 0 0 0 0 0 0 0 >
2 79047086 >21226457>21226460 T C 60.0 . GT 0 0 0 0 0 0 0 0 0 0 >
2 79047149 >21226460>21226463 CC CAT 60.0 . GT 0 0 0 0 0 0 0 0 0 0 >
2 79047156 >21226463>21226466 CCG CA 60.0 . GT 0 0 0 0 0 0 0 0 0 0 >
2 79047160 >21226466>21226469 G A 60.0 . GT 0 0 0 0 0 0 0 0 0 0 >
2 79047166 >21226469>21226471 GG G 60.0 . GT 0 0 0 0 0 0 0 0 0 0 0 >

InvalidInput, error: "invalid gzip header"

Dear developers,

I wanted to try vcfbub using a pangenome of A. thaliana built using pggb.

This is the vg deconstruct (vg version v1.40.0 "Suardi") command stored in my .log file:

vg deconstruct -P Arabidopsis_thaliana -H # -e -a -t 16 c0_output/Arabidopsis_input.fasta.gz.a7c61d0.community.0.fa.7659a9d.417fcdf.171f02d.smooth.final.gfa

This is how the .decomposed.vcf looks like:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=CONFLICT,Number=.,Type=String,Description="Sample names for which there are multiple paths in the graph with conflicting alleles">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=LV,Number=1,Type=Integer,Description="Level in the snarl tree (0=top level)">##INFO=<ID=PS,Number=1,Type=String,Description="ID of variant corresponding to parent snarl">
##INFO=<ID=AT,Number=R,Type=String,Description="Allele Traversal as path in graph">##contig=<ID=Arabidopsis_thaliana#1#1,length=30427671>
##INFO=<ID=LEN,Number=A,Type=Integer,Description="allele length">##INFO=<ID=ORIGIN,Number=1,Type=String,Description="Decomposed from a complex record using vcflib vcfwave and alignment with WFA2-lib.">
##INFO=<ID=INV,Number=A,Type=String,Description="Count of haplotypes which are aligned in the inverted orientation using vcflib vcfwave.">
##bcftools_annotateVersion=1.16+htslib-1.16
##bcftools_annotateCommand=annotate -x INFO/TYPE c0_output/Arabidopsis_input.fasta.gz.a7c61d0.community.0.fa.7659a9d.417fcdf.171f02d.smooth.final.Arabidopsis_thaliana.decomposed.vcf.tmp.vcf; Date=Tue Dec 13 00:21:36 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  GCA_020911765   GCA_023115395   GCA_900660825   GCA_902460265   GCA_902460275   GCA_902460285   GCA_902460295   GCA_902460305   GCA_902460315   GCA_902825305   GCA_903064285   GCA_903064295   GCA_903064325   GCA_904420315   GCA_933208065   MyArabidopsis
Arabidopsis_thaliana#1#1        171     >123>125        T       TA      60      .       AC=1;AF=0.166667;AN=6;AT=>123>125,>123>124>125;LV=0;NS=6        GT      0       0       .       .       .       .       .       .       0       0       0       .       .       1       .       .
Arabidopsis_thaliana#1#1        177     >125>128        C       T       60      .       AC=1;AF=0.166667;AN=6;AT=>125>127>128,>125>126>128;LV=0;NS=6    GT      0       0       .       .       .       .       .       .       1       0       0       .       .       0       .       .
Arabidopsis_thaliana#1#1        219     >133>136        C       T       60      .       AC=1;AF=0.142857;AN=7;AT=>133>135>136,>133>134>136;LV=0;NS=7    GT      0       0       .       .       .       .       .       .       1       0       0       .       .       0       .       0
Arabidopsis_thaliana#1#1        253     >136>139        T       C       60      .       AC=2;AF=0.285714;AN=7;AT=>136>138>139,>136>137>139;LV=0;NS=7    GT      0       0       .       .       .       .       .       .       1       0       0       .       .       1       .       0
Arabidopsis_thaliana#1#1        270     >139>142        G       A       60      .       AC=1;AF=0.142857;AN=7;AT=>139>141>142,>139>140>142;LV=0;NS=7    GT      0       0       .       .       .       .       .       .       1       0       0       .       .       0       .       0
Arabidopsis_thaliana#1#1        276     >142>145        T       G       60      .       AC=1;AF=0.142857;AN=7;AT=>142>144>145,>142>143>145;LV=0;NS=7    GT      0       0       .       .       .       .       .       .       1       0       0       .       .       0       .       0
Arabidopsis_thaliana#1#1        291     >145>148        T       A       60      .       AC=1;AF=0.142857;AN=7;AT=>145>147>148,>145>146>148;LV=0;NS=7    GT      0       0       .       .       .       .       .       .       0       0       0       .       .       1       .       0

I tried to run vcfbub with the following command:

vcfbub -l 0 -r 10000 -i Arabidopsis_input.fasta.gz.a7c61d0.community.0.fa.7659a9d.417fcdf.171f02d.smooth.final.Arabidopsis_thaliana.decomposed.vcf > Arabidopsis_vcfbub_l0_r10000.vcf

And I am getting the following error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: IoError(Custom { kind: InvalidInput, error: "invalid gzip header" })', src/main.rs:14:6
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Rust full backtrace:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: IoError(Custom { kind: InvalidInput, error: "invalid gzip header" })', src/main.rs:14:6
stack backtrace:
   0:     0x556630e01100 - std::backtrace_rs::backtrace::libunwind::trace::h32eb3e08e874dd27
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x556630e01100 - std::backtrace_rs::backtrace::trace_unsynchronized::haa3f451d27bc11a5
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x556630e01100 - std::sys_common::backtrace::_print_fmt::h5b94a01bb4289bb5
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:66:5
   3:     0x556630e01100 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hb070b7fa7e3175df
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:45:22
   4:     0x556630e20e8e - core::fmt::write::hd5207aebbb9a86e9
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/fmt/mod.rs:1202:17
   5:     0x556630dfe935 - std::io::Write::write_fmt::h3bd699bbd129ab8a
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/io/mod.rs:1679:15
   6:     0x556630e02783 - std::sys_common::backtrace::_print::h7a21be552fdf58da
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:48:5
   7:     0x556630e02783 - std::sys_common::backtrace::print::ha85c41fe4dd80b13
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:35:9
   8:     0x556630e02783 - std::panicking::default_hook::{{closure}}::h04cca40023d0eeca
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:295:22
   9:     0x556630e0246f - std::panicking::default_hook::haa3ca8c310ed5402
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:314:9
  10:     0x556630e02e2a - std::panicking::rust_panic_with_hook::h7b190ce1a948faac
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:698:17
  11:     0x556630e02d27 - std::panicking::begin_panic_handler::{{closure}}::hbafbfdc3e1b97f68
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:588:13
  12:     0x556630e015ac - std::sys_common::backtrace::__rust_end_short_backtrace::hda93e5fef243b4c0
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:138:18
  13:     0x556630e02a42 - rust_begin_unwind
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5
  14:     0x556630d7ed33 - core::panicking::panic_fmt::h8d17ca1073d9a733
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14
  15:     0x556630d7ee23 - core::result::unwrap_failed::hfaddf24b248137d3
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/result.rs:1785:5
  16:     0x556630d8728c - vcfbub::main::h5c2271e241dd4124
  17:     0x556630d87b23 - std::sys_common::backtrace::__rust_begin_short_backtrace::h73a2075f4446c9c0
  18:     0x556630d8f419 - std::rt::lang_start::{{closure}}::h5399f682a4a2dd89
  19:     0x556630dfb44f - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::hb69be6e0857c6cfb
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/ops/function.rs:283:13
  20:     0x556630dfb44f - std::panicking::try::do_call::h396dfc441ee9c786
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:492:40
  21:     0x556630dfb44f - std::panicking::try::h6cdda972d28b3a4f
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:456:19
  22:     0x556630dfb44f - std::panic::catch_unwind::h376039ec264e8ef9
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panic.rs:137:14
  23:     0x556630dfb44f - std::rt::lang_start_internal::{{closure}}::hc94720ca3d4cb727
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/rt.rs:148:48
  24:     0x556630dfb44f - std::panicking::try::do_call::h2422fb95933fa2d5
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:492:40
  25:     0x556630dfb44f - std::panicking::try::h488286b5ec8333ff
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:456:19
  26:     0x556630dfb44f - std::panic::catch_unwind::h81636549836d2a25
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panic.rs:137:14
  27:     0x556630dfb44f - std::rt::lang_start_internal::h6ba1bb743c1e9df9
                               at /rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/rt.rs:148:20
  28:     0x556630d87a28 - main
  29:     0x7fa6c21c7083 - __libc_start_main
                               at /build/glibc-SzIz7B/glibc-2.31/csu/../csu/libc-start.c:308:16
  30:     0x556630d7efb5 - <unknown>

Is there anything wrong with my decomposed.vcf file or is there any conflict between libraries/packages? I'm not familiar with Rust so I apologise if it's a basic question.

Thank you in advance.

Best regards,

Lia

Rust Error occurs:{ kind: InvalidInput, error: "invalid gzip header" }

Hi there,
I met the Rust error. The command is
vcfbub -l 0 -r 100000 -i input.vcf > output.filter.vcf

I got the error message like:
'thread 'main' panicked at 'called Result::unwrap() on an Err value: IoError(Custom { kind: InvalidInput, error: "invalid gzip header" })', src/main.rs:14:6
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace'

The graph was built using MC, and the input.vcf file was generated using the command vg deconstruct.

I am not familiar with this error message. Is there anything with my vcf file? I wonder how to check the vcf file in advance or is there any methods to deal with the problem?

Any help will be highly appreciated.

Thank You!
-Zoey

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.