Giter VIP home page Giter VIP logo

variantannotation's People

Contributors

d-cameron avatar dtenenba avatar hpages avatar jiefei-wang avatar jmarshall avatar jwokaty avatar kayla-morrell avatar lawremi avatar lcougnaud avatar liubuntu avatar lshep avatar mtmorgan avatar nhayden avatar nturaga avatar pshannon-bioc avatar sonali-bioc avatar villafup avatar vjcitn avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

variantannotation's Issues

Add a covariate to VRange object

Hi, I want to add an additional column to my VRange object.

I read in a multi-vcf file first and then make it into a VRange object. I have sampleNames for all of my samples but want to further annotate them into different subsets. Is there any way that I could achieve this goal?

Unable to Export VCF After rbind

A minimal example using one file, although I noticed the problem when using two distinct files.

vcf <- readVcf(system.file("extdata", "ex2.vcf", package="VariantAnnotation") , "hg19",
               param = ScanVcfParam(geno = NA))
vcf <- rbind(vcf, vcf)
writeVcf(vcf, file = "test.vcf")

Error is "all 'geno()' fields must be named".

Incorrect added ##FILTER line when none are present.

When no ##FILTER line is present in the header of a vcf, then a DataFrame with 1 row and 1 column is added to the header of the vcf. This results in the following line being written when writing out the vcf:
##FILTER=All filters passed
This incorrectly written line causes igv to crash when loading in a vcf.
When other ##FILTER lines are present the line that gets written out is:
##FILTER=<ID=PASS,Description="All filters passed">

This issue arises because of line 186 of the methods-writeVcf.R file. Changing the line to this would probably fix this issue:
} else if(ncol(df) == 1L && nrow(df) == 1L && nms != "FILTER") {

I've added a minimal example vcf file here:
example_vcf.txt

Session info:

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] MutationalPatterns_1.12.0         NMF_0.22.0                        cluster_2.1.0                    
 [4] rngtools_1.5                      pkgmaker_0.31.1                   registry_0.5-1                   
 [7] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.54.0                   rtracklayer_1.46.0               
[10] ggbeeswarm_0.6.0                  VariantAnnotation_1.32.0          Rsamtools_2.2.3                  
[13] Biostrings_2.54.0                 XVector_0.26.0                    SummarizedExperiment_1.16.1      
[16] DelayedArray_0.12.3               BiocParallel_1.20.1               matrixStats_0.56.0               
[19] Biobase_2.46.0                    GenomicRanges_1.38.0              GenomeInfoDb_1.22.1              
[22] IRanges_2.20.2                    S4Vectors_0.24.4                  BiocGenerics_0.32.0              
[25] forcats_0.5.0                     stringr_1.4.0                     dplyr_1.0.0                      
[28] purrr_0.3.4                       readr_1.3.1                       tidyr_1.1.0                      
[31] tibble_3.0.1                      ggplot2_3.3.1                     tidyverse_1.3.0                  

add a GENO field

Hi,

I would like to add a GENO field in my VCF (CollapsedVCF object), so here is what I did:

geno(header(vcf_chunk))["addCov",]=list("1","Integer","Added coverage from external table")
covs = matrix(...) #here I have the correct format with same rownames and colnames as other GENO fields
geno(vcf_chunk[1,])[["addCov"]] = covs

and I get this:

Warning message:
In .Method(..., FUN = FUN, MoreArgs = MoreArgs, SIMPLIFY = SIMPLIFY,  :
  longer argument not a multiple of length of shorter

It seems that this create an "NA" field in the GENO, and I can not change this into my true name:

names(geno(vcf_chunk[i,]))[14]="addCov"
names(geno(vcf_chunk[i,]))[14]
[1] NA

predictCoding() and changes in start/stop codons?

Hi there,

I have a SNP in the first codon of a gene (ATG->ATA). This gets annotated by predictCoding() as a nonsynonymous change. As a biologist, I see this start codon loss as a much more interesting/extreme change than an ATG->ATA change that would occur in the middle of the ORF.

It's been a while since I used other tools that annotated SNP consequences (Ensembl VEP) but I think they call it something like 'start_lost'.

What would you think about having predictCoding do something similar? For the moment I have a hack to do it myself:

mcols(y)$CONSEQUENCE <- as.character(mcols(y)$CONSEQUENCE)
testLostStart <- mcols(y)$CONSEQUENCE=="nonsynonymous" & 
    sapply(mcols(y)$PROTEINLOC, function(x) { 1 %in% x } )
if(sum(testLostStart) > 0) {
    mcols(y)[which(testLostStart),"CONSEQUENCE"] <- "start_lost"
}

The same question applies to stop codons - we biologists are very interested in SNPs that cause loss of a stop codon, but I am guessing those won't be detected by predictCoding() (my guess is just based on the possible values of CONSEQUENCE).

What would you think about also checking for stop_lost mutations?

thanks for thinking about it! all the best,

Janet

Can I read in a VCF "file" as a character string?

First off, thanks for an excellent tool.

This is a bit of a niche issue, but I was wondering is it possible to pass readVCF a character string instead of a filename?

I have a database in which VCF files have been inserted as a blob variable. My current implementation:

  1. Reads the VCF character string from the database
  2. Writes it to a tempfile
  3. Passes the tempfile to readVCF to create a CollapsedVCF object
  4. Deletes the temp file

For the most part, this isn't too much of a performance hit, but I have some 100+ mb VCF files in which the file i/o takes longer than I would like. If I could go directly from character string to CollapsedVCF it would be much more efficient.

REFCODON not consistent with REF

Hello VariantAnnotation team,
I am using the predictCoding function to annotate the coding consequence of SNPs in my VCF file using a custom GTF file. Here is what I ran:

txdb = makeTxDbFromGFF(file="my_genepred.gtf", format="gtf")
fa = open(FaFile("GRCh37.primary_assembly.genome.fa"))
vcf_file = readVcf(file = "my_vcf.vcf")
effects = predictCoding(vcf_file, txdb, fa)

In the result "effects" object, most of the annotation was correct, but some REFCODON or VARCODON were not consistent with REF or ALT. For example, here is the SNP annotation of one gene (I only showed the 'GENEID', "POS", "REFCODON", "VARCODON", "REF", "ALT" columns).

GENEID     POS REFCODON VARCODON REF ALT
ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG 9727738      TGA      TCA   A   G
ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG 9728777      AAA      AAA   C   T
ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG 9728786      AGT      ACT   T   G

You can see that the consequence of A>G mutation if codon TGA>TCA, which doesn't make sense? I went back to my VCF and at chr19:9727738 there is indeed A>G mutation. So it seems that there is a bug of the codon annotation.

I am also attaching the GTF annotation of this gene in case there is problem of my GTF file.
chr19 master_genepred.txt transcript 9720432 9731906 . - . gene_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; transcript_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG";
chr19 master_genepred.txt exon 9720432 9722012 . - . gene_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; transcript_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; exon_number "1"; exon_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG.1";
chr19 master_genepred.txt CDS 9721984 9722012 . - 2 gene_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; transcript_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; exon_number "1"; exon_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG.1";
chr19 master_genepred.txt exon 9727721 9727847 . - . gene_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; transcript_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; exon_number "2"; exon_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG.2";
chr19 master_genepred.txt CDS 9727721 9727847 . - 0 gene_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; transcript_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; exon_number "2"; exon_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG.2";
chr19 master_genepred.txt exon 9728767 9728855 . - . gene_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; transcript_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; exon_number "3"; exon_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG.3";
chr19 master_genepred.txt CDS 9728767 9728799 . - 0 gene_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; transcript_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; exon_number "3"; exon_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG.3";
chr19 master_genepred.txt exon 9730108 9730258 . - . gene_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; transcript_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; exon_number "4"; exon_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG.4";
chr19 master_genepred.txt exon 9731838 9731906 . - . gene_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; transcript_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; exon_number "5"; exon_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG.5";
chr19 master_genepred.txt start_codon 9728797 9728799 . - 0 gene_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; transcript_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; exon_number "3"; exon_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG.3";
chr19 master_genepred.txt stop_codon 9721981 9721983 . - 0 gene_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; transcript_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG"; exon_number "1"; exon_id "ENST00000326044.9_2:chr19:-|22|2017:277:469|truncation|ATG.1";

Thank you!

Files do not exist error

The VCF file I'm using is valid as far as other tools such as bcftools, gatk VariantsToTable, are concerned. However, I'm getting this error reading it in:

> list.files(path="large_files") %>% grep("*.vcf.gz", ., value = T)
[1] "161229_1634201813.vcf.gz" "tmp.vcf.gz"              
> fl <- system.file("extdata", "large_files/161229_1634201813.vcf.gz", package="VariantAnnotation")
> vcf <- readVcf(fl, "hg19")
Error in .io_check_exists(path(con)) : file(s) do not exist:
  ‘’

Suggestions?

indexVcf Fails if VCF File Not Compressed

I looked at the documentation of indexVcf and it says nothing about requiring compressed files. However,

indexVcf("primaryTissue.merged.annotated.vcf")
Error in value[[3L]](cond) : file does not appear to be bgzip'd

suppressing 'select()' returned 1:1 mapping

Hey!

Thanks for a great package! I'm currently replacing my VEP dependency with this and it seems like it's a fair bit faster, and I'm very happy to keep it in R.

A minor annoyance is that locateVariants() keep throwing informative little messages about how many mappings select() finds which is cluttering up the output. Like this:

'select()' returned many:1 mapping between keys and columns
.'select()' returned many:1 mapping between keys and columns
'select()' returned many:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned many:1 mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
'select()' returned many:1 mapping between keys and columns

I been trying to chase down where the message come from and if there is a way to switch them off. There is a (, ...) slot in locateVariants, but the description is only that Additional arguments passed to methods and it's not clear to if I can put anything there to make it run quietly. I tried wrapping it in capture.output(locateVariants(...)) but the messages bypassed that as well.

These messages may originate from a dependency of VariantAnnotation, but maybe someone knows how to get rid of them anyway?

Cheers,
/Christoffer

Converting VCF --> data.table

Hello,

We (myself and @Al-Murphy) use VariantAnnotation in our package MungeSumstats and are looking for a way to efficiently convert the (collapsed) VCF format to data.table (since the rest of our munging pipeline relies on everything being in table format).

We've adapted solutions from various forums but it's still rather slow.
https://github.com/neurogenomics/MungeSumstats/blob/master/R/vcf2df.R

I was wondering if you know of a more efficient way to convert VCF to data.table (or a regular data.frame), or could see ways that our existing function could be improved?

Many thanks in advance,
Brian

Filtering VCF by ranges

Would be useful to filter a VCF incrementally by range. Range-based iteration makes sense when e.g. removing duplicates. Right now filterVcf() only works by yield count. The GenomicFiles package makes reduction easier but not necessarily filtering at the file level. At least, side effects are undesirable in that functional framework.

merging VCF objects

It would be nice to have an efficient utility to merge two VCF objects that have differing (but often overlapping) sets of variants and samples. I guess the assay matrices would be filled with NAs. Maybe assume that the rowData() and colData() have the same columns, for now. Also assume that only the ranges and alt need to match (like VRanges matching). Currently we are using bcftools for this, but it would be nice to move to a more integrated workflow.

I guess a simple implementation would be to coerce the two (or more) VCF objects to VRanges objects, c() them together and then coerce back to VCF, preserving the info() and geno() distinction.

API to change output suffix to .gz

The majority of VCF handling bioinformatics libraries use a .vcf.gz suffix, even for block gziped output. writeVcf() with index=TRUE does not support this and forceably sets the suffix to .bgz.

The following commands do exactly the same thing:

writeVcf(vcf, "example.vcf", index=TRUE)
writeVcf(vcf, "example.vcf.bgz", index=TRUE)
writeVcf(vcf, "example.vcf.gz", index=TRUE)

Desired behaviour: specifying a .vcf.gz as the output file, actually writes to the output file instead of silently changing the suffix of the output file to .vcf.bgz.

FILTER Replacement Causes Error

Wouldn't this be a reasonable action?

Browse[1]> filt(variants) <- "PASS"
Error in .checkLength(x, length(value)) : 
  length(value) must equal length(rowRanges(x)

readVcfStack(param=ScanVcfParam()) doesn't work if "which=" is not specified.

As the title indicates, I have to add "which=rowRanges()" to make the readVcfStack work when using param. readVcf works as expected. Issue only exist in Bioc devel. Coding in bioc devel docker.

library(VariantAnnotation)

extdata <- system.file(package="GenomicFiles", "extdata")
files <- dir(extdata, pattern="^CEUtrio.*bgz$", full=TRUE)
names(files) <- sub(".*_([0-9XY]+).*", "\\1", basename(files))
seqinfo <- as(readRDS(file.path(extdata, "seqinfo.rds")), "Seqinfo")

stack <- VcfStack(files, seqinfo)
rstack <- RangedVcfStack(stack)

readVcfStack(rstack)  ## works                                                                                                                                                                                 
readVcfStack(rstack, param = ScanVcfParam())  ## doesn't work                                                                                                                                                  
readVcfStack(rstack, param = ScanVcfParam(which = rowRanges(rstack))) ## workaround                                                                                            

?ScanVcfParam
## which: ... If ‘which’ is not specified all ranges are returned.                                                                                                                                             

## readVcf
vf <- VcfFile(files[1])
readVcf(vf, "hg19")
readVcf(vf, "hg19", param = ScanVcfParam())  ## works 

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base

other attached packages:
[1] GenomicFiles_1.29.0 rtracklayer_1.53.1
[3] BiocParallel_1.27.12 VariantAnnotation_1.39.0
[5] Rsamtools_2.9.1 Biostrings_2.61.2
[7] XVector_0.33.0 SummarizedExperiment_1.23.4
[9] Biobase_2.53.0 GenomicRanges_1.45.0
[11] GenomeInfoDb_1.29.8 IRanges_2.27.2
[13] S4Vectors_0.31.5 MatrixGenerics_1.5.4
[15] matrixStats_0.61.0 BiocGenerics_0.39.2

loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 lattice_0.20-45 prettyunits_1.1.1
[4] png_0.1-7 assertthat_0.2.1 digest_0.6.28
[7] utf8_1.2.2 BiocFileCache_2.1.1 R6_2.5.1
[10] RSQLite_2.2.8 httr_1.4.2 pillar_1.6.3
[13] zlibbioc_1.39.0 rlang_0.4.11 GenomicFeatures_1.45.2
[16] progress_1.2.2 curl_4.3.2 rstudioapi_0.13
[19] blob_1.2.2 Matrix_1.3-4 stringr_1.4.0
[22] RCurl_1.98-1.5 bit_4.0.4 biomaRt_2.49.4
[25] DelayedArray_0.19.4 compiler_4.1.1 pkgconfig_2.0.3
[28] tidyselect_1.1.1 KEGGREST_1.33.0 tibble_3.1.4
[31] GenomeInfoDbData_1.2.7 XML_3.99-0.8 fansi_0.5.0
[34] crayon_1.4.1 dplyr_1.0.7 dbplyr_2.1.1
[37] GenomicAlignments_1.29.0 bitops_1.0-7 rappdirs_0.3.3
[40] grid_4.1.1 lifecycle_1.0.1 DBI_1.1.1
[43] magrittr_2.0.1 stringi_1.7.4 cachem_1.0.6
[46] xml2_1.3.2 ellipsis_0.3.2 filelock_1.0.2
[49] vctrs_0.3.8 generics_0.1.0 rjson_0.2.20
[52] restfulr_0.0.13 tools_4.1.1 bit64_4.0.5
[55] BSgenome_1.61.0 glue_1.4.2 purrr_0.3.4
[58] hms_1.1.1 yaml_2.2.1 parallel_4.1.1
[61] fastmap_1.1.0 AnnotationDbi_1.55.1 memoise_2.0.0
[64] BiocIO_1.3.0

installation fail with error undefined symbol: hts_get_bgzfp

Hi
I tried to install VariantAnnotation by BiocManager from bioconductor at R 3.6.0 (I have to use R 3.6 to compatible for another VariantAnnotation dependent package). I kept getting error

Error: package or namespace load failed for ‘VariantAnnotation’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/home/suc1/R/x86_64-pc-linux-gnu-library/3.6/00LOCK-VariantAnnotation/00new/VariantAnnotation/libs/VariantAnnotation.so':
  /home/suc1/R/x86_64-pc-linux-gnu-library/3.6/00LOCK-VariantAnnotation/00new/VariantAnnotation/libs/VariantAnnotation.so: undefined symbol: hts_get_bgzfp

Anyone had similar problem and solved it before?

hts_get_bgzfp should be a part of htslib. it is loaded in my system. Still not working.

Speed up `readVcf`

Splitting the Issue originally raised here about improving the efficiency of readVcf.

@lawremi suggested using chunking as is currently implemented in writeVcf here. I imagine this could be further improved through parallelization of that chunking procedure.

This should be quite analogous to implement in readVcf. Following up on discussion here, would this be something that @vjcitn @mtmorgan @vobencha @lawremi or one of the other VariantAnnotation co-authors would be able to implement?

Many thanks in advance,
Brian

extractROWS() on a CollapsedVCF instance returns an invalid object.

@hpages VariantAnnotation fails in devel https://master.bioconductor.org/checkResults/3.11/bioc-LATEST/VariantAnnotation/malbec2-buildsrc.html because extractROWS() returns an invalid CollapsedVCF object. A simple example is

fl <- system.file("extdata", "ex2.vcf",  package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")
validObject(extractROWS(vcf, 1:2))

leading to

Error in validObject(extractROWS(vcf, 1:2)) :
  invalid class "CollapsedVCF" object: 1: 'fixed(object)' and 'rowRanges(object) must have the same number of rows
invalid class "CollapsedVCF" object: 2: 'info' must have the same number of rows as 'rowRanges'

It seems like the solution is to add

diff --git a/R/methods-VCF-class.R b/R/methods-VCF-class.R
index b1305c2..ae8e4ce 100644
--- a/R/methods-VCF-class.R
+++ b/R/methods-VCF-class.R
@@ -296,6 +296,10 @@ setMethod("vcfFields", "VCF",
 ### Subsetting and combining
 ###
 
+setMethod(vertical_slot_names, "VCF", function(x) {
+    c("fixed", "info", callNextMethod())
+})
+
 setMethod("[", c("VCF", "ANY", "ANY"),
     function(x, i, j, ..., drop=TRUE)
 {

Does that mean the [ and [<- methods immediately following this can be removed?

predictcoding - multiple mutations at a codon are never evaluated additively

I have HIV sequence data, where for three consecutive positions I have SNPS. These three all code for SNPS in the same codon.
predictcoding() returns the independently evaluated REFAA and VARAA, fine its expected behaviour. But a warning that "hey these variant amino acid mutations aren't true" would be nice!

image

scanBcfHeader _hts_rewind() error in 1.29.19

One of my unit tests in SeqArray is now failing to read an example file:

vcffile <- SeqArray::seqExampleFileName("vcf")
vcf <- readVcf(vcffile, genome="hg19")
Error in scanBcfHeader(bf) : [internal] _hts_rewind() failed

Problems in parsing VCF fields containing multiple entries separated by commas (example with Mutect and Freebayes)

Hi there,

I really love this package and I wrote one function to parse several type of VCF output (from several caller) into a standardised format with standardised columns (https://annaquaglieri16.github.io/varikondo/articles/vignette.html)

However, there is a problem when parsing VCF fields that contain more than one entry separated by a comma. For example, the QSS field in MuTect2 contains base_quality_ref,base_quality_alt'. VariantAnnotation` only reads in the first column. Below is a reproducible example.

  • VariantAnnotation
library(varikondo)

mutect_vcf <- system.file("extdata", "germline_mutect.vcf", package = "varikondo")
vcf <- VariantAnnotation::readVcf(mutect_vcf)

head(VariantAnnotation::geno(vcf)$QSS)
                   SRX381851
chr1:36933772_C/T  15571    
chr1:36935370_T/C  13932    
chr1:36937059_A/G  7351     
chr1:36939108_T/C  10997    
chr1:36941395_G/C  6697     
chr1:36941539_G/GT 5449     
  • read.table
vcf_read_table <- read.table(mutect_vcf)
head(vcf_read_table[,c("V9","V10")])
                                                      V9                                               V10
1 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1       0/1:523,9:0.017:7:2:0.222:15571,247:261:262
2 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1       0/1:482,7:0.014:5:2:0.714:13932,203:257:225
3 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1 0/1:249,239:0.504:109:130:0.544:7351,6930:123:126
4 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1       0/1:377,6:0.020:5:1:0.833:10997,186:176:201
5 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1         0/1:223,3:0.016:1:2:0.333:6697,93:122:101
6 GT:AD:AF:ALT_F1R2:ALT_F2R1:FOXOG:QSS:REF_F1R2:REF_F2R1         0/1:182,156:0.469:89:67:.:5449,4636:86:96

Which shows how the QSS field (second to last) is reported :/.
I also try the SeqArray package and it also throw an error trying to read this field :/

At the moment I will tray to fix it by using read.table but is there a chance that this can be updated?

I actually also noted a similar problem when trying to read in Freebayes output. Freebayes reports several alternative alleles (if they are present) separated by commas. Only the first one is reported with VariantAnnotation.

Thanks!!

Anna

Subset Fails if Info and Geno Excluded

Minimal example:

library(VariantAnnotation)
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19", param = ScanVcfParam(info = NA, geno = NA))
subset(vcf, FILTER == "PASS") # Works well without param setting above.
# Error: subscript is a logical vector with out-of-bounds TRUE values
vcf[rowRanges(vcf)$FILTER == "PASS", ] # Workaround

The environment:

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

other attached packages:
 [1] VariantAnnotation_1.34.0    Rsamtools_2.4.0             Biostrings_2.56.0           XVector_0.28.0             
 [5] SummarizedExperiment_1.18.1 DelayedArray_0.14.0         matrixStats_0.56.0          Biobase_2.48.0             
 [9] GenomicRanges_1.40.0        GenomeInfoDb_1.24.0         IRanges_2.22.1              S4Vectors_0.26.0           
[13] BiocGenerics_0.34.0

REF ALT QUAL FILTER fields can no longer be modified via rowRanges()

Assigning values to one of these field should updated the value in the vcf object. It does not.

Example code:

vcf <- VCF(rowRanges=GRanges(seqnames="chr1", ranges=IRanges(start=1, width=1)), fixed = DataFrame(QUAL=1))
rowRanges(vcf)$QUAL <- 7
all(rowRanges(vcf)$QUAL == 7)

Version: VariantAnnotation_1.24.5
Expected behaviour: QUAL field is updated to 7 for all rows.
Actual behaviour: nothing happens.

Previous versions of VariantAnnotation behaved as expected. The change log indicates:

'rowRanges<-' and 'mcols<-' on VCF class behave as they do on RangedSummarizedExperiment

I suspect this bug is a regression associated with that change. Note that it is only the fields in the fixed slot that have become immutable via the rowRanges() accessor. Other fields behave as expected.

Import Error for bcftools Compressed Files

VCF files that are filtered using bcftools and have their data compressed by specifying -O z as part of the command can't be imported into R because of the error "scanBcfHeader(bf) : [internal] _hts_rewind() failed". Decompressing the file and re-compressing it using the R function bgzip enables it to be successfully imported, but it seems an inefficient workaround and it's unclear why it's required. This issue has been reported for an older version of VariantAnnotation #22

writeVcf save vcf with invalid contig information

If I read a valid vcf with the last 3.10 Bioconductor version 1.32.0

  • when the vcf file contains a single contig, the ouput contig information is truncated and not valid
vcf <- readVcf("input_singlecontig.vcf")
writeVcf(vcf, "output_singlecontig.vcf")

head of input_singlecontig.vcf

##fileformat=VCFv4.2
##contig=<ID=10,length=135534747>
##ALT=<ID=INV,Description="Inversion">

head of output_singlecontig.vcf

##fileformat=VCFv4.2
##fileDate=20200109
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=LOW_QUAL,Description="Low quality call as specified by 'variantcalling.lowQuality'">
##contig=135534747
  • but works correctly when at least two contigs are present, e.g.
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>

I am attaching the two files so that the error can be reproduced easily (extensions changed to .txt):
input_singlecontig.txt
output_singlecontig.txt

Thanks for you help!

Create protein sequences including variants from a VCF file

Dear BiocTeam,

I am investigating the proteome of human cancer samples and want to insert their genetic variations into the reference proteome fasta sequences to increase the sensitivity of my peptide/protein quantification.

Can you implement this "proteomeVariantInsertion()" in the VariantAnnotation package?

The VariantAnnotation::predictCoding() function already translates codons at variant positions from a reference BSgenome object to assess the consequences of a variant. I would like to take all coding variants (or just non-synonymous SNVs for a start) and insert them into the reference proteome, then save the modified fasta file.

See also my post in the Bioconductor forum.

Thanks,
Daniel

Proposal: String representation/RleMatrix

I'd like to propose a couple of potential optimisations, particularly useful for cases where many entries are the same (e.g. AD: NA NA for GT: "./."). I'm working with a file which compresses nicely to .rds (~40MB) but is very large when in memory (17GB). This size becomes difficult to work with on a local machine or interactively.

Looping over the individual assays in a CompressedVcf I see that the largest ones are those which are matrices of lists (e.g. AD, AF, MBQ, ...). In theory (I think) they contain the same scale of information as GT, but their representation makes them much larger.

I appreciate that this list format makes working with the data cleaner than storing the entries as delimited strings (as I believe they natively are in the VCF) which likely requires parsing the string and splitting into a list-like structure anyway, but this is a tradeoff between size and usability, and applies to the entire dataset even if a user is only interested in a subset.

Would there be interest in a representation of CollapsedVcf list-matrices as either delimited strings (in which case they could be stored as RleMatrix for an additional compression boost; or RleListMatrix (which doesn't exist, but here's an open issue: Bioconductor/DelayedArray#62)? I'm not well-versed enough in the Rle side to know if Rle provides a benefit when stored in a list, but it's an option.

Below is a comparison of object sizes for a toy example 'assay' constructed as a matrix of list elements and the size savings are potentially very large (in this case 1/62 the size).

## create a matrix of list elements
x <- matrix(
  replicate(
    2e4, 
    c(
      sample(c(NA, 1, 2), 1, prob = c(100, 1, 1)), 
      sample(c(NA, 1, 2), 1, prob = c(100, 1, 1))
    ), 
    simplify = FALSE), 
  ncol = 10
)
head(x)
#>      [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]     
#> [1,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [2,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [3,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [4,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [5,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#> [6,] Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2 Numeric,2
#>      [,8]      [,9]      [,10]    
#> [1,] Numeric,2 Numeric,2 Numeric,2
#> [2,] Numeric,2 Numeric,2 Numeric,2
#> [3,] Numeric,2 Numeric,2 Numeric,2
#> [4,] Numeric,2 Numeric,2 Numeric,2
#> [5,] Numeric,2 Numeric,2 Numeric,2
#> [6,] Numeric,2 Numeric,2 Numeric,2

## collapse to singleton strings (credit: @lawremi)
v <- unlist(x)
v[is.na(v)] <- "."
x_str <- matrix(
  S4Vectors::unstrsplit(
    c(relist(v, x)), ","), 
  nrow(x), ncol(x))
head(x_str)
#>      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]
#> [1,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [2,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [3,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [4,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [5,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."
#> [6,] ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,." ".,."

suppressMessages(library(DelayedArray))

## Rle representation of matrix of list doesn't work
as(x, "RleArray")
#> Error in .Call2("Rle_constructor", values, lengths, PACKAGE = "S4Vectors"): Rle of type 'list' is not supported

## Rle representation of string matrix
x_rle <- as(x_str, "RleArray")
x_rle
#> <2000 x 10> matrix of class RleMatrix and type "character":
#>         [,1]  [,2]  [,3]  ... [,9]  [,10]
#> [1,]    ".,." ".,." ".,." .   ".,." ".,."
#> [2,]    ".,." ".,." ".,." .   ".,." ".,."
#> [3,]    ".,." ".,." ".,." .   ".,." ".,."
#> [4,]    ".,." ".,." ".,." .   ".,." ".,."
#> [5,]    ".,." ".,." ".,." .   ".,." ".,."
#> ...     .     .     .     .   .     .    
#> [1996,] ".,." ".,." ".,1" .   ".,." ".,."
#> [1997,] ".,." ".,." ".,." .   "2,." ".,."
#> [1998,] ".,." ".,." ".,." .   ".,." ".,."
#> [1999,] "1,." ".,." "1,." .   ".,." ".,."
#> [2000,] "2,." ".,." ".,." .   ".,." ".,."

## compare sizes
pryr::object_size(x)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> 1.44 MB
pryr::object_size(x_str)
#> 161 kB
pryr::object_size(x_rle)
#> 23 kB

Created on 2020-02-28 by the reprex package (v0.3.0)

I'm not across this package enough to have any insights into implementation issues or how this might affect other aspects of the inner workings (or user-side workings) but this approach reduces a 1.71GB CollapsedVcf into a 20MB object of the same class without destroying any information. Credit to @lawremi for guidance towards optimising the conversion and on structural advice.

Data Representation Efficiency of VCF

I imported a 14 GB VCF (uncompressed) and after a while I noticed it finally took 228 GB RAM when stored in memory (server has 512 GB RAM, so didn't access swap space). Could the package provide a more efficient representation of lots of variants in R?

> system.time(variants <- readVcf("test.vcf"))
    user   system  elapsed 
3314.009  140.141 3455.743

readVcfAsVRanges (Number = A)

Hi,

I have a VCF file with some INFO fields encoded with Number=A. VariantAnnotation::readVcf works fine, but VariantAnnotation::readVcfAsVRanges does not work (Error: is(values, "vector_OR_factor") is not TRUE).

When I edit the VCF, setting the Number fields of the INFO tags to Number=1, readVcfAsVRanges works again. Is there a way to make readVcfAsVRanges work on VCF files with INFO tags encoded with such allele-specific numbering (Number=A)?

sessionInfo:
`
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] VariantAnnotation_1.24.2 Rsamtools_1.30.0
[3] SummarizedExperiment_1.8.1 DelayedArray_0.4.1
[5] matrixStats_0.52.2 Biobase_2.38.0
[7] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.46.0
[9] rtracklayer_1.38.2 Biostrings_2.46.0
[11] XVector_0.18.0 GenomicRanges_1.30.1
[13] GenomeInfoDb_1.14.0 IRanges_2.12.0
[15] S4Vectors_0.16.0 BiocGenerics_0.24.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 compiler_3.4.3 prettyunits_1.0.2
[4] progress_1.1.2 GenomicFeatures_1.30.0 bitops_1.0-6
[7] tools_3.4.3 zlibbioc_1.24.0 biomaRt_2.34.1
[10] digest_0.6.13 bit_1.1-12 RSQLite_2.0
[13] memoise_1.1.0 tibble_1.3.4 lattice_0.20-35
[16] rlang_0.1.6 Matrix_1.2-11 DBI_0.7
[19] GenomeInfoDbData_1.0.0 httr_1.3.1 stringr_1.2.0
[22] bit64_0.9-7 grid_3.4.3 R6_2.2.2
[25] AnnotationDbi_1.40.0 XML_3.98-1.9 RMySQL_0.10.13
[28] BiocParallel_1.12.0 magrittr_1.5 blob_1.1.0
[31] GenomicAlignments_1.14.1 assertthat_0.2.0 stringi_1.1.6
[34] RCurl_1.95-4.8
`

writeVcf error with latest version 1.28.2

Hi,

the latest VariantAnnotation update in stable and devel somehow broke my package PureCN.

library(PureCN)
data(purecn.example.output)
purecnVcf <- predictSomatic(purecn.example.output, return.vcf=TRUE)
writeVcf(purecnVcf, file="Sample1_PureCN.vcf")
Error in if (idx != 1) { : argument is of length zero
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.1

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] PureCN_1.13.5               DNAcopy_1.56.0             
 [3] VariantAnnotation_1.28.2    Rsamtools_1.34.0           
 [5] Biostrings_2.50.1           XVector_0.22.0             
 [7] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
 [9] BiocParallel_1.16.0         matrixStats_0.54.0         
[11] Biobase_2.42.0              GenomicRanges_1.34.0       
[13] GenomeInfoDb_1.18.1         IRanges_2.16.0             
[15] S4Vectors_0.20.1            BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0               lattice_0.20-38          prettyunits_1.0.2       
 [4] assertthat_0.2.0         digest_0.6.18            R6_2.3.0                
 [7] plyr_1.8.4               futile.options_1.0.1     RSQLite_2.1.1           
[10] httr_1.3.1               ggplot2_3.1.0            pillar_1.3.0            
[13] zlibbioc_1.28.0          rlang_0.3.0.1            GenomicFeatures_1.34.1  
[16] progress_1.2.0           lazyeval_0.2.1           data.table_1.11.8       
[19] blob_1.1.1               Matrix_1.2-15            splines_3.5.1           
[22] stringr_1.3.1            RCurl_1.95-4.11          bit_1.1-14              
[25] biomaRt_2.38.0           munsell_0.5.0            compiler_3.5.1          
[28] rtracklayer_1.42.0       pkgconfig_2.0.2          gridExtra_2.3           
[31] tibble_1.4.2             GenomeInfoDbData_1.2.0   XML_3.98-1.16           
[34] crayon_1.3.4             GenomicAlignments_1.18.0 bitops_1.0-6            
[37] grid_3.5.1               gtable_0.2.0             DBI_1.0.0               
[40] magrittr_1.5             formatR_1.5              scales_1.0.0            
[43] stringi_1.2.4            futile.logger_1.4.3      Rhdf5lib_1.4.0          
[46] lambda.r_1.2.3           RColorBrewer_1.1-2       tools_3.5.1             
[49] bit64_0.9-7              BSgenome_1.50.0          hms_0.4.2               
[52] rhdf5_2.26.0             AnnotationDbi_1.44.0     colorspace_1.3-2        
[55] BiocManager_1.30.4       memoise_1.1.0            VGAM_1.0-6  

ScanVcfParam which Parameter Description Inconsistency

It is defined as

A GRanges describing the sequences and ranges to be queried.

But, in the constructor of ScanVcfParam

The ‘which’ argument to the constructor can be one of several types, as documented above.

I am hoping that it's possible to import the first few records using a setting such as which = 1:100 before importing the entire 250 GB gnomAD VCF file to check I've got what I need for an analysis.

Left-pointing breakend ALT variants get truncated because the allele starts with '.'

Breakend parsing works if the ALT is something like ACT. but fails when it points the other way and the ALT is something like .ACT. I've tracked the issue down to vcftype.c:224:

vcftype->u.character[idx] =
            ('.' == *field) ? vcftype->charDotAs : field;

This comparison only check that the first character matches . - it doesn't also check that there are no other characters in the string. I'd put in a PR to replace the above code with a strcmp() check with "." or even a simple && '\0' != *(field + 1) but since I'm unfamiliar with the code base I'm concerned that doing so might have unintentional side effects.

Could you have a look into making the change when you have some time?

Thanks!

Subset No Longer Works For Fixed Columns

I ran this code at the beginning of the year and it worked.

library(VariantAnnotation)
example(readVcf)
subset(vcf, FILTER == "PASS")  # Error in eval(expr, as.env(envir, enclos)) : object 'FILTER' not found

I use Bioconductor 3.12 and all packages are up-to-date. It might be nice to add it to Examples section so it's always checked.

Ambiguous Description of Ellipsis for readVcf

For the ... parameter of readVCF, the description is "Additional arguments, passed to methods. For import, the arguments are passed to readVcf". But there are four other functions documented in the same Usage section. Which methods will ... be passed to in those cases and what parameter names are valid?

How to output all records in INFO fild ?

Hello,
I have 64539 records in a vcf file. The annotation information is in [INFO] fild, and [INFO] isn't the same in each record.
For example,
wechatimg4
Not all the records have the "ExAC_Het_AMR" annotation. How to output @info@listData$ExAC_Het_AMR@unlistData the same length as number of vcd records? And the missing data represented by "NA" or "." ?

Thank you very much !

`writeVcf` doesn't capture ALT Descriptions properly

fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")
tmp <- tempfile()
writeVcf(vcf, filename=tmp)
lines[grepl("ALT=", lines)]
##  [1] "##ALT=<ID=DEL,Description=Deletion>"                       
##  [2] "##ALT=<ID=DEL:ME:ALU,Description=Deletion of ALU element>" 
##  [3] "##ALT=<ID=DEL:ME:L1,Description=Deletion of L1 element>"   
##  [4] "##ALT=<ID=DUP,Description=Duplication>"                    
##  [5] "##ALT=<ID=DUP:TANDEM,Description=Tandem Duplication>"      
##  [6] "##ALT=<ID=INS,Description=Insertion of novel sequence>"    
##  [7] "##ALT=<ID=INS:ME:ALU,Description=Insertion of ALU element>"
##  [8] "##ALT=<ID=INS:ME:L1,Description=Insertion of L1 element>"  
##  [9] "##ALT=<ID=INV,Description=Inversion>"                      
## [10] "##ALT=<ID=CNV,Description=Copy number variable region>"    

Note the lack of quotes in the Description value. This means that a subsequent readVcf does not recognize this field as a string and discards the ALT information altogether. The cause is here:

} else if(nms == "PEDIGREE" || nms == "ALT") {
if (!is.null(rownames(df)))
df <- DataFrame(ID = rownames(df), df)
.pasteMultiFieldDF(df, nms)

where there is a lack of special-casing for the Description column, which is present in the general case.

Error in VCF parsing with VariantAnnotation

Hi,

I have been annotating VCF files with VEP.

utils::download.file("https://i12g-gagneurweb.in.tum.de/public/bugreports/bioc_variantAnnotation/example_no_anno.vcf.gz", "example_no_anno.vcf.gz")
utils::download.file("https://i12g-gagneurweb.in.tum.de/public/bugreports/bioc_variantAnnotation/example_vep_anno.vcf.gz", "example_vep.vcf.gz")

VEP command on the command line

vep -i example_no_anno.vcf.gz --vcf TRUE --output_file example_vep.vcf.gz --compress_output bgzip --minimal TRUE --allele_number TRUE --everything TRUE --assembly GRCh37 --db_version 94 --merged TRUE --user anonymous --port 3337 --host ensembldb.ensembl.org --cache TRUE --dir dir_cache/ensembl-vep/94/cachedir --sift s --polyphen s --total_length TRUE --numbers TRUE --symbol TRUE --hgvs TRUE --ccds TRUE --uniprot TRUE --xref_refseq TRUE --af TRUE --max_af TRUE --af_exac TRUE --af_gnomad TRUE --pubmed TRUE --canonical TRUE --biotype TRUE

However after reading the annotated VCF file, some lines seem to be randomly split and parsed as a new line. In a minimal example with 1 variant, I end up with 2 entries in R, where the second one has half of the info column as chromosome names. Could this be a bug?

library(VariantAnnotation)

# plain vcf file
vcf <- readVcf("example_no_anno.vcf.gz")
colData(vcf)
dim(vcf)
str(seqlevels(vcf))

# annotated with VEP 
# contains very long line but no errors in the format
vcf <- readVcf("example_vep.vcf.gz")
colData(vcf)
dim(vcf)
str(seqlevels(vcf)[1])
str(seqlevels(vcf)[2])
str(seqlevels(vcf)[3])

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.6 (Nitrogen)

Matrix products: default
BLAS: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRblas.so
LAPACK: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] plotly_4.8.0                ggpubr_0.2                  scales_1.0.0                cowplot_0.9.4               tidyr_0.8.2                 ggplot2_3.1.0              
 [7] magrittr_1.5                ensemblVEP_1.24.0           AnnotationHub_2.14.2        dplyr_0.7.8                 data.table_1.12.0           VariantAnnotation_1.28.7   
[13] Rsamtools_1.34.0            Biostrings_2.50.2           XVector_0.22.0              SummarizedExperiment_1.12.0 DelayedArray_0.8.0          BiocParallel_1.16.5        
[19] matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.1         IRanges_2.16.0              S4Vectors_0.20.1           
[25] BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] httr_1.4.0                    viridisLite_0.3.0             jsonlite_1.6                  bit64_0.9-7                   shiny_1.2.0                   assertthat_0.2.0             
 [7] interactiveDisplayBase_1.20.0 BiocManager_1.30.4            blob_1.1.1                    BSgenome_1.50.0               GenomeInfoDbData_1.2.0        yaml_2.2.0                   
[13] progress_1.2.0                pillar_1.3.1                  RSQLite_2.1.1                 lattice_0.20-38               glue_1.3.0                    digest_0.6.18                
[19] promises_1.0.1                colorspace_1.3-2              htmltools_0.3.6               httpuv_1.4.5.1                Matrix_1.2-15                 plyr_1.8.4                   
[25] XML_3.98-1.16                 pkgconfig_2.0.2               biomaRt_2.38.0                zlibbioc_1.28.0               purrr_0.2.5                   xtable_1.8-3                 
[31] later_0.7.5                   tibble_2.0.0                  DT_0.5                        withr_2.1.2                   GenomicFeatures_1.34.1        lazyeval_0.2.1               
[37] crayon_1.3.4                  mime_0.6                      memoise_1.1.0                 tools_3.5.1                   prettyunits_1.0.2             hms_0.4.2                    
[43] stringr_1.3.1                 munsell_0.5.0                 AnnotationDbi_1.44.0          bindrcpp_0.2.2                compiler_3.5.1                rlang_0.3.1                  
[49] grid_3.5.1                    RCurl_1.95-4.11               rstudioapi_0.9.0              htmlwidgets_1.3               labeling_0.3                  bitops_1.0-6                 
[55] gtable_0.2.0                  DBI_1.0.0                     R6_2.3.0                      GenomicAlignments_1.18.1      rtracklayer_1.42.1            bit_1.1-14                   
[61] bindr_0.1.1                   stringi_1.2.4                 Rcpp_1.0.0                    tidyselect_0.2.5

Please let me know, if you need more input to replicate this error.

Best,
Michaela Müller

CollapsedVRanges?

@lawremi @vobencha @hpages
Hi,
I am writing a package and want to use a data class very similar to VRanges but needs some modification. Is it possible to add a CollapsedVRanges/ExpandedVRanges, with slots of REF/ALT in the DNAStringSet/DNAStringSetList class?
Such a class could be a simple and useful data structure for variants from other formats, such as maf/bed, instead of VCF.
Thanks!
Qiang

writeVcf: Error in recycleSingleBracketReplacementValue

Hi,

my package started to fail in devel because of an error in writeVcf. There are a couple of warnings where the genome version is ""hg19"" instead of "hg19". Not sure that's related and an error on my side.

Thanks,
Markus

@vobencha

>library(PureCN)
>data(purecn.example.output)
>purecnVcf <- predictSomatic(purecn.example.output, return.vcf=TRUE)
>writeVcf(purecnVcf, file="Sample1_PureCN.vcf")
Error in recycleSingleBracketReplacementValue(value, x) : 
  replacement has length zero
>sessionInfo()
R Under development (unstable) (2019-03-18 r76245)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] PureCN_1.13.21              VariantAnnotation_1.29.23  
 [3] Rsamtools_1.99.3            Biostrings_2.51.5          
 [5] XVector_0.23.2              SummarizedExperiment_1.13.0
 [7] DelayedArray_0.9.8          BiocParallel_1.17.18       
 [9] matrixStats_0.54.0          Biobase_2.43.1             
[11] GenomicRanges_1.35.1        GenomeInfoDb_1.19.2        
[13] IRanges_2.17.4              S4Vectors_0.21.16          
[15] BiocGenerics_0.29.2         DNAcopy_1.57.0             

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0               lattice_0.20-38          prettyunits_1.0.2       
 [4] assertthat_0.2.1         digest_0.6.18            R6_2.4.0                
 [7] plyr_1.8.4               futile.options_1.0.1     RSQLite_2.1.1           
[10] httr_1.4.0               ggplot2_3.1.0            pillar_1.3.1            
[13] zlibbioc_1.29.0          rlang_0.3.1              GenomicFeatures_1.35.8  
[16] progress_1.2.0           lazyeval_0.2.1           data.table_1.12.0       
[19] blob_1.1.1               Matrix_1.2-17            splines_3.6.0           
[22] stringr_1.4.0            RCurl_1.95-4.12          bit_1.1-14              
[25] biomaRt_2.39.2           munsell_0.5.0            compiler_3.6.0          
[28] rtracklayer_1.43.3       pkgconfig_2.0.2          tibble_2.1.1            
[31] gridExtra_2.3            GenomeInfoDbData_1.2.0   XML_3.98-1.19           
[34] crayon_1.3.4             GenomicAlignments_1.19.1 bitops_1.0-6            
[37] grid_3.6.0               gtable_0.2.0             DBI_1.0.0               
[40] magrittr_1.5             formatR_1.6              scales_1.0.0            
[43] stringi_1.4.3            futile.logger_1.4.3      Rhdf5lib_1.5.1          
[46] lambda.r_1.2.3           RColorBrewer_1.1-2       tools_3.6.0             
[49] bit64_0.9-7              BSgenome_1.51.0          hms_0.4.2               
[52] rhdf5_2.27.15            AnnotationDbi_1.45.1     colorspace_1.4-0        
[55] memoise_1.1.0            VGAM_1.1-1     

Conversion to VRanges Error

I think that this worked in the past. I recently began working with Bioconductor 3.12.

fl <- system.file("extdata", "ex2.vcf", package = "VariantAnnotation")
vcf <- readVcf(fl, genome = "hg19")
as(vcf, "VRanges") # Error in ans[] <- x : replacement has length zero

scanVcf not working with a connection and streaming

Hi,

I was just trying to stream through a very big VCF because it is too big to store in memory and saw that scanVcf has that capability theoretically. However I get an arrow instead.

Here the reproducible example

fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf_file <- file(fl)
vcf_line <- scanVcf(file=vcf_file,n=1)

which just says
"Error: $ operator is invalid for atomic vectors"

but I CAN use readLine(vcf_file,n=1)

also the session info

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.6/lib/libopenblasp-r0.3.6.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.30.0                R6_2.4.0                                org.Hs.eg.db_3.8.2                      TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [5] GenomicFeatures_1.36.0                  AnnotationDbi_1.46.0                    rtracklayer_1.44.0                      stringdist_0.9.5.1                     
 [9] ape_5.3                                 grImport_0.9-2                          XML_3.98-1.19                           Rsamtools_2.0.0                        
[13] Biostrings_2.52.0                       XVector_0.24.0                          SummarizedExperiment_1.14.0             DelayedArray_0.10.0                    
[17] BiocParallel_1.18.0                     matrixStats_0.54.0                      Biobase_2.44.0                          GenomicRanges_1.36.0                   
[21] GenomeInfoDb_1.20.0                     IRanges_2.18.0                          S4Vectors_0.22.0                        BiocGenerics_0.30.0                    
[25] dplyr_0.8.1                             data.table_1.12.2                       RColorBrewer_1.1-2                      kableExtra_1.1.0                       
[29] knitr_1.22                             

loaded via a namespace (and not attached):
 [1] colorspace_1.4-1                  htmlTable_1.13.1                  base64enc_0.1-3                   rstudioapi_0.10                   bit64_0.9-7                      
 [6] xml2_1.2.0                        splines_3.6.0                     geneplotter_1.62.0                Formula_1.2-3                     annotate_1.62.0                  
[11] cluster_2.0.9                     BiocManager_1.30.4                readr_1.3.1                       compiler_3.6.0                    httr_1.4.0                       
[16] backports_1.1.4                   assertthat_0.2.1                  Matrix_1.2-17                     lazyeval_0.2.2                    acepack_1.4.1                    
[21] htmltools_0.3.6                   prettyunits_1.0.2                 tools_3.6.0                       igraph_1.2.4.1                    gtable_0.3.0                     
[26] glue_1.3.1                        GenomeInfoDbData_1.2.1            reshape2_1.4.3                    fastmatch_1.1-0                   Rcpp_1.0.1                       
[31] limSolve_1.5.5.3                  nlme_3.1-140                      xfun_0.7                          stringr_1.4.0                     rvest_0.3.3                      
[36] lpSolve_5.6.13                    MASS_7.3-51.4                     zlibbioc_1.30.0                   scales_1.0.0                      BSgenome_1.52.0                  
[41] hms_0.4.2                         yaml_2.2.0                        memoise_1.1.0                     gridExtra_2.3                     ggplot2_3.1.1                    
[46] biomaRt_2.40.0                    rpart_4.1-15                      latticeExtra_0.6-28               stringi_1.4.3                     RSQLite_2.1.1                    
[51] highr_0.8                         genefilter_1.66.0                 checkmate_1.9.3                   rlang_0.3.4                       pkgconfig_2.0.2                  
[56] bitops_1.0-6                      evaluate_0.13                     lattice_0.20-38                   purrr_0.3.2                       labeling_0.3                     
[61] GenomicAlignments_1.20.0          htmlwidgets_1.3                   cowplot_0.9.4                     bit_1.1-14                        tidyselect_0.2.5                 
[66] BSgenome.Hsapiens.UCSC.hg19_1.4.0 plyr_1.8.4                        magrittr_1.5                      DESeq2_1.24.0                     Hmisc_4.2-0                      
[71] DBI_1.0.0                         withr_2.1.2                       pillar_1.4.0                      foreign_0.8-71                    survival_2.44-1.1                
[76] RCurl_1.95-4.12                   nnet_7.3-12                       tibble_2.1.1                      crayon_1.3.4                      rmarkdown_1.12                   
[81] progress_1.2.1                    locfit_1.5-9.1                    blob_1.1.1                        digest_0.6.18                     webshot_0.5.1                    
[86] xtable_1.8-4                      munsell_0.5.0                     viridisLite_0.3.0                 quadprog_1.5-7                   

Different output on same command and same input file, different library version

I am running a variant calling pipeline to detect solid tumours and haematological alterations. The variant calling is done with Mutect2, whereas I am using an R script to perform some filtering on the VCF produced by Mutect

I noticed, however, that the output of the header command (VariantAnnotation library, R programming language) is not the same considering two versions (1.24.5 vs. 1.32.0).

When using v 1.32.0 I get

> header(input)
class: VCFHeader 
samples(1): Panel-STHT-T210986-PJR
meta(6): MutectVersion fileformat ... GATKCommandLine contig
fixed(1): FILTER
info(16): CONTQ DP ... STR TLOD
geno(13): GT AD ... SAAF SAPP

Whereas when using v 1.24.5 I get fixed(0): instead of fixed(1): FILTER, as shown above:

> header(input)
class: VCFHeader 
samples(1): Panel-STHT-T210986-PJR
meta(3): META GATKCommandLine contig
fixed(0):
info(16): CONTQ DP ... STR TLOD
geno(13): GT AD ... SAAF SAPP

I also noticed that if I continue with the filtering that the R script does (eventually writing a vcf with writeVcf), with VariantAnnotation v 1.32.0 I get an extra line in the vcf header: ##FILTER=All filters passed, which is upsetting another downstream program (the error message that I get is .vcf: Invalid FILTER line). And it appears to me that this ##FILTER=All filters passed line is actually coming from the extra line of the header command.

The easiest solution for me would be to downgrade from 1.32.0 to 1.24.5: However, I would like to understand where this extra line comes from

Session information (when using v1.32.0 of VariantAnnotation is shown below:

R version 3.6.1 (2019-07-05)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 8 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=de_BE.UTF-8           LC_COLLATE=en_GB.UTF-8       
 [5] LC_MONETARY=de_BE.UTF-8       LC_MESSAGES=en_GB.UTF-8      
 [7] LC_PAPER=de_BE.UTF-8          LC_NAME=de_BE.UTF-8          
 [9] LC_ADDRESS=de_BE.UTF-8        LC_TELEPHONE=de_BE.UTF-8     
[11] LC_MEASUREMENT=de_BE.UTF-8    LC_IDENTIFICATION=de_BE.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] httr_1.4.1                  stringr_1.4.0              
 [3] readr_1.3.1                 xlsx_0.6.3                 
 [5] VariantAnnotation_1.32.0    Rsamtools_2.2.3            
 [7] Biostrings_2.54.0           XVector_0.26.0             
 [9] SummarizedExperiment_1.16.1 DelayedArray_0.12.2        
[11] BiocParallel_1.20.1         matrixStats_0.55.0         
[13] Biobase_2.46.0              GenomicRanges_1.38.0       
[15] GenomeInfoDb_1.22.0         IRanges_2.20.2             
[17] S4Vectors_0.24.3            BiocGenerics_0.32.0        
[19] optparse_1.6.4             

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2               lattice_0.20-38          prettyunits_1.0.2       
 [4] xlsxjars_0.6.1           assertthat_0.2.1         zeallot_0.1.0           
 [7] digest_0.6.15            BiocFileCache_1.10.2     R6_2.2.2                
[10] backports_1.1.4          RSQLite_2.1.1            pillar_1.4.2            
[13] zlibbioc_1.32.0          rlang_0.4.0              GenomicFeatures_1.38.2  
[16] progress_1.2.2           curl_4.2                 blob_1.1.1              
[19] Matrix_1.2-17            RCurl_1.98-1.1           bit_1.1-14              
[22] biomaRt_2.42.0           compiler_3.6.1           rtracklayer_1.46.0      
[25] pkgconfig_2.0.3          askpass_1.1              openssl_1.4.1           
[28] tidyselect_0.2.5         tibble_2.1.3             GenomeInfoDbData_1.2.2  
[31] XML_3.98-1.11            crayon_1.3.4             dplyr_0.8.3             
[34] dbplyr_1.4.2             GenomicAlignments_1.22.1 bitops_1.0-6            
[37] rappdirs_0.3.1           grid_3.6.1               DBI_1.0.0               
[40] magrittr_1.5             stringi_1.4.3            getopt_1.20.3           
[43] vctrs_0.2.0              tools_3.6.1              bit64_0.9-7             
[46] BSgenome_1.54.0          glue_1.3.1               purrr_0.3.2             
[49] hms_0.5.1                AnnotationDbi_1.48.0     memoise_1.1.0           
[52] rJava_0.9-11

writeVcf save invalid vcf with metadata out of order

If I read a valid vcf with the last 3.8 Bioconductor version 1.28.1

param_T <- ScanVcfParam(sample = samples_names[1],  geno = c("AD", "MQ", "DP", "AF"), info = c("DP","AF", "CALLERS"))
  Tumor <- readVcf(file = vcf_file[i], genome = "GRCH37", param = param_T)
  

head of  vcf_file[i
##fileformat=VCFv4.2
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>

and then I save it with

writeVcf(Tumor, paste0("./vcf/",samples_names[1], ".vcf"), index = F)

the header is out of order ##fileformat=VCFv4.2 is not on the first line and there is a Strange SAMPLE field

Thanks for you support!

##SAMPLE=Tumor
##commandline="/home/pastore/data/bcbio/anaconda/bin/freebayes -f /lila/data/abdelwao-lab/pastorea/bcbio/genomes/Hsapiens/GRCh37/seq/GRCh37.fa --genotype-qualities --strict-vcf --ploidy 2 --targets /lila/data/abdelwao-lab/pastorea/Project/Ben/histiocytosis/hempact/work/freebayes/1/B02_816_T1-1_0_249250621-regions-nolcr.bed --min-repeat-entropy 1 --no-partial-observations --min-alternate-fraction 0.05 --pooled-discrete --pooled-continuous --report-genotype-likelihood-max --allele-balance-priors-off /lila/data/abdelwao-lab/pastorea/Project/Ben/histiocytosis/hempact/work/align/C_K0Y3K8_T001_d/C_K0Y3K8_T001_d-sort.bam /lila/data/abdelwao-lab/pastorea/Project/Ben/histiocytosis/hempact/work/align/PON/PON-sort.bam"
##fileDate=20181115
##fileformat=VCFv4.2
##phasing=none

Unlisted Requirement fails on CentOS 7.4 w MariaDB

I'm trying to install VariantAnnotation in R 3.4.2 on CentOS 7.4

One of the errors is

2: In install.packages(pkgs = doing, lib = lib, ...) :
  installation of package ‘RMySQL’ had non-zero exit status

I attempted to install RMySQL yesterday and that failed too - although there is a fork work around, which is less than ideal.

Two things of note:

  • RMySQL isn't listed as a dependency on a new install of VariantAnnotation?
  • The RMySQL page notes that it is being phased out for RMariaDB

locateVariants returns genes from both Forward and Reverse strands in PRECEDEID and FOLLOWID

Hello,

I'm looking at an intergenic variant (in the bovine genome) and want to figure the genes relative to which it is downstream or upstream, i.e. relative to the gene's position and its strand. I used locateVariants() with region=IntergenicVariants(upstream=1000000, downstream=1000000) in order to do this. However, some results puzzle me.

Here's what my variant looks like in the gene annotation results:

> all_var_df[rownames(all_var_df) == "AX-106756303", ]

             seqnames    start      end width strand   LOCATION LOCSTART LOCEND QUERYID TXID CDSID GENEID    PRECEDEID     FOLLOWID
AX-106756303        1 34617002 34617002     1      * intergenic       NA     NA       1 <NA>         <NA> ENSBTAG0.... ENSBTAG0....

Here's what PRECEDEID looks like:

lapply(all_var_df[rownames(all_var_df) == "AX-106756303", ]$PRECEDEID, function(X) {mapIds( org.Bt.eg.db, keys=X, column="SYMBOL", keytype="ENSEMBL", multiVals="first") } )

ENSBTAG00000019313 ENSBTAG00000016711 ENSBTAG00000003877 ENSBTAG00000044714 ENSBTAG00000020940 ENSBTAG00000020939 ENSBTAG00000001656 ENSBTAG00000045788 ENSBTAG00000006536 
           "ZMIZ1"             "PPIF"          "ZCCHC24"                 NA           "ANXA11"            "PLAC9"          "TMEM254"                 NA             "CL46" 

When looking closer at these genes in Ensembl, I notice that they are all located "to the right" of the SNP location on the forward strand and that some of them are on the Forward strand (e.g. ZMIZ1 and PPIF), while others are on the Reverse strand (e.g. ZCCHC24 and PLAC9):

https://oct2018.archive.ensembl.org/Bos_taurus/Location/View?db=core;g=ENSBTAG00000013264;r=28:34948822-35454825;t=ENSBTAT00000046662

This seems a bit confusing. Shouldn't the variant be considered upstream relative to the two genes on the Forward strand, and downstream relative to the genes in the Reverse strand?

How should the PRECEDEID gene list be interpreted, more precisely?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.