trichelab / biscuiteer Goto Github PK

tools for analysing Biscuit (https://huishenlab.github.io/biscuit/) output

R 97.77% Python 2.23%

biscuiteer's Introduction

biscuiteer

The original luxury biscuit boutique

Wait, no, that's these guys. biscuiteer, on the other hand, is a package to process biscuit output quickly into bsseq objects. A number of features such as VCF header parsing, shrunken M-value calculations (for compartment inference), and tumor/normal copy number segmentation are also included, but the task of locus- and region-level differential methylation inference is delegated to other packages (such as dmrseq).

Installing

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("biscuiteer")

A development version is available on GitHub and can be installed via:

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("trichelab/biscuiteerData")
BiocManager::install("trichelab/biscuiteer")

Usage

Loading data

biscuiteer can load headered or header-free BED-like files as produced from biscuit vcf2bed or biscuit mergecg, but we encourage users to keep their VCF headers (or just the entire VCF, which you will want to do anyways, as biscuit calls SNVs and allows for structural variant detection in a manner similar to typical whole-genome sequencing tools). Since biscuit records the version of the software and the calling arguments used to process a set of files in the output VCF, this allows for much better reproducibility:

# Load the package
#
library(biscuiteer)

# To read in some data, you need:
#   A BED file (with or without a header)
#   A VCF file (only really needs the header)
#   Whether the file is merged CG data or not
#   Any other additional function inputs (genome, region to load, etc)
#
orig_bed <- system.file("extdata", "MCF7_Cunha_chr11p15.bed.gz",
                        package="biscuiteer")
orig_vcf <- system.file("extdata", "MCF7_Cunha_header_only.vcf.gz",
                        package="biscuiteer")
bisc <- readBiscuit(BEDfile = orig_bed, VCFfile = orig_vcf,
                    merged = FALSE)

# To print metadata information from the loaded file:
#
biscuitMetadata(bisc)
#
# CharacterList of length 3
# [["Reference genome"]] hg19.fa
# [["Biscuit version"]] 0.1.3.20160324
# [["Invocation"]] biscuit pileup -r /primary/vari/genomicdata/genomes/hg19/hg1...

# This is all drawn from the VCF header:
#
metadata(bisc)$vcfHeader
#
# class: VCFHeader 
# samples(1): MCF7_Cunha
# meta(5): fileformat reference source contig program
# fixed(1): FILTER
# info(3): NS CX N5
# geno(7): GT DP ... GL GQ

Downstream bits

A/B compartment inference, age estimation from WGBS, and so forth (examples to appear).
To wit, a shorthand summary of hypo- and hyper-methylation at some regions commonly associated with each:

bisc.CpGindex <- CpGindex(bisc)
#
# Computing hypermethylation indices...
# Loading HMM_CpG_islands.hg19...
# Loading H9state23unmeth.hg19...
# Computing hypomethylation indices...
# Loading PMDs.hg19.rda from biscuiteerData...
# Loading Zhou_solo_WCGW_inCommonPMDs.hg19.rda from biscuiteerData...
# Computing indices...

show(bisc.CpGindex)
#
# CpGindex with 1 row and 3 columns
#     hyper.MCF7_Cunha   hypo.MCF7_Cunha ratio.MCF7_Cunha
#            <numeric>         <numeric>        <numeric>
# 1 0.0690734126984127 0.199261516805161 0.34664702851757
#   -------
# This object is just a DataFrame that has an idea of where it came from:
# Hypermethylation was tallied across 120 region (see 'object@hyperMethRegions'). 
# Hypomethylation was tallied across 13127 region (see 'object@hypoMethRegions').

bisc.CpGindex@hyperMethRegions
#
# GRanges object with 120 ranges and 1 metadata column:
#       seqnames            ranges strand |              score
#          <Rle>         <IRanges>  <Rle> |          <numeric>
#     1     chr1 32230201-32230224      * | 0.0399999991059303
#     2     chr1 43638401-43638449      * | 0.0399999991059303
#     3     chr1 44884001-44884005      * | 0.0399999991059303
#     4     chr1 46860401-46860406      * | 0.0599999986588955
#     5     chr1 51435801-51436075      * | 0.0499999998137355
#   ...      ...               ...    ... .                ...
#   116    chr20   8112392-8112400      * | 0.0299999993294477
#   117    chr20 17207801-17208191      * | 0.0599999986588955
#   118    chr22 20004801-20004802      * | 0.0350000001490116
#   119    chr22 37252601-37252731      * | 0.0599999986588955
#   120    chr22 43781850-43781952      * | 0.0449999999254942
#   -------
#   seqinfo: 21 sequences from hg19 genome

Updating documentation

$ make doc. Requires the roxygen2 package.

biscuiteer's People

Contributors

Stargazers

Watchers

Forkers

ttriche heoly32 zwdzwd jordanveldboom biobenkj jamespeapen svad98 ianbed dnbuckley kew24

biscuiteer's Issues

Add BiocStyle to Suggests

Hi. I'm running reverse dependency checks on parallelly, I noticed that your package lacks Suggests: BiocStyle, cf. https://github.com/HenrikBengtsson/parallelly/blob/develop/revdep/problems.md#biscuiteer. Without, it you get:

*** checking re-building of vignette outputs ... WARNING

Error(s) in re-building vignettes:
  ...
--- re-building ‘biscuiteer.Rmd’ using rmarkdown
Error: processing vignette 'biscuiteer.Rmd' failed with diagnostics:
there is no package called ‘BiocStyle’
--- failed re-building ‘biscuiteer.Rmd’

SUMMARY: processing the following file failed:
  ‘biscuiteer.Rmd’

Error: Vignette re-building failed.
Execution halted

Install biscuiteer on R version < 3.6

Hi,

Trying to install biscuiteer with
BiocManager::install("trichelab/biscuiteer",update = F)

(update is False becuase I'm running on cluster, and I don't have root permission to update)

And I'm getting the error:
"this R is version 3.5.1, package 'biscuiteer' requires R >= 3.6.0"

Is there something I can do, giving that I don't have root permission to update R?

CpGindex breaks when hypoMeth and hyperMeth have different sizes

Using chr11p15 (for reducing computation time), the hypoMeth and hyperMeth arrays have different sizes. This becomes an issue when trying to calculate the ratio between them, as R can't conform the arrays to one another.

Should the ratio be included? Or, can it be removed and left up to the user to calculate the ratio on their own if they decide they want to do that?

Issue merging many large bsseq objects with biscuiteer::unionize()

Working on my first analysis w/ BISCUIT/biscuiteer, but I’ve encountered some issues handling the data. I have 20 gzip/tabix’d VCFs (15-20Gb each) with accompanying bed.gz files. Biscuiteer seems to be working just fine with small/toy datasets. However, I’ve been having issues merging all these samples into a single bsseq object. I think part of the issue is simply due to the large sample number and the amount of data for each sample. I have attempted to solve this issue with two approaches that have failed thus far:

biscuiteer::readBiscuit() for each sample individually and then use biscuiteer::unionize() to get a single object.
Merge vcf.gz and bed.gz files on the command line and then import together using biscuiteer::readBiscuit()

Do you have any advice for a better/ideal approach in this situation?

thanks in advance!

atRegions fails when given dropNA argument

When calling atRegions(bsseq, regions, mappings, nm, ...), you can pass it either dropNA or impute, which will then be passed on to summarizeBsSeqOver. When trying to pass it dropNA, atRegions will fail because of an incorrect number of dimensions (see error message below). The issue is arising because summarizeBsSeqOver will return a numeric, rather than a matrix, when only one sample is given. Should summarizeBsSeqOver always return a matrix, and then let atRegions reduce this to a numeric if there is only one sample?

Code:
tcga_bed <- system.file("extdata", "TCGA_BLCA_A13J_chr11p15_merged.bed.gz",
package = "biscuiteer")
tcga_vcf <- system.file("extdata", "TCGA_BLCA_A13J_header_only.vcf.gz",
package = "biscuiteer")
bisc <- read.biscuit(BEDfile = tcga_bed, VCFfile = tcga_vcf,
merged = TRUE, genome = "hg38", verbose = TRUE)

reg <- Ranges(seqnames = rep("chr11",5),
strand = rep("*",5),
ranges = IRanges(start = c(0,2.8e6,1.17e7,1.38e7,1.69e7),
end= c(2.8e6,1.17e7,1.38e7,1.69e7,2.2e7))
)
regions <- atRegions(bsseq = bisc, regions = reg, dropNA = TRUE)

Error message:
Error in regional[as.character(regions), ] :
incorrect number of dimensions

Size of subset bsseq objects made with biscuiteer::readBiscuit & then unionize larger than expected

dim(myBS)
25555777 7
saveRDS(myBS, file = "biscuiteer.RDS") # 7.6 Gb on disk

test <- myBS[1,]
dim(test)
1 7
saveRDS(test, file = "test.RDS") # 7.4 Gb on disk

single sample import does not return DelayedArray objects

Importing single samples that are proper bed files without a VCF:

#grab file names
my_samp_names <- list.files("path/to/bed/files") %>% gsub("*.bed", "", .)

#import using lapply
my_bsseq_list <- lapply(my_samp_names, function(x) read.biscuit(BEDFile = "path/to/my/bed", sampleNames = x, genome = "hg19")

returns objects M and Cov matrices instead of DelayedMatrix. Ultimately, this also returns straight beta values from biscuit-derived bed files (vcf2bed) instead of methylated read counts.

Will check to see which version of biscuiteer this was introduced.

Flogit and fexpit are not invertable

The flogit and fexpit functions are not invertable as written similar to compartmap usage of these functions. Just noting here as well.

Old flogit:

#' Calculate squeezed logit function
#'
#' Helper function for calculating squeezed logit
#'
#' @param x    A vector of values between 0 and 1 inclusive
#' @param sqz  The amount by which to 'squeeze' (DEFAULT: 0.000001)
#'
#' @return     A vector of values between -Inf and +Inf
#'
#' @importFrom gtools logit
#'
flogit <- function(x,
                   sqz = 0.000001) {
  x[ which(x < sqz) ] <- sqz 
  x[ which(x > (1 - sqz)) ] <- (1 - sqz)
  gtools::logit(x)
}

New flogit as written by @ttriche in compartmap

#' Helper function: squeezed logit
#'
#' @param p       a vector of values between 0 and 1 inclusive
#' @param sqz     the amount by which to 'squeeze', default is .000001
#'
#' @return        a vector of values between -Inf and +Inf
#'
#' @examples
#'
#'   set.seed(1234)
#'   p <- runif(n=1000)
#'   summary(p) 
#' 
#'   sqz <- 1 / (10**6)
#'   x <- flogit(p, sqz=sqz)
#'   summary(x) 
#'
#'   all( abs(p - fexpit(x, sqz=sqz)) < sqz )
#'   all( abs(p - fexpit(flogit(p, sqz=sqz), sqz=sqz)) < sqz ) 
#'
#' @export 
flogit <- function(p, sqz=0.000001) { 

  midpt <- 0.5
  deflate <- 1 - (sqz * midpt)
  if (any(p > 1 | p < 0)) stop("Values of p outside (0,1) detected.")
  squoze <- ((p - midpt) * deflate) + midpt
  return( log( squoze / (1 - squoze)) )
  
}

Will file PR to change.

trans slot size very large when combining single sample bsseq objects

When combining single sample bsseq objects using either biscuiteer::unionize() or bsseq::combineList() with a list of individual bsseq objects, the trans slot becomes very large (e.g. 120 gb). Further, this is propagated to the SummarizedExperiment object when saving as an HDF5-backed SE.

Looking at hansenlab/bsseq#53 (comment), it appears that this is just a function to do a smooth. Within the bsseq::combineList(), the trans argument is given as follows at construction of the new, combined bsseq object:

ans_trans <- function() NULL

The worry here is that for some reason, we are in fact copying data around, leading loss of all gains made by making it an HDF5-backed object.

One workaround is to reset the trans slot to be the same as whatever you're combining it with. There is an explicit check in bsseq::combineList() but not in biscuiteer::unionize() for this. Alternatively, we can set both to the above and combine.