Giter VIP home page Giter VIP logo

methylkit's Introduction

methylKit Logo

methylKit

Build Status

Github Build Status
Bioc Release Bioc release status
Bioc Devel Bioc devel status

GitHub R package version codecov

Introduction

methylKit is an R package for DNA methylation analysis and annotation from high-throughput bisulfite sequencing. The package is designed to deal with sequencing data from RRBS and its variants, but also target-capture methods such as Agilent SureSelect methyl-seq. In addition, methylKit can deal with base-pair resolution data for 5hmC obtained from Tab-seq or oxBS-seq. It can also handle whole-genome bisulfite sequencing data if proper input format is provided.

Current Features

  • Coverage statistics
  • Methylation statistics
  • Sample correlation and clustering
  • Differential methylation analysis
  • Feature annotation and accessor/coercion functions
  • Multiple visualization options
  • Regional and tiling windows analysis
  • (Almost) proper documentation
  • Reading methylation calls directly from Bismark(Bowtie/Bowtie2 alignment files
  • Batch effect control
  • Multithreading support (for faster differential methylation calculations)
  • Coercion to objects from Bioconductor package GenomicRanges
  • Reading methylation percentage data from generic text files

Staying up-to-date

You can subscribe to our googlegroups page to get the latest information about new releases and features (low-frequency, only updates are posted)

To ask questions please use methylKit_discussion forum

You can also check out the blogposts we make on using methylKit


Installation

in R console,

library(devtools)
install_github("al2na/methylKit", build_vignettes=FALSE, 
  repos=BiocManager::repositories(),
  dependencies=TRUE)

if this doesn't work, you might need to add type="source" argument.

Install the development version

library(devtools)
install_github("al2na/methylKit", build_vignettes=FALSE, 
  repos=BiocManager::repositories(),ref="development",
  dependencies=TRUE)

if this doesn't work, you might need to add type="source" argument.


How to Use

Typically, bisulfite converted reads are aligned to the genome and % methylation value per base is calculated by processing alignments. methylKit takes that % methylation value per base information as input. Such input file may be obtained from AMP pipeline for aligning RRBS reads. A typical input file looks like this:

chrBase	chr	base	strand	coverage	freqC	freqT
chr21.9764539	chr21	9764539	R	12	25.00	75.00
chr21.9764513	chr21	9764513	R	12	0.00	100.00
chr21.9820622	chr21	9820622	F	13	0.00	100.00
chr21.9837545	chr21	9837545	F	11	0.00	100.00
chr21.9849022	chr21	9849022	F	124	72.58	27.42
chr21.9853326	chr21	9853326	F	17	70.59	29.41

methylKit reads in those files and performs basic statistical analysis and annotation for differentially methylated regions/bases. Also a tab separated text file with a generic format can be read in, such as methylation ratio files from BSMAP, see here for an example. Alternatively, read.bismark function can read SAM file(s) output by Bismark(using bowtie/bowtie2) aligner (the SAM file must be sorted based on chromosome and read start). The sorting must be done by unix sort or samtools, sorting using other tools may change the column order of the SAM file and that will cause an error.

Below, there are several options showing how to do basic analysis with methylKit.

Documentation

  • You can look at the vignette here. This is the primary source of documentation. It includes detailed examples.
  • You can check out the slides for a tutorial at EpiWorkshop 2013. This works with older versions of methylKit, you may need to update the function names.
  • You can check out the tutorial prepared for EpiWorkshop 2012. This works with older versions of methylKit, you may need to update the function names.
  • You can check out the slides prepared for EuroBioc 2018. This also includes more recent features of methylKit and is meant to give you a quick overview about what you can do with the package.

Downloading Annotation Files

Annotation files in BED format are needed for annotating your differentially methylated regions. You can download annotation files from UCSC table browser for your genome of interest. Go to [http://genome.ucsc.edu/cgi-bin/hgGateway]. On the top menu click on "tools" then "table browser". Select your "genome" of interest and "assembly" of interest from the drop down menus. Make sure you select the correct genome and assembly. Selecting wrong genome and/or assembly will return unintelligible results in downstream analysis.

From here on you can either download gene annotation or CpG island annotation.

  1. For gene annotation, select "Genes and Gene prediction tracks" from the "group" drop-down menu. Following that, select "Refseq Genes" from the "track" drop-down menu. Select "BED- browser extensible data" for the "output format". Click "get output" and on the following page click "get BED" without changing any options. save the output as a text file.
  2. For CpG island annotation, select "Regulation" from the "group" drop-down menu. Following that, select "CpG islands" from the "track" drop-down menu. Select "BED- browser extensible data" for the "output format". Click "get output" and on the following page click "get BED" without changing any options. save the output as a text file.

In addition, you can check this tutorial to learn how to download any track from UCSC in BED format (http://www.openhelix.com/cgi/tutorialInfo.cgi?id=28)


R script for Genome Biology publication

The most recent version of the R script in the Genome Biology manuscript is here.


Citing methylKit

If you used methylKit please cite:

If you used flat-file objects or over-dispersion corrected tests please consider citing:

and also consider citing the following publication as a use-case with specific cutoffs:


Contact & Questions

e-mail to [email protected] or post a question using the web interface.

if you are going to submit bug reports or ask questions, please send sessionInfo() output from R console as well.

Questions are very welcome, although we suggest you read the paper, documentation(function help pages and the vignette) and blog entries first. The answer to your question might be there already.


Contribute to the development

See the trello board for methylKit development. You can contribute to the methylKit development via github ([http://github.com/al2na/methylKit/]) by opening an issue and discussing what you want to contribute, we will guide you from there. In addition, you should:

  • Bump up the version in the DESCRIPTION file on the 3rd number. For example, the master branch has the version numbering as in "X.Y.1". If you make a change to master branch you should bump up the version in the DESCRIPTION file to "X.Y.2".

  • Add your changes to the NEWS file as well under the correct version and appropriate section. Attribute the changes to yourself, such as "Contributed by X"

License

Artistic License/GPL

methylkit's People

Contributors

abierling avatar al2na avatar alexg9010 avatar hpages avatar jondeen avatar jwokaty avatar karl616 avatar katwre avatar lshep avatar marcinkosinski avatar nturaga avatar shengli avatar vobencha avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

methylkit's Issues

file name too long

@al2na this error mentioned in the discussion forum here https://groups.google.com/forum/#!topic/methylkit_discussion/FHi_1K5cTZs is really due to a limit in the length of a file name (max 255 ).

touch $(printf 'a%.0s' {1..256}) touch: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa: File name too long

We have to think of an alternative naming scheme to allow the simultaneous work with many samples.

What do you think of the idea to name the tabix file with an unique id and create a text file with the same id that stores the the creation command or the list of samples used in the unite command ?

Error for with unite function for methylRawListDB object

Hi,
I get this error when I try to run unite function for more then 3 samples per group for methylRawListDB object. Since I haver more ~200 samples of the human genome, I really need to use tabix option and would appreciate your help.

Here is the error message:

Error in merge.data.table(res, tmp, by = c("V1", "V2", "V3", "V4"), all = all) :
x has some duplicated column name(s): V5.x,V6.x,V7.x,V5.y,V6.y,V7.y. Please remove or rename the duplicate(s) and try again.

This is a minimal example using methyKit dataset that reproduces the error I am getting.

files <- list(system.file("extdata", "test1.myCpG.txt", package = "methylKit"),
system.file("extdata", "test2.myCpG.txt", package = "methylKit"),
system.file("extdata", "test1.myCpG.txt", package = "methylKit"),
system.file("extdata", "control1.myCpG.txt", package = "methylKit"),
system.file("extdata", "control2.myCpG.txt", package = "methylKit"),
system.file("extdata", "control1.myCpG.txt", package = "methylKit")
)

myobj <- methRead(files,
sample.id = as.list(paste0("S", c(1:6))),
assembly = "hg19", context = "CpG",
treatment = c(rep(0, 3), rep(1, 3)), dbtype = "tabix”)

meth <- unite(myobj)

Thank you,
Martina

feature request: subtract.and.adjust.mkBS.vs.oxBS

This is a feature request to extend/clarify the method adjust.methylC found in the development branch of methylKit. See below:

I would like to know if the 5hmC levels listed below are from an (a) oxBS experiment or (b) 5hmC levels from the subtraction of an oxBS and a BS (or mock BS, mkBS) experiment:

I presume it's (b), in which case it would take multiple steps from the bam/bismark calls to reach that point in the analysis and then call methylKit.

It would be great if there was an alternative function that was able to load the values straight from parallel mkBS and oxBS experiments on the same sampleset. For example, I imagine something like:

myoxBS = read(oxBS.files...)
mymkBS = read(mkBS.files...)
adjusted = subtract.and.adjust.mkBS.vs.oxBS(mymkBS, myoxBS)

Let me know if this makes sense.

     # read 5hmC and 5mC files
     hmc.file=system.file("extdata", "test1.myCpG.txt", package = "methylKit")
     mc.file =system.file("extdata", "test2.myCpG.txt", package = "methylKit")

     my5hmC=read( hmc.file,sample.id="hmc",assembly="hg18")
     my5mC =read( mc.file,sample.id="mc",assembly="hg18")

     # adjusting the 5mC levels using 5hmC levels
     adjusted.5mC=adjust.methylC(my5mC,my5hmC)

Adjust measured 5mC levels using 5hmC levels

Description:

Measured 5mC levels via bisulfite sequencing might be a
combination of 5hmC and 5mC levels since bisulfite sequencing can
not distinguish between the two. This function can adjust 5mC
levels of a bisulfite sequencing experiment if the user supplies
corresponding 5hmC levels from the same sample.


Usage:

adjust.methylC(mc,hmc,save.db,...,chunk.size)

S4 method for signature 'methylRaw,methylRaw'

adjust.methylC(mc, hmc, save.db = FALSE, ...,
chunk.size = 1e+06)

S4 method for signature 'methylRawList,methylRawList'

adjust.methylC(mc, hmc,
save.db = FALSE, ..., chunk.size = 1e+06)

S4 method for signature 'methylRawDB,methylRawDB'

adjust.methylC(mc, hmc, save.db = TRUE,
..., chunk.size = 1e+06)

S4 method for signature 'methylRawListDB,methylRawListDB'

adjust.methylC(mc, hmc,
save.db = TRUE, ..., chunk.size = 1e+06)


possible typo in `methCall.pl`

When performing a paired_sam analysis, perform_sam subroutine is called but the 2 last parameters (nolap and paired) are inverted compared to their positions in perform_sam definition: the value of nolap is used to set parameter paired.

feature request: tmpdir option for readBismarkCytosineReport function

Would it be possible to explicitly set up the tempdir that is used for reading in the input files for readBismarkCytosineReport? In platforms where the instance has a small /tmp disk space, reading in a bunch of samples may make it run out of space unless tmpdir is set to another mounted point with more space...

Thx

Fisher-test option for adjust.methylC

This is a feature request against the development version, which contains an adjust.methylC feature that does subtraction of an oxBS and mkBS (mock-BS or BS sample). See the code below.

With this optional Fisher-test filtering, those entries where the 4 counts don't pass the t test with a given threshold are excluded, and the rest of entries are included. One could leave the default of not using fisher-test filtering as it is now.

fisher.test(matrix(c(mkBS_methylated, mkBS_unmethylated, oxBS_methylated, oxBS_unmethylated),nrow=2,dimnames=list(mkBS=c("methylated","unmethylated"),oxBS=c("methylated","unmethylated"))))
setMethod("adjust.methylC", c("methylRaw","methylRaw"),function(mc,hmc){

  lst=new("methylRawList",list(mc,hmc),treatment=c(1,0))
  data=getData(unite(lst))

  diff=(data$numCs1)-round(data$coverage1*(data$numCs2/data$coverage2))
  diff[diff<0]=0
  data$numCs1=diff
  data$numTs1=data$coverage1-data$numCs1
  colnames(data)[5:7]=c("coverage","numCs","numTs")
  new("methylRaw",data[,1:7],[email protected],  assembly=mc@assembly, 
                             context =mc@context,     resolution=mc@resolution)

})

Error in getMethylDiff when a methylDiff object contains NA

When a methylDiff object contains NA in a qvalue/meth.diff column then the getMethylDiff function returns an error:

Error in if (max(i) > nrow(x)) stop("subscript contains out-of-bounds indices") : 
  missing value where TRUE/FALSE needed

Here is a toy example:

sim.methylBase = dataSim(replicates=6,
                          sites=5000,
                          treatment=c(1,1,1,0,0,0),
                          percentage=50,
                          effect=25,
                          add.info=TRUE)
limma.meth.log<-function(methylBase.obj){
  
  require(limma)
  require(qvalue)
  group<-methylBase.obj@treatment
  design<-model.matrix(~group)
  
  # do the test in limma
  p.meth=percMethylation(methylBase.obj) # get percent methylation values
  fit <- lmFit(log2(p.meth/(100-p.meth)), design = design)
  tstat.ord <- fit$coef/ fit$stdev.unscaled/ fit$sigma
  
  fit2 <- ebayes(fit)
  
  # make the data for methylKit object
  df=cbind(getData(methylBase.obj)[,1:4],pvalue=fit2$p.value[,2],
           qvalue=qvalue(fit2$p.value[,2])$qvalues,
           meth.diff=rowMeans(p.meth[,1:2])-rowMeans(p.meth[,3:4])  )

  # create a new methylDiff object
  obj=new("methylDiff",df,[email protected],
          assembly=methylBase.obj@assembly,
          context=methylBase.obj@context,treatment=methylBase.obj@treatment,
          destranded=methylBase.obj@destranded,resolution=methylBase.obj@resolution)
}

limma.qvalue.log=limma.meth.log(sim.methylBase[[1]])
getMethylDiff(limma.qvalue.log,
                              difference=difference,qvalue=qvalue, type="all")

Sth like this works for me:
limma.qvalue.log[which(limma.qvalue.log$qvalue<0.05 & abs(limma.qvalue.log$meth.diff)>5)],
but it doesnt work with getMethylDiff because you dont use which function to subset rows (

.Object[.Object$qvalue<qvalue & abs(.Object$meth.diff) > difference,],
).

I am not sure if it's a real bug or there shouldn't be any NAs in pvalue or meth.diff column, but maybe at least a more informative message could be printed to a user :)

Header in tabix file

storing extra information in tabix files is actually quite easy:

#load library
library(methylKit)

# load table stored in tabix file
my_mat <- headTabix("inst/extdata/ctrl2.txt.bgz")


# write comment-header to new file
write(x = "#ctrl2.txt.bgz","../my_file.txt")
# write table to new file
write.table(x = my_mat,file = "../my_file.txt",
            append = TRUE,col.names = FALSE,
            row.names = FALSE,sep = "\t")

# make tabix out of file
makeMethTabix("../my_file.txt",rm.file = FALSE)

# view table stored in tabix file
headTabix("../my_file.txt.bgz")

# access other information of tabix file
Rsamtools::headerTabix("../my_file.txt.bgz")

# especially header
Rsamtools::headerTabix("../my_file.txt.bgz")$header

Maybe it would also be a good idea to store all slots of a methylDB object in its tabix file, such that we can easily reread all information.

tileMethylCounts and missing data

Hi,

we are currently happy to use MethylKit in our group.

Still, a small issue with tileMethylCounts. One of my colleagues noticed that tileMethylCounts generated a lot of NA-tiles when min.per.group was lower than the number of samples. She was able to identify that it was enough with a single missing value to have it propagated into the associated tile. Is this a design decision?

We followed it up and landed in the regionCounts function. It looks like the sums are done without excluding NA-values. I attach a small patch with the beginning to a fix. It isn't tested and I suspect it might not be complete. This "fix" has the potential to generate a variable number of tiles per sample and I didn't understand how the merging worked. There could be a problem

methylKit.relaxedTiling.patch.txt

Typo error in diffMeth.R

chromsome should be chromosome

,las=2,horiz=T,col=c("magenta","aquamarine4"),main=paste("% of hyper & hypo methylated regions per chromsome",sep=""),xlab="% (percentage)",...)

Error in calculateDiffMeth()

Hi,

I encountered the following error message when using calculateDiffMeth on a methylBase object (attached as zipped RData):

Error in data.frame(subst[, 1:4], tmp$p.value, p.adjusted(tmp$q.value, :
arguments imply differing number of rows: 50020, 0

Session Info:

R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Mavericks 10.9.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] methylKit_1.0.0 GenomicRanges_1.26.4 GenomeInfoDb_1.10.3 IRanges_2.8.2 S4Vectors_0.12.2 BiocGenerics_0.20.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 plyr_1.8.4 XVector_0.14.1 R.methodsS3_1.7.1 bitops_1.0-6
[6] R.utils_2.5.0 tools_3.3.3 zlibbioc_1.20.0 mclust_5.3 tibble_1.3.3
[11] gtable_0.2.0 lattice_0.20-35 fastseg_1.20.0 rlang_0.1.1 Matrix_1.2-10
[16] coda_0.19-1 rtracklayer_1.34.2 stringr_1.2.0 gtools_3.5.0 Biostrings_2.42.1
[21] grid_3.3.3 Biobase_2.34.0 qvalue_2.6.0 data.table_1.10.4 emdbook_1.3.9
[26] XML_3.98-1.9 BiocParallel_1.8.2 limma_3.30.13 ggplot2_2.2.1 reshape2_1.4.2
[31] magrittr_1.5 GenomicAlignments_1.10.1 scales_0.4.1 Rsamtools_1.26.2 MASS_7.3-47
[36] splines_3.3.3 SummarizedExperiment_1.4.0 bbmle_1.0.19 colorspace_1.3-2 numDeriv_2016.8-1
[41] stringi_1.1.5 RCurl_1.95-4.8 lazyeval_0.2.0 munsell_0.4.3 R.oo_1.21.0

best wishes,
Alex

issue_methylKit.rdata.zip

Problem with `percMethylation` function

Hi,
I was trying to use the percMethylation function to create a data-frame of percent methylation values. However, I got the following error:
'''
> class(united.obj)
[1] "methylBaseDB"
attr(,"package")
[1] "methylKit"
> mat <- percMethylation(united.obj)
Error in rbindlist(res) :
Item 1 of list input is not a data.frame, data.table or list
'''

When I looked at the code, and trying to generate the same using getData(), and then the numCs.index, and numTs.index, it worked.

Error While Annotating DMRs

Hi, I am following the tutorial provided, on executing the following command, I am getting an error. Could you please help me in resolving this issue.

gene.obj=readTranscriptFeatures(system.file("extdata", "refseq.hg18.bed.txt", package = "methylKit"))

Reading the table...
Calculating intron coordinates...
Error in (function (cl, name, valueClass) :
assignment of an object of class "NULL" is not valid for @'NAMES' in an object of class "IRanges"; is(value, "characterORNULL") is not TRUE

Colour scale on correlation plots

The correlation plot matrix does not have a colour scale for the colour densities. Could a scale be added to help make more sense of the plots.

improve documentation around pipeline argument for the read() function

From what I understand, the only Bismark output that methylKit works with is the SAM file and that is processed with read.bismark. However, the documentation for read has a pipeline argument with one of the options as "bismark". What sort of file is expected there?

I also saw a discussion about using Bismark cytosine reports (https://groups.google.com/forum/#!topic/methylkit_discussion/0s1DLTWsLyM). Apparently, they can be converted by simply using the code:
awk '{OFS="\t";if($4+0 > 0 || $5+0 >0 ) print $1,$2,$3,$4/($4+$5),$4+$5;}' cytosineReport.txt > outputForMethylKit
That file will work with read, but it requires a custom pipeline setting, so I am still not sure what the "bismark" pipeline setting is for.

can't load bismark coverage file

Hi Altuna.
I have previously use methylkit by importing the .sam files from Bismark with processBismarkAln to make the DMR analysis.
Now, a few years later, I notice that there is an option to import the coverage as well as the cytocine report files from Bismark (methRead), but for some reason I can't load them.
I noticed a discrepancy between the description that you make and what the files actually look like. Is it this possible?
This is your descripcion:

Bismark aligner can output methylation information per base in
#' multiple different formats. This function reads coverage files,
#' which have chr,start,end, number of cytosines (methylated bases)
#' and number of thymines (unmethylated bases).

And this is the cov file:
ChrC 54955 54955 0.32051282051282 42 13062
ChrC 54962 54962 0.127809939102323 17 13284
ChrC 54964 54964 0.180505415162455 24 13272
ChrC 54965 54965 0.135297654840649 18 13286
ChrC 54967 54967 0.0827067669172932 11 13289
ChrC 54971 54971 0.542781756502073 72 13193
ChrC 54977 54977 0.330777326717787 44 13258

Why is there a % column between the end and the methylated Cs?
Am I looking the wrong file? Is from the methylation_extractor program from bismark

The same happens with the cytocine_report file..

Would I have the same results if I use the sam (can I use the bam files now?) instead the coverage files?

Best

Diego

error running methylkit getCorrelation

I am running the script below and I get an error, not sure if it's related to the chromosomal naming or something else. Any ideas?

#!/usr/bin/Rscript
library(methylKit)
library(tools)
args<-commandArgs(TRUE)

# readBismarkCytosineReport function
devtools::source_gist("4839e615e2401d73fe51")

file.list = list("/data/group/test/ILMN_Run001/ILMN_Run001-12345678/WBGS-oxBS-v1-0250ng-PCR10-R01_S1_L003_R1_001.cutB_pe.sorted.mrgd.combo.CpG_report.txt.gz"
,"/data/group/test/ILMN_Run001/ILMN_Run001-12345678/450K-GM12878.nnn.CpG_report.txt.gz"
)
# This reads straight from the output of bismark, works for both strands
myobj = readBismarkCytosineReport(
        file.list,
          sample.id=list(
strsplit(basename("/data/group/test/ILMN_Run001/ILMN_Run001-12345678/WBGS-oxBS-v1-0250ng-PCR10-R01_S1_L003_R1_001.cutB_pe.sorted.mrgd.combo.CpG_report.txt.gz"), '[_]')[[1]][1],
strsplit(basename("/data/group/test/ILMN_Run001/ILMN_Run001-12345678/450K-GM12878.nnn.CpG_report.txt.gz"), '[_]')[[1]][1]
               ),
        assembly="hg38",
        context="CpG",
treatment=c(1,0),
        min.cov=10)

meth = unite(myobj)

pdf(file="/data/group/test/ILMN_Run001/ILMN_Run001-12345678/Methylkit.02.Meth_CorrelationPlot.s02.m10.pdf"); getCorrelation(meth,plot=T); dev.off()

Error:

Rscript /data/group/test/ILMN_Run001/ILMN_Run001-12345678/run_methcorrelation.s02.m10.R
Loading required package: methods
Sourcing https://gist.githubusercontent.com/al2na/4839e615e2401d73fe51/raw/936308ae69a3f60aee57b5acb8d49070132a4f78/readBismarkFiles.R
SHA-1 hash of file is c3f50bff59693c2f9764bea7b1f645a58b1502e7
Loading required package: R.utils
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.7.0 (2015-02-19) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.19.0 (2015-02-27) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'

The following objects are masked from 'package:methods':

    getClasses, getMethods

The following objects are masked from 'package:base':

    attach, detach, gc, load, save

R.utils v2.2.0 (2015-12-09) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'

The following object is masked from 'package:utils':

    timestamp

The following objects are masked from 'package:base':

    cat, commandArgs, getOption, inherits, isOpen, parse, warnings

Loading required package: data.table
Read 57664572 rows and 7 (of 7) columns from 1.331 GB file in 00:01:05
Warning messages:
1: In fread(filepath, ...) :
  Bumped column 1 to type character on data row 54814099, field contains 'MT'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
2: In fread(filepath, ...) :
  Bumped column 1 to type character on data row 368078, field contains 'X'. Coercing previously read values in this column from logical, integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
                              WBGS-oxBS-v1-0250ng-PCR10-R01
WBGS-oxBS-v1-0250ng-PCR10-R01                            NA
450K-GM12878.nnn.CpG                                     NA
                              450K-GM12878.nnn.CpG
WBGS-oxBS-v1-0250ng-PCR10-R01                   NA
450K-GM12878.nnn.CpG                            NA
Error in plot.window(...) : need finite 'xlim' values
Calls: getCorrelation ... localPlot -> plot -> plot.default -> localWindow -> plot.window
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
Execution halted

struggling to install development branch on R 3.2.1

See log below. Any ideas?

[...]
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
* DONE (GenomicAlignments)

The downloaded source packages are in
        �/tmp/Rtmp4G2oBy/downloaded_packages�
'/data/apps/R/3.2.1/lib64/R/bin/R' --no-site-file --no-environ --no-save  \
  --no-restore CMD INSTALL  \
  '/tmp/Rtmp4G2oBy/devtools32d21c86651b/al2na-methylKit-ef14be0'  \
  --library='/data/home/user/R/x86_64-unknown-linux-gnu-library/3.2'  \
  --install-tests 

ERROR: dependencies �emdbook�, �mclust� are not available for package �methylKit�
* removing �/data/home/user/R/x86_64-unknown-linux-gnu-library/3.2/methylKit�
Error: Command failed (1)

feature request: ability to ingest SAMPLE.myCpG.txt.gz compressed files

This is a feature request for methylKit:

In order to reduce the disk usage footprint, it would be convenient if mehylkit had a read operation from compressed SAMPLE.myCpG.txt.gz files. Example:

# Uncompressed txt files
file.list = list(args[1],args[2],args[3],args[4])
myobj = read(file.list,sample.id=list("CASE1","CASE2","ctrl1","ctrl2"),assembly="hg38",pipeline="amp",treatment=c(1,1,0,0))

Maybe an extra flag that detects the gz extension?

file.list = list(args[1],args[2],args[3],args[4])
myobj = read(file.list, compressed_ext=c(".gz",".bz2"),sample.id=list("CASE1","CASE2","ctrl1","ctrl2"),assembly="hg38",pipeline="amp",treatment=c(1,1,0,0))

Apologies if this already exists, or if it's a trivial R code modification...

Analyze DiffMeth with more than 2 treatments...

Hi Altuna,
I have 3 treatments in my experiment, actually 2 treatments and the control. When do I have to make all the paired combinantions (control-treatment1), (control-treatment2) and (treatment1-treatment2)? when I run the unite() function? How do I do it?
Or is it a way to run calculteDiffMeth() function using the object created by unite() function assigning a given pair?

This is my code (I run the 3 context separately):

my methylRawList

CpG.bismark.rdd.bam <- processBismarkAln(file.list1, sample.id = list("NI2","NI3","Cg1","Cg2","OR1","OR2"), treatment =c(0,0,1,1,2,2), assembly = "TAIR10", save.folder = "freq_files1" , save.context = "CpG", read.context = "CpG");

CHG.bismark.rdd.bam <- processBismarkAln(file.list1, sample.id = list("NI2","NI3","Cg1","Cg2","OR1","OR2"), treatment =c(0,0,1,1,2,2), assembly = "TAIR10", save.folder = "freq_files1" , save.context = "CHG", read.context = "CHG");

CHH.bismark.rdd.bam <- processBismarkAln(file.list1, sample.id = list("NI2","NI3","Cg1","Cg2","OR1","OR2"), treatment =c(0,0,1,1,2,2), assembly = "TAIR10", save.folder = "freq_files1" , save.context = "CHH", read.context = "CHH");

my objects

meth_CpG_bismark.rdd.bam <- unite(CpG.bismark.rdd.bam, min.per.group = 2L, destrand = FALSE)
meth_CHG_bismark.rdd.bam <- unite(CHG.bismark.rdd.bam, min.per.group = 2L, destrand = FALSE)
meth_CHH_bismark.rdd.bam <- unite(CHH.bismark.rdd.bam, min.per.group = 2L, destrand = FALSE)

When I run the calculateDiffMeth function in one of the objects this warning appear:

Diff_Meth_CpG.rdd.bam <- calculateDiffMeth(meth_CpG_bismark.rdd.bam)
more than two groups detected:
will calculate methylation difference as the difference of max(x) - min(x),
where x is vector of mean methylation per group per region,but
the statistical test will remain the same.
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred
2: glm.fit: fitted probabilities numerically 0 or 1 occurred
3: glm.fit: fitted probabilities numerically 0 or 1 occurred
4: glm.fit: fitted probabilities numerically 0 or 1 occurred
5: glm.fit: fitted probabilities numerically 0 or 1 occurred

Is ther a way to make pairs in the DiffMeth? or should I create my objects in pairs before? How do I do that?

Best

Diego

diffMethPerChr percentages

After calling diffMethPerChr(), the percentages of hypermethylated and hypomethylated add up to over 100. Is that a bug or a misunderstanding?

Example:

> diffMethPerChr = diffMethPerChr(mkDiff, plot=F, qvalue.cutoff=1.0, meth.cutoff=0)
> diffMethPerChr
$diffMeth.per.chr
     chr number.of.hypermethylated percentage.of.hypermethylated number.of.hypomethylated percentage.of.hypomethylated
1   chr1                      1043                      51.27827                     1110                     54.57227
12  chr2                      1489                      51.88153                     1549                     53.97213
13  chr3                       846                      52.71028                      858                     53.45794
14  chr4                      1314                      50.36412                     1408                     53.96704

`read.bismark` fails on windows

The function read.bismark fails to find the perl script methCall.pl on windows because of the space in the program folder name ("Program Files").
To make it work, I modified read.bismark function to add quotes around the path name (variable ex.loc).

I suggest a mid-pval approach as alternative option to your fast.fisher that uses min likelihood

Dear Altuna Akalin,
I am a master student of biochemistry and user of your package methylKit. Thanks for creating this package. I have a lot of WGBS data to analyze. Your implementation of Fisher's exact test in fast.fisher is a lot faster than R's fisher.test. Still it is considerably slower than an implementation of Fisher's exact test using midpvalues in a twosided test, please read for further information on the topic link
Could you please make an option where one can use the midpval approach, it is about 13 times faster according to my benchmark testing and would save me a lot of time in my script.
You can use this code as template:

super.fast.fisher <- function (cntg_table) {

q <-  cntg_table[1,1] # first field in 2 X 2 contigency table 
m <- cntg_table[1,1] + cntg_table[2,1] # sum of first column in contingency table
n <-  cntg_table[2,1] + cntg_table[2,2] # sum of second column in contingency table 
k <- cntg_table[1,1] + cntg_table[1,2] # sum of first row in contingency table

pval_right <- phyper(q = q, m = m, n = n, k = k, lower.tail = FALSE) + (0.5 * dhyper(q,  m, n, k)) # test on the right side 
                      

pval_left <- phyper(q = q-1 , m = m, n = n, k = k, lower.tail = TRUE) + (0.5 * dhyper(q,  m, n, k)) # test on the left side
                                                               
 
return (ifelse(test = pval_right > pval_left, yes = pval_left * 2, no = pval_right * 2 )) # pvalue of twosided test with mid-pval

Many thank and best regards,

Ivo

build errors in Bioconductor methylKit devel version

Hi,

We are developing two R packages that are using the Bioconductor version of methylKit. We plan to submit our packages to Bioconductor. However, the development version of the methylKit package (Bioconductor 3.5) has some build errors: http://bioconductor.org/checkResults/devel/bioc-LATEST/methylKit/. Those errors are related to functions that have been removed from the IRanges devel package. I have noticed that most functions are not used anymore by methylKit. So, by removing those importations, the problem should be solved.

I wonder if this issue is currently being processed.

Thanks,

Astrid

how to find the differential methylation regions?

sorry to interrupt again. here I have a question, I use methylkit for my rice BS-seq data analysis, and I can find the differential methylated cytosine using calculateDiffMeth code, but I am confused to how to get the DMR? I try through the whole command and did't get the satisifying result? I used the eDMR to analysis my data and only found "CpG" context sequences analyzed, how about the "CHH" "CHG"? how can I get my rice data pass the analysis?
Thanks a lot and looling forward to hear from you!

is it ok to load bismark CpG_report files with non-integer counts?

Hi,

I am inputting CpG_report.txt.gz oxBS and BS bismark files into
methpipe mlml, which uses an EM algorithm to determine the levels of
5-mC and 5-hmC by subtracting the converted and non-converted counts
from the oxBS and BS experiments. This produces three coefficients for
each position, namely:

  1. 5-mC level
  2. 5-hmC level
  3. unmethylated level

After this process, I can produce two output files in
CpG_report.txt.gz format which due to the EM algorithm above, will
sometimes produce counts that are non-integer. E.g. for columns 4th and 5th in Bismark CpG_report.txt.gz:

The genome-wide cytosine methylation output file is tab-delimited in the following format:
==========================================================================================
<chromosome>  <position>  <strand>  <count methylated>  <count non-methylated>  <C-context>  <trinucleotide context>

Given a position with coverage 56, and the following 5-mC, 5-hmC and C
levels, columns 4th and 5th would be non-integers:

0.483333*56 = 27.066648
0.416667*56 = 23.333352
0.1*56      = 5.6
-----------------------
1.0*56      = 56

I think the adjust.methylC function may be doing a similar
manipulation to the counts, and the result is also loaded as a
methylRaw object, but I want to make sure what I am doing here is
kosher.

# adjusting the 5mC levels using 5hmC levels
adjusted.5mC=adjustMethylC(my5mC,my5hmC)

I would like to make sure that methylKit can ingest this non-integer
CpG_report.txt.gz file correctly, and one still will be able to use
the minimum coverage option to filter out low-coverage positions to do
the DMR calling.

I am using the readBismarkCytosineReport function as described below.

Let me know if this is a good way to proceed.

#' Read bismark cytosine report file as a methylKit object
#' 
#' Bismark aligner can output methylation information per base in
#' multiple different formats. This function reads cytosine report files,
#' which have chr,start, strand, number of cytosines (methylated bases) 
#' and number of thymines (unmethylated bases),context, trinucletide context.
#' 
#' \@param location a list or vector of file paths to coverage files
#'     
#' \@param sample.id a list or vector of sample ids
#' \@param assembly a string for genome assembly. Any string would work.
#' \@param treatment if there are multiple files to be read a treatment 
#'                  vector should be supplied.
#' \@param context a string for context of methylation such as: "CpG" or "CHG"
#' \@param min.cov a numeric value for minimum coverage. Bases that have coverage
#' below this value will be removed.
#' 
#' \@return methylRaw or methylRawList objects

readBismarkCytosineReport<-function(location,sample.id,assembly="unknown",treatment,
                                         context="CpG",min.cov=10){
  if(length(location)>1){
    stopifnot(length(location)==length(sample.id),
              length(location)==length(treatment))
  }

  result=list()
  for(i in 1:length(location)){
    df=fread.gzipped(location[[i]],data.table=FALSE)

    # remove low coverage stuff
    df=df[ (df[,4]+df[,5]) >= min.cov ,]




    # make the object (arrange columns of df), put it in a list
    result[[i]]= new("methylRaw",
                     data.frame(chr=df[,1],start=df[,2],end=df[,2],
                                strand=df[,3],coverage=(df[,4]+df[,5]),
                                numCs=df[,4],numTs=df[,5]),
                     sample.id=sample.id[[i]],
                     assembly=assembly,context=context,resolution="base"
    )
  }

  if(length(result) == 1){
    return(result[[1]])
  }else{

    new("methylRawList",result,treatment=treatment)
  }
}

concern about p-values of zero

Hi,

Our lab has been using methylKit for the differential methylation analysis of our 13 bisulfite sequencing samples (7 cases, 6 controls; without covariates but with coverage normalization) and noticed that for a couple of sites we were receiving p-values of zero, as seen in this example:

chr start end strand pvalue qvalue meth.diff
chr1 11160205 11160205 + 4.780530e-04 0.033563177 -15.166690
chr1 11336239 11336239 + 9.241583e-04 0.048401063 6.739688
chr1 11396931 11396931 + 0.000000e+00 0.000000000 -43.781785
chr1 11398012 11398012 - 2.156598e-05 0.005130648 14.402332
chr1 11561902 11561902 - 3.546179e-06 0.001508041 28.381195

To us, such a p-value seems a bit too good to be true, so we were wondering if you could tell us what would lead to a p-value of zero? Is there, for example, a p-value threshold after the passing of which the output is simply rounded to zero? Or might there be another artifact that could be causing this?

Thanks for your time and an otherwise great tool!

Issue with Unite function and large dataset

Hi Akalin,

I'm getting the following error when uniting my sample with the following code:

meth<-unite(object = familyObj, destrand=TRUE, by=.EACHI, save.db = F)

## Error in vecseq(f , len , if (allow.cartesian || notjoin || !anyDuplicated(f , : Join results in 216308346 rows; more than 60136491 = nrow(x)+nrow(i).
Check for duplicate key values in i each of which join to the same
group in x over and over again. If that’s ok, try by=.EACHI to run
j for each group to avoid the large allocation.  If you are sure you
wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please
search for this error message in the FAQ, Wiki, Stack Overflow and
datatable-help for advice.

I've checked to make sure that none of my samples have duplicate chr and coordinates using the following 1 liner
$cut -f 2,3 ${dir}BSTag-Sample-${SampleName}-reprep_CpG.methylKit| sort | uniq -cd

I've also tried replicating the problem with only using 2-3 samples, but I received no error. I then tried to subset my 17 samples to just the first 1,000,000 lines; this received no error. I tried again when I used the first 5 million, and also received no error. I tried 10 million, and I got the same error (albeit less duplicates mentioned). I wanted to shrink the file sizes down so I subset the file to only rows>5,000,000 and less then 10,000,000; this returned the error:

## Error in vecseq(f , len , if (allow.cartesian || notjoin || !anyDuplicated(f , : Join results in 11620518 rows; more than 7736026 = nrow(x)+nrow(i).
Check for duplicate key values in i each of which join to the same
group in x over and over again. If that’s ok, try by=.EACHI to run
j for each group to avoid the large allocation.  If you are sure you
wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please...

I tried to subset the files even further, such as 5,000,000 < rows < 7,500,000 and 7,500,000 < rows < 10,000,000, but this did not replicate the error. So unfortunately, I'm making available some large files which replicate the error.

The following is the code that I used (using the most recent github install of methylKit):

suppressPackageStartupMessages({ library(methylKit)
})
library(RColorBrewer)
library(gplots)
library(knitr)

file.list=list("BS-TAG-Sample-77_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-78_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-79_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-80_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-81_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-82_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-83_CpG_percentFix_5000000.methylKit",
               "BSTag-Sample-84-reprep_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-85_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-86_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-87_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-88_CpG_percentFix_5000000.methylKit",
               "BSTag-Sample-89-reprep_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-90_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-91_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-92_CpG_percentFix_5000000.methylKit",
               "BS-TAG-Sample-93_CpG_percentFix_5000000.methylKit")

familyObj=methRead(file.list, sample.id=list("gm77", "gm78", "gm79", "gm80",
"gm81", "gm82", "gm83", "gm84", "gm85", "gm86", "gm87", "gm88",
"gm89", "gm90", "gm91","gm92", "gm93"), assembly="hg38_decoy_EBV",
treatment=rep(0,17), context="CpG", mincov = 3)

meth_debug<-unite(object = familyObj, destrand=TRUE, by=.EACHI, save.db = F)
# I've also tried
meth_debug<-unite(object = familyObj, destrand=TRUE)

I've placed the files in the following google drive (sorry about the size of the files):
https://drive.google.com/drive/folders/0B1bAk0VjZmbVMURzOVJrWXVraUE?usp=sharing

Thank you for your help,
Andrew

feature request: read input from bismark sorted bams

This is a feature request on the development branch of methylkit.

As far as I can tell from the documentation, methylkit can read the necessary input from bismark output files by running the read.bismark() method, which currently calls a perl script that seeks .sam.gz files sorted by chromosome-position and transforms the content of each .sam.gz into the correct input for methylKit.

I have seen on Trello there is the intention to integrate bamtools c++ method calls, I presume it overlaps with this feature request: for methylkit to read in chromosome-position sorted bams from Bismark.

Thx

an error occurred when merging data

Dear Akalin,

An error occurred when I performed

meth=unite(filtered.myobjDB, destrand=FALSE)

Could you help me to deal with the problem?

Here is the error:
$meth=unite(filtered.myobjDB, destrand=FALSE)

Read 970860 rows and 7 (of 7) columns from 0.031 GB file in 00:00:03
Error in merge.data.table(res, tmp, by = c("V1", "V2", "V3", "V4"), all = all) :
x has some duplicated column name(s): V5.x,V6.x,V7.x,V5.y,V6.y,V7.y. Please remove or rename the duplicate(s) and try again.

Here is my data:
head(filtered.myobjDB)
[[1]]
methylRawDB object with 12585679 rows

chr start end strand coverage numCs numTs
1 chr1 13416 13416 + 12 4 8
2 chr1 13502 13502 + 16 10 6
3 chr1 13527 13527 + 16 8 8
4 chr1 13536 13536 + 16 6 10
5 chr1 16057 16057 + 10 9 1
6 chr1 16069 16069 + 11 11 0

sample.id: bankuai_29
assembly: hg19
context: CpG
resolution: base
dbtype: tabix

[[2]]
methylRawDB object with 11631348 rows

chr start end strand coverage numCs numTs
1 chr1 16242 16242 + 10 5 5
2 chr1 20126 20126 + 20 20 0
3 chr1 20155 20155 + 22 21 1
4 chr1 20205 20205 + 24 18 6
5 chr1 20233 20233 + 25 23 2
6 chr1 20252 20252 + 22 20 2

sample.id: bankuai_30
assembly: hg19
context: CpG
resolution: base
dbtype: tabix

[[3]]
methylRawDB object with 10197712 rows

chr start end strand coverage numCs numTs
1 chr1 13502 13502 + 13 12 1
2 chr1 13527 13527 + 11 6 5
3 chr1 13536 13536 + 11 4 7
4 chr1 16554 16554 + 10 9 1
5 chr1 47175 47175 + 25 19 6
6 chr1 47204 47204 + 30 16 14

sample.id: bankuai_32
assembly: hg19
context: CpG
resolution: base
dbtype: tabix

[[4]]
methylRawDB object with 18457443 rows

chr start end strand coverage numCs numTs
1 chr1 13416 13416 + 21 2 19
2 chr1 13502 13502 + 24 18 6
3 chr1 13527 13527 + 20 7 13
4 chr1 13536 13536 + 19 7 12
5 chr1 14698 14698 + 10 8 2
6 chr1 14709 14709 + 10 8 2

sample.id: bankuai_35
assembly: hg19
context: CpG
resolution: base
dbtype: tabix

[[5]]
methylRawDB object with 19208834 rows

chr start end strand coverage numCs numTs
1 chr1 13416 13416 + 15 7 8
2 chr1 13502 13502 + 17 14 3
3 chr1 13527 13527 + 16 10 6
4 chr1 13536 13536 + 15 10 5
5 chr1 14652 14652 + 11 2 9
6 chr1 14698 14698 + 13 9 4

sample.id: bankuai_34
assembly: hg19
context: CpG
resolution: base
dbtype: tabix

Here is my input data:
file.list=list("/lustre/home/zhangyongbiao/data/methylation/bankuai_29_CpG_symmetric.meth.for.methylKit","/lustre/home/zhangyongbiao/data/methylation/bankuai_30_CpG_symmetric.meth.for.methylKit", "/lustre/home/zhangyongbiao/data/methylation/bankuai_32_CpG_symmetric.meth.for.methylKit", "/lustre/home/zhangyongbiao/data/methylation/bankuai_35_CpG_symmetric.meth.for.methylKit","/lustre/home/zhangyongbiao/data/methylation/bankuai_34_CpG_symmetric.meth.for.methylKit")

myobjDB=methRead(file.list, sample.id=list("bankuai_29","bankuai_30","bankuai_32","bankuai_35","bankuai_34"), assembly="hg19", treatment=c(1,1,1,1,0), context="CpG",dbtype = "tabix", dbdir="/lustre/home/zhangyongbiao/data/methylation/methlDB" )

filtered.myobjDB=filterByCoverage(myobjDB,lo.count=10,lo.perc=NULL,hi.count=NULL,hi.perc=99.9)

argument is of length zero read.transcript.features bug

This may be related to issue #12 . Info below:

Running the following script gives the error below:

#!/usr/bin/Rscript
library(methylKit)
args<-commandArgs(TRUE)
devtools::source_gist("4839e615e2401d73fe51")
file.list = list(args[1],args[2],args[3],args[4])
# myobj = read(file.list,sample.id=list("CASE1","CASE2","ctrl1","ctrl2"),assembly="hg38",pipeline="amp",treatment=c(1,1,0,0))
myobj = readBismarkCytosineReport(file.list,sample.id=list("CASE1","CASE2","ctrl1","ctrl2"),assembly="hg38",treatment=c(1,1,0,0),min.cov=30)
pdf(   file="MethylationStats_Sample1.pdf") ; getMethylationStats(myobj[[1]],plot=T,both.strands=T); dev.off()
pdf(   file="MethylationStats_Sample2.pdf") ; getMethylationStats(myobj[[2]],plot=T,both.strands=T); dev.off()
pdf(file="MethylationStats_Reference1.pdf") ; getMethylationStats(myobj[[3]],plot=T,both.strands=T); dev.off()
pdf(file="MethylationStats_Reference2.pdf") ; getMethylationStats(myobj[[4]],plot=T,both.strands=T); dev.off()

meth = unite(myobj)
pdf(file="Meth_CorrelationPlot.pdf"); getCorrelation(meth,plot=T); dev.off()

myDiff=calculateDiffMeth(meth)
# write.table(file="diff.tsv",myDiff)

gene.obj=read.transcript.features(system.file("/bi/group/cegx/UCSC_database", "refseq.hg38.bed.txt", package = "methylKit"))
ann=annotate.WithGenicParts(myDiff,gene.obj)
write.table(file="annotate.WithGenicParts.tsv",ann)

See below:

          CASE1     CASE2     ctrl1     ctrl2
CASE1 1.0000000 0.9776257 0.3313481 0.2487731
CASE2 0.9776257 1.0000000 0.3276954 0.2451217
ctrl1 0.3313481 0.3276954 1.0000000 0.8153897
ctrl2 0.2487731 0.2451217 0.8153897 1.0000000
null device 
          1 
Warning messages:
1: glm.fit: fitted probabilities numerically 0 or 1 occurred 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
3: glm.fit: fitted probabilities numerically 0 or 1 occurred 
Error in if (grepl("^track", f.line)) (skip = 1) : 
  argument is of length zero
Calls: read.transcript.features -> read.transcript.features
In addition: Warning message:
In file(con, "r") :
  file("") only supports open = "w+" and open = "w+b": using the former
Execution halted

error of calculateDiffMeth function

Hi,

I am running the latest version of methylkit. However, I always encounter the following error for calculateDiffMeth function. The methrw.ut is produced successfully by unite function.

Best,
Pengcheng Yang

================= the error message: =================
methrw.ut=unite(methrw,destrand=FALSE)
methrw.dmb=calculateDiffMeth(methrw.ut,num.cores=5)
Error in pi0 * length(order_rawp) * order_rawp :
non-numeric argument to binary operator
Calls: calculateDiffMeth ... calculateDiffMeth -> fix.q.values.fisher -> SLIMfunc -> QValuesfun
In addition: Warning message:
In parallel::mclapply(my.f.list, f.fisher, mc.cores = num.cores) :
all scheduled cores encountered errors in user code

failed installation, S4Vectors update

Hello,
I tired to install based on manual with:

library(devtools) install_github("al2na/methylKit", build_vignettes=FALSE, repos=BiocInstaller::biocinstallRepos(), dependencies=TRUE,type="source")
But I got this:

Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : namespace ‘S4Vectors’ 0.12.2 is already loaded, but >= 0.13.13 is required ERROR: lazy loading failed for package ‘methylKit’

I also tried to update S4Vectors with:

source("https://bioconductor.org/biocLite.R") biocLite("S4Vectors")

But the version I got is 0.12.2, I think that's the latest stable version. How do I install version >=0.13.13?

I guess I'm not the only one with this question, you may consider add it in the installation manaul.

Thanks!

Ji

feature request: script to transform bismark methylation output into methylKit input

This is a feature request against the development branch of methylKit:

As far as I can tell from the Illumina BaseSpace MethylKit, one can use methylation calls from bismark methylation extractor as input for methylkit. I presume there is a data transformation process between the bismark output files and the required methylKit input format. I have seen a similar data transformation described for BSMAP here: http://zvfak.blogspot.co.uk/2012/10/how-to-read-bsmap-methylation-ratio.html

This is a feature request to implement this data transformation of bismark methylation extractor files into methylkit input files.

NOTE: It may be that once the data transformation is identified, it's best for @FelixKrueger to have such code on bismark's side.

Thx

merge.data.table error when using unite function

hi, @al2na
Thanks for the great methylKit!

I used methylKit to process our WGBS data, however, a error occured when using unite function.

myData.tiled.1k <- tileMethylCounts(myData, win.size = 1000, step.size = 1000, cov.bases = 5, mc.cores = 2) # It's OK!
myData.tiled.1k.unite <- unite(myData.tiled.1k, min.per.group = 2L, mc.cores = 1)# A error occured
Error in merge.data.table(res, tmp, by = c("V1", "V2", "V3", "V4"), all = all) : 
  x has some duplicated column name(s): V5.x,V6.x,V7.x,V5.y,V6.y,V7.y. Please remove or rename the duplicate(s) and try again.

Let me know if additional information needed!
Please help me go through the problem!

libtool error in the installation of methykit

Dear al2na,

I have installed the latest R and bioclite, but there were libtool error when I were installing methykit package by using your install_github code in R console. The detailed error information is attached below, thanks for your help.

Error information:
../libtool: line 5986: cd: :/ifshk1/BC_CANCER/03user/lixiangchun/Software/INSTALL/hdf5-1.8.16/lib:/ifshk4/BC_PUB/biosoft/PIPE_RD/Package/R-3.1.1/lib64/R/lib:/ifshk1/BC_CANCER/03user/lixiangchun/Software/INSTALL/zlib-1.2.8/lib:/ifshk1/BC_CANCER/03user/lixiangchun/Software/INSTALL/bzip2-1.0.6/lib:/ifshk7/BC_RES/TECH/PMO/zhangbaifeng/eDMR/xz/lib:/ifshk7/BC_RES/TECH/PMO/zhangbaifeng/eDMR/pcre/lib:/ifshk7/BC_RES/TECH/PMO/zhangbaifeng/eDMR/pcre-8.39/curl/lib: No such file or directory
libtool: link: cannot determine absolute directory name of :/ifshk1/BC_CANCER/03user/lixiangchun/Software/INSTALL/hdf5-1.8.16/lib:/ifshk4/BC_PUB/biosoft/PIPE_RD/Package/R-3.1.1/lib64/R/lib:/ifshk1/BC_CANCER/03user/lixiangchun/Software/INSTALL/zlib-1.2.8/lib:/ifshk1/BC_CANCER/03user/lixiangchun/Software/INSTALL/bzip2-1.0.6/lib:/ifshk7/BC_RES/TECH/PMO/zhangbaifeng/eDMR/xz/lib:/ifshk7/BC_RES/TECH/PMO/zhangbaifeng/eDMR/pcre/lib:/ifshk7/BC_RES/TECH/PMO/zhangbaifeng/eDMR/pcre-8.39/curl/lib' make[5]: *** [libgnu.la] Error 1 make[5]: Leaving directory/tmp/RtmpfLBspx/devtools7c1c1c4ee618/Rhtslib/src/htslib/lib'
make[4]: *** [all-recursive] Error 1
make[4]: Leaving directory /tmp/RtmpfLBspx/devtools7c1c1c4ee618/Rhtslib/src/htslib/lib' make[3]: *** [all] Error 2 make[3]: Leaving directory/tmp/RtmpfLBspx/devtools7c1c1c4ee618/Rhtslib/src/htslib/lib'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory /tmp/RtmpfLBspx/devtools7c1c1c4ee618/Rhtslib/src/htslib' make[1]: *** [all] Error 2 make[1]: Leaving directory/tmp/RtmpfLBspx/devtools7c1c1c4ee618/Rhtslib/src/htslib'
make: *** [htslib/libhts.la] Error 2
ERROR: compilation failed for package 'Rhtslib'

  • removing '/ifshk7/BC_RES/TECH/PMO/zhangbaifeng/eDMR/R-3.3.1/library/Rhtslib'
    Error: Command failed (1)

the new version looks great but I get an error message when I try to install:

  • installing source package ‘methylKit’ ...
    ** libs
    make: Nothing to be done for `all'.
    installing to /Library/Frameworks/R.framework/Versions/3.3/Resources/library/methylKit/libs
    ** R
    ** data
    ** inst
    ** tests
    ** preparing package for lazy loading
    ** help
    *** installing help indices
    ** building package indices
    ** installing vignettes
    ** testing if installed package can be loaded
    Error in dyn.load(file, DLLpath = DLLpath, ...) :
    unable to load shared object '/Library/Frameworks/R.framework/Versions/3.3/Resources/library/methylKit/libs/methylKit.so':
    dlopen(/Library/Frameworks/R.framework/Versions/3.3/Resources/library/methylKit/libs/methylKit.so, 6): Library not loaded: /Users/aakalin/Rlibs/Rhtslib/lib/libhts.0.dylib
    Referenced from: /Library/Frameworks/R.framework/Versions/3.3/Resources/library/methylKit/libs/methylKit.so
    Reason: image not found
    Error: loading failed
    Execution halted
    ERROR: loading failed
  • removing ‘/Library/Frameworks/R.framework/Versions/3.3/Resources/library/methylKit’
    Error: Command failed (1)

Read unable to load files

Hello,

methylKit has an issue like so:


> library(methylKit)
> file.list = list("../MethylSig/GSM1384234_Control_1.methylSig","../MethylSig/GSM1384235_Control_2.methylSig",
+ "../MethylSig/GSM1384236_Control_3.methylSig","../MethylSig/GSM1384237_DSS_1.methylSig","../MethylSig/GSM1384238_DSS_2.methylSig",
+ "../MethylSig/GSM1384239_DSS_3.methylSig","../MethylSig/GSM1384240_AOM_1.methylSig","../MethylSig/GSM1384241_AOM_2.methylSig","../MethylSig/GSM1384242_AOM_3.methylSig")
> read(file.list)
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘read’ for signature ‘"list", "missing", "missing"’

with input files looking like:

chrBase  chr   base  strand   coverage freqC freqT
chr1.3000574    chr1        3000574 F   99  0.00    100.00
chr1.3000575    chr1        3000575 F   99  100.00  0.00
chr1.3000726    chr1        3000726 F   99  33.30   66.70

both with and without the header.

is anyone aware of a solution to this?
-DEC

methRead reads all C types ?

I've been using methylKit recently. It's easy to use.
Here is my question.When I use methRead to read bismarkCytosineReport( I used --cytosine_report --CX. So the report is mixed with CpG, CHH and CHG ), I get methylRaw object mixed with the three kinds of C.

Here is my command.
meth=methRead(sample_list,
sample.id=list("test","ctrl"),
assembly="hg19",
treatment=c(1,0),
context="CpG",
pipeline = "bismarkCytosineReport",
header=FALSE,
mincov = 1)

I thought context argument can filter CHH & CHG at first. Now it seems wrong. Am I right ? I think should seperate them before read in R.

cannot coerce class structure annotationByGenicParts to a data.frame

I resolved most of the issues with my script and now got to this point where the annonationByGenicParts fails:

I get a few warnings about non-karyotipic chromosomal entries, but I think it's fair to say those can be easily get rid of.

#!/usr/bin/Rscript
library(methylKit)
args<-commandArgs(TRUE)
devtools::source_gist("4839e615e2401d73fe51")
file.list = list(args[1],args[2],args[3],args[4])
# myobj = read(file.list,sample.id=list("CASE1","CASE2","ctrl1","ctrl2"),assembly="hg38",pipeline="amp",treatment=c(1,1,0,0))
myobj = readBismarkCytosineReport(file.list,sample.id=list("CASE1","CASE2","ctrl1","ctrl2"),assembly="hg38",treatment=c(1,1,0,0),min.cov=30)
pdf(   file="MethylationStats_Sample1.pdf") ; getMethylationStats(myobj[[1]],plot=T,both.strands=T); dev.off()
pdf(   file="MethylationStats_Sample2.pdf") ; getMethylationStats(myobj[[2]],plot=T,both.strands=T); dev.off()
pdf(file="MethylationStats_Reference1.pdf") ; getMethylationStats(myobj[[3]],plot=T,both.strands=T); dev.off()
pdf(file="MethylationStats_Reference2.pdf") ; getMethylationStats(myobj[[4]],plot=T,both.strands=T); dev.off()

meth = unite(myobj)
pdf(file="Meth_CorrelationPlot.pdf"); getCorrelation(meth,plot=T); dev.off()

# cluster all samples using correlation distance and return a tree object for plclust
hc = clusterSamples(meth, dist="correlation", method="ward", plot=FALSE)

# cluster all samples using correlation distance and plot hiarachical clustering
clusterSamples(meth, dist="correlation", method="ward", plot=TRUE)

# screeplot of principal component analysis.
pdf(file="PCASamples.screeplot.pdf"); PCASamples(meth, screeplot=TRUE); dev.off()

# principal component anlaysis of all samples.
pdf(file="PCASamples.pcaxyplot.pdf"); PCASamples(meth); dev.off()

myDiff=calculateDiffMeth(meth)
# write.table(file="diff.tsv",myDiff)

gene.obj=read.transcript.features("/bi/group/cegx/UCSC_database/refseq.hg38.bed.txt")

ann=annotate.WithGenicParts(myDiff,gene.obj)
summary(ann)
write.table(file="annotate.WithGenicParts.tsv",ann)

I get this error after the warnings:

Warning messages:
1: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chr11_KI270832v1_alt, chr11_KI270903v1_alt, chr1_KI270713v1_random, chr22_KI270877v1_alt, chr22_KI270878v1_alt, chr6_GL000251v2_alt, chr6_GL000252v2_alt, chr6_GL000256v2_alt, chrUn_KI270442v1
  - in 'y': chrY
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
2: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chr11_KI270832v1_alt, chr11_KI270903v1_alt, chr1_KI270713v1_random, chr22_KI270877v1_alt, chr22_KI270878v1_alt, chr6_GL000251v2_alt, chr6_GL000252v2_alt, chr6_GL000256v2_alt, chrUn_KI270442v1
  - in 'y': chrY
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
3: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chr11_KI270832v1_alt, chr11_KI270903v1_alt, chr1_KI270713v1_random, chr22_KI270877v1_alt, chr22_KI270878v1_alt, chr6_GL000251v2_alt, chr6_GL000252v2_alt, chr6_GL000256v2_alt, chrUn_KI270442v1
  - in 'y': chrY
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
4: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrY
  - in 'y': chr11_KI270832v1_alt, chr11_KI270903v1_alt, chr1_KI270713v1_random, chr22_KI270877v1_alt, chr22_KI270878v1_alt, chr6_GL000251v2_alt, chr6_GL000252v2_alt, chr6_GL000256v2_alt, chrUn_KI270442v1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
5: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrY
  - in 'y': chr11_KI270832v1_alt, chr11_KI270903v1_alt, chr1_KI270713v1_random, chr22_KI270877v1_alt, chr22_KI270878v1_alt, chr6_GL000251v2_alt, chr6_GL000252v2_alt, chr6_GL000256v2_alt, chrUn_KI270442v1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
6: In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': chrY
  - in 'y': chr11_KI270832v1_alt, chr11_KI270903v1_alt, chr1_KI270713v1_random, chr22_KI270877v1_alt, chr22_KI270878v1_alt, chr6_GL000251v2_alt, chr6_GL000252v2_alt, chr6_GL000256v2_alt, chrUn_KI270442v1
  Make sure to always combine/compare objects based on the same reference
  genome (use suppressWarnings() to suppress this warning).
                Length                  Class                   Mode 
                     1 annotationByGenicParts                     S4 
Error in as.data.frame.default(x[[i]], optional = TRUE) : 
  cannot coerce class "structure("annotationByGenicParts", package = "methylKit")" to a data.frame
Calls: write.table ... data.frame -> as.data.frame -> as.data.frame.default
Execution halted

sorry to interrupt

hello, here I have a question, I have the rice BS-seq data, so I want to know if the R packages (methylkit) is suitable for the rice analysis? I used this packages, but when I turned to the step: "myDiff=calculateDiffMeth(meth)", it take over 24 hours to get the results and didn't show any hint that showing anything is wrong. and I am confused, can this packages analyze the rice plant genome?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.