zw-xjtlu / exomepeak2 Goto Github PK

Peak calling and differential methylation for MeRIP-Seq

R 100.00%

exomepeak2's Introduction

exomePeak2

2021-02-07

exomePeak2 provides peak detection and differential methylation for Methylated RNA Immunoprecipitation Sequencing (MeRIP-Seq) data. MeRIP-Seq is a commonly applied sequencing assay that measures the location and abundance of RNA modification sites under specific cellular conditions. In practice, the technique is sensitive to PCR amplification biases commonly found in NGS data. In addition, the efficiency of immunoprecipitation often varies between different IP samples. exomePeak2 can perform peak calling and differential analysis independent of GC content bias and IP efficiency changes.

To install exomePeak2 from Github, type the following command in R console.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install(c("Rsamtools", "GenomicAlignments", "GenomicRanges", 
                       "GenomicFeatures", "DESeq2", "ggplot2", "mclust", "BSgenome", 
                       "Biostrings", "GenomeInfoDb", "BiocParallel", "IRanges", 
                       "S4Vectors", "rtracklayer", "methods", "stats", 
                       "utils", "BiocGenerics", "magrittr", "speedglm", "splines"))

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

devtools::install_github("ZW-xjtlu/exomePeak2", build_vignettes = TRUE)

To view the documentation of exomePeak2, type browseVignettes("exomePeak2") after installation.

exomepeak2's People

Contributors

Stargazers

Watchers

Forkers

sunhaifeng123 shutaochen97 tsjzz xia1112 joybio

exomepeak2's Issues

failing due to memory/"caught segment address (nil): cause 'unknown'"

Hello - I've been attempting to run exomepeak2 on hg38 full transcript for chr10 only, single samples provided for ip and input control vs treated however it's been failing consistently due to 1) memory limit issues and 2) now with increased memory it's failing with the error: "*** caught segfault ***
address (nil), cause 'unknown'"

I'm using an HPC linux OS, requesting a single node, 20 threads and 128gb memory. I'm using a conda compiled R4.3 environment. Within R I'm increasing the memory limit to 100GB using the 'ulimit' package.

memory limit issue:

> library(exomePeak2)
> library(DESeq2)
> library(GenomicFeatures)
> library(BSgenome.Hsapiens.UCSC.hg38)
> gc()
           used  (Mb)  gc trigger     (Mb) max used  (Mb)
Ncells  9343582 499.1    19313185   1031.5 11348671 606.1
Vcells 16300605 124.4 13107200000 100000.0 28107829 214.5
> txdb <- makeTxDbFromGFF(
+     file = "/ifs/data/sz3145_gp/genomes/Homo_sapiens/UCSC/hg38/Annotation/gencode.v43.annotation.gtf",
+     format = "gtf",
+     dataSource = "GENCODE GTF",
+     organism = "Homo sapiens")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
> exomePeak2(
+     bam_ip = c(
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304491_dedup.REF_chr10.bam"
+         ),
+     bam_input = c(
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304485_dedup.REF_chr10.bam"
+         ),
+     bam_input_treated = c(   
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304487_dedup.REF_chr10.bam"
+         ),
+     bam_ip_treated = c(
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304493_dedup.REF_chr10.bam"
+         ),
+     txdb = txdb,
+     genome = "BSgenome.Hsapiens.UCSC.hg38",
+     mode = "full_transcript" 
+     )
Extract bin features ... OK
Count reads on bin features ... OK
Identify background features ... OK
Estimate sample sepecific size factors from the background ... OK
Calculate bin GC contents on exons ... Killed

memory increased, caught segment error

> devtools::install_github("krlmlr/ulimit")
> library(ulimit)
> ulimit::memory_limit(100000) 
 soft  hard
1e+05   Inf
> gc()
          used (Mb) gc trigger (Mb) max used (Mb)
Ncells  924928 49.4    1812270 96.8  1328472 71.0
Vcells 2404583 18.4    8388608 64.0  5079774 38.8
> library(exomePeak2)
> library(DESeq2)
> library(GenomicFeatures)
> library(BSgenome.Hsapiens.UCSC.hg38)
> txdb <- makeTxDbFromGFF(
+     file = "/ifs/data/sz3145_gp/genomes/Homo_sapiens/UCSC/hg38/Annotation/gencode.v43.annotation.gtf",
+     format = "gtf",
+     dataSource = "GENCODE GTF",
+     organism = "Homo sapiens")
> exomePeak2(
+     bam_ip = c(
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304491_dedup.REF_chr10.bam"
+         ),
+     bam_input = c(
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304485_dedup.REF_chr10.bam"
+         ),
+     bam_input_treated = c(   
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304487_dedup.REF_chr10.bam"
+         ),
+     bam_ip_treated = c(
+         "/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304493_dedup.REF_chr10.bam"
+         ),
+     txdb = txdb,
+     genome = "BSgenome.Hsapiens.UCSC.hg38",
+     mode = "full_transcript" 
+     )
Extract bin features ... OK
Count reads on bin features ... OK
Identify background features ... OK
Estimate sample sepecific size factors from the background ... OK
Calculate bin GC contents on exons ... OK
Fit GC curves with smoothing splines ... OK
Calculate offset matrix for bins ... OK
Detect peaks with GLM ...
 *** caught segfault ***
address (nil), cause 'unknown' 

Traceback:
 1: t(exp(modelMatrix %*% t(betaRes$beta_mat)))
 2: fitNbinomGLMs(objectNZ, betaTol = betaTol, maxit = maxit, useOptim = useOptim,     useQR = useQR, renameCols = renameCols, modelMatrix = modelMatrix,     minmu = minmu)
 3: nbinomWaldTest(dds)
 4: callPeaks(se, txdb, test_method, p_cutoff, exByGene, bin_size,     motif_based)
 5: force(x)
 6: quiet(.)
 7: callPeaks(se, txdb, test_method, p_cutoff, exByGene, bin_size,     motif_based) %>% quiet
 8: diffAnalysis(bam_IP = bam_ip, bam_input = bam_input, bam_IP_treated = bam_ip_treated,     bam_input_treated = bam_input_treated, txdb = txdb, genome = genome,     bin_size = bin_size, step_size = step_size, fragment_length = fragment_length,     strandness = strandness, test_method = test_method, p_cutoff = p_cutoff,     plot_gc = plot_gc, parallel = parallel, motif_based = motif_based,     motif_sequence = "DRACH")
 9: exomePeak2(bam_ip = c("/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304491_dedup.REF_chr10.bam"),     bam_input = c("/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304485_dedup.REF_chr10.bam"),     bam_input_treated = c("/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304487_dedup.REF_chr10.bam"),     bam_ip_treated = c("/ifs/data/sz3145_gp/fy2306/projects/rotation/data/GSE207663/sambamba_default/unique/GSM6304493_dedup.REF_chr10.bam"),     txdb = txdb, genome = "BSgenome.Hsapiens.UCSC.hg38", mode = "full_transcript")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

As we would prefer to run exomepeak2 on the whole bam file, splitting into chromosomes was a workaround, however, this single chromosome is not running to completion. I've read that exomepeak2 is highly memory intensive, however, it says for large genomes that 4GB memory is needed...

We've successfully run exomepeak2 on exons only, but we need the full transcript data.
Any help/guidance would be greatly appreciated!

请教：关于差异peaks的分析输出结果

尝试着使用exomePeak2进行分析得到了一个初步的差异peaks的分析结果。

我之前使用的分析方法是Macs2进行peaks calling后使用bedtools合并生物学重复，然后再使用bedtools 从bam中得到IP的富集的reads（在这一部分中，没用用到input的数据），最后在使用diffbind等R包进行差异分析。exomePeak2考虑到了生物学重复这一部分，在差异分析这一步也去考虑了input文件，似乎是吧表达也考虑进去进行校正（可参考的只有Liu的一篇文章）。

我的问题是：现在我已经使用exomePeak2完成了control 和treat的差异分析，一般情况下这时候我需要用IGV去表征这样的一个富集结果，现在的结果我放在下面的图中，以最后一行的差异分析结果，他的计算经历了本身的IP相对input的一次计算Modlog，从最终结果来看是treat相对control修饰程度上升，但是reads却没有明显的变化，首先希望能够了解一下这部分的数据的简单的一个计算过程？其次是我更倾向于相信这一部分数据显示的修饰上下调还是reads水平？

最后在peak calling这一步我使用exomePeak2得到了不错的结果，很感谢您的开发，以上就是我的部分疑问，实在是找不到相关的资料，仅能从bioconductor上找到一些。期待您的回复！

Usage of ExomePeak2 for Bacterial & Viral m6A

Hi,

I've been gratefully using ExomePeak2 for my m6A differential methylation analyses.
Although I was focusing on eukaryotic hosts, I am interested in analyzing m6A in ssRNA & dsDNA viruses, and perhaps bacteria in future.
May I ask your advice on parameters I should tweak for these cases? For example, I am using RNA-seq (& MeRIP-seq) reads that contain host and viral sequences.

Thanks,
Taehyung

How to decide the fragment_length

Hi,

Recently I have a question about how to decide my fragment_length:

For example my fragment length peak is 200 (not including adapters), and I take PE150 for sequence, should I change the default fragment_length=100 to 200 ? And how about taking SE50 for sequence?

This really puzzled me these days.

Thanks advance for your reply!

Haifeng
NanJing Medical University
2020-08-04

an error occured

exomePeak2 is so nice a better package for m6a , I installed it correctly and try to use , all bam files have been indexed. I come across the error like this :

My command is :

library(exomePeak2)
library(TxDb.Mmusculus.UCSC.mm10.ensGene)
library(BSgenome.Mmusculus.UCSC.mm10)

MeRIP_Seq_Alignment <- scanMeripBAM(
  bam_ip = c("R19008728-QMJ-Thy--control-IP-1_combined_R.sorted.bam", 
             "R19008728-QMJ-Thy--control-IP-2_combined_R.sorted.bam"), 
  bam_input = c("R19008728-QMJ-Thy--control-input-1_combined_R.sorted.bam",
                "R19008728-QMJ-Thy--control-input-2_combined_R.sorted.bam"),
  bam_treated_ip = c("R19008728-QMJ-Thy--KO-IP-1_combined_R.sorted.bam", 
                     "R19008728-QMJ-Thy--KO-IP-2_combined_R.sorted.bam"),
  bam_treated_input = c("R19008728-QMJ-Thy--KO-input-1_combined_R.sorted.bam",
                        "R19008728-QMJ-Thy--KO-input-2_combined_R.sorted.bam"), 
  paired_end = TRUE,
  library_type = "1st_strand") 

SummarizedExomePeaks <- exomePeakCalling(merip_bams = MeRIP_Seq_Alignment,
                                         txdb = TxDb.Mmusculus.UCSC.mm10.ensGene,
                                         bsgenome = Mmusculus,
                                         p_adj_cutoff = 0.05,
                                         logFC_cutoff = 0,
                                         parallel = TRUE)

And then it runs and error :

Generating bins on exons...
Counting reads on bins...
Error in names(res) <- nms : 'names'属性的长度[8]必需和矢量的长度[2]一样

> Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal

I'll very grateful if you can give me any help , thank you !

Sun Haifeng
NanJing Medical University

duplicate 'names(reads)' not allowed

Dear Developer, I am having some problems running exomePeak2 analysis, here is my code:

library(exomePeak2)
gene_anno_gtf <- "/scratch/lb4489/bioindex/gencode.vM33.annotation.gtf"

input_files <- c('/scratch/lb4489/project/eem6a/ctrl_rep1.sorted.bam','/scratch/lb4489/project/eem6a/ctrl_rep2.sorted.bam','/scratch/lb4489/project/eem6a/ctrl_rep3.sorted.bam')
ip_files <- c('/scratch/lb4489/project/eem6a/ctrl_Merip_rep1.sorted.bam','/scratch/lb4489/project/eem6a/ctrl_Merip_rep2.sorted.bam','/scratch/lb4489/project/eem6a/ctrl_Merip_rep3.sorted.bam')
input_treated_files <- c('/scratch/lb4489/project/eem6a/treat_rep1.sorted.bam','/scratch/lb4489/project/eem6a/treat_rep2.sorted.bam','/scratch/lb4489/project/eem6a/treat_rep3.sorted.bam')  
ip_treated_files <- c('/scratch/lb4489/project/eem6a/treat_Merip_rep1.sorted.bam','/scratch/lb4489/project/eem6a/ctrl_Merip_rep2.sorted.bam','/scratch/lb4489/project/eem6a/treat_Merip_rep3.sorted.bam')

exomePeak2(bam_ip = ip_files, bam_input = input_files, 
           bam_ip_treated = ip_treated_files, bam_input_treated = input_treated_files, 
           gff = gene_anno_gtf, genome = "mm39", p_cutoff = 1e-2,
           experiment_name = paste0('exomePeak2_output_EE_vs_SH'))

Here is the output：

Extract bin features ... OK
Count reads on bin features ... Error in .local(features, reads, mode, ignore.strand, ...) : 
  duplicate 'names(reads)' not allowed
Calls: exomePeak2 ... featuresCounts -> summarizeOverlaps -> summarizeOverlaps -> .local
In addition: Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
  The "phase" metadata column contains non-NA values for features of type
  stop_codon. This information was ignored.
Execution halted

Can you help me see what the problem is.

Thanks,
LeeLee

reference genome can't be recognized?

Hi, ZW-xjtlu,

When I ran exomePeak2, it came with a warning about reference genome.
The supplied txdb was loaded from a sqlite file for mm39 which was the same for data alignment.
However, it was said the reference genome wasn't provided, though it was assigned in the command line.
Is this an issue which may cause problems? How can this issue be solved?

The output as below:

The command as below:

exomePeak2(bam_ip=CON_IP_BAM, bam_input=CON_INPUT_BAM, bam_input_treated=TRE_INPUT_BAM, bam_ip_treated=TRE_IP_BAM, gff=gtfpath, txdb =mm39_txdb)

The warning message as below:

In diffAnalysis(bam_IP = bam_ip, bam_input = bam_input, bam_IP_treated = bam_ip_treated, :
Reference genome not provided, GC content bias is left uncorrected.

关于线程问题

你好：

近期我在使用exomePeak包进行peakcalling，由于数据量较大（2个样本，3个生物学重复，每个约20G数据），计算十分缓慢（约1小时计算1%），我使用的服务器性能充足却无法发挥，而我也没有在手册中找到有关多线程的参数。

请问exomePeak2是否有助于加快运算速度，或者能提供线程控制参数呢？如果仍使用exomePeak包有无其他办法提高速度呢？

谢谢指导！

Calculate bin GC contents on exons ... Error in h(simpleError(msg, call))

Hello,
Try to use exomePeak2 for peak calling, yet it output error.
bam files were aligned to ensembl mus musculus GRCm38 GTF which is the same assigned by 'gtfpath'.
genome were aligned as mm10.
It ran for a while then stopped with an error as shown below.
How can I fix it?
Thanks,

exomePeak2(bam_ip = IP_BAM, bam_input = INPUT_BAM, gff = gtfpath, genome = "mm10", save_dir=savepath)
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Extract bin features ... OK
Count reads on bin features ... OK
Identify background features ... OK
Estimate sample sepecific size factors from the background ... OK
Calculate bin GC contents on exons ... Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'letterFrequency': error in evaluating the argument 'x' in selecting a method for function 'XStringSet': The following seqlevels are to be dropped but are currently in use
(i.e. have ranges on them): 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, X, Y, MT, GL456210.1, GL456211.1, GL456212.1,
GL456216.1, GL456219.1, GL456221.1, GL456233.1, GL456239.1, GL456350.1,
GL456354.1, GL456372.1, GL456381.1, GL456385.1, JH584292.1, JH584293.1,
JH584294.1, JH584295.1, JH584296.1, JH584297.1, JH584298.1, JH584299.1,
JH584303.1, JH584304.1. Please use the 'pruning.mode' argument to
control how to prune 'x', that is, how to remove the ranges in 'x' that
are on these sequences. For example, do something like:
seqlevels(x, pruning.mode="coarse") <- new_seqlevels
or:
keepSeqlevels(x, new_seqlevels, pruning.mode="coarse")
See ?seqinfo for a description of the pruning modes.
In addition: Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.
2: In .merge_two_Seqinfo_objects(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)

请教最新版exomePeak2 （1.7）报错

作者您好，我最近用新版exomePeak2的时候报了这样的错：
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Extract bin features ... OK
Count reads on bin features ... OK
Identify background features ... OK
Estimate sample sepecific size factors from the background ... OK
Detect peaks with GLM ... OK
Count reads on peaks ... OK
Calculate offset matrix for peaks ... OK
Detect differentially modified peaks with interactive GLM ... OK
Error in .local(object, con, format, ...) :
Scores must be non-NA numeric values
Calls: exomePeak2 ... callGeneric -> eval -> eval -> export -> export -> .local

以下是我的代码（3个重复）：
library(exomePeak2)

setwd("/home/wuzh/shMETTL14")
GENE_ANNO_GTF = '/home/wuzh/hisat2_index/hg19/hg19.gtf'

control_Input = c('siControl_1_input.bam','siControl_2_input.bam','siControl_3_input.bam')
control_IP = c('siControl_1_IP.bam','siControl_2_IP.bam','siControl_3_IP.bam')

shmettl14_Input = c('siMETTL14_1_input.bam','siMETTL14_2_input.bam','siMETTL14_3_input.bam')
shmettl14_IP = c('siMETTL14_1_IP.bam','siMETTL14_2_IP.bam','siMETTL14_3_IP.bam')

sep1 <- exomePeak2(
bam_ip = control_IP,
bam_input = control_Input,
bam_ip_treated = shmettl14_IP,
bam_input_treated = shmettl14_Input,
gff = GENE_ANNO_GTF,
#genome = "hg19",
strandness = "1st_strand",
parallel = 12,
save_output = TRUE,
save_dir = getwd(),
experiment_name = "exomePeak2_updated_output_shmettl14")
之前用两个重复的时候好像可以跑通，挺疑惑的。希望您能帮我看一下，十分感激！

Error in manager$availability[[as.character(result$node)]] <- TRUE : wrong args for environment subassignment

老师您好,感谢您开发了这个非常棒的工具。最近我在使用时遇到了这个报错

exomePeak2(bam_ip = IP_BAM,
           bam_input = INPUT_BAM,
           # gff = GENE_ANNO_GTF,
           txdb = TxDb.Hsapiens.UCSC.hg38.knownGene,
           fragment_length = 100,
           parallel = 12,
           p_cutoff = 0.05,
           experiment_name = "NC")
Extract bin features ... OK
Count reads on bin features ... Error in manager$availability[[as.character(result$node)]] <- TRUE :
  wrong args for environment subassignment
In addition: Warning message:
In peakCalling(bam_IP = bam_ip, bam_input = bam_input, txdb = txdb,  :
  Reference genome not provided, GC content bias is left uncorrected.
Error in serialize(data, node$con) : ignoring SIGPIPE signal

试了多次还是一样,不知道是什么原因,感谢老师解答!

whats next with the result (csv,bed,rds)

hi i have use your package for my analysis . as ,

Define file paths to BAM files----

f1 <- "C:/Users/Azka/Desktop/RwithAmmar/azkexomepeak/WT_IP_R1_alignments_final_sorted.bam"
f2 <- "C:/Users/Azka/Desktop/RwithAmmar/azkexomepeak/WT_IP_R2_alignments_final_sorted.bam"
f3 <- "C:/Users/Azka/Desktop/RwithAmmar/azkexomepeak/WT_IP_R3_alignments_final_sorted.bam"
IP_BAM <- c(f1, f2, f3)
f1 <- "C:/Users/Azka/Desktop/RwithAmmar/azkexomepeak/WT_input_R1_alignments_final_sorted.bam"
f2 <- "C:/Users/Azka/Desktop/RwithAmmar/azkexomepeak/WT_input_R2_alignments_final_sorted.bam"
f3 <- "C:/Users/Azka/Desktop/RwithAmmar/azkexomepeak/WT_input_R3_alignments_final_sorted.bam"
INPUT_BAM <- c(f1, f2, f3)

Define file paths to genome and GFF files

#genome <- "C:/Users/Azka/Desktop/RwithAmmar/azkexomepeak/genomic.fna"
gff <- "C:/Users/Azka/Desktop/RwithAmmar/azkexomepeak/GCF_000146045.2_R64_genomic.gff"

#check files
file.exists(INPUT_BAM,IP_BAM,gff,genome)

Peak Calling

exomePeak2(bam_ip = IP_BAM,bam_input = INPUT_BAM, gff = gff )

this simply gaves me csv ,bed and rds .
it gave some error like some transcripts in the input data may not have unique identifiers or may have multiple coding sequences, and that the reference genome was not provided for correction of GC content bias.

kindly guide me what can i edit to perform better and what can i do with the result files , do i need to interprete and make graph by my self or there are some available that give sharp outcomes

thanks advance .

exomePeak2报错

不知道为什么报错，希望得到您的回复

使用的代码如下：

还有这种情况，报完错继续分析

您好，请问有已发表，或者预印版的文章吗

我先仔细看看exomepeak2和其他R包的对比，如果确实有新的优势，我们将会改进pipeline，用您的R包。
谢谢！

How to install

Hi,
Recently, i can not install exomePeak2 package because "RMariaDB" package not correct installed, and MySQL and RMySQL also not installed.
Could you please give some suggestions?

weird result with test_method=DESeq2

Hi, I am trying to use DESeq2 as the test method. The command ran, but it generated 2 records only while the default test method generated >6000 records.
I attached the command and output log below. What could be the potential cause and how can I fix it?

exomePeak2(bam_ip=CON_IP_BAM, bam_input=CON_INPUT_BAM, bam_input_treated=TRE_INPUT_BAM, bam_ip_treated=TRE_IP_BAM, gff=gtfpath, txdb=txdb, test_method="DESeq2", p_cutoff=0.05, save_dir=outputpath)
Extract bin features ... OK
Count reads on bin features ... OK
Identify background features ... OK
Estimate sample sepecific size factors from the background ... OK
Detect peaks with GLM ... gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
OK
Count reads on peaks ... OK
Calculate offset matrix for peaks ... OK
Detect differentially modified peaks with interactive GLM ... gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
OK
GRangesList object of length 2:
$1
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand

1 3 95031787-95031912 +
1 3 95032508-95032531 +
seqinfo: 61 sequences from an unspecified genome; no seqlengths
$2
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand

2 19 5473563-5473712 -

seqinfo: 61 sequences from an unspecified genome; no seqlengths
Warning message:
In diffAnalysis(bam_IP = bam_ip, bam_input = bam_input, bam_IP_treated = bam_ip_treated, :
Reference genome not provided, GC content bias is left uncorrected.

exomePeak2运行出错

以下是我的运行代码：version: 1.14.3
setwd("/data2/XXX/ZJ_MeRIPseq")
library(exomePeak2)

GTF_FILE <- "/data1/database/GRCH38.94/Homo_sapiens.GRCh38.94.chr.gtf"

res <- exomePeak2(
bam_input = c("WT_bam/0h_Input.bam"),
bam_ip = c("WT_bam/0h_IP.bam"),
bam_input_treated = c("WT_bam/4h_Input.bam"),
bam_ip_treated = c("WT_bam/4h_IP.bam"),
genome = "hg38",
gff = GTF_FILE,
save_dir = "exomePeak2_WT_0h_4h"
)

Extract bin features ... OK
Count reads on bin features ... OK
Identify background features ... Error in if (G[1] == 1) { : missing value where TRUE/FALSE needed
Calls: exomePeak2 ... classifyBackground -> Mclust -> eval -> eval -> mclustBIC
In addition: There were 47 warnings (use warnings() to see them)
请问可能出现的问题在哪呢？

使用exomePeak2进行差异分析时报错

作者您好，非常感谢您编写的这个R包。但是我在使用exomePeak2进行差异分析时报错，请问如何解决，非常感谢🙏。

ERROR with 'XML'

Hi
I installed the exomePeak2 with BiocManager, but somehow error occured with 'XML'. I downloaded and installed the XML in my system, while it still not work. Do you have any suggestion?
Thanks in advance!

基因组文件

Question about blockCount

Hello, Zhen.
I have a question about the output of exomePeak2. I discovered that the blockCount of all peaks is equal to one. Is this correct? If not, how should the parameters be adjusted? I used exomePeak2 v1.6.0, and my code is included below:
exomePeak2( bam_ip = ip_bam, bam_input = input_bam, genome = "hg38", gff_dir = GENE_ANNO_GTF, paired_end = TRUE, library_type = "unstranded", fragment_length = 150, parallel = 12, peak_calling_mode = "whole_genome" )
If you could give me some advice, it would be greatly appreciated.

no @SQ lines in the header.

Dear developer,

Thank you for providing a useful tool in RNA analysis, I have successfully used your tool in many projects, which have yielded great results. However, recently I ran into an error which cannot be solved. I applied the almost same code as previous projects, but this time exomePeak2 tells that:

Count reads on bin features ... [samopen] **no @SQ lines in the header.**
Error in names(res) <- `*vtmp*` : 
  'names' attribute [2] must be the same length as the vector [1]
Calls: exomePeak2 ... .dispatchBamFiles -> bplapply -> bplapply -> bplapply -> bplapply
In addition: Warning message:
stop worker failed:
  wrong args for environment subassignment 
Execution halted
Error in serialize(data, node$con) : ignoring SIGPIPE signal
Calls: local ... .send -> <Anonymous> -> sendData.SOCKnode -> serialize
Execution halted

Here are all the other information:

The bam file header (I am not including all of them):

@HD     VN:1.4  SO:coordinate
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr3 LN:198295559
@SQ     SN:chr4 LN:190214555
@SQ     SN:chr5 LN:181538259
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr7 LN:159345973
@SQ     SN:chr8 LN:145138636
@SQ     SN:chr9 LN:138394717
@SQ     SN:chr10        LN:133797422
@SQ     SN:chr11        LN:135086622
@SQ     SN:chr12        LN:133275309
@SQ     SN:chr13        LN:114364328
@SQ     SN:chr14        LN:107043718
@SQ     SN:chr15        LN:101991189
@SQ     SN:chr16        LN:90338345
@SQ     SN:chr17        LN:83257441
@SQ     SN:chr18        LN:80373285
@SQ     SN:chr19        LN:58617616
@SQ     SN:chr20        LN:64444167
@SQ     SN:chr21        LN:46709983
@SQ     SN:KI270732.1   LN:41543

The R script I used:

message("library...")
library(exomePeak2)
library(devtools)
library(TxDb.Hsapiens.UCSC.hg38.refGene)
library(BSgenome.Hsapiens.UCSC.hg38)

path.bam <- "<bam_path>"
GENE_ANNO_GTF <- "<gtf_path>/gencode.v45.basic.annotation.gtf"

files <- list.files(file.path(path.bam), full.names = TRUE)

files <- files[grep("sorted.rmdup.uniq.bam$",files)]

#nat10 EXT (These are the treated samples)
EXT1 <- files[4]
EXT2 <- files[5]
EXT3 <- files[6]
IP_BAM.EXT <- c(EXT1,EXT2,EXT3)
INPUT_BAM.EXT <- files[8]


#nat10 RC (This is the control group)
RC1 <- files[1]
RC2 <- files[2]
RC3 <- files[3]
IP_BAM.RC <- c(RC1,RC2,RC3)
INPUT_BAM.RC <- files[7]

#make txdb file
message("making txdb")
txdb <- makeTxDbFromGFF(file = "<path to gtf>/gencode.v45.basic.annotation.gtf", 
                        format="gtf", 
                        dataSource="Gencode", 
                        organism="Homo sapiens")
seqinfo(txdb)
seqlevels(txdb)
seqname <- c(paste0("chr",1:22),"chrX","chrY","chrM")
seqlevels(txdb,pruning.mode="coarse") <- seqname

#call peaks for each sample------------------------------------------------------------------------------------
## EXT1
message("calling peaks for EXT1")
dir.create("<out_dir>/EXT1")
exomePeak2(bam_ip = EXT1,
           bam_input = INPUT_BAM.EXT,
           gff = GENE_ANNO_GTF,
           txdb = txdb,
           genome = BSgenome.Hsapiens.UCSC.hg38,
           fragment_length = 150,
           p_cutoff = 0.05,
           save_dir = "<out_dir>/EXT1")

Bam files are generated by STAR, duplicated reads are removed by using sambamba.

STAR \
--runThreadN 24 \
--runMode alignReads \
--readFilesCommand gunzip -c \
--genomeDir <index_dir> \
--readFilesIn ${read1} ${read2} \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix <out_dir>/${sample} \
--quantMode TranscriptomeSAM \
--outSAMattributes All

sambamba markdup -t 2\
    --overflow-list-size 600000 \
    --tmpdir='./' \
    -r ${sample}Aligned.sortedByCoord.out.bam \
    ${sample}_sorted.rmdup.bam

sambamba view -h -t 16 -f bam \
    -F "[XS] == null and not unmapped and not duplicate" \
    ${sample}_sorted.rmdup.bam >  
    ${sample}_sorted.rmdup.uniq.bam

PS
gtf and fa files are downloaded from Gencode:
gencode.v45.basic.annotation.gtf & GRCh38.primary_assembly.genome.fa

Also, all the codes are the same as I used in previous projects. The only difference is that the sample this time is human instead of mouse.

Thank you for your valuable time.

Best,
Feiyang

含有生物学重复的差异peak分析

你好，我看了expmePeak2的说明文档，示例代码对照组有3个ip和input数据，请问这三个数据，是同一个样本的3个lane的数据，还是3个不同的独立样本的数据呢？
如果是同一个样本的3个lane的数据，那想请问一下，对照组和实验组各有3个生物学重复，应该怎么进行差异peak的分析呢？应该如何输入？谢谢

Introns in peak detection of RNA samples

Hi,

I utilized exomePeak2 for peak detection in RNA samples and observed that many methylated regions corresponded to introns. This finding surprised me since I did not specify the mode argument, and according to my understanding, the default is set to "exon."

The workflow involved taking the exomePeak2 bed file output, converting it into a bigWig file, uploading it to the UCSC genome browser custom track, and subsequently discovering the presence of introns.

For this analysis, I used paired-end BAM files as input and an NCBI GTF file (ncbiRefSeq.gtf for Rattus norvegicus). The specified arguments in the exomePeak2 function were as follows:

exomePeak2(bam_ip=control_ip_files,
           bam_input=control_input_files,
           gff=ncbiRefSeq.gtf,
           save_dir="my/experiment/path",
           experiment_name="control_input_ip_diff",
           genome="rn7",
           txdb=makeTxDbFromGFF(ncbiRefSeq.gtf, format="gtf"))

I am also adding the text from my log file when executing the R code:

start
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats

Attaching package: ‘MatrixGenerics’

The following objects are masked from ‘package:matrixStats’:

   colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
   colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
   colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
   colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
   colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
   colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
   colWeightedMeans, colWeightedMedians, colWeightedSds,
   colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
   rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
   rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
   rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
   rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
   rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
   rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
   rowWeightedSds, rowWeightedVars

Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

   IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

   anyDuplicated, aperm, append, as.data.frame, basename, cbind,
   colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
   get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
   match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
   Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
   table, tapply, union, unique, unsplit, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

   expand.grid, I, unname

Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor

   Vignettes contain introductory material; view with
   'browseVignettes()'. To cite Bioconductor, see
   'citation("Biobase")', and for packages 'citation("pkgname")'.


Attaching package: ‘Biobase’

The following object is masked from ‘package:MatrixGenerics’:

   rowMedians

The following objects are masked from ‘package:matrixStats’:

   anyMissing, rowMedians

Warning messages:
1: package ‘exomePeak2’ was built under R version 4.2.2 
2: package ‘SummarizedExperiment’ was built under R version 4.2.2 
3: package ‘MatrixGenerics’ was built under R version 4.2.1 
4: package ‘BiocGenerics’ was built under R version 4.2.1 
5: package ‘IRanges’ was built under R version 4.2.1 
6: package ‘GenomeInfoDb’ was built under R version 4.2.3 
7: package ‘Biobase’ was built under R version 4.2.1 
Loading required package: AnnotationDbi
Warning messages:
1: package ‘GenomicFeatures’ was built under R version 4.2.2 
2: package ‘AnnotationDbi’ was built under R version 4.2.2 
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
 The "phase" metadata column contains non-NA values for features of type
 stop_codon. This information was ignored.

Attaching package: ‘Biostrings’

The following object is masked from ‘package:base’:

   strsplit

Extract bin features ... OK
Count reads on bin features ... OK
Identify background features ... OK
Estimate sample sepecific size factors from the background ... OK
Calculate bin GC contents on exons ... OK
Fit GC curves with smoothing splines ... OK
Calculate offset matrix ... OK
Detect peaks with GLM ... OK
GRangesList object of length 34629:
$`1`
GRanges object with 1 range and 0 metadata columns:
   seqnames              ranges strand
      <Rle>           <IRanges>  <Rle>
 1     chr4 154898070-154898094      +
 -------
 seqinfo: 80 sequences from an unspecified genome; no seqlengths

$`2`
GRanges object with 1 range and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
 2     chrM 7758-9382      +
 -------
 seqinfo: 80 sequences from an unspecified genome; no seqlengths

$`3`
GRanges object with 1 range and 0 metadata columns:
   seqnames            ranges strand
      <Rle>         <IRanges>  <Rle>
 3    chr12 31113498-31114297      -
 -------
 seqinfo: 80 sequences from an unspecified genome; no seqlengths

...
<34626 more elements>
Warning messages:
1: In .merge_two_Seqinfo_objects(x, y) :
 Each of the 2 combined objects has sequence levels not in the other:
 - in 'x': chr19_NW_023637714v1_random, chr19_NW_023637715v1_random, chrUn_NW_023637730v1, chrUn_NW_023637732v1, chrUn_NW_023637734v1, chrUn_NW_023637735v1, chrUn_NW_023637736v1, chrUn_NW_023637737v1, chrUn_NW_023637738v1, chrUn_NW_023637740v1, chrUn_NW_023637741v1, chrUn_NW_023637742v1, chrUn_NW_023637743v1, chrUn_NW_023637744v1, chrUn_NW_023637745v1, chrUn_NW_023637747v1, chrUn_NW_023637748v1, chrUn_NW_023637749v1, chrUn_NW_023637750v1, chrUn_NW_023637751v1, chrUn_NW_023637752v1, chrUn_NW_023637753v1, chrUn_NW_023637756v1, chrUn_NW_023637757v1, chrUn_NW_023637758v1, chrUn_NW_023637760v1, chrUn_NW_023637761v1, chrUn_NW_023637762v1, chrUn_NW_023637765v1, chrUn_NW_023637766v1, chrUn_NW_023637767v1, chrUn_NW_023637768v1, chrUn_NW_023637769v1, chrUn_NW_023637772v1, chrUn_NW_023637773v1, chrUn_NW_023637775v1, chrUn_NW_023637776v1, chrUn_NW_023637779v1, chrUn_NW_023637781v1, chrUn_NW_023637782v1, chrUn_NW_023637783v1, chr [... truncated]
2: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
 GRanges object contains 36 out-of-bound ranges located on sequences
 Ang,LOC108352909, LOC120094777, LOC103691088, LOC108351262,
 LOC103693688, LOC120094123,Vof16, LOC102551251, LOC120097087,
 LOC120098939, LOC108350834, LOC120095272, LOC120093965, LOC120094883,
 LOC120097370, LOC120097467, LOC120094794, Galnt4, LOC120095229,
 LOC108352196, LOC120100348, LOC120096115, LOC120096778, LOC120100476,
 LOC120100393, LOC120103263, LOC120100962, LOC120093625, LOC102552101,
 LOC120100737, LOC103692527, LOC120095249, LOC108350542, LOC120097094,
 LOC120102858, LOC120095738, and LOC120093583. Note that ranges located
 on a sequence whose length is unknown (NA) or on a circular sequence
 are not considered out-of-bound (use seqlengths() and isCircular() to
 get the lengths and circularity flags of the underlying sequences). You
 can use trim() to trim these ranges. See ?`trim,GenomicRanges-method`
 for more information.
finished.

I would greatly appreciate any suggestions or insights.

Error in Identify background with Gaussian Mixture Model

Hi, I am using exomepeak2 for MeRIP-seq differential peak analysis, but I have problem with identify backround step:

Identify background with Gaussian Mixture Model ... Error in colSums(par_samples) :
'x' must be an array of at least two dimensions

my code is:

genome <- BSgenome.Mmusculus.UCSC.mm10

f1 <- "INP1.bam"
f2 <- "INP2.bam"
INPUT_BAM <- c(f1,f2)

f3 <- "M6A1.bam"
f4 <- "M6A2.bam"
IP_BAM <- c(f3,f4)

f5 <- "SKOINP1.bam"
f6 <- "SKOINP2.bam"
TREATED_INPUT_BAM <- c(f5,f6)

f7 <- "SKOM6A1.bam"
f8 <- "SKOM6A2.bam"
TREATED_IP_BAM <- c(f7,f8)

# exon analysis
sep <- exomePeak2(bam_ip = IP_BAM,
                  bam_input = INPUT_BAM,
                  bam_treated_input = TREATED_INPUT_BAM,
                  bam_treated_ip = TREATED_IP_BAM,
                  gff_dir = GENE_ANNO_GTF,
                  bsgenome = genome,
                  paired_end = T,
                  library_type = 'unstranded',
                  p_cutoff = '0.05',
                  p_adj_cutoff = '0.05',
                  parallel = 13,
                  correct_GC_bg = T,
                  save_plot_GC = T,
                  save_plot_analysis = T)
sep

Thankes in advance,
Marco

Problem about strand information

I split BAM into two files: plus (forward) strand and minus (reverse) strand.

However, ~50% peaks on minus strand for plus-strand BAM file, and ~50% peaks on plus strand for minus-strand BAM file.

Why? I am sure that the two bam files have no problem.

使用exomePeak2进行peak calling时，程序被kill掉

作者您好，我最近在服务器上使用exomePeak2时，程序总是被kill，请问一下解决方法。
IFN <- exomePeak2(bam_ip=IP_IFN_BAM, bam_input=Input_IFN_BAM, parallel=1, gff="/data1/gaoronghui/Homo_sapiens/Homo_sapiens.GRCh38.104.chr_patch_hapl_scaff.gtf", genome= "BSgenome.Hsapiens.NCBI.GRCh38")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Extract bin features ... OK
Count reads on bin features ... Killed

Differential Methylation Analysis

Hi,

I am trying to decide between two options of input for differntial methylation analysis with exomePeak2 function.
Consider I have 2 files of control input samples, 2 files of control Immuno-Precipitation (IP) samples, 2 files of treated input samples and 2 files of treated IP samples. Which option is a the right usage of the exomePeak2 function:

do the analysis several times for 4 files with the arguments as follows (example):
bam_ip = control_ip1,
bam_input = control_input1,
bam_input_treated = treated_input1,
bam_ip_treated = treated_ip1
do the analysis one time with all the files passed to the arguments as vectors:
bam_ip = c(control_ip1, control_ip2),
bam_input = c(control_input1, control_input2),
bam_input_treated = c(treated_input1, treated_input2)
bam_ip_treated = c(treated_ip1, treated_ip2)

if the second option is the better one, how does the function handles the diffreneces between the two samples of the same type?

Thankes in advance,
Uri Levy

linux版本的参数少了很多

您好！
请问一下您，我在服务器的ubuntu系统上安装了R，接着安装了exomePeak2 ，显示版本为1.10.0，但是参数相比于Windows的R包少了很多参数，比如paired_end、log2FC_cutoff等，这是为什么呢？
谢谢您万忙之中的回答！谢谢！

peak calling时log2FC的问题

想问下我发现在callPeak的时候结果文件显示ip的ReadsCount.IP要小于input的ReadsCount.INPUT但是log2FoldChange却是一个大于1的正数。这是为什么

exomepeak2 运行三次重复，padj/ score值出现异常

非常感谢作者开发出exomePeak2这么好用的包。我在使用过程中发现，一旦在计算三次重复的的输入值时，会出现奇怪的结果，比如
基本所有peak的padj值都为0.998， score值都等于0.000180458。但是三次重复分开三次运行时，结果却是正常的。下面是我的运行代码，我现在很困惑，希望作者能够帮忙答疑解惑，非常感谢🙏🙏🙏
f1="CK1_IP_.sorted.bam"
f2="CK2_IP_.sorted.bam"
f3="CK3_IP_.sorted.bam"
IP_BAM = c(f1,f2,f3)
ff1="CK1_input.sorted.bam"
ff2="CK2_input.sorted.bam"
ff3="CK3_input.sorted.bam"
INPUT_BAM = c(ff1,ff2,ff3)
ft1="TL1_IP.sorted.bam"
ft2="TL2_IP.sorted.bam"
ft3="TL3_IP.sorted.bam"
TREATED_IP_BAM = c(ft1,ft2,ft3)
fft1="TL1_input.sorted.bam"
fft2="TL2_input.sorted.bam"
fft3="TL3_input.sorted.bam"
TREATED_INPUT_BAM = c(fft1,fft2,fft3)

exomePeak2(bam_ip =IP_BAM,
bam_input =INPUT_BAM,
bam_treated_input =TREATED_INPUT_BAM,
bam_treated_ip =TREATED_IP_BAM,
gff_dir ="IRGSP-1.0.gff")

apply exomePeak2 in m6a-seq2 with user-defined size factors

关于exomePeak2的diff.log2.FC

11 | chr12 | 125626656 | 125626831 | 11 | + | 1 | 175 | 0 | 4.65 | 5.62 | 2.83 | 6.86 | AACS | 1.104841855 | 5.86E-26 | 1.43E-25 | 25.23227465
我的结果中出现了这种情况并且有很多，就是 diff.log2FC该为负值，但是给出的结果为正值，不知道是什么原因

请教

如果指定了双端模式（paired_end=T），fragment length等参数还要考虑吗？

不存在叫‘speedglm’这个名字的程辑包

Package ‘speedglm’ was removed from the CRAN repository.

Removed on 2023-04-01 for policy violation

Please use the canonical form https://cran.r-project.org/package=speedglm to link to this page.

R-4.2.3貌似用不了

Run fail at "Identify background feature" step (+ most recent version of package installation?)

Hi,

I have two issues.

First, one is regarding failure during ExomePeak2 run.
I am analyzing MeRIP-Seq data of different species using exomePeak2_1.10.0. I successfully ran the program with green monkey data. One of the species is human. I am using the most recent version of the genome assemblies (GCF_009914755.1), but it fails with the following error in "background feature calculation process":

Extract bin features ... OK
Count reads on bin features ... OK
Identify background features ... Error in if (G[1] == 1) { : missing value where TRUE/FALSE needed
Calls: run_exomepeak2 ... classifyBackground -> Mclust -> eval -> eval -> mclustBIC

I was trying to debug it by getting into the code of Mclust package, but I got lost. Could you please guide me from here?

========================================================================================
In the meantime, I tried to install the newest version available in Bioconductor package (3.18) in R 4.3.2.
However, some dependencies seem to be not installed in the most recent Bioconductor version.
Could you kindly guide me through the installation of exomePeak2 1.14.3?

The following is my installation steps:

if (!require("BiocManager", quietly = TRUE))
		# install.packages("BiocManager")
		# BiocManager::install(version = "3.18")


# install exomePeak2 dependencies
BiocManager::install(c("Rsamtools", "GenomicAlignments", "GenomicRanges", 
					"GenomicFeatures", "DESeq2", "ggplot2", "mclust", "BSgenome", 
					"Biostrings", "GenomeInfoDb", "BiocParallel", "IRanges", 
					"S4Vectors", "rtracklayer", "methods", "stats", 
					"utils", "BiocGenerics", "magrittr", "speedglm", "splines"),force = TRUE)

# install exomePeak2: https://github.com/ZW-xjtlu/exomePeak2
BiocManager::install("exomePeak2")

Thanks :)

Fix Bioconductor Build ERROR

We have been trying to reach out to you with the maintainer email: [email protected] . You have been unresponsive and the package continues to fail. It is failing in both release and devel since June.

https://bioconductor.org/checkResults/devel/bioc-LATEST/exomePeak2/
https://bioconductor.org/checkResults/release/bioc-LATEST/exomePeak2/

基因组文件

请问如果在比对时用的是ensemble版本的基因组文件，我就不能用这个包做callpeak了吗？

diff peaks width are too long

Hi, I am using exomepeak2 for MeRIP-seq differential peak analysis, here are the code:

# ------------------------------------------------------------------
# call differential peaks
# ------------------------------------------------------------------

TREATED_INPUT_BAM = Input.bam[3:4]
TREATED_IP_BAM = IP.bam[3:4]

INPUT_BAM = Input.bam[1:2]
IP_BAM = IP.bam[1:2]

res <- exomePeak2(bam_ip = IP_BAM,
           bam_input = INPUT_BAM,
           bam_input_treated = TREATED_INPUT_BAM,
           bam_ip_treated = TREATED_IP_BAM,
           gff = GENE_ANNO_GTF,
           fragment_length = 150,
           experiment_name = "DOXvsDMSO3",
           parallel = 20,
           plot_gc = FALSE,
           # test_method = "DESeq2",
           p_cutoff = 1,
           save_dir = "./exomePeak-result/")

But the output diffPeaks.csv shows the peak is too long which is strange:

Can you give some advice? Thanks.

关于exomePeak2的结果疑问

作者你好，我在成功运行exomePeak2后，在exomePeak_output的文件夹中可以找到一个diffpeak.csv的结果。在改csv中，可以看到同一个基因存在多个m6A位点。
我的疑问是，应该怎么将这些同属一个基因的m6A位点合并，已获得单独的列表，以适用于后续的结合RNAseq的差异基因列表，进行两组学的九象限图绘制？
谢谢

exomePeak2进行差异分析时报错

你好，感谢作者开发了exomepeak，使用exomePeak2进行peak calling是正常的，但进行差异甲基化分析的时候就报错
GENE_ANNO_GTF = system.file("extdata", "example.gtf", package="exomePeak2")

f1 = system.file("extdata", "IP1.bam", package="exomePeak2")
f2 = system.file("extdata", "IP2.bam", package="exomePeak2")
f3 = system.file("extdata", "IP3.bam", package="exomePeak2")
f4 = system.file("extdata", "IP4.bam", package="exomePeak2")
IP_BAM = c(f1,f2,f3,f4)
f1 = system.file("extdata", "Input1.bam", package="exomePeak2")
f2 = system.file("extdata", "Input2.bam", package="exomePeak2")
f3 = system.file("extdata", "Input3.bam", package="exomePeak2")

INPUT_BAM = c(f1,f2,f3)
f1 = system.file("extdata", "treated_IP1.bam", package="exomePeak2")
TREATED_IP_BAM = c(f1)
f1 = system.file("extdata", "treated_Input1.bam", package="exomePeak2")
TREATED_INPUT_BAM = c(f1)
res <- exomePeak2(bam_ip = IP_BAM,
bam_input = INPUT_BAM,
bam_ip_treated = TREATED_IP_BAM,
bam_input_treated = TREATED_INPUT_BAM,
gff = GENE_ANNO_GTF,
genome = "hg19")
Error in exomePeak2(bam_ip = IP_BAM, bam_input = INPUT_BAM, bam_ip_treated = TREATED_IP_BAM, :
unused arguments (bam_ip_treated = TREATED_IP_BAM, bam_input_treated = TREATED_INPUT_BAM)

Question about blockSizes and log2FoldChange

Hi, I have some question about exomePeak2. My MeRIPseq data is paired, and mapping the sequencing reads to reference genome by Bowtie2 (bowtie2 -p 100 -x /data/Jlab-YJ/index/grch38_bt2/GRCh38.primary_assembly.genome.index -1 H9-Input_R1_val_1.fq.gz -2 H9-Input_R2_val_2.fq.gz |samtools sort -O bam -@ 128 -o - > H9-m6A-Input.bam). Next, the input and ip BAM files are used to call peak by exomePeak2. The parameters are as follows:

library(BSgenome.Hsapiens.UCSC.hg38)
library("exomePeak2")
setwd("/data/Jlab-YJ/m6A_dataset/H9-m6Aseq/tri_fq/")
gtf=c("/data/Jlab-YJ/index/gencode.v29.annotation.gtf")
bam_ip <- c("H9-m6A-IP.bam")
bam_input <- c("H9-m6A-Input.bam")
exomePeak2(
gff_dir = gtf,
genome ="hg38",
bam_ip = bam_ip,
bam_input = bam_input,
paired_end = TRUE,
library_type = c("unstranded"),
save_dir = "exomePeak2_output1", log2FC_cutoff = 2.5 )

###the results as below:
paired_end = TRUE,
library_type = c("unstranded"),
save_dir = "exomePeak2_output")
Make the TxDb object ... OK
Generate bins on exons ... OK
Count reads on bins ... OK
Calculate GC contents on exons ... OK
Estimate offsets of GC content biases on bins ... OK
Peak Calling with regularized NB GLM ... OK
Count reads on the merged peaks and the control regions ... OK
Estimate offsets of GC content biases on modification peaks/sites ... OK
Calculate peaks/sites statistics with regularized NB GLM ... OK
Result files have been saved in the directory: 'exomePeak2_output'.
class: SummarizedExomePeak
dim: 68866 4
metadata(0):
assays(2): counts GCsizeFactors
rownames(68866): peak_1 peak_100 ... control_34233 control_34234
rowData names(2): GC_content feature_length
colnames(4): SRR1035214.bam SRR1035222.bam SRR1035213.bam
SRR1035221.bam
colData names(3): design_IP design_Treatment sizeFactor
Warning messages:
1: In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type
stop_codon. This information was ignored.
2: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:

in 'x': chr1_GL383518v1_alt, chr1_GL383519v1_alt, chr1_GL383520v2_alt, chr1_KI270759v1_alt, chr1_KI270760v1_alt, chr1_KI270761v1_alt, chr1_KI270762v1_alt, chr1_KI270763v1_alt, chr1_KI270764v1_alt, chr1_KI270765v1_alt, chr1_KI270766v1_alt, chr1_KI270892v1_alt, chr2_GL383521v1_alt, chr2_GL383522v1_alt, chr2_GL582966v2_alt, chr2_KI270767v1_alt, chr2_KI270768v1_alt, chr2_KI270769v1_alt, chr2_KI270770v1_alt, chr2_KI270771v1_alt, chr2_KI270772v1_alt, chr2_KI270773v1_alt, chr2_KI270774v1_alt, chr2_KI270775v1_alt, chr2_KI270776v1_alt, chr2_KI270893v1_alt, chr2_KI270894v1_alt, chr3_GL383526v1_alt, chr3_JH636055v2_alt, chr3_KI270777v1_alt, chr3_KI270778v1_alt, chr3_KI270779v1_alt, chr3_KI270780v1_alt, chr3_KI270781v1_alt, chr3_KI270782v1_alt, chr3_KI270783v1_alt, chr3_KI270784v1_alt, chr3_KI270895v1_alt, chr3_KI270924v1_alt, chr3_KI270934v1_alt, chr3_KI270935v1_alt, chr3_KI270936v1_alt, chr3_KI270937v1_alt, chr4_GL000257v2_ [... truncated]
3: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
GRanges object contains 64 out-of-bound ranges located on sequences
ENSG00000113558.18,ENSG00000113575.9,ENSG00000272772.1,
ENSG00000116783.14,ENSG00000254685.6,ENSG00000259030.7,
ENSG00000117640.17,ENSG00000162430.16,ENSG00000255054.3,
ENSG00000118058.21,ENSG00000167283.7,ENSG00000285827.1,
ENSG00000119139.19,ENSG00000165060.12,ENSG00000285130.1,
ENSG00000108239.8,
ENSG00000123297.17,ENSG00000123427.16,ENSG00000257921.5,
ENSG00000066135.12,ENSG00000126091.20,ENSG00000284989.1,
ENSG00000127054.20,ENSG00000240731.1,
ENSG00000130234.10,ENSG00000147003.6,ENSG00000285602.1,
ENSG00000135316.17,ENSG00000135317.12,ENSG00000271793.1,
ENSG00000118939.17,ENSG00000136153.19,ENSG00000261553.5,
ENSG00000137726.16,ENSG00000137731.13,ENSG00000255245.4,
ENSG00000138032.20,ENSG00000138079.13,ENSG00000285542.1,
ENSG00000054219.10,ENSG00000241399.6,ENSG00000248672.5,
ENSG00000005100.12,ENSG00000108561.8,
ENSG00000141642.8,ENSG00000141646.13,ENSG00000267699.2,
ENSG00000 [... truncated]
4: In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:
in 'x': chr1_GL383518v1_alt, chr1_GL383519v1_alt, chr1_GL383520v2_alt, chr1_KI270759v1_alt, chr1_KI270760v1_alt, chr1_KI270761v1_alt, chr1_KI270762v1_alt, chr1_KI270763v1_alt, chr1_KI270764v1_alt, chr1_KI270765v1_alt, chr1_KI270766v1_alt, chr1_KI270892v1_alt, chr2_GL383521v1_alt, chr2_GL383522v1_alt, chr2_GL582966v2_alt, chr2_KI270767v1_alt, chr2_KI270768v1_alt, chr2_KI270769v1_alt, chr2_KI270770v1_alt, chr2_KI270771v1_alt, chr2_KI270772v1_alt, chr2_KI270773v1_alt, chr2_KI270774v1_alt, chr2_KI270775v1_alt, chr2_KI270776v1_alt, chr2_KI270893v1_alt, chr2_KI270894v1_alt, chr3_GL383526v1_alt, chr3_JH636055v2_alt, chr3_KI270777v1_alt, chr3_KI270778v1_alt, chr3_KI270779v1_alt, chr3_KI270780v1_alt, chr3_KI270781v1_alt, chr3_KI270782v1_alt, chr3_KI270783v1_alt, chr3_KI270784v1_alt, chr3_KI270895v1_alt, chr3_KI270924v1_alt, chr3_KI270934v1_alt, chr3_KI270935v1_alt, chr3_KI270936v1_alt, chr3_KI270937v1_alt, chr4_GL000257v2_ [... truncated]

When I open the result file (Mod.csv), I find some peaks is lower log2FoldChange than threshold （log2FC_cutoff = 2.5），and the blocksize of some peaks is very large (almost 10K). The attachment file is the result. But the RNA fragment is 100nt before sequencing. I want to know my pepline is right or not, and whether there is a reference pepline ?

I am looking forward for your reply! Thank you very much!
Good luck to you!

output padj with many NAs

Hi, ZhenWei:

The R package exomePeak2 is nice and ideal for m6A-Seq analysis in our group.

But recently a result troubled me:

In the Differential Modification Analysis result: DiffMod.xlsx, the padj column have many NA. And the corresponding pvalue rank from 0.00000135 to 0.999901024.

I dont konw how to deal with these peaks, can you help offer some help?

And a minor question:

What does the DiffModLog2FC mean, it seems derived from "ModLog2FC_treated-ModLog2FC_control" or like log2(IP.Treated/IP.Control) ?

And in your group, waht's the better parameter cutoff for Differential Modification Peaks? pvalue or padj 0.05, and should a cutoff on DiffModLog2FC?

Thank you in advance for your reply!

Haifeng Sun
Nanjing Medical University, Jiangsu Province

The supplementary materials was emailed to you ([email protected]).

Thanks again!

关于RMBase2.0

我看到RMBase2.0网站给出的数据都是 site level
但是我用[exomePeak2]跑完以后依旧是一段一段的
请问我该如何得到类似于RMBase2.0 的结果

关于exomepeak2为什么没有选项使用spike-in进行normlize

您好，

我想问一下，为什么exomepeak2中没有使用spike-in进行normlize的选项，是因为内置的normlize方式更优于使用spike-in，还是需要在运行结果的基础上使用spike-in?

期待您的解答，谢谢！

运行报错

感谢您和团队开发这个给力的工具，我有一个问题一直解决不了

使用清华源镜像

BiocManager::install('exomePeak2')

library(exomePeak2)

set.seed(1)

GENE_ANNO_GTF = system.file("extdata", "example.gtf", package="exomePeak2")

f1 = system.file("extdata", "IP1.bam", package="exomePeak2")
f2 = system.file("extdata", "IP2.bam", package="exomePeak2")
f3 = system.file("extdata", "IP3.bam", package="exomePeak2")
f4 = system.file("extdata", "IP4.bam", package="exomePeak2")
IP_BAM = c(f1,f2,f3,f4)

f1 = system.file("extdata", "Input1.bam", package="exomePeak2")
f2 = system.file("extdata", "Input2.bam", package="exomePeak2")
f3 = system.file("extdata", "Input3.bam", package="exomePeak2")
INPUT_BAM = c(f1,f2,f3)

exomePeak2(bam_ip = IP_BAM,
bam_input = INPUT_BAM,
gff_dir = GENE_ANNO_GTF,
genome = "hg19",
paired_end = FALSE)

peak location and reads count strands

HI ZW-xjtlu,

I was using the newest exomepeak2 from github and met some issues.

We saw some cases like this peak 25895.

1, Is this normal that the peak can locate in the middle of the exon ?

2, May I ask does enomepeak2 strand specific in default ?

We used the default parameter, but we noticed for peak 25895 , which is at negative strand, reads count are all from positive strand.

Thanks

Question about filters on Differential methylated peaks

Good morning,
I have a question about filters that have to be applied on differential methylation results. Default filters peaks that have pvalue>0.0001, nevertheless also padj is evaluated. So my question is: why did you choose pvalue? And it is better to use pvalue or adjusted pvalue?
Thank you, Serena

Calculate GC contents on exons ... Error in h(simpleError(msg, call))

Hi Prof. Zhu,

I tried to run my own data, but there is a error:

Make the TxDb object ... OK
Sorting and indexing BAM files with Rsamtools...[bam_sort_core] merging from 28 files and 1 in-memory blocks...
[bam_sort_core] merging from 36 files and 1 in-memory blocks...
[bam_sort_core] merging from 28 files and 1 in-memory blocks...
[bam_sort_core] merging from 35 files and 1 in-memory blocks...
OK
Generate bins on exons ... OK
Count reads on bins ... OK
Calculate GC contents on exons ... Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'letterFrequency': The following seqlevels are to be dropped but are currently in use (i.e. have ranges on them): 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000213.1,
GL000216.2, GL000218.1, GL000219.1, GL000220.1, GL000225.1, KI270442.1, KI270711.1, KI270713.1, KI270721.1, KI270726.1,
KI270727.1, KI270728.1, KI270731.1, KI270733.1, KI270734.1, KI270744.1, KI270750.1. Please use the 'pruning.mode'
argument to control how to prune 'x', that is, how to remove the ranges in 'x' that are on these sequences. For example,
do something like:
seqlevels(x, pruning.mode="coarse") <- new_seqlevels
or:
keepSeqlevels(x, new_seqlevels, pruning.mode="coarse")
See ?seqinfo for a description of the pruning modes.
In addition: Warning message:
In .Seqinfo.mergexy(x, y) :
The 2 combined objects have no sequence levels in common. (Use
suppressWarnings() to suppress this warning.)

My script is:

library(exomePeak2)
GENE_ANNO_GTF= "D:/experiment/data analysis/fxs project/m6A analysis/2nd m6 seq/Homo_sapiens.GRCh38.107.gtf"
samplenames = c("Glu237","Glu229")
Input_bamPath <- paste0("D:/experiment/data analysis/fxs project/m6A analysis/2nd m6 seq/exomePeak2/2 samples/",samplenames,".input.bam")
IP_bamPath <- paste0("D:/experiment/data analysis/fxs project/m6A analysis/2nd m6 seq/exomePeak2/2 samples/",samplenames,".m6A.bam")

res <- exomePeak2 (bam_ip = IP_bamPath,
bam_input = Input_bamPath,
gff = GENE_ANNO_GTF,
genome = "hg38")
res ## Peak ranges
mcols(res) ## Peak statistics

Could you help me fix it? thanks so much!

Best,

Lu Lu