yeolab / merge_peaks Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 5.0 35.62 MB

Pipeline for using IDR to produce a set of peaks given two replicate eCLIP peaks

Python 2.10% Shell 41.70% Perl 15.60% Common Workflow Language 40.59%

merge_peaks's Introduction

merge_peaks

CWL-defined pipeline for using IDR to produce a set of peaks given two replicate eCLIP peaks

Installation:

(For YEOLAB: module load eclipidrmergepeaks/0.1.0)

For all others:

install CWL
- pip install cwl
If Docker is installed, CWL will attempt to pull the following software versions automatically:
(alternatively if Docker is not installed) install packages conventionally:
- You will need Perl (5.10.1+) with the following packages (Recommended: https://perlbrew.pl):
  - cpan install Statistics::Basic
  - cpan install Statistics::Distributions
  - cpan install Statistics::R
- In addition, you will need IDR 2.0.2 and Clipper

Outline of workflow:

Normalize CLIP BAM over INPUT for each replicate (overlap_peakfi_with_bam_PE.cwl)
Peak compression/merging on input-normalized peaks for each replicate (compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat_outputfull.cwl)
Entropy calculation on CLIP and INPUT read probabilities within each peak for each replicate (make_informationcontent_from_peaks.cwl)
Reformat *.full files into *.bed files for each replicate (full_to_bed.cwl)
Run IDR on peaks ranked by entropy (idr.cwl)
Calculates summary statistics at different IDR cutoffs (parse_idr_peaks.cwl)
Normalize CLIP BAM over INPUT using new IDR peak positions (overlap_peakfi_with_bam_PE.cwl)
Identifies reproducible peaks within IDR regions (get_reproducing_peaks.cwl)

Usage:

(see the examples/merge_peaks_1input.yaml or examples/merge_peaks_2inputs.yaml manifest file for a full example). Below is a description of all fields required to be filled out in the manifest file:

First, use the example template to fill out the names and paths pertaining to your samples. The shebang "#!" line will depend on your experimental setup (either 2 replicates with 2 corresponding inputs, or 2 replicates normalized over 1 input).

This should match what was used to call CLIPper peaks.

species: hg19

BAM file containing the merged-barcode (read 2 only) PCR-deduped CLIP reads mapping to the genome for Replicate 1. Replace "rep1" with a unique ID for each rep1.

    - name: "rep1"
      ip_bam: 
        class: File
        path: /home/centos/peCLIP_inputs/ENCFF994WPX.r2.bam

BAM file containing the merged-barcode (read 2 only) PCR-deduped INPUT reads mapping to the genome for Replicate 1.

      input_bam:
        class: File
        path: /home/centos/peCLIP_inputs/ENCFF590UCY.r2.bam

BED file containing the called peak clusters for Replicate 1 Output from either CLIPPER or input-normed peaks. This pipeline will perform input norm internally for you, so it won't really matter which file you use.

      peak_clusters:
        class: File
        path: /home/centos/peCLIP_inputs/ENCFF639MYI.bed6

BAM file containing the merged-barcode (read 2 only) PCR-deduped CLIP reads mapping to the genome for Replicate 2. Replace "rep2" with a unique ID for each rep2.

    - name: "rep2"
      ip_bam: 
        class: File
        path: /home/centos/peCLIP_inputs/ENCFF154BQS.r2.bam

BAM file containing the merged-barcode (read 2 only) PCR-deduped INPUT reads mapping to the genome for Replicate 2.

      input_bam:
        class: File
        path: /home/centos/peCLIP_inputs/ENCFF590UCY.r2.bam

BED file containing the called peak clusters for Replicate 2 Output from CLIPPER or input-normed peaks. This pipeline will perform input norm internally for you.

      peak_clusters:
        class: File
        path: /home/centos/peCLIP_inputs/ENCFF664WCU.bed6

FINAL OUTPUTS

Merged reproducible peaks are reported as:

rep1.vs.rep2.bed

Where rep1 and rep2 are the user-defined names in the manifest.

To run the workflow:

Ensure that the yaml file is accessible and that wf_get_reproducible_eclip_peaks.cwl is in your $PATH.
Type: ./merge_peaks_2inputs.yaml

Outputs

merged_peaks_bed: this is the BED6 file containing reproducible peaks as determined by entropy-ordered peaks between two replicates.
- chrom
- start
- end
- minimum of the -log10 p-value between two replicates (coolumn 4)
- geomean of the log2 fold changes (column 5)
- strand This is probably what will be useful.
*.full files: these tabbed outputs have the following columns (in order):
- chromosome
- start
- end
- name (colon separated region)
- reads in CLIP
- reads in INPUT
- p-value
- chi value or (F)isher
- (F)isher or (C)hi square test
- enriched or depleted
- negative log10p value (400 if above certain threshold)
- log2 fold change
- entropy
idr.out: output from IDR
idr.out.bed: output from IDR as a bed file
*.custombed: contains individual replicate information. The headers are:
- IDR region (entire IDR identified reproducible region)
- peak (reproducible peak region)
- geomean of the l2fc
- rep1 log2 fold change
- rep2 log2 fold change
- rep1 -log10 pvalue
- rep2 -log10 pvalue

Notes:

The current conda version of perl installed using create_environment_merge_peaks.sh will install perl v5.22.0, which is different from the version tested on TSCC (5.10.1). Since 5.18, there have been slight changes resulting in hash keys being accessed in a non-deterministic way. Installing 5.22.0 will result in minor changes from the reference, but will otherwise give similar outputs. Included is a script run_perlbrew_perl5.10.1.sh which will attempt to install perl 5.10.1, which will give you deterministic results.
Custombed files are staged for deprecation, we don't usually use this.

merge_peaks's People

Contributors

Stargazers

Watchers

Forkers

xjyx peaceben vannostrandlab algaebrown zhongguosaiting

merge_peaks's Issues

contains undefined reference?

IDR analysis error

Hi ,dear yeolab ,i'm doing the IDR analysis following your eclip-seq processing pipeline V2.2 ,i got the following error

"Could not load extension schema http://schema.org/docs/schema_org_rdfa.html: Error fetching http://schema.org/docs/schema_org_rdfa.html: HTTPConnectionPool(host='schema.org', port=80): Max retries exceeded with url: /docs/schema_org_rdfa.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fca02ae0130>: Failed to establish a new connection: [Errno 110] Connection timed out'))
Warning: Field $schemas contains undefined reference to http://schema.org/docs/schema_org_rdfa.html
../cwl/samtools-index.cwl:54:3: Warning: Field class contains undefined reference to
http://schema.org/CreativeWork "
../cwl/samtools-index.cwl:69:7: Warning: checking item
Warning: Field class contains undefined reference to
http://schema.org/Organization
../cwl/samtools-index.cwl:65:5: Warning: checking item
Warning: Field class contains undefined reference to
http://schema.org/Organization
../cwl/samtools-index.cwl:59:3: Warning: Field class contains undefined reference to
http://schema.org/Person

i donot know what's wrong with it ,can you help me? Thank you very much!

Which file do I use as the result?

In the result, there are mainly *.idr.out, *.idr.out.normed.bed and *.custombed files. According to Supplementary Protocol 2: eCLIP-seq Processing Pipeline, the *.custombed file should be the final result and contain individual replicate information. However, in the Readme, file *.custombed is told that it cannot be used. So in which file should I use the results to annotate genes? If I can get a reply, I would be extremely grateful.

output file

Among these which one is the output file as I don't see any "custombed" files ?

01v02.idr.out                     ENCODE4.INPUT.umi.r1.fq.genome-mappedSoSo.rmDupSo.readnum                                                  ENCODE4.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.bed.full
01v02.IDR.out.0102merged.01.full  ENCODE4.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.bed                                     ENCODE4.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.readnum
01v02.IDR.out.0102merged.02.full  ENCODE4.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.bed.compressed.bed                      rep1.vs.rep2.bed
01v02.idr.out.bed                 ENCODE4.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.bed.compressed.bed.entropy.bed          rep1.vs.rep2.consistency
01v02.idr.out.normed.bed          ENCODE4.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.bed.compressed.bed.entropy.excessreads  rep1.vs.rep2.rescue_ratio
01v02.idr.out.normed.bed.full     ENCODE4.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.bed.compressed.bed.entropy.full

ERROR ；How to solve this problem ？

ERROR : Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.40/version: dial unix /var/run/docker.sock: connect: permission denied
Workflow error, try again with --debug for more information:
Docker is not available for this tool, try --no-container to disable Docker, or install a user space Docker replacement like uDocker with --user-space-docker-cmd.: Cannot communicate with docker daemon: Command '['docker', 'version']' returned non-zero exit status 1

overlap_peakfi_with_bam.pl

I'm getting the following issue

WARNING: Overriding HOME environment variable with SINGULARITYENV_HOME is not permitted
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = (unset),
        LANG = "en_US.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").
reading peak file /var/lib/cwl/stgf7e182df-5cae-4505-b70d-9330e5e663fd/10249_sample3_10249_sample4_ready.split1.peakClusters.bed
now doing expt /var/lib/cwl/stg0607570c-da94-4f28-b3f1-fd8352700c76/10249_sample3_10249_sample4_ready.split1.bam
R1 strand error 403
Use of uninitialized value $strand in concatenation (.) or string at /opt/merge_peaks/bin/perl/overlap_peakfi_with_bam.pl line 402, <B> line 107.
R1 strand error 403

Command is:

cwltool --singularity /groups/cgsd/alexandre/eclip/workflow/merge_peaks/cwl/wf_full_IDR_pipeline_2inputs_scatter.cwl /groups/cgsd/alexandre/eclip/workflow/merge_peaks/analysis.yml

My yaml is:

#!/usr/bin/env eCLIP_full_IDR_pipeline_2inputs_scatter_singleNode

species: hg19
samples:
  - 
    - name: "10249_sample1_10249_sample2_merged_clip_and_input"
      ip_bam: 
        class: File
        path: /groups/cgsd/alexandre/cromwell-executions/SamtoolsMerge/f355824c-d131-4077-ac49-9b5877ee7dba/call-View/shard-0/execution/10249_sample1_10249_sample2_ready.bam
      input_bam:
        class: File
        path: /groups/cgsd/alexandre/cromwell-executions/SamtoolsMerge/f355824c-d131-4077-ac49-9b5877ee7dba/call-View/shard-2/execution/10249_sample5_10249_sample6_ready.bam
      peak_clusters:
        class: File
        path: /groups/cgsd/alexandre/cromwell-executions/SamtoolsMerge/f355824c-d131-4077-ac49-9b5877ee7dba/call-Clipper/shard-0/execution/10249_sample1_10249_sample2_ready.bed
    - name: "10249_sample3_10249_sample4_merged_clip_and_input"
      ip_bam: 
        class: File
        path: /groups/cgsd/alexandre/cromwell-executions/SamtoolsMerge/f355824c-d131-4077-ac49-9b5877ee7dba/call-View/shard-1/execution/10249_sample3_10249_sample4_ready.bam
      input_bam:
        class: File
        path: /groups/cgsd/alexandre/cromwell-executions/SamtoolsMerge/f355824c-d131-4077-ac49-9b5877ee7dba/call-View/shard-3/execution/10249_sample7_10249_sample8_ready.bam
      peak_clusters:
        class: File
        path: /groups/cgsd/alexandre/cromwell-executions/SamtoolsMerge/f355824c-d131-4077-ac49-9b5877ee7dba/call-Clipper/shard-1/execution/10249_sample3_10249_sample4_ready.bed
chrom_sizes:
  class: File
  path: /groups/cgsd/alexandre/eclip/workflow/chrom.sizes

And images used are:

brianyee_clipper:5d865bb.sif
brianyee_merge_peaks:0.1.0.sif
brianyee_samtools:1.6.sif

Minimum number of peaks in output file

Hi,

I ran into an issue while executing the pipeline on my local system -

ValueError: Peak files must contain at least 20 peaks post-merge

My data had only 7 peaks post-merge. I was wondering if this might cause the downstream pipeline to fail and exit with just one file '01v02.idr.out', as is happening in my situation.

Thanks!

Question on input normalization

I'm working through the analysis of an eCLIP experiment using your pipelines. Everything got completed including the merge peaks pipeline.
However, a graduate student in the lab pointed out something in regards to input normalization. In some cases, we are seeing read coverage in the IP samples but not the input samples for certain reproducible peaks. When I look through the output files (full files after input normalization), I see that the input shows a read of '1' despite there being no coverage when looking at a BedGraph file for the input.
I've looked through some of the perl scripts to see if I can figure out what's happening but I'm not fluent enough with perl to get a grasp on this. My thought is that if there are reads in the IP samples but not the input, the input is assigned a read count of 1. Perhaps if we sequenced deeper on the input, we'd see some small coverage.
One caveat to our experiment is that we had a fairly low sequencing depth overall.
Attached is an example of our files in IGV.

what result do we need to do downstream analysis

Hi!
I have a question during i use merge_peaks to normalize peak called by CLIPper. There are several steps in example of merge_peaks, but which result do we need to do downstream analysis? Or, which peak file could be considered as confident result of a RBP binding sites?
Looking forward for your reply! Thank you very much!

Error with cwltoil

I am getting the following error on running the yaml file with specfications:

RuntimeError: Please run with "toil-cwl-runner" instead of "cwltoil" (which has been removed)

However, in the main code on eCLIP_merge_peaks, I do not see toil-cwl-runner as an option, so I can't add it. Is this a version conflict?

Relaxing log2fc and log10p?

Hi Brian-
First off, really appreciate the addition of a method for running the pipeline on AWS -- there was some weirdness getting it to run correctly on our cluster which we never quite figured out, but an AWS instance worked great.

I ran the merge_peaks workflow on some eCLIP data we generated, and the final output bed file is pretty sparse -- only a total of ~60 peaks, some of which didn't get called even though on visual inspection there are clear peaks called by CLIPper in IP samples but not in the input. I tried changing the l2fc_cutoff and l10p_cutoff declarations in parse_idr_peaks.pl, but I just got the exact same number of peaks upon rerunning it. Looking at the file some more, are those variables actually used? Or is the 'next unless...' statement on line 63 where the FC/p value cutoffs are actually stated? (I would try it myself but I am racking up a pretty good AWS bill...)

thanks!
-Lynn

Parallel Processing Option

Is there a multi-threading option for the 'merge_peaks' pipeline? Just wondering if it's possible to speed it up a bit, it took 24 hours to complete and my BAM files aren't that big. I think the bottleneck is the IDR component which seems to call 'clipper' repeatedly, I know clipper is able to be run in parallel so is the anyway to pass the '--processors' flag to clipper without having to modify a bunch of files?

Failed Jobs

Hi,

On executing the yaml file, all 7 jobs are failing with the following error message:

toil.leader.FailedJobsException: The job store 'file:/scratch/sanat.mishra/Thesis/eclip/merge_peaks/examples/AARS/.tmp/cwltoil_jobstore' contains 7 failed jobs: 'CWLWorkflow' kind-CWLWorkflow/instance-07ufg5ik, 'CWLWorkflow' kind-CWLWorkflow/instance-te8ikfkj, 'CWLJob' samtools view kind-CWLJob/instance-mlvmb13t, 'CWLJob' samtools view kind-CWLJob/instance-ise3vpe9, 'CWLWorkflow' kind-CWLWorkflow/instance-2zfokdt3, 'CWLJob' samtools view kind-CWLJob/instance-b_j5t4td, 'CWLJob' samtools view kind-CWLJob/instance-csv4vy5v

Additionally, for each job, the following is also included -

[2021-08-26T20:43:08+0200] [MainThread] [W] [toil.leader] Job 'CWLJob' samtools view kind-CWLJob/instance-b_j5t4td is completely failed
[2021-08-26T20:43:08+0200] [MainThread] [W] [toil.leader] Job failed with exit value 127: 'CWLJob' samtools view kind-CWLJob/instance-csv4vy5v
Exit reason: None
[2021-08-26T20:43:08+0200] [MainThread] [W] [toil.leader] No log file is present, despite job failing: 'CWLJob' samtools view kind-CWLJob/instance-csv4vy5v
[2021-08-26T20:43:08+0200] [MainThread] [W] [toil.job] Due to failure we are reducing the remaining try count of job 'CWLJob' samtools view kind-CWLJob/instance-csv4vy5v with ID kind-CWLJob/instance-csv4vy5v to 0

What's going wrong? Is there an issue with samtools?

disable Docker?

Hi,

Thanks for the nice tool! I have a question how to disable Docker when running the .yaml?

Here's the error information:
cwltool.errors.WorkflowException: Docker is not available for this tool, try --no-container to disable Docker, or install a user space Docker replacement like uDocker with --user-space-docker-cmd.: Command '['docker', 'images', '--no-trunc', '--all']' returned non-zero exit status 1.

I did a modification to wf/eCLIP_merge_peaks in 210, by replacing '--user-space-docker-cmd udocker ' with '--no-container' but get the same error information again. Since I'm using on the SGE, it's impossible to use sudo Docker, can you give any suggestions on that?

Many thanks!
Iris

overlap_peakfi_with_bam.pl fails

This fails, my guess is lack of disk space. I run the pipeline with the cachedir parameter pointing to a place with lots of space. This step of the workflow is still trying to use /tmp.
^[[1;30mINFO^[[0m [job input_norm] cachedir/68aca4df6a4ff0d05a29902c91766026$ overlap_peakfi_with_bam.pl
/tmp/9i9jebnh/stgca92ad1c-426d-4696-b296-624d86e7773e/K562_eclip_reps.id_3593.A01.r1.fq.genome-mappedSo.rmDupSo.merged.r2.bam
/tmp/9i9jebnh/stg29389bfd-e06f-4e4c-8b6c-930404d4582f/K562_eclip_reps.input_3592.NIL.r1.fq.genome-mappedSo.rmDupSo.r2.bam
/tmp/9i9jebnh/stg4722d23e-3e7d-4d6f-a56b-60ec8cae2549/K562_eclip_reps.id_3593.A01.r1.fq.genome-mappedSo.rmDupSo.merged.r2.peakClusters.bed
/tmp/9i9jebnh/stga1f6fbd5-202f-4104-921a-38ff315be473/K562_eclip_reps.id_3593.A01.r1.fq.genome-mappedSo.rmDupSo.merged.r2.readnum
/tmp/9i9jebnh/stg62c392a2-b631-4c14-a1a5-d2782a676cdb/K562_eclip_reps.input_3592.NIL.r1.fq.genome-mappedSo.rmDupSo.r2.readnum
K562_eclip_reps.id_3593.A01.r1.fq.genome-mappedSo.rmDupSo.merged.r2.peakClusters.normed.bed

The error is:
now doing input /tmp/9i9jebnh/stg29389bfd-e06f-4e4c-8b6c-930404d4582f/K562_eclip_reps.input_3592.NIL.r1.fq.genome-mappedSo.rmDupSo.r2.bam
print() on closed filehandle OUT at eclipWorkflow/merge_peaks/bin/perl/overlap_peakfi_with_bam.pl line 109.
print() on closed filehandle OUTFULL at eclipWorkflow/merge_peaks/bin/perl/overlap_peakfi_with_bam.pl line 110.