koesgroup / snakemake_chipseq_pe Goto Github PK

View Code? Open in Web Editor NEW

14.0 3.0 4.0 21.34 MB

Pipeline for the analysis of PE ChIP-seq data

License: Creative Commons Attribution Share Alike 4.0 International

Shell 1.88% Python 49.92% TeX 48.20%

chip-seq-pipelines snakemake

snakemake_chipseq_pe's People

Contributors

Stargazers

Watchers

Forkers

tijsbliek elisevanbree zm-git-dev irenexzwen

snakemake_chipseq_pe's Issues

Add a DAG of the pipeline in the README

To have a rapid overview of the pipeline I think it would be nice to include the DAG of the pipeline.

Add protected and temporary to the relevant outputs

There are a lot of files that can be removed once the pipeline is finished. Here

Use all treatment and control

To test the peak calling rules I have tested the bed branch on real samples, the good news is that the peak calling works well. There is still the issue with the indexing of the bam file here #8 .

The problem is that the pipeline only called peaks for ATAC1 vs ATAC4, while it should also have work for all treatment and control.

treatment= 'ATAC1', 'ATAC2', 'ATAC3'
control = 'ATAC4', 'ATAC5', 'ATAC6'.

I expect the files ATAC2 vs ATAC5 and ATAC3 vs ATAC6 as well.

Create one .yaml environment file per rule

To increase reproducibility and avoid the need to install and activate a conda virtual environemnt before running Snakemake, it would be good to create one environment file per rule (envs/rule1.yaml) in order to use snakemake with the --use-conda argument.

Implement singularity

Add singularity management within Snakemake.

From the documentation:

Singularity enables users to have full control of their environment. Singularity containers can be used to package entire scientific workflows, software and libraries, and even data. This means that you don’t have to ask your cluster admin to install anything for you - you can put it in a Singularity container and run.

Single end

The snakemake pipeline is for now only usable for PE sequencing, it would be good to have it to work for single end as well.

Make a documentation for deeptools

Documentations of deeptools already exist in its repository, I think it would be nice to include part of the documentation in the README file or at least links to it in order to explain what is the purpose of the figures generated by the pipeline and how to interpret them.

Mistake

https://github.com/KoesGroup/Snakemake_ChIPseq/blob/86cdaa227fb3ab10e57d4711b6f462931535056c/Snakefile#L47-L48

Small mistake to be changed on the bed branch, CASE should be treatment and CONTROLS should be control.

CASES = get_samples_per_treatment(treatment="treatment")
CONTROLS = get_samples_per_treatment(treatment="control")

bedgraph -bga

Hello Jihed,
About the pull request #3:
Why do you want to report regions with zero coverage (bedtools genomecov -bga)? Is there a specific reason? Because this will significantly increase the size of your bedgraph files.
I think you could simply use the -bgoption there and only report the positions with some coverage.
Hope it helps,
Cheers
Marc

Handle GTF or GFF formats

For tomato, one has to generate the GTF file from the GFF3 format .
For other species, you can provide a GTF file directly.
Rules has to be changed into 'external_data.smk'.

HPC Cluster execution

Transform the Snakemake pipeline so that it can be executed on a cluster environment such as LISA (SURF). On LISA, the batch job management system is SLURM.

Implement multiQC

Add MultiQC at the end of the pipeline to produce html reports

Implement genome visualization of the data in the pipeline

I have found this repository which seems to allow to produce nice genome browser using bigwig, bed files, etc ...

This might be a good thing to add in the next release!

Write a publication in the "The Journal of Open Source Software"

We should include a CITATION file in the main repository to indicate how to cite the pipeline.
To have a proper scientific publication, we could write a publication in the "Journal of Open Source Software".
That way, the pipeline could be properly cited. Publications are short ~3 pages and easy to write.

See an example:
http://joss.theoj.org/papers/6eb3ba7dddbdab8788a430eb62fc3841

Citation would look like:
Bennett et al., (2018). restez: Create and Query a Local Copy of GenBank in R. Journal of Open Source Software, 3(31), 1102, https://doi.org/10.21105/joss.01102

Deeptools : Correlation plot

Implement the correlation plot using the deeptools:

multiBamSummary
plotCorrelation

Define the best correlation method

order in the rules

When running the snakefile with multiple core, it seems that the indexing of the bam file is done later than the rules using it.
'results/mapped/ChIP1_L1.sorted.rmdup.bam' does not appear to have an index. You MUST index the file first!

'results/mapped/ChIP1_L1.sorted.rmdup.bam' does not appear to have an index. You MUST index the file first!
    Error in rule bamcompare:
        jobid: 3
        output: results/bamcompare/log2_ChIP1_ChIP2_L1.bamcompare.bw

RuleException:
CalledProcessError in line 249 of /Users/Jihed/Desktop/DMC1_ChIPseq/Snakefile:
Command ' set -euo pipefail;  bamCompare -b1 results/mapped/ChIP1_L1.sorted.rmdup.bam -b2 results/mapped/ChIP2_L1.sorted.rmdup.bam -o results/bamcompare/log2_ChIP1_ChIP2_L1.bamcompare.bw ' returned non-zero exit status 1
  File "/Users/Jihed/Desktop/DMC1_ChIPseq/Snakefile", line 249, in __rule_bamcompare
  File "/anaconda3/envs/bigwig/lib/python3.5/concurrent/futures/thread.py", line 55, in run```

The rule producing the bam index file is `rule bam_index`, take sorted bam files as input. The output is required by the `rule_all`, so the file are produced at some point and it appears in the DAG. However I can not find the files `*.bai` in the output folder.

Some logs are incomplete

Some of the logs produced by the pipeline (branch develop) are empty files:

bedgraph.log
bamcompare.log
macs2 narrowPeak
samtools rmdup

Move the definitions of the group for deep tools to the configuration file

On the Deeptools branch:
For now the definition of groups to make the matrix is written in the Snakefile, therefore it is required to change the Snakefile to make.
Before finding a more handy solution to define the group, I should move the group definition to the configuration file and refer to it in the Snakefile

test

to test the automation