Giter VIP home page Giter VIP logo

atac-seq-pipeline's Introduction

ENCODE ATAC-seq pipeline

DOICircleCI

Introduction

This pipeline is designed for automated end-to-end quality control and processing of ATAC-seq and DNase-seq data. The pipeline can be run on compute clusters with job submission engines as well as on stand alone machines. It inherently makes uses of parallelized/distributed computing. Pipeline installation is also easy as most dependencies are automatically installed. The pipeline can be run end-to-end, starting from raw FASTQ files all the way to peak calling and signal track generation using a single caper submit command. One can also start the pipeline from intermediate stages (for example, using alignment files as input). The pipeline supports both single-end and paired-end data as well as replicated or non-replicated datasets. The outputs produced by the pipeline include 1) formatted HTML reports that include quality control measures specifically designed for ATAC-seq and DNase-seq data, 2) analysis of reproducibility, 3) stringent and relaxed thresholding of peaks, 4) fold-enrichment and pvalue signal tracks. The pipeline also supports detailed error reporting and allows for easy resumption of interrupted runs. It has been tested on some human, mouse and yeast ATAC-seq datasets as well as on human and mouse DNase-seq datasets.

The ATAC-seq pipeline protocol specification is here. Some parts of the ATAC-seq pipeline were developed in collaboration with Jason Buenrostro, Alicia Schep and Will Greenleaf at Stanford.

Features

  • Portability: The pipeline run can be performed across different cloud platforms such as Google, AWS and DNAnexus, as well as on cluster engines such as SLURM, SGE and PBS.
  • User-friendly HTML report: In addition to the standard outputs, the pipeline generates an HTML report that consists of a tabular representation of quality metrics including alignment/peak statistics and FRiP along with many useful plots (IDR/TSS enrichment). An example of the HTML report. The json file used in generating this report.
  • Supported genomes: Pipeline needs genome specific data such as aligner indices, chromosome sizes file and blacklist. We provide a genome database downloader/builder for hg38, hg19, mm10, mm9. You can also use this builder to build genome database from FASTA for your custom genome.

Installation

  1. Git clone this pipeline.

    IMPORTANT: use ~/atac-seq-pipeline/atac.wdl as [WDL] in Caper's documentation.

    $ cd
    $ git clone https://github.com/ENCODE-DCC/atac-seq-pipeline
  2. Install pipeline's Conda environment if you want to use Conda instead of Docker/Singularity. Conda is recommneded on local computer and HPCs (e.g. Stanford Sherlock/SCG).

    *IMPORTANT: use encode-atac-seq-pipeline as [PIPELINE_CONDA_ENV] in Caper's documentation.

  3. Skip this step if you have installed pipeline's Conda environment. Caper is already included in the Conda environment. Install Caper. Caper is a python wrapper for Cromwell.

    IMPORTANT: Make sure that you have python3(> 3.4.1) installed on your system.

    $ pip install caper  # use pip3 if it doesn't work
  4. Follow Caper's README carefully. Find an instruction for your platform.

    IMPORTANT: Configure your Caper configuration file ~/.caper/default.conf correctly for your platform.

Test input JSON file

Use https://storage.googleapis.com/encode-pipeline-test-samples/encode-atac-seq-pipeline/ENCSR356KRQ_subsampled_caper.json as [INPUT_JSON] in Caper's documentation.

Input JSON file

IMPORTANT: DO NOT BLINDLY USE A TEMPLATE/EXAMPLE INPUT JSON. READ THROUGH THE FOLLOWING GUIDE TO MAKE A CORRECT INPUT JSON FILE.

An input JSON file specifies all the input parameters and files that are necessary for successfully running this pipeline. This includes a specification of the path to the genome reference files and the raw data fastq file. Please make sure to specify absolute paths rather than relative paths in your input JSON files.

  1. Input JSON file specification (short)
  2. Input JSON file specification (long)

Running a pipeline on DNAnexus

You can also run this pipeline on DNAnexus without using Caper or Cromwell. There are two ways to build a workflow on DNAnexus based on our WDL.

  1. dxWDL CLI
  2. DNAnexus Web UI

How to organize outputs

Install Croo. You can skip this installation if you have installed pipeline's Conda environment and activated it. Make sure that you have python3(> 3.4.1) installed on your system. Find a metadata.json on Caper's output directory.

$ pip install croo
$ croo [METADATA_JSON_FILE]

How to make a spreadsheet of QC metrics

Install qc2tsv. Make sure that you have python3(> 3.4.1) installed on your system.

Once you have organized output with Croo, you will be able to find pipeline's final output file qc/qc.json which has all QC metrics in it. Simply feed qc2tsv with multiple qc.json files. It can take various URIs like local path, gs:// and s3://.

$ pip install qc2tsv
$ qc2tsv /sample1/qc.json gs://sample2/qc.json s3://sample3/qc.json ... > spreadsheet.tsv

QC metrics for each experiment (qc.json) will be split into multiple rows (1 for overall experiment + 1 for each bio replicate) in a spreadsheet.

atac-seq-pipeline's People

Contributors

leepc12 avatar ottojolanki avatar vervacity avatar strattan avatar akundaje avatar karl616 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.