Giter VIP home page Giter VIP logo

clapanalysis_sge's Introduction

CLAPAnalaysis_sge

This is a fork of the CLAPAnalysis pipeline, modified to work on the UCL cluster Myriad. It performs the same workflow that the original pipeline (described below)

To use the pipeline in myriad:

1. Download the pipeline

As the authors mentioned: "For reproducibility, we recommend keeping the pipeline, input, and output directories together. In other words, the complete directory should look like this GitHub repository with an extra workup subdirectory created upon running this pipeline."

git clone https://github.com/lconde-ucl/CLAPAnalysis_sge.git

2. Navigate to the main folder and bring your FASTQ files (these need to use a _R{1,2}.fastq.gz nomenclature)

E.g.:

cd CLAPAnalysis_sge/clap_pipeline

mkdir fastqs
cd fastqs/
rsync -arv rds2:/rdss/rd01/ritd-ag-project-rd011h-shend55/PRC2_TEMP/SAFA_MinusTag_CLIP_1.fastq.gz .
rsync -arv rds2:/rdss/rd01/ritd-ag-project-rd011h-shend55/PRC2_TEMP/SAFA_MinusTag_CLIP_2.fastq.gz .
ln -s SAFA_MinusTag_CLIP_1.fastq.gz SAFA_MinusTag_CLIP_R1.fastq.gz 
ln -s SAFA_MinusTag_CLIP_2.fastq.gz SAFA_MinusTag_CLIP_R2.fastq.gz 

3. Create the sample file (samples.json)

./fastq2json.py --fastq_dir fastqs/

3. Load snakemake/7.32.4 (needs to be <v8 because the cluster options were removed after v8)

module load blic-modules
module load snakemake/7.32.4 
activate_snakemake

4. Modify the config.yaml (optional)

  • Email address: Add an email address if you want to receive an email if the pipeline fails (default: none)
  • Assembly: Specify the assembly, hg38, mm10 or mixed (default: mixed)
email: ""
assembly: "mixed"

5. Run the pipeline

./run_pipeline.sh 

6. The output will be in the workup folder

E.g.:

./CLAPAnalysis_sge/clap_pipeline/workup/
├── [ 68K]  alignments
│   ├── [4.0K]  SAFA_MinusTag_CLIP.part_000._STARgenome/
│   ├── [4.0K]  SAFA_MinusTag_CLIP.part_001._STARgenome/
│   ├── [4.0K]  SAFA_MinusTag_CLIP.part_002._STARgenome/
│   ├── [4.0K]  SAFA_MinusTag_CLIP.part_003._STARgenome/
│   ├── [4.0K]  SAFA_MinusTag_CLIP.part_004._STARgenome/
│   ├── [4.0K]  SAFA_MinusTag_CLIP.part_005._STARgenome/
│   ├── [4.0K]  SAFA_MinusTag_CLIP.part_006._STARgenome/
│   ├── [4.0K]  SAFA_MinusTag_CLIP.part_007._STARgenome/
│   ├── [4.0K]  SAFA_MinusTag_CLIP.part_008._STARgenome/
│   ├── [4.0K]  SAFA_MinusTag_CLIP.part_009._STARgenome/
│   ├── [4.0K]  SAFA_MinusTag_CLIP.part_010._STARgenome/
│   └── [4.0K]  SAFA_MinusTag_CLIP.part_011._STARgenome
├── [4.0K]  fastqs/
├── [ 16K]  logs/
│   └── [ 12K]  cluster/
├── [4.0K]  splitfq/
└── [ 16K]  trimmed/

 112K used in 18 directories

The documentation from the original pipeline is below.




Contents

Overview

This pipeline performs a tiered RNA alignment first to repetitive and structural RNAs (rRNAs, snRNAs, snoRNAs, tRNAs), followed by unique alignment to an appropriate reference genome.

Quick Start

This pipeline assumes an existing conda installation and is written as a Snakemake workflow. To install Snakemake with conda, run

conda env create -f envs/snakemake.yaml
conda activate snakemake

to create and activate a conda environment named snakemake. Once all the input files are ready, run the pipeline on a SLURM server environment with

./run_pipeline.sh

After the pipeline finishes, you can explore mapped alignments in (workup/alignments) and calculate enrichments using (scripts/Enrichment.jar) by passing the arguments:

java -jar Enrichment.jar <sample.bam> <input.bam> <genefile.bed> <savename.windows> <sample read count> <input read count>

Other common usage notes:

  • To run the pipeline for input RNA samples, replace Snakefile with Snakefile_for_input under --snakefile in /run_pipeline.sh.

  • To run the pipeline for on a local computer (e.g., laptop), comment out or remove the --cluster-config cluster.yaml and --cluster "sbatch ..." arguments within ./run_pipeline.sh, and set the number of jobs -j <#> to the number of local processors available.

  • run_pipeline.sh passes any additional arguments to snakemake. For example, run ./run_pipeline.sh --dry-run to perform a dry run, or ./run_pipeline.sh --forceall to force (re)execution of all rules regardless of past output.

Pipeline

The pipeline relies on scripts written in Java, Bash, and Python. Versions of Python are specified in conda environments described in envs/, along with other third-party programs and packages that this pipeline depends on.

Workflow

  1. Define samples and paths to FASTQ files (fastq2json.py or manually generate samples.json)
  2. Split FASTQ files into chunks for parallel processing (set num_chunks in config.yaml)
  3. Adaptor trimming (Trim Galore!)
  4. Tiered RNA alignment workflow:
    1. Alignment to repetitive and structural RNAs (Bowtie2)
    2. Convert unmapped reads to FASTQ files (samtools)
    3. Alignment to unique RNAs in reference genome (STAR)
  5. Chromosome relabeling (add "chr") and filtering (removing non-canonical chromosomes)
  6. PCR deduplication (Picard)
  7. Repeat masking (based on UCSC blacklists)
  8. Merge all BAMs from initial chunking (samtools)

Directory structures

We will refer to 4 directories:

  1. Working directory: We follow the Snakemake documentation in using the term "working directory".

    • This is also where Snakemake creates a .snakemake directory within which it installs conda environments and keeps track of metadata regarding the pipeline.
  2. Pipeline directory: where the software (including the Snakefile and scripts) resides

    • envs/
    • scripts/
    • fastq2json.py
    • Snakefile
  3. Input directory: where configuration and data files reside

    • assets/
    • data/
    • cluster.yaml
    • config.yaml: paths are specified relative to the working directory
    • samples.json: paths are specified relative to the working directory
    • run_pipeline.sh: the paths in the arguments --snakefile <path to Snakefile>, --cluster-config <path to cluster.yaml>, and --configfile <path to config.yaml> are relative to where you run run_pipeline.sh
  4. Output or workup directory (workup/): where to place this workup directory can be changed in config.yaml

    • alignments/
    • fastqs/
    • logs/
    • splitfq/
    • trimmed/

For reproducibility, we recommend keeping the pipeline, input, and output directories together. In other words, the complete directory should look like this GitHub repository with an extra workup subdirectory created upon running this pipeline.

Input Files

  1. config.yaml: YAML file containing the processing settings and paths of required input files. As noted above, paths are specified relative to the working directory.

    • output_dir: path to create the output directory <output_dir>/workup within which all intermediate and output files are placed.
    • temp_dir: path to a temporary directory, such as used by the -T option of GNU sort
    • samples: path to samples.json file
    • repeat_bed:
      • mm10: path to mm10 genomic regions to ignore, such as UCSC blacklist regions; reads mapping to these regions are masked
      • hg38: path to hg38 genomic regions to ignore, such as UCSC blacklist regions; reads mapping to these regions are masked
      • mixed: path to mixed hg38+mm10 genomic regions to ignore, such as UCSC blacklist regions; reads mapping to these regions are masked
    • bowtie2_index:
    • assembly: currently supports either "mm10","hg38", or mixed
    • star_index:
    • num_chunks: integer giving the number of chunks to split FASTQ files from each sample into for parallel processing
  2. samples.json: JSON file with the location of FASTQ files (read1, read2) to process.

    • config.yaml key to specify the path to this file: samples

    • This can be prepared using fastq2json.py --fastq_dir <path_to_directory_of_FASTQs> or manually formatted as follows:

      
      {
         "sample1": {
           "R1": ["<path_to_data>/sample1_R1.fastq.gz"],
           "R2": ["<path_to_data>/sample1_R2.fastq.gz"]
         },
         "sample2": {
           "R1": ["<path_to_data>/sample2_R1.fastq.gz"],
           "R2": ["<path_to_data>/sample2_R2.fastq.gz"]
         },
         ...
      }
      
    • The pipeline (in particular, the script scripts/bash/split_fastq.sh) currently only supports one read 1 (R1) and one read 2 (R2) FASTQ file per sample.

      • If there are multiple FASTQ files per read orientation per sample (for example, if the same sample was sequenced multiple times, or it was split across multiple lanes during sequencing), the FASTQ files will first need to be concatenated together, and the paths to the concatenated FASTQ files should be supplied in the JSON file.
    • Each sample is processed independently, generating independent BAM files.

  3. assets/repeats.RNA.mm10.bed, assets/repeats.RNA.hg38.bed, assets/repeats.RNA.combined.hg38.mm10.bed: blacklisted repetitive genomic regions (i.e., poor alignments to rRNA regions) for CLIP and CLAP data.

  4. assets/index_mm10/*.bt2, assets/index_hg38/*.bt2, assets/index_mixed/*.bt2: Bowtie 2 genome index

    • config.yaml key to specify the path to the index: bowtie2_index: {'mm10': <mm10_index_prefix>, 'hg38': <hg38_index_prefix>, 'mixed': <mixed_index_prefix}
      • Bowtie2 indexes for repetitive and structural RNAs may be custom generated from a FASTA file containing desired annotations.
  5. assets/index_mm10/*.star, assets/index_hg38/*.star, assets/index_mixed/*.star: STAR genome index

    • config.yaml key to specify the path to the index: star_index: {'mm10': <mm10_index_prefix>, 'hg38': <hg38_index_prefix>, 'mixed': <mixed_index_prefix}
      • Combined (mixed) genome build can be concatenated from standard GRCm38 (mm10) and GRCh38 (hg38) builds.

Output Files

  1. Merged mapped BAM Files for individual proteins (workup/alignments/*.merged.mapped.bam)

  2. Window enrichment tables computed from CLIP/CLAP sample BAMs and input BAMs

    • These are generated independently of the Snakemake pipeline.
    • Example BED files used to survey enrichments are provided in the assets folder.

Credits

Adapted from the SPRITE, RNA-DNA SPRITE, and ChIP-DIP pipelines by Isabel Goronzy (@igoronzy).

Other contributors

clapanalysis_sge's People

Contributors

lconde-ucl avatar mitchguttman avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.