Giter VIP home page Giter VIP logo

rna-seq-hisat2-featurecounts's Introduction

RNA-seq (Hisat2-FeatureCounts-DeSeq2) based on Snakemake

License: MIT Snakemake Version Bioconda DOI

Introduction

This Snakemake workflow is designed for the analysis of RNA-seq data (Hisat2-FeatureCounts-DeSeq2). It includes steps for quality control, read alignment, quantification, and differential expression analysis. This workflow handles paired-end data. We used the data published by our previous data.

If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository.

Overview

The standard workflow performs the following steps:

  1. Quality control of raw reads using FastQC and aggregated report with MultiQC
  2. Trim reads with trim_galore
  3. Download the reference genome and gif
  4. Build HIAST2 index and align reads to reference genome
  5. Quantify gene expression levels using FeatureCounts
  6. Merge all counts into a single file
  7. Normalize counts using TPM and FPKM
  8. Perform differential expression analysis using DeSeq2

Requirment

Install conda/miniconda

  1. Please carefully read the related docs in the miniconda web https://docs.conda.io/projects/miniconda/en/latest/

  2. Download the Linux or macOS files
    Linux x86_64 https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    Linux arch64 https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh

  3. install miniconda

mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm -rf ~/miniconda3/miniconda.sh

Install snakemake using mamba/conda

  1. Install snakemake and conda enviroment
    snakemake need mamba to create the enviroment.
    If you have conda installed, you can install mamba via conda.

Using mamba will speed up, snakemake is using mamba by default, if not installed, it will report an error, and you need to use conda according to the prompts.

# install mamba  
conda install -y -c conda-forge mamba  

# create the directory
mkdir snakemake
cd snakemake
git clone https://github.com/erwinQiao/rna-seq-hisat2-deseq2.git

# if you canot use git, you can download the rna-seq-hisat2-deseq2.zip and unzip it.  
# or you can git from gitee  
git clone [email protected]:erwinqiao/rna-seq-hisat2-deseq2.git
  1. Install the required snakemake
# install snakemake via mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake 

# install snakemake via conda
conda create -c conda-forge -c bioconda -n snakemake snakemake

# activate the conda enviroment
conda activate snakemake

Run workflow

Step1 Activate the conda enviroment

conda activate snakemake

# checke snakemake version
snakemake --version
snakemake -v

If you are slow to build env, you can create the environment first and then install snakemake

conda create -n snakemake  

conda activate snakemake  

conda install -c conda-forge -c bioconda snakemake
mamba install -c conda-forge -c bioconda snakemake

Attention: Syntax for Conda has changed in Snakemake 8.0 and later!

In the new versions of Snakemake, the syntax for using Conda environments has been adjusted. Please check the help:

snakemake --help

Step2 Input samples.csv etal

You need to verify if the data names, groups, and samples are consistent with sample.csv.

Meanwhile,you need to create a directory names reads in dir (rna-seq-hisat2-featureCounts). You just need to make sure that the 'reads' folder and the 'workflow' are in the same directory.

In the config/config.yaml, you can choose the ensembl using homo_sapiens 110 fasta (gtf) or genecode human 44 fasta(gtf). You need to fill by your choose.

Step3 Run workflow

# set the cores nubmer by yourself

snakemake --cores number -s workflow/Snakefile --use-conda  

Step4 Rerun

if you accidentally interrupt the program and have unfinished tasks, you can use the following command to resume the execution and ensure that the incomplete tasks are successfully completed:

snakemake --cores 90 --use-conda -s workflow/Snakefile --rerun-incomplete

Inverstagte results

Check the process

When you run the workflow, you can see the process of the workflow in the terminal.

Job stats:  
job                      count
------------------------  -------
all                             1  
boxplot                         1
fastqc                         12  
featuresCounts                  6  
get_ensembl_genome_fasta        1  
get_ensembl_gtf                 1  
hisat2                          6  
hisat2_ensembl_index            1  
mergeCounts                     1  
multiqc                         1  
rename                          6  
trim_galore                     6  
total                          43 

If you have resources, you can skip the steps of downloading the genome and building the index during the process, thus reducing both space and time. However, you must ensure that the files are complete.

Job stats:  
job                     count  
--------------------  -------
all                         1  
boxplot                     1
fastqc                     30  
featuresCounts             15  
hisat2                     15  
hisat2_ensembl_index        1  
mergeCounts                 1  
multiqc                     1  
rename                     15  
trim_galore                15  
total                      95  

Check the results

The results are in the results folder. Use tree -h -dto check the tree of the results

The results contains three folders of logs, results, resources. The contents of each folder refer to the following contents


├── [4.0K]  config
├── [4.0K]  logs
│   ├── [4.0K]  fastqc
│   ├── [4.0K]  featureCounts
│   ├── [4.0K]  hisat2
│   │   ├── [4.0K]  align
│   │   └── [4.0K]  index
│   ├── [4.0K]  mergeCount
│   ├── [4.0K]  multiqc
│   ├── [4.0K]  plot
│   ├── [4.0K]  resources
│   └── [4.0K]  trim_galore
├── [4.0K]  reads
├── [ 16K]  resources
│   └── [ 28K]  hisat2_index_Homo_sapiens
├── [4.0K]  results
│   ├── [4.0K]  featureCounts
│   ├── [4.0K]  mapped
│   ├── [4.0K]  mergeCounts
│   ├── [4.0K]  plot
│   ├── [4.0K]  qc
│   │   ├── [4.0K]  fastqc
│   │   └── [4.0K]  multiqc_data
│   └── [4.0K]  trimmed
└── [4.0K]  workflow
    ├── [4.0K]  envs
    ├── [4.0K]  rules
    └── [4.0K]  scripts
        └── [4.0K]  __pycache__

The config folder contains the configuration file of the pipeline.
The logs folder contains the log files of the pipeline.
The reads folder contains the raw reads.
The resources folder contains the resources of the pipeline.
The results folder contains the results of the pipeline.
The workflow folder contains the workflow of the pipeline.

The main limitation of this pipeline is that it requires a relatively large hard disk. Since tmp is not used, all files are retained for subsequent use. At the same time, gene.fasta, index and DESEq2 are not added because of the problem of multi-group comparison.

Check the DAG

You can check the DAG of the pipeline by running the following command:

snakemake --dag results/plot/ViolinplotFPKM.pdf | dot -Tsvg > dag.svg

dag

rna-seq-hisat2-featurecounts's People

Contributors

erwinqiao avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.