RNA-seq analysis

General sequencing data analysis materials

Next-Gen Sequence Analysis Workshop (2015) held by Titus Brown (now in UC Davis)
Fall 2015, BMMB 852: Applied Bioinformatics by Istvan Albert from Penn state University. He developed the all-time popular biostars
Steven Turner in UVA is maitaining a list of training opportunities for genomic data analysis
Jeff Leek group's recommended genomic papers
awesome tutorial for NGS file format
UVA Bioconnector Workshops
Explaining your errors QC fail

RNA-seq specific

Introduction to RNA-seq analysis youtube video
RNAseq differential expression analysis – NGS2015
Kallisto and sleuth tutorial blazing fast RNA-seq analysis by Lior Patcher's lab. A sleuth for RNA-Seq
pathway analysis using GAGE
Tutorial: RNA-seq differential expression & pathway analysis with Sailfish, DESeq2, GAGE, and Pathview
A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
RNA-seq tutorial wiki Informatics for RNA-seq: A web resource for analysis on the cloud.
RNA-seqlopedia Great introduction of RNA-seq from sample preparation to data analysis
RNAseq data analysis from data carpentry
paper: Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage
paper: A survey of best practices for RNA-seq data analysis
paper: Cross-platform normalization of microarray and RNA-seq data for machine learning applications. Tool
review: Translating RNA sequencing into clinical diagnostics: opportunities and challenges

RNA-seq experimental design

Quality Control

QoRTs: a comprehensive toolset for quality control and data processing of RNA-Seq experiments
QUaCRS
RSeQC RNA-seq data QC
RNA-SeqQC

Normalization, quantification, and differential expression

A Comparison of Methods: Normalizing High-Throughput RNA Sequencing Data
Errors in RNA-Seq quantification affect genes of relevance to human disease
A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification
Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data
paper: Union Exon Based Approach for RNA-Seq Gene Quantification: To Be or Not to Be?
paper: The impact of amplification on differential expression analyses by RNA-seq Computational removal of read duplicates is not recommended for differential expression analysis.
paper: Normalization of RNA-seq data using factor analysis of control genes or samples: About spike-ins control and R normalization strategy - remove unwanted variation (RUV).

Traditional way of RNA-seq analysis

Two nature protocols for RNA-seq analysis
Count-based differential expression analysis of RNA sequencing data using R and Bioconductor Based on DESeq and EdgeR.
Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

A nice tutorial from f1000 research RNA-Seq workflow: gene-level exploratory analysis and differential expression from Michael Love who is the author of DESeq2.

A post from Nextgeneseek

QuickRNASeq lifts large-scale RNA-seq data analyses to the next level of automation and interactive visualization

The three papers kind of replaces earlier tools from Salzberg’s group (Bowtie/TopHat,Cufflinks, and Cuffmerge)
they offer a totally new way to go from raw RNA-seq reads to differential expression analysis:
align RNA-seq reads to genome (HISATinstead of Bowtie/TopHat, STAR),
assemble transcripts and estimate expression (StringTie instead of Cufflinks), and
perform differential expression analysis (Ballgown instead of Cuffmerge).

RapMap: A Rapid, Sensitive and Accurate Tool for Mapping RNA-seq Reads to Transcriptomes. From Sailfish group.

BitSeq Transcript isoform level expression and differential expression estimation for RNA-seq

For mapping based methods, usually the raw reads are mapped to transcriptome or genome (need to model gaps by exon-exon junction), and then a gene/transcript level counts are obtained by:

HTSeq-count: one of the most popular counting tool, but it is slow.
featureCounts: much faster, use mulitple threads.
VERSE: built on featureCounts, integrate HTseq.
eXpress.

Finally, differential expression is carried out by

DESeq2
EdgeR
limma Voom
EBseq An R package for gene and isoform differential expression analysis of RNA-seq data
JunctionSeq differential usage of exons and splice junctions in High-Throughput, Next-Generation RNA-Seq datasets. The methodology is heavily based on the DEXSeq bioconductor package.The core advantage of JunctionSeq over other similar tools is that it provides a powerful automated tools for generating readable and interpretable plots and tables to facilitate the interpretation of the results. An example results report is available here.
MetaSeq Meta-analysis of RNA-Seq count data in multiple studies
derfinder Annotation-agnostic differential expression analysis of RNA-seq data at base-pair resolution
DGEclust is a program for clustering and differential expression analysis of expression data generated by next-generation sequencing assays, such as RNA-seq, CAGE and others
Degust: Perform RNA-seq analysis and visualisation. Simply upload a CSV file of read counts for each replicate; then view your DGE data.
Vennt Dynamic Venn diagrams for Differential Gene Expression.
GlimmaInteractive HTML graphics for RNA-seq data

Extra Notes

Benchmarking

bcbio.rnaseq
RNAseqGUI. I have used several times. looks good.
compcodeR
paper: Benchmark Analysis of Algorithms for Determining and Quantifying Full-length mRNA Splice Forms from RNA-Seq Data
paper: Comparative evaluation of isoform-level gene expression estimation algorithms for RNA-seq and exon-array platforms
paper:A benchmark for RNA-seq quantification pipelines

Map free

RNASkim
Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-seq Data using Lightweight-Alignment. It is the sucessor of Salfish I have used Salfish once, and it is super-fast! Salmon is supposed to be even better. tutorial
Kallisto from Lior Patcher's lab. paper: Near-optimal probabilistic RNA-seq quantification
sleuth works with Kallisto for differential expression.
Differential analysis of RNA-Seq incorporating quantification uncertainty: sleuth
Reanalysis of published RNA-Seq data using kallisto and sleuth based on shiny.
tximport: import and summarize transcript-level estimates for gene-level analysis now on bioconductor
f1000 research paper Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences from Mike love et.al.

Blog posts on Kallisto

A biostar post: Do not feed rounded estimates of gene counts from kallisto into DESeq2 (please make sure you read through all the comments, and now there is a suggested workflow for feeding rounded estimates of gene counts to DESeq etc)

There is some confusion in the answers to this question that hopefully I can clarify with the three comments below:

kallisto produces estimates of transcript level counts, and therefore to obtain an estimate of the number of reads from a gene the correct thing to do is to sum the estimated counts from the constituent transcripts of that gene. Of note in the language above is the word "estimate", which is necessary because in many cases reads cannot be mapped uniquely to genes. However insofar as obtaining a good estimate, the approach of kallisto (and before it Cufflinks, RSEM, eXpress and other "transcript level quantification tools") is superior to naïve "counting" approaches for estimating the number of reads originating from a gene. This point has been argued in many papers; among my own papers it is most clearly explained and demonstrated in Trapnell et al. 2013.

Although estimated counts for a gene can be obtained by summing the estimated counts of the constituent transcripts from tools such as kallisto, and the resulting numbers can be rounded to produce integers that are of the correct format for tools such as DESeq, the numbers produced by such an approach do not satisfy the distributional assumptions made in DESeq and related tools. For example, in DESeq2, counts are modeled "as following a negative binomial distribution". This assumption is not valid when summing estimated counts of transcripts to obtain gene level counts, hence the justified concern of Michael Love that plugging in sums of estimated transcript counts could be problematic for DESeq2. In fact, even the estimated transcript counts themselves are not negative binomial distributed, and therefore also those are not appropriate for plugging into DESeq2. His concern is equally valid with many other "count based" differential expression tools.

Fortunately there is a solution for performing valid statistical testing of differential abundance of individual transcripts, namely the method implemented in sleuth. The approach is described here. To test for differential abundance of genes, one must first address the question of what that means. E.g. is a gene differential if at least one isoform is? or if all the isoforms are? The tests of sleuth are performed at the granularity of transcripts, allowing for downstream analysis that can capture the varied questions that might make biological sense in specific contexts.

In summary, please do not plug in rounded estimates of gene counts from kallisto into DESeq2 and other tools. While it is technically possible, it is not statistically advisable. Instead, you should use tools that make valid distributional assumptions about the estimates.

However, Charlotte Soneson, Mike Love and Mark Robinson showed in a f1000 paper: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences that rounded values from transcript level can be fed into DESeq2 etc for gene-level differential expression, and it is valid and preferable in many ways.

Thanks Rob Patro for pointing it out!

artemis: RNAseq analysis, from raw reads to pathways, typically in a few minutes. Mostly by wrapping Kallisto and caching everything we possibly can.
isolator:Rapid and robust analysis of RNA-Seq experiments.

Isolator has a particular focus on producing stable, consistent estimates. Maximum likelihood approaches produce unstable point estimates: small changes in the data can result in drastically different results, conflating downstream analysis like clustering or PCA. Isolator produces estimates that are in general, simultaneously more stable and more accurate other methods

Batch effects

TACKLING BATCH EFFECTS AND BIAS IN TRANSCRIPT EXPRESSION by mike love
paper:Tackling the widespread and critical impact of batch effects in high-throughput data by Jeffrey T. Leek in Rafael A. Irizarry's lab.
A reanalysis of mouse ENCODE comparative gene expression data
Is it species or is it batch? They are confounded, so we can't know
Mouse / Human Transcriptomics and Batch Effects
Meta-analysis of RNA-seq expression data across species, tissues and studies:Interspecies clustering by tissue is the predominantly observed pattern among various studies under various distance metrics and normalization methods Surrogate Variable Analysis:SVA bioconductor
Paper Summary: Systematic bias and batch effects in single-cell RNA-Seq data

Databases

ReCount is an online resource consisting of RNA-seq gene count datasets built using the raw data from 18 different studies
The Digital Expression Explorer The Digital Expression Explorer (DEE) is a repository of digital gene expression profiles mined from public RNA-seq data sets. These data are obtained from NCBI Short Read Archive.
blog post for it
SHARQ Search public, human, RNA-seq experiments by cell, tissue type, and other features | Indexing 19807 files
RESTful RNA-seq Analysis API A simple RESTful API to access analysis results of all public RNAseq data for nearly 200 species in European Nucleotide Archive.
intropolis is a list of exon-exon junctions found across 21,504 human RNA-seq samples on the Sequence Read Archive (SRA) from spliced read alignment to hg19 with Rail-RNA. Two files are provided:
ExpressionAtlas bioconductor package:

This package is for searching for datasets in EMBL-EBI Expression Atlas, and downloading them into R for further analysis. Each Expression Atlas dataset is represented as a SimpleList object with one element per platform. Sequencing data is contained in a SummarizedExperiment object, while microarray data is contained in an ExpressionSet or MAList object.

GTEx Resources in the UCSC Browser signal track on trackhub
batch recompute ~20,000 RNA-seq samples from larget sequencing project such as TCGA, TARGET and GETEX. Used hg38 and gencode v21 as annotation.

Gene Set enrichment analysis

Pathway analysis

[Statistical analysis and visualization of functional profiles for gene and gene clusters: bioconductor
clusterProfiler](http://www.bioconductor.org/packages/devel/bioc/html/clusterProfiler.html) by GuangChuang Yu from University of HongKong. Can do many jobs and GSEA like figure. It is very useful and I will give it a try besides
GAGE.
DAVID:The Database for Annotation, Visualization and Integrated Discovery (DAVID ). UPDATED in 2016!!!

Fusion gene detection

fusioncatcher
PRADA from our lab
Fusion Matcher: Match predicted fusions according to chromosomal location or gene annotation(s)
paper:Comprehensive evaluation of fusion transcript detection algorithms and a meta-caller to combine top performing methods in paired-end RNA-seq data
paper: Comparative assessment of methods for the fusion transcripts detection from RNA-Seq data
chimera A package for secondary analysis of fusion products.
Pegasus Fusion Annotation and Prediction.
Oncofuse is a framework designed to estimate the oncogenic potential of de-novo discovered gene fusions. It uses several hallmark features and employs a bayesian classifier to provide the probability of a given gene fusion being a driver mutation.

Alternative splicing

SplicePlot: a tool for visualizing alternative splicing Sashimi plots
Multivariate Analysis of Transcript Splicing (MATS)
SNPlice is a software tool to find and evaluate the co-occurrence of single-nucleotide-polymorphisms (SNP) and altered splicing in next-gen mRNA sequence reads. SNPlice requires, as input: genome aligned reads, exon-intron-exon junctions, and SNPs. exon-intron-exon junctions and SNPs may be derived from the reads directly, using, for example, TopHat2 and samtools, or they may be derived from independent sources
Visualizing Alternative Splicing github page
spladder Tool for the detection and quantification of alternative splicing events from RNA-Seq data
SUPPA This tool generates different Alternative Splicing (AS) events and calculates the PSI ("Percentage Spliced In") value for each event exploiting the fast quantification of transcript abundances from multiple samples

microRNAs and non-coding RNAs

miARma-Seq workflow miRNA-Seq And RNA-Seq Multiprocess Analysis tool, a comprehensive pipeline analysis suite designed for mRNA, miRNA and circRNA identification and differential expression analysis, applicable to any sequenced organism.
[All the tools you need to analyse your miRNAs:tools4miRNAs
paper Evaluation of microRNA alignment techniques

transcriptional pausing

GRO-seq
RNApol2 ChIP-seq
iRNA-seq: computational method for genome-wide assessment of acute transcriptional regulation from total RNA-seq data

Allel specific expression

Single cell RNA-seq

paper: Design and computational analysis of single-cell RNA-sequencing experiments
On the widespread and critical impact of systematic bias and batch effects in single-cell RNA-Seq data
review: Single-cell genome sequencing: current state of the science
Ginkgo A web tool for analyzing single-cell sequencing data.
Seurat is an R package designed for the analysis and visualization of single cell RNA-seq data. It contains easy-to-use implementations of commonly used analytical techniques, including the identification of highly variable genes, dimensionality reduction (PCA, ICA, t-SNE), standard unsupervised clustering algorithms (density clustering, hierarchical clustering, k-means), and the discovery of differentially expressed genes and markers.
R package for the statistical assessment of cell state hierarchies from single-cell RNA-seq data
Monocle Differential expression and time-series analysis for single-cell RNA-Seq and qPCR experiments.
Single Cell Differential Expression: bioconductor package scde
Sincera:A Computational Pipeline for Single Cell RNA-Seq Profiling Analysis. Bioconductor package will be available soon.
MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data
scDD: A statistical approach for identifying differential distributions in single-cell RNA-seq experiments
Inference and visualisation of Single-Cell RNA-seq Data data as a hierarchical tree structure: bioconductor CellTree
Fast and accurate single-cell RNA-Seq analysis by clustering of transcript-compatibility counts by Lior Pachter et.al
cellity: Classification of low quality cells in scRNA-seq data using R.
bioconductor: using scran to perform basic analyses of single-cell RNA-seq data
scater: single-cell analysis toolkit for expression with R
Monovar: single-nucleotide variant detection in single cells
paper: Comparison of methods to detect differentially expressed genes between single-cell populations

single cell RNA-seq clustering

Geometry of the Gene Expression Space of Individual Cells
pcaReduce: Hierarchical Clustering of Single Cell Transcriptional Profiles.
Single-Cell Consensus Clustering bioconductor package
CountClust: Clustering and Visualizing RNA-Seq Expression Data using Grade of Membership Models. Fits grade of membership models (GoM, also known as admixture models) to cluster RNA-seq gene expression count data, identifies characteristic genes driving cluster memberships, and provides a visual summary of the cluster memberships
FastProject: A Tool for Low-Dimensional Analysis of Single-Cell RNA-Seq Data
SNN-Cliq Identification of cell types from single-cell transcriptomes using a novel clustering method

fw1121 / rna-seq-analysis Goto Github PK

rna-seq-analysis's Introduction