seqgendiff's Introduction

RNA-Seq Generation/Modification for Simulation

This package will take real RNA-seq data (either single-cell or bulk) and alter it by adding signal to it. This signal is in the form of a generalized linear model with a log (base-2) link function under a Poisson / negative binomial / mixture of negative binomials distribution. The advantage of this way of simulating data is that you can see how your method behaves when the simulated data exhibit common (and annoying) features of real data. This is without you having to specify these features a priori. We call the way we add signal “binomial thinning”.

The main functions are:

select_counts(): Subsample the columns and rows of a real RNA-seq count matrix. You would then feed this sub-matrix into one of the thinning functions below.
thin_diff(): The function most users should be using for general-purpose binomial thinning. For the special applications of the two-group model or library/gene thinning, see the functions listed below.
thin_2group(): The specific application of thinning in the two-group model.
thin_lib(): The specific application of library size thinning.
thin_gene(): The specific application of total gene expression thinning.
thin_all(): The specific application of thinning all counts.
effective_cor(): Returns an estimate of the actual correlation between the surrogate variables and a user-specified design matrix.
ThinDataToSummarizedExperiment(): Converts a ThinData object to a SummarizedExperiment() object.
ThinDataToDESeqDataSet(): Converts a ThinData object to a DESeqDataSet object.

If you find a bug or want a new feature, please submit an issue.

Check out NEWS for updates.

Installation

To install from CRAN, run the following code in R:

install.packages("seqgendiff")

To install the latest version of seqgendiff, run the following code in R:

install.packages("devtools")
devtools::install_github("dcgerard/seqgendiff")

To get started, check out the vignettes by running the following in R:

library(seqgendiff)
browseVignettes(package = "seqgendiff")

Or you can check out the vignettes I post online: https://dcgerard.github.io/seqgendiff/.

Citation

If you use this package, please cite:

Gerard, D (2020). “Data-based RNA-seq simulations by binomial thinning.” BMC Bioinformatics. 21(1), 206. doi: 10.1186/s12859-020-3450-9.

A BibTeX entry for LaTeX users is

@article{gerard2020data,
    author = {Gerard, David},
    title = {Data-based {RNA}-seq simulations by binomial thinning},
    year = {2020},
    volume={21},
    number={1},
    pages={206},
    doi = {10.1186/s12859-020-3450-9},
    publisher = {BioMed Central Ltd},
    journal = {BMC Bioinformatics}
}

Code of Conduct

Please note that the ‘seqgendiff’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

seqgendiff's People

Contributors

Stargazers

Watchers

seqgendiff's Issues

wording in simulation vignette

Thanks for code, I like it. First I confess to not fully reading the paper, I see it does a staunch defense of itself, but need more time to look at the details.

I had to read the simulation vignette a few times, I couldn't get what you were up to.

I gather now that you decide to treat the usual biological explainer variables (variable of interest) as batch effects and so subject to collection (well, modelling) in the surrogate variables sva() function. And then impose the thin_2group() procedure. It's a great idea, but I think mentioning the explainer variables as Batch Effect or Unwanted Variance might help set the scene better.

Also at the end, you say SVA2 "gets at" the original biological explainer variable, which is interesting but I would have thought it actually "catching it" because that is what we want in this simulation context. Of course it's very interesting because it's clear SVA can sometimes lead to stealing signal from the variable of interest, usually quite undesireable. I look forward to finishing reading your paper!

Add Vignette with Multiple Groups and Specifying Null Genes

A frequent question concerns setting up multiple groups and specifying the null genes. Adding a vignette would help out a lot.

Here is the script I usually send folks, and this could easily be turned into a vignette if it's cleaned up a little.

library(seqgendiff)
set.seed(1)

## Generate some simulated data. In practice you would use real data.
n <- 100
p <- 1000
Y <- rpois(n = n*p, lambda = 100)
dim(Y) <- c(p, n)

## Subsample individuals so that simulation does not depend on quirks
## of some genes.
Ysub <- select_counts(mat = Y, nsamp = 50, ngene = 1000)

## Generate design and coefficient matrices
group <- sample(c(1:4), size = ncol(Ysub), replace = TRUE)
group <- as.factor(group)
X <- model.matrix(~group)
X <- X[, -1, drop = FALSE] ## Remove intercept
betamat <- rnorm(ncol(X) * nrow(Ysub))
dim(betamat) <- c(nrow(Ysub), ncol(X))

## Choose which genes are null by setting those rows in betamat to 0
nullvec <- sample(x = c(TRUE, FALSE),
                  size = nrow(betamat),
                  replace = TRUE,
                  prob = c(0.9, 0.1))
betamat[nullvec, ] <- 0

## X is the design matrix representing group.
X

## betamat is the coefficient matrix containing effect sizes for
## each group at each gene
head(betamat)

## Thin
thout <- thin_diff(mat = Ysub, design_perm = X, coef_perm = betamat)
Ynew <- thout$mat
Xnew <- thout$designmat
betanew <- thout$coefmat

## Compare linear regression estimates with true coefficients
coefest <- t(coef(lm(log2(t(Ynew) + 0.5) ~ Xnew))[-1, , drop = FALSE])
plot(coefest, betanew)
abline(0, 1, lty = 2, col = 2)

Recommend Projects

dcgerard / seqgendiff Goto Github PK

seqgendiff's Introduction

RNA-Seq Generation/Modification for Simulation

Installation

Citation

Code of Conduct

seqgendiff's People

Contributors

Stargazers

Watchers

Forkers

seqgendiff's Issues

wording in simulation vignette

Add Vignette with Multiple Groups and Specifying Null Genes

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent