Giter VIP home page Giter VIP logo

seqgendiff's Introduction

RNA-Seq Generation/Modification for Simulation

R-CMD-check Codecov test coverage License: GPL v3 Lifecycle: stable CRAN status

This package will take real RNA-seq data (either single-cell or bulk) and alter it by adding signal to it. This signal is in the form of a generalized linear model with a log (base-2) link function under a Poisson / negative binomial / mixture of negative binomials distribution. The advantage of this way of simulating data is that you can see how your method behaves when the simulated data exhibit common (and annoying) features of real data. This is without you having to specify these features a priori. We call the way we add signal “binomial thinning”.

The main functions are:

  • select_counts(): Subsample the columns and rows of a real RNA-seq count matrix. You would then feed this sub-matrix into one of the thinning functions below.
  • thin_diff(): The function most users should be using for general-purpose binomial thinning. For the special applications of the two-group model or library/gene thinning, see the functions listed below.
  • thin_2group(): The specific application of thinning in the two-group model.
  • thin_lib(): The specific application of library size thinning.
  • thin_gene(): The specific application of total gene expression thinning.
  • thin_all(): The specific application of thinning all counts.
  • effective_cor(): Returns an estimate of the actual correlation between the surrogate variables and a user-specified design matrix.
  • ThinDataToSummarizedExperiment(): Converts a ThinData object to a SummarizedExperiment() object.
  • ThinDataToDESeqDataSet(): Converts a ThinData object to a DESeqDataSet object.

If you find a bug or want a new feature, please submit an issue.

Check out NEWS for updates.

Installation

To install from CRAN, run the following code in R:

install.packages("seqgendiff")

To install the latest version of seqgendiff, run the following code in R:

install.packages("devtools")
devtools::install_github("dcgerard/seqgendiff")

To get started, check out the vignettes by running the following in R:

library(seqgendiff)
browseVignettes(package = "seqgendiff")

Or you can check out the vignettes I post online: https://dcgerard.github.io/seqgendiff/.

Citation

If you use this package, please cite:

Gerard, D (2020). “Data-based RNA-seq simulations by binomial thinning.” BMC Bioinformatics. 21(1), 206. doi: 10.1186/s12859-020-3450-9.

A BibTeX entry for LaTeX users is

@article{gerard2020data,
    author = {Gerard, David},
    title = {Data-based {RNA}-seq simulations by binomial thinning},
    year = {2020},
    volume={21},
    number={1},
    pages={206},
    doi = {10.1186/s12859-020-3450-9},
    publisher = {BioMed Central Ltd},
    journal = {BMC Bioinformatics}
}

Code of Conduct

Please note that the ‘seqgendiff’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

seqgendiff's People

Contributors

dcgerard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

jhsiao999 jappy0

seqgendiff's Issues

wording in simulation vignette

Thanks for code, I like it. First I confess to not fully reading the paper, I see it does a staunch defense of itself, but need more time to look at the details.

I had to read the simulation vignette a few times, I couldn't get what you were up to.

I gather now that you decide to treat the usual biological explainer variables (variable of interest) as batch effects and so subject to collection (well, modelling) in the surrogate variables sva() function. And then impose the thin_2group() procedure. It's a great idea, but I think mentioning the explainer variables as Batch Effect or Unwanted Variance might help set the scene better.

Also at the end, you say SVA2 "gets at" the original biological explainer variable, which is interesting but I would have thought it actually "catching it" because that is what we want in this simulation context. Of course it's very interesting because it's clear SVA can sometimes lead to stealing signal from the variable of interest, usually quite undesireable. I look forward to finishing reading your paper!

Add Vignette with Multiple Groups and Specifying Null Genes

A frequent question concerns setting up multiple groups and specifying the null genes. Adding a vignette would help out a lot.

Here is the script I usually send folks, and this could easily be turned into a vignette if it's cleaned up a little.

library(seqgendiff)
set.seed(1)

## Generate some simulated data. In practice you would use real data.
n <- 100
p <- 1000
Y <- rpois(n = n*p, lambda = 100)
dim(Y) <- c(p, n)

## Subsample individuals so that simulation does not depend on quirks
## of some genes.
Ysub <- select_counts(mat = Y, nsamp = 50, ngene = 1000)

## Generate design and coefficient matrices
group <- sample(c(1:4), size = ncol(Ysub), replace = TRUE)
group <- as.factor(group)
X <- model.matrix(~group)
X <- X[, -1, drop = FALSE] ## Remove intercept
betamat <- rnorm(ncol(X) * nrow(Ysub))
dim(betamat) <- c(nrow(Ysub), ncol(X))

## Choose which genes are null by setting those rows in betamat to 0
nullvec <- sample(x = c(TRUE, FALSE),
                  size = nrow(betamat),
                  replace = TRUE,
                  prob = c(0.9, 0.1))
betamat[nullvec, ] <- 0

## X is the design matrix representing group.
X

## betamat is the coefficient matrix containing effect sizes for
## each group at each gene
head(betamat)

## Thin
thout <- thin_diff(mat = Ysub, design_perm = X, coef_perm = betamat)
Ynew <- thout$mat
Xnew <- thout$designmat
betanew <- thout$coefmat

## Compare linear regression estimates with true coefficients
coefest <- t(coef(lm(log2(t(Ynew) + 0.5) ~ Xnew))[-1, , drop = FALSE])
plot(coefest, betanew)
abline(0, 1, lty = 2, col = 2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.