thelovelab / fishpond Goto Github PK

View Code? Open in Web Editor NEW

24.0 3.0 9.0 7.18 MB

Differential expression and allelic analysis, nonparametric statistics

Home Page: https://thelovelab.github.io/fishpond

R 97.26% TeX 2.74%

rnaseq bioconductor statistics genomics transcriptomics gene-expression salmon scrnaseq

fishpond's Introduction

fishpond

Fishpond: downstream methods and tools for expression data

Fishpond contains a method, swish(), for differential transcript and gene expression analysis of RNA-seq data using inferential replicates. Also the package contains utilities for working with Salmon, alevin, and alevin-fry quantification data, including loadFry().

Quick start

The following paradigm is used for running a Swish analysis:

y <- tximeta(coldata) # reads in counts and inf reps
y <- scaleInfReps(y) # scales counts
y <- labelKeep(y) # labels features to keep
set.seed(1) # for reproducibility
y <- swish(y, x="condition") # simplest Swish case

How does Swish work

Swish accounts for inferential uncertainty in expression estimates by averaging test statistics over a number of inferential replicate datasets, either posterior samples or bootstrap samples. This is inspired by a method called SAMseq, hence we named our method Swish, for "SAMseq With Inferential Samples Helps". Averaging over inferential replicates produces a different test statistic than what one would obtain using only point estimates for expression level.

For example, one of the tests possible with swish() is a correlation test of expression level over a condition variable. We can visualize the distribution of inferential replicates with plotInfReps():

The test statistic is formed by averaging over these sets of data:

p-values and q-values are computed through permutation of samples (see vignette for details on permutation schemes).

The Swish method is described in the following publication:

Zhu, A., Srivastava, A., Ibrahim, J.G., Patro, R., Love, M.I. "Nonparametric expression analysis using inferential replicate counts" Nucleic Acids Research (2019) 47(18):e105 PMC6765120

The SEESAW method for allelic expression analysis is described in the following preprint:

Euphy Wu, Noor P. Singh, Kwangbom Choi, Mohsen Zakeri, Matthew Vincent, Gary A. Churchill, Cheryl L. Ackert-Bicknell, Rob Patro, Michael I. Love. "Detecting isoform-level allelic imbalance accounting for inferential uncertainty" bioRxiv (2022) doi: 10.1101/2022.08.12.503785

Installation

This package can be installed via Bioconductor:

BiocManager::install("fishpond")

Funding

This work was funded by NIH NHGRI R01-HG009937.

fishpond's People

Contributors

Stargazers

Watchers

Forkers

k3yavi wzacs1 lianos varix fennecfish dongzehe antoniogps josephlalli npsdc

fishpond's Issues

gene_ids after summarising counts with loadFry

Thank you for making Fishpond! I have been using the loadFry function to aggregate USA counts produced by Alevin Fry (adding up U+S+A). I would have expected that the summarised counts table would have a third of the gene_ids but actually there are fewer. The only filtering I found in the documentation was nonzero but that defaults to FALSE. Looking up some of the ENSGs that are missing from the collated output, it seems that they are pseudogenes. Is this some additional filtering that loadFry does?

One-sample t-test for paired design

R CMD check WARNING: Requires (indirectly) orphaned package: ‘gtools’

Hello @mikelove,

I got a warning when doing R CMD check for the roe package saying that an orphaned package is indirectly required. gtools is a dependency in fishpond.

  ❯ checking package dependencies ... WARNING
  Requires (indirectly) orphaned package: ‘gtools’

I think this happened very recently because I did not see this warning last month. The only effect is that this warning made my github actions unhappy :(.

In the package's webpage, it shows that the maintainer is ORPHANED.

Version: | 3.9.2.2
-- | --
Depends: | R (≥ 2.10), methods, stats, utils
Suggests: | car, gplots, knitr, rstudioapi, SGP, taxize
Published: | 2022-06-13
Author: | Gregory R. Warnes, Ben Bolker, Thomas Lumley and CRAN team
Maintainer: | ORPHANED

Just want to report this. As gools is a widely used package, maybe someone will take the maintainership soon.

Dongze

Unused argument (diffAI = TRUE)

Hi,

I am trying to create a stimulation data following the guide, but I got a error:

library(fishpond)
set.seed(1)
y <- makeSimSwishData(diffAI=TRUE, n=12)
colData(y)

Error in makeSimSwishData(diffAI = TRUE, n = 12) : 
  unused argument (diffAI = TRUE)

I think it's because the version of fishpond seems too old (2.2.0), but this seems to be the latest version in BiocManager on the Windows platform?

Regrads.

Zhang

Extend to ordinal conditions

Briefly looking through the code here, it seems like the basic algorithm (correlation + bootstrapping) is extendable to more than two conditions. This would in effect turn your DE software into a potential engine for uncertainty-aware eQTL/sQTL analyses.

I’m sure I’ve missed some major hurdle - as this work is at the far end of my bioinformatics knowledge - but if not, is this a feature you are considering adding?

Error in infRepError(infRepIdx) : there are no inferential replicates in the assays of 'y'

Hi Mike,
Can you give some hints for my error message?

coldata <- read.csv(file.path(dir, "coldata_mark.csv"))
coldata
sample condition type
1 A1 adequate Female
2 A2 adequate Female
3 A3 adequate Female
4 D1 deficient Female
5 D2 deficient Female
6 D3 deficient Female
7 MA1 adequate Male
8 MA2 adequate Male
9 MA3 adequate Male
10 MD1 deficient Male
11 MD2 deficient Male
12 MD3 deficient Male
names(coldata)
[1] "sample" "condition" "type"

names(coldata) <- c("names","condition","type")

coldata$files <- file.path(dir, "quants_mus", coldata$names, "quant.sf")
all(file.exists(coldata$files))
[1] TRUE
suppressPackageStartupMessages(library(SummarizedExperiment))
se <- tximeta(coldata)
importing quantifications
reading in files with read.delim (install 'readr' package for speed up)
1 2 3 4 5 6 7 8 9 10 11 12
found matching transcriptome:
[ GENCODE - Mus musculus - release M24 ]
building TxDb with 'GenomicFeatures' package
Import genomic features from the file as a GRanges object ... trying URL 'ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M24/gencode.vM24.annotation.gtf.gz'
Content type 'unknown' length 28499422 bytes (27.2 MB)
==================================================
OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
generating transcript ranges
fetching genome info for GENCODE
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type stop_codon. This information was
ignored.
assayNames(se)
[1] "counts" "abundance" "length"
head(rownames(se))
[1] "ENSMUST00000193812.1" "ENSMUST00000082908.1" "ENSMUST00000162897.1" "ENSMUST00000159265.1" "ENSMUST00000070533.4"
[6] "ENSMUST00000192857.1"

female samples

y <- se
#y <- y[,y$condition %in% c("naive","IFNg")]
y <-y[,y$type %in% c("Female")]
y
class: RangedSummarizedExperiment
dim: 140948 6
metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo
assays(3): counts abundance length
rownames(140948): ENSMUST00000193812.1 ENSMUST00000082908.1 ... ENSMUST00000082422.1 ENSMUST00000082423.1
rowData names(3): tx_id gene_id tx_name
colnames(6): A1 A2 ... D2 D3
colData names(3): names condition type
y$condition
[1] "adequate" "adequate" "adequate" "deficient" "deficient" "deficient"
y$type
[1] "Female" "Female" "Female" "Female" "Female" "Female"
y <- scaleInfReps(y)
Error in infRepError(infRepIdx) :
there are no inferential replicates in the assays of 'y'

Using loadFry for HTO quantifications

Hi @DongzeHE and @mikelove -
I'm currently working through this alevin-fry vignette for processing HTO/ADT data from 10X.

The tutorial was written I believe before loadFry was incorporated here into fishpond, and the authors include their own load_fry function for reading alevin-fry quantifications (RNA, ADT, HTO) into R.

I can load my RNA data just fine with fishpond::loadFry but I can't figure out how to load the HTO data that way, I get an error:

locating quant file
Reading meta data
USA mode: FALSE
Processing 10 genes and 10013 barcodes
Not in USA mode, ignore argument outputFormat
Constructing output SingleCellExperiment object
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'j' in selecting a method for function '[': error in evaluating the argument 'x' in selecting a method for function 'colSums': unable to find an inherited method for function 'assay' for signature '"SingleCellExperiment", "logical"'

But their load_fry function works fine for this, so it seems like the two functions have deviated somewhat.

Is there some parameter I can specify to fishpond::loadFry to properly load the HTO object? I've even tried a custom format specifying only "spliced" counts (since that's what their vignette suggests) but I get the same error.

Thanks!

Allelic plots should offer to customize allelic labels

matrixStats::rowRanks(): Specify argument 'ties.method = "max"' to avoid being affected by future updates

matrixStats will align the default ties.method for colRanks() and rowRanks() with that of base::rank(), cf. HenrikBengtsson/matrixStats#142. Because of this, I ran a few reverse dependency checks to see what could break, and I spotted fishpond. It doesn't specify ties.method, so its results will be affected when we change the default:

https://github.com/mikelove/fishpond/search?q=rowRanks

Wrong index in randomSamplesToRemove()

Hello @mikelove , this issue is related to a question that I posted on Bioconductor (link). I did some debugging in RStudio, and I believe there is a bug in randomSamplesToRemove(). Note that the version of fishpond that I am using is 2.6.0.

The inputs I have for the function are as follows.

> condition
[1] neg_control neg_control neg_control neg_control
[5] neg_control 1uM_MeCHO 1uM_MeCHO 1uM_MeCHO
[9] 1uM_MeCHO 1uM_MeCHO 1uM_MeCHO
Levels: neg_control 1uM_MeCHO
> covariate
[1] heterozygous heterozygous heterozygous wildtype
[5] wildtype heterozygous heterozygous heterozygous
[9] wildtype wildtype wildtype
Levels: heterozygous wildtype
> tab
                  covariate
condition    heterozygous  wildtype
   neg_control         3         2
   1uM_MeCHO           3         3

There is an imbalance in the covariate in the negative control (neg_control) group, and I believe that the intention is to remove a heterozygous sample from the negative control group. The current code extracts tab[1,i] and tab[2,i]. In my case, it extracts tab[1,1] and tab[2,1], and the values are 3 and 3 respectively. Their subtraction is then the size input to sample(), meaning that no sample will actually be selected to be removed from the imbalanced group. Ultimately, this causes an error downstream.

Therefore, I believe this line of code should have been

cond1small <- tab[1,i] < tab[i,2]

Likewise, this should have been

tab[i,2] - tab[1,i]

And, this should also have been

tab[1,i] - tab[i,2]

Thank you for your time and consideration!

support of sparsed Matrix output of alevin

Hi,
Currently the default output of salmon alevin is sparsed Matrix, but fishpond can deal only with regular matrix. E.g. trying to apply of scaleInfReps (v. 1.3.10) to salmon alevin (v.1.1.0) output file results to this:

Error in colSums(counts) :
  'x' must be an array of at least two dimensions

Error in dbFileConnect(file)

Hi,
When I try to run your sample code and data, I found the following error messages, do you know what my problem is? I can successfully load all required R packages. Thank you

-------error message below_---------

suppressPackageStartupMessages(library(SummarizedExperiment))
se <- tximeta(coldata)
importing quantifications
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
found matching transcriptome:
[ Gencode - Homo sapiens - release 29 ]
loading existing TxDb created: 2020-06-05 21:23:05
Error in dbFileConnect(file) :
DB file '/Users/xxx/Library/Caches/BiocFileCache/13d2eae3fb0a_13d2eae3fb0a.sqlite' not found

How to output the scaled counts for each sample

Dear @mikelove

Thank you for your package. This is my first time to run Swish ( following the vignette, https://bioconductor.org/packages/devel/bioc/vignettes/fishpond/inst/doc/swish.html) using inferential replicates computed by salmon.

First, when running salmon, I set --validateMappings --seqBias --gcBias --numBootstraps 100 to generate bootstrap samples. Here, you typically recommend 20-30 inferential replicates, but I set --numBootstraps 100, would this have an effect on the quantification results?

Second, after running following commands, how to save the scaled counts to a file? The se object contains following metadata.

class: SummarizedExperiment 
dim: 14796 14 
metadata(4): tximetaInfo countsFromAbundance infRepsScaled preprocessed
assays(103): counts abundance ... infRep99 infRep100
rownames(14796): gene-A1CF gene-A2M ... gene-ZZEF1 gene-ZZZ3
rowData names(7): log10mean keep ... locfdr qvalue
colnames(14): C1 C2 ... DXM6 DXM7
colData names(2): names condition

library(tximeta)
library(fishpond)
library(GenomicRanges)
library(SummarizedExperiment)
data <- read.table("sample.inf",head=T)
dir <- getwd()
data$files<-file.path(dir,data$names,"quant.sf")
all(file.exists(data$files))
data <- data[data$condition %in% c("C","DXM"),]
data$condition <- factor(data$condition, levels=c("C","DXM"))
tx2gene <- read.table("tran_to_gene", head=F, col.names=c("TXNAME","GENEID"))
se <- tximeta(data, skipMeta=TRUE, txOut=FALSE, tx2gene=tx2gene)
se<-scaleInfReps(se)
se <- labelKeep(se)
se <- se[mcols(se)$keep,]
set.seed(1)
se <- swish(se, x="condition")

Looking forward to your reply, thank you.
Yours sincerely.

Zheng zhuqing

Have all 0 LFC for all infReps, remove these first

Hi,

I got a error when test dynamic allelic imbalance,

y <- swish(y, x="allele", pair="samples",
           cov="time", cor="pearson")

Error in swishCorPair(infRepsArray, condition, pair, covariate, cor, nperms,  : 
  5220 features have all 0 LFC for all infReps, remove these first

However, I am not sure how to remove all 0 LFC for all infReps, could you please give me more infomation?

Regrads,

Zhang

plotInfReps y axis mislabel

Hello,

First I want to thank you for the great package!

When I was trying to plot the expression of transcripts using plotInfReps, I noticed that the y axis label was 'counts' in stead of 'scaled counts' by default although I've run scaleInfReps already in the first step. I guess this was due to a mistyped "infRepsScaled" in these lines?

https://github.com/mikelove/fishpond/blob/ca8c3f6b77b20991bcbd5de90a186544df3871a6/R/helper.R#L160-L163