lucylgao / clusterpval Goto Github PK

View Code? Open in Web Editor NEW

15.0 3.0 5.0 64 KB

License: Other

R 100.00%

clusterpval's Introduction

clusterpval: Inference for Estimated Clusters in R

See http://lucylgao.com/clusterpval for tutorials and examples. See https://arxiv.org/abs/2012.02936 for the preprint.

Install

Make sure that devtools is installed by running install.packages("devtools"), then type

devtools::install_github("lucylgao/clusterpval")

clusterpval's People

Contributors

Stargazers

Watchers

Forkers

jeskowagner jannetty rosyluo gonzalez-delgado

clusterpval's Issues

Applying clusterpval in single cell data

Hi Lucy,

I am interested in using your statistical model to test gene expression differences between clusters.
My input is a matrix where: columns correspond to cells, rows to genes. Usually I would apply a (row-wise) wilcox test in order to check if the mean expression of one group of cells (belonging to a cluster) is significantly higher/lower than in the other group. This way, I can obtain a list of genes significantly up/down regulated between these two groups.

I have been testing your clusterpval tool but I am not sure how to apply it in cases like this. Can you help me?

Best,
Coral

How to apply this tool?

Dear authors,

Thank you for writing this package. I could understand the motivation behind it, however, owing to my lack of in-depth knowledge in this area, I find it hard to imagine how to include this package in my existing scRNAseq analysis workflow.

The vignette showed how to test for the difference between two clusters. But then what should I do if I have 10 clusters? Does this mean that I should perform a pairwise comparison of all 10 clusters (10 choose 2 = 45 tests)?

And if my p-value is above a certain threshold, can I use it to justify the merging of a pair of clusters?

Thanks for your insights.

Best,
Mikhael

Per-dimension difference in means

Hi,

I am interested in using this interesting approach for post-clustering inference, but I need to test the difference in means in each dimension separately (for differential expression analysis).

Is this possible/appropriate using this method?

Thanks in advance!

Deal with Seurat object

Hi Lucy,
Thank you so much for developing this tool.
I currently identify clusters through Seurat and I would like to use your tools to test the difference in mean of two clusters. I think I might use test_clusters_approx function for this:

_X1<-as.matrix(combined@assays[["RNA"]]@CountS)
cluster<[email protected][["seurat_clusters"]]
function_cluster <- function(X) {
combined<-CreateSeuratObject(X)
combined <- NormalizeData(combined, normalization.method = "LogNormalize", scale.factor = 10000)
combined <- FindVariableFeatures(combined, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(combined)
variable_gene<-combined@assays[["RNA"]]@var.features
combined<-ScaleData(combined, features = all.genes)
combined <- RunPCA(combined,features = all.genes)
combined <- RunUMAP(combined,features = all.genes)
combined <- RunTSNE(combined,features = all.genes)
combined <- FindNeighbors(combined,features = all.genes)
combined <- FindClusters(combined, resolution = 0.5)
return([email protected][["seurat_clusters"]])
}

cluster<-function_cluster(X1)
test_clusters_approx(X1, k1=1, k2=2, cl=cluster,cl_fun = function_cluster,ndraws=10000)_

combined is the Seurat object, when i turn into the final step, the function test_clusters_approx keeps running the cluster function over and over again...

Do you have any solutions for that?

Thank you!

speeding up `test_clusters_approx`

Like others who have opened issues, I'm interested in running this neat method to look at the difference of means in my clusters from single cell data. (Well, really I would like to adjust p-values for differential expression analysis, similar to #3 and #4, but as I understand it that extension is not possible right now). However running test_clusters_approx is painfully slow, as in #2. I took a look at the source code and it seems like the function could easily be sped up by running the piece where it runs cl_fun in a for loop in parallel (using future or any other parallellization package):

From

for(j in 1:ndraws) {
    if(phi[j] < 0) next
    
    # Compute perturbed data set
    Xphi <- X
    Xphi[cl == k1, ] <- t(orig_k1 + (phi[j] - stat)*k1_constant)
    Xphi[cl == k2, ] <- t(orig_k2 + (phi[j] - stat)*k2_constant)
    
    # Recluster the perturbed data set
    cl_Xphi <- cl_fun(Xphi)
    if(preserve_cl(cl, cl_Xphi, k1, k2)) {
      log_survives[j] <- -(phi[j]/scale_factor)^2/2 + (q-1)*log(phi[j]/scale_factor) - (q/2 - 1)*log(2) - log(gamma(q/2)) - log(scale_factor) -
        stats::dnorm(phi[j], mean=stat, sd=scale_factor, log=TRUE)
    }
  }

future_lapply(X = 1:ndraws, FUN = function(j) {
    if(phi[j] < 0) next
    
    # Compute perturbed data set
    Xphi <- X
    Xphi[cl == k1, ] <- t(orig_k1 + (phi[j] - stat)*k1_constant)
    Xphi[cl == k2, ] <- t(orig_k2 + (phi[j] - stat)*k2_constant)
    
    # Recluster the perturbed data set
    cl_Xphi <- cl_fun(Xphi)
    if(preserve_cl(cl, cl_Xphi, k1, k2)) {
      log_survives[j] <- -(phi[j]/scale_factor)^2/2 + (q-1)*log(phi[j]/scale_factor) - (q/2 - 1)*log(2) - log(gamma(q/2)) - log(scale_factor) -
        stats::dnorm(phi[j], mean=stat, sd=scale_factor, log=TRUE)
    }
  }

Would you be willing to implement this? I am also happy to make the modifications and submit a pull request for it.

Consider alternative dist alogrithm

Hello,

First off: thanks for the great paper, talks and package. The concept of computing p-values on clusters has been very useful to me.

When using large datasets I sometimes notice that the computation of p-values is very time demanding, however. I therefore profiled where I lose the most time, and found stats::dist to be the main offender when using

test_hier_clusters_exact(X=X, link="average")

So I played around with alternative ways to yield the same result as stats::dist but faster. I stumbled across this post by Florian Privé who deserves full credit for the following implementation, which I adapted to return the exact same data as stats::dist for consistency.

prive_dist = function(X){
    # Credit: Florian Privé (https://stackoverflow.com/a/59687204) 
    suppressWarnings(as.dist(sqrt(outer(rowSums(X^2), rowSums(X^2), '+') - tcrossprod(X, 2 * X))))
}

For large dataset, Privé's implementation is much faster, but performs worse for smaller datasets. See the following:

library(microbenchmark) # used for benchmarking

# which dataset sizes to test for distance computation
N = 10^(2:6)

# wrapper to benchmark a dataset of a given size
times = function(N){
    print(N)
    X = matrix(runif(N), nrow=sqrt(N))
    timings = microbenchmark(stats::dist(X), 
                             prive_dist(X),
                             times=10)
    print(timings, unit="relative")
}

sapply(N, times)

To which the output looks like this

[1] 100
Unit: relative
           expr      min       lq     mean   median      uq     max neval cld
 stats::dist(X) 1.000000 1.000000 1.000000 1.000000 1.00000 1.00000    10  a 
  prive_dist(X) 6.718533 6.702482 6.758787 5.977713 5.77667 8.49686    10   b
[1] 1000
Unit: relative
           expr      min      lq     mean   median       uq      max neval cld
 stats::dist(X) 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000    10  a 
  prive_dist(X) 1.366609 1.36663 1.411547 1.397399 1.460971 1.445506    10   b
[1] 10000
Unit: relative
           expr      min       lq     mean   median       uq     max neval cld
 stats::dist(X) 3.707272 7.008658 4.558213 6.926236 6.633612 1.34467    10   b
  prive_dist(X) 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000    10  a 
[1] 1e+05
Unit: relative
           expr      min       lq    mean   median       uq     max neval cld
 stats::dist(X) 16.49259 15.07916 16.4001 13.62932 19.25519 14.8057    10   b
  prive_dist(X)  1.00000  1.00000  1.0000  1.00000  1.00000  1.0000    10  a
[1] 1e+06
Unit: relative
           expr      min      lq     mean   median       uq      max neval cld
 stats::dist(X) 73.23724 55.5484 54.38444 54.05382 50.83549 46.74125    10   b
  prive_dist(X)  1.00000  1.0000  1.00000  1.00000  1.00000  1.00000    10  a

Arguably, the worse performance in small datasets is negligible given that it would still run fast in absolute terms, whereas the performance gain in larger datasets is massive.

One might want to think about implementing it in a way that you always get best performance. For example like this:

custom_dist = function(X){
    if(length(X) < 1000){
        return(stats::dist(X))
    } else {
        return(prive_dist(X))
    }
}

If you think this is a good idea I could submit a PR, probably next week.

Thanks again for the package and al your work! Best wishes,

Jesko

test_complete_hier_clusters_approx and test_clusters_approx return NaN for pval and stderr if run on datasets with more than 343 cols

Thank you for your work and for making this terrific R package!

I noticed when trying to run test_clusters_approx on the Zheng dataset from your clusterpval-experiments repo that test_complete_hier_clusters_approx and test_clusters_approx return NaN for pval and stderr on datasets with many columns. I think the gamma(q/2) on line 278 (in test_complete_hier_clusters_approx) and line 426 (in test_clusters_approx) will cause a float overflow if q is greater than 343 (if the input data has 344 or more columns). This float overflow causes the log_survives value to always be infinity, which I think makes the pval and stderr return NaN.

Because the gamma(q/2) value is immediately logged, and we know that q/2 will always be either a positive integer or a positive half integer (and thus we can calculate the particular values of the gamma function at each possible value of q/2), I think we can calculate the log(gamma(q/2)) value ahead of time taking advantage of the fact that log(a*b) = log(a) + log(b) and log(a/b) = log(a) - log(b).

I wrote a few helper functions that calculate the log(gamma(q/2)) values directly without overflowing a float (I will submit a pull request and link this issue). When using these helper functions both functions return float values for pval and stderr. Please let me know if I have misunderstood anything and thank you very much again for this R package!

lucylgao / clusterpval Goto Github PK

clusterpval's Introduction

clusterpval: Inference for Estimated Clusters in R

Install

clusterpval's People

Contributors

Stargazers

Watchers

Forkers

clusterpval's Issues

Applying clusterpval in single cell data

How to apply this tool?

Per-dimension difference in means

Deal with Seurat object

speeding up `test_clusters_approx`

Consider alternative dist alogrithm

test_complete_hier_clusters_approx and test_clusters_approx return NaN for pval and stderr if run on datasets with more than 343 cols

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent