Giter VIP home page Giter VIP logo

dorothea's Introduction

DoRothEA: collection of human and mouse regulons

R-CMD-check BioC status codecov GitHub

Overview

DoRothEA is a gene regulatory network (GRN) containing signed transcription factor (TF) - target gene interactions. DoRothEA regulons, the collection of a TF and its transcriptional targets, were curated and collected from different types of evidence for both human and mouse. A confidence level was assigned to each TF-target interaction based on the number of supporting evidence.

These regulons, coupled with any statistical method, can be used to infer TF activities from bulk or single-cell transcriptomics.

This is an R package for storing the regulons. To infer TF activities, please check out decoupleR, available in R or python.

Installation

DoRothEA is available in Bioconductor. In addition, one can install the development version from the Github repository:

## To install the package from Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("dorothea")

## To install the development version from the Github repo:
devtools::install_github("saezlab/dorothea")

Updates

Since the original release, we have implemented some extensions in DoRothEA:

  1. Extension to mouse: Originally DoRothEA was developed for the application to human data. In a benchmark study we showed that DoRothEA is also applicable to mouse data, as described in Holland et al., 2019. Accordingly, we included new parameters to run mouse version of DoRothEA by transforming the human genes to their mouse orthologs.

  2. Extension to single-cell RNA-seq data: We showed that DoRothEA can be applied to scRNA-seq data, as described in Holland et al., 2020

  3. Extension to other databases We have released a new literature based GRN with increased coverage and better performance at identifying perturbed TFs, called CollecTRI. We encourage users to use CollecTRI instead of DoRothEA. Vignettes on how to obtain activities are available at the decoupleR package.

License

DoRothEA is intended only for academic use as in contains resources whose licenses don't permit commercial use. However, we developed a non-academic version of DoRothEA by removing those resources (mainly TRED from the curated databases). You can find the non-academic package with the regulons here.

Citation

If you use the DoRothEA resource, please cite:

Garcia-Alonso L, Holland CH, Ibrahim MM, Turei D, Saez-Rodriguez J. Benchmark and integration of resources for the estimation of human transcription factor activities. Genome Research. 2019. DOI: 10.1101/gr.240663.118.

If you infer TF activities, please cite:

Badia-i-Mompel P., Vélez Santiago J., Braunger J., Geiss C., Dimitrov D., Müller-Dott S., Taus P., Dugourd A., Holland C.H., Ramirez Flores R.O. and Saez-Rodriguez J. decoupleR: Ensemble of computational methods to infer biological activities from omics data. 2022. Bioinformatics Advances. DOI: 10.1093/bioadv/vbac016

If you use the CollecTRI resource, please cite:

Müller-Dott S., Tsirvouli E., Vázquez M., Ramirez Flores R.O., Badia-i-Mompel P., Fallegger R., Lægreid A. and Saez-Rodriguez J. Expanding the coverage of regulons from high-confidence prior knowledge for accurate estimation of transcription factor activities. bioRxiv. 2023. DOI: 10.1101/2023.03.30.534849

dorothea's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dorothea's Issues

Request: tables from Garcia-Alonso et al 2019

Please cancel my request - it was solved by Corrigendum: Benchmark and integration of resources for the estimation of human transcription factor activities.
Many thanks.
Best,
Jérémie

TF Enrichment Scores

Thank you for developing this great tool!

For the TF enrichment scores from (https://bioconductor.org/packages/release/data/experiment/vignettes/dorothea/inst/doc/single_cell_vignette.html#introduction) do positive values mean positive enrichment and negative values mean negative enrichment of the respective TFs?

I am trying to take the ratio of TF enrichment scores for two different cell types and want to understand how best to do this/interpret the results.

access to dorothea_hs?

Hi, maybe it is an obvious question, but is there any mode to access to dorothea_hs dataset (containing TF-target interactions with confidence levels and A-E scoring)? It would be interesting for me to explore what genes are taken in consideration to define the significant regulons found in a sc experiment. Thanks.

Some conflicts of gene symbol

Thanks for your interesting tools.
I found some conflicts of gene symbol in the provided human regulons. For example, BHLHE40, and its previous HGNC symbols STRA13 / BHLHB2 can all be found in regulons.

run_viper() ERROR: no method for coercing this S4 class to a vector

Hi, I am running the example of run_viper() but reported an error "Error in as.vector(data) : no method for coercing this S4 class to a vector". Also, when I run run_viper() with my own Seurat object, the same error comes up as well. Would you help me with this?

CODE

library(bcellViper)
data(bcellViper, package = "bcellViper")
data(dorothea_hs, package = "dorothea")
tf_activities <- run_viper(dset, dorothea_hs,
                           options =  list(method = "scale", minsize = 4,
                           eset.filter = FALSE, cores = 1,
                           verbose = FALSE))

ERROR

Error in as.vector(data) :
no method for coercing this S4 class to a vector

VERSION

R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

other attached packages:
[1] bcellViper_1.38.0 Signac_1.12.0 Biobase_2.62.0 BiocGenerics_0.48.1 Seurat_5.0.2 SeuratObject_5.0.1 sp_2.1-3
[8] pheatmap_1.0.12 decoupleR_2.8.0 dorothea_1.14.1

Input format

Dear developers,

What is the optimal format for Dorothea (bulk transcriptome analysis)? Is this FPKM, normalised counts, log normalised counts??
In the modern annotations, we have up to 50000 genes and many of them are not expressed. Should I filter the low expressed genes before the analysis?

install error

Dear Team,

I am struggling installing dorothea from Bioconductor, github or via conda. I get error: package ?dorothea? is not available (for R version 3.6.1)

Please could you advise on how I get around this?

Many thanks,
Oliver

> BiocManager::install("dorothea")
Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
Installing package(s) 'dorothea'
Old packages: 'rJava', 'Seurat'
Update all/some/none? [a/s/n]:
Update all/some/none? [a/s/n]:
n
Warning message:
package ?dorothea? is not available (for R version 3.6.1)
> devtools::install_github("saezlab/dorothea")
Downloading GitHub repo saezlab/dorothea@HEAD
?  checking for file ?/tmp/slurm_18930860/RtmpkBYJrq/remotes205ac5b742d5b/saezlab-dorothea-6e2b837/DESCRIPTION? (533ms)
?  preparing ?dorothea?:
?  checking DESCRIPTION meta-information ...
?  checking for LF line-endings in source and make files and shell scripts
?  checking for empty or unneeded directories
?  looking to see if a ?data/datalist? file should be added
?  building ?dorothea_1.3.0.tar.gz? (3s)
   Warning: invalid uid value replaced by that for user 'nobody'
   Warning: invalid gid value replaced by that for user 'nobody'
   
ERROR: this R is version 3.6.1, package 'dorothea' requires R >= 4.0
Error: Failed to install 'dorothea' from GitHub:
  (converted from warning) installation of package ?/tmp/slurm_18930860/RtmpkBYJrq/file205ac225b1ee6/dorothea_1.3.0.tar.gz? had non-zero exit status
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
 
Matrix products: default
BLAS/LAPACK: /camp/lab/luscomben/home/users/ziffo/.conda/envs/rtest/lib/libopenblasp-r0.3.7.so
 
locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C             
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8   
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8  
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                
 [9] LC_ADDRESS=C               LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C      
 
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    
 
loaded via a namespace (and not attached):
[1] BiocManager_1.30.10 compiler_3.6.1      tools_3.6.1 

Calculating TF scores and inference on SEURAT clusters

Hi everyone,

Thank you a lot for the package,
I followed the tutorial where we re-did the clustering with TF,
I got interesting results however I want to keep my initial SEURAT clustering and do the TF activity score and the rest on them,

How would I do that?

Thank you a lot,

Problem with object sizes.

Dear dorothea developers.

I am facing a size issue when working with large Seurat objects ( in my case, 27584 rows and 76680 columns). As the call for dorothea::run_viper.Seurat() makes use of as.matrix(), it returns the following error:

"Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105"

This is due to the size of the object, as rows*columns returns a number of integer values above .Machine$integer.max (2147483647).

Is there a possibility to use Seurat::as.sparse() or similars instead of as.matrix()? If not, how can I solve this issue?

Best,
Enrique

Question regarding input into run_viper()

Hello, I am using the DoRothEA package for bulk rna-seq analysis. I have been working with the DESeq2 package. Can I use vst normalized data for the input argument in run_viper, or does the input have to be non-transformed, non pre-filtered counts?

Thanks

run_viper input

Dear team,

Thank you for this excellent package.

I am trying to use run_viper function with the dorothea regulons on two different datasets and look for overlap between the two datasets.

For the gene_expression input to run_viper, is this expecting normalised counts? I was planning to use the [http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#variance-stabilizing-transformation](variance stabilised values from DESeq2) for this - is this correct?

Because I am interested in the enrichment similarities between two datasets would you advise that the input should contain the gene counts for all samples from both datasets together like:

tf_activities <- run_viper(assay(vsd_datasets1and2), regulons, options =  list(method = "scale", minsize = 25, eset.filter = FALSE, cores = 4)) 

Or should I perform run_viper separately for both datasets? and then merge output matrices i.e.:

tf_activities.dataset1 <- run_viper(assay(vsd_dataset1), regulons,options =  list(method = "scale", minsize = 25, eset.filter = FALSE, cores = 4))
tf_activities.dataset2 <- run_viper(assay(vsd_dataset2), regulons,options =  list(method = "scale", minsize = 25, eset.filter = FALSE, cores = 4))
tf_activities..merged <- as_tibble(tf_activities.dataset1, rownames = "TF") %>% full_join(as_tibble(tf_activities.dataset2, rownames = "TF"), by = "TF")

Presumably if together is better then i should run DESeq2 vst function on all samples together too?

Many thanks for your help!
Oliver

Retrieve genes used for TF ID

Dear Dorothea's team,
Is it possible to obtain a list of the genes from the input data that have been used to identify a determined TF in each cell (or cluster)?
Thanks for such a good tool,
Cheers

problem too large error

Hi:

I tried to run_viper step but encountered this error. I have about 100k cells, is that too many? What is the limit?

thanks

Error in asMethod(object): Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Traceback:

  1. run_viper(t_only, regulon, options = list(method = "scale", minsize = 4,
    . eset.filter = FALSE, cores = 1, verbose = FALSE))
  2. run_viper.Seurat(t_only, regulon, options = list(method = "scale",
    . minsize = 4, eset.filter = FALSE, cores = 1, verbose = FALSE))
  3. as.matrix(Seurat::GetAssayData(input, assay = "RNA", slot = "data"))
  4. as.matrix.Matrix(Seurat::GetAssayData(input, assay = "RNA", slot = "data"))
  5. as(x, "matrix")
  6. asMethod(object)

Installing DoRothEA

Dear authors,

I would like first to thank you for your work and for making it available publicly.

I would like to perform some tests using your tool but I encoutered some difficulties to install it.

Just to let you know that I first tried to install the BioConductor version of DoRothEA but it seems that it is not available for both R 3.6.1 nor R 3.6.3
I also wasn't able to find the BioConductor help vignette using the keyword "DoRothEA".
Is DoRothEA maintained in BioConductor ?

I then turn to the Github version and I figured out that R version >= 4.0 is needed to install DoRothEA, however I don't have this R version and I cannot update shared ressources we have in my institution.

Do you plan to make DoRothEA available to lower R versions ?

Thank you very much in advance for your response.

Best regards

NES using DESEQ2

Hi,
I was just wondering could you help with a query I have. I was following your vignette for bulk RNA-seq TF analysis and when I reached the part involving NES I wasn't sure which column to use from my data. In the vignette you use the t column, coming from limma differential gene expression analysis, however I used DESEQ2 for my differential gene expression and don't have a t column. Which column should I use instead? Is the 'stat' column the equivalent in DESEQ2?

Thanks,
Conor

Integrated data from seurat

Hi all,

should I be using the integrated data slot or the non integrated data slot after seurat integration. My batch effect is very strong as they are from different experiments. Thanks

Resulted scaled data

Hi
I am using your package to infer TF activity from scRNA-seq analysis.
I realized from reading the vignette and other github issues that TF score, as found in dorothea assay, is scaled.
I have a few questions regarding this matter (and some other matters, too):

  1. DE TFs:
    You suggested before to obtain DE TFs by using Seurat's FindAllMarkers, but this function is made to find differentially expressed genes among normalized counts, not scaled count. I wonder what would be the best approach?

  2. TF confidence level
    Do you recommend using only A,B,C confidence levels, like in the vignette?

  3. TFs enrichment. Though this question isn't entirely related, i'm gonna ask anyways.
    Say I generated a cluster-specific list of DE TFs. How can I test for enrichment (like GO for genes, etc)?
    Or, in a more general question, what can be downstream analysis after aquiring TF scores?

Thanks!

Error with run_viper wrapper and viper command

I'm running into an error when trying to run the "run_viper" command below.

run_viper(deg_t, dorothea_hs, tidy=T, options = list(minsize = 5, eset.filter = FALSE, cores = 1, verbose = TRUE, nes = TRUE))

The output I get is as follows:

Computing the association scores
Computing regulons enrichment with aREA
Error in wts^2 : non-numeric argument to binary operator

I've attached the list of deg_t genes, their t values, and the dorothea_hs file used. I have also ran viper successfully with other sets of genes from this same dataset and am trying to determine why this one failed.

deg_t.csv
dorothea_hs.csv

Please, check the CTFRs data to make it compatible with the library

Hi,

The current version of the Consensus TF regulons data (CTFRs_v122016.rdata) is not compatible with the package of functions from lib_enrichment_scores.r. CTFRs_v122016.rdata is a list of two elements: geneset$NAME is a vector and geneset$GENES is a list of vectors.

First step during SLEA() is SLEA.clean_genesets(). This function expects to manage a list whose vectors contain gene names as names of each element of the vector (i.e. gene names from
names(geneset$GENES[[TF]]))). However, lapply(geneset$GENES,names) leads to NULLs because the different vectors are unnamed. Because of that, no regulons are kept for the analysis.

> load("./data/CTFRs_v122016.rdata")
> source("./src/lib_enrichment_scores.r")
> all(unlist(lapply(geneset$GENES,function(x) is.null(names(x)))))
[1] TRUE
# This is the source of the following issue:
> SLEA.clean_genesets(genesets=geneset, E=exprs(eSet))
Removing targets under more than 10  TF
	 0  targets keept
	 0  targets removed
Filtering genesets: removing targets not in the expression matrix
Removing gene sets with less than  3  genes
	 793  gene sets removed
	 0  gene sets used covering  0  genes in the expression matrix
$NAME
character(0)

$GENES
named list()

If we named them, SLEA.clean_genesets() works well. Then, some of the methods finish properly. For instance, for GSVA:

# To name each vector from the list
> geneset2 <- geneset
> geneset2$GENES <- lapply(geneset$GENES,function(x) setNames(x,x))
# Test if it works out with this new 'geneset2'
> SLEA.mat <- SLEA(E=exprs(eSet),genesets=geneset2,method="GSVA",M = NULL, permutations = 1000, filter_E = F)
Removing targets under more than 10  TF
	 7978  targets keept
	 18  targets removed
Filtering genesets: removing targets not in the expression matrix
Removing gene sets with less than  3  genes
	 667  gene sets removed
	 126  gene sets used covering  6124  genes in the expression matrix
Getting Enrichment Scores
Calculating SLEA scores using GSVA

Done!

However VIPER does not work yet, it seems that it expects geneset$GENES was a list of numeric vectors with the gene names as names of the vector:

> SLEA.mat <- SLEA(E=exprs(eSet),genesets=geneset2,method="VIPER",M = NULL, permutations = 1000, filter_E = F)
Removing targets under more than 10  TF
	 7978  targets keept
	 18  targets removed
Filtering genesets: removing targets not in the expression matrix
Removing gene sets with less than  3  genes
	 667  gene sets removed
	 126  gene sets used covering  6124  genes in the expression matrix
Getting Enrichment Scores
Calculating SLEA scores using VIPER
Error in x/length(x) : non-numeric argument to binary operator
Called from: matrix(x/length(x), nrow = 1, ncol = length(x))

Hope that it helps! Thank you very much for your attention :)

All the best,
Javier

No bug, just a question :-)

I am trying to redo an analysis from a paper. Here the state to have selected the TF with highest confidence from Dorothea and they end up with 168 TFs (and 2602 unique targets). Working with the latest version under R 4.1 I could find 96 for confidnce level A and 22 for B, only adding up to 118. Adding level C would give 271... I also tried with dorothea v1.0.0, but that gave me roughly the same numbers. Do you have any idea which version they have been using? Where can I obtain older versions of dorothea?
Thanks in advance!!
Aldo

Integration of regulon activities across different datasets

Dear authors, thank you for developing dorothea, I am excited to use it for my research.
I would like to ask a question concerning the integration of different datasets to compare regulon activities, I hope this is the right place.

Besides merging all the data during the upstream processing, is there a strategy to compare the NES obtained in a dataset with another one? Something tells me that I cannot directly do that, can you confirm?

For instance, I found the regulons reported in https://dorothea.opentargets.io/#/ very useful, and I would like to measure and compare regulons the activity in another set of samples (and then for example find the most similar cell line). But let's suppose I now measure the same regulons on a set of immune cells, could I integrate the NES of those two sources? What would be your recommendation to achieve that, possibly starting from processed data?

Cluster using the TF regulons

Hi,

I have inquires regarding a DoRothEA in R. The output shows the variable TFs once plotting the heatmap. I wonder how I am able to collect the list of corresponding regulons (TF targets) in each cluster. Could you please give me a piece of advice?

Thanks,

Error in if (names(regulon[[1]])[1] == "tfmode") regulon <- list(regulon = regulon) : argument is of length zero

I get this error. It seems to be looking for something that's not there.

dorothea_regulon_human <- get(data("dorothea_hs", package = "dorothea"))

str(dorothea_regulon_human)
tibble [486,751 × 4] (S3: tbl_df/tbl/data.frame)
$ tf : chr [1:486751] "ADNP" "ADNP" "ADNP" "ADNP" ...
$ confidence: chr [1:486751] "D" "D" "D" "D" ...
$ target : chr [1:486751] "ATF7IP" "DYRK1A" "TLK1" "ZMYM4" ...
$ mor : num [1:486751] 1 1 1 1 1 1 1 1 1 1 ...

viper.scores <- viper(dat, dorothea_regulon_human)
Computing the association scores
Error in if (names(regulon[[1]])[1] == "tfmode") regulon <- list(regulon = regulon) :
argument is of length zero

names(regulon[[1]])[1]
NULL

sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] viper_1.22.0 Biobase_2.48.0 BiocGenerics_0.34.0 dorothea_1.0.0 magrittr_1.5

loaded via a namespace (and not attached):
[1] fastmatch_1.1-0 rnndescent_0.0.7 plyr_1.8.6 igraph_1.2.5 splines_4.0.2
[6] BiocParallel_1.22.0 GenomeInfoDb_1.24.2 ggplot2_3.3.2 scater_1.16.2 urltools_1.7.3
[11] digest_0.6.25 htmltools_0.5.0 GOSemSim_2.14.0 viridis_0.5.1 GO.db_3.11.4
[16] fansi_0.4.1 memoise_1.1.0 mixtools_1.2.0 limma_3.44.3 annotate_1.66.0
[21] graphlayouts_0.7.0 matrixStats_0.56.0 sccore_0.1 enrichplot_1.8.1 prettyunits_1.1.1
[26] colorspace_1.4-1 blob_1.2.1 ggrepel_0.8.2 pagoda2_0.1.1 xfun_0.15
[31] dplyr_1.0.0 crayon_1.3.4 RCurl_1.98-1.2 jsonlite_1.7.0 scatterpie_0.1.4
[36] genefilter_1.70.0 brew_1.0-6 survival_3.2-3 glue_1.4.1 polyclip_1.10-0
[41] gtable_0.3.0 zlibbioc_1.34.0 XVector_0.28.0 DelayedArray_0.14.0 kernlab_0.9-29
[46] BiocSingular_1.4.0 Rook_1.1-1 SingleCellExperiment_1.10.1 scales_1.1.1 DOSE_3.14.0
[51] DBI_1.1.0 edgeR_3.30.3 Rcpp_1.0.5 viridisLite_0.3.0 xtable_1.8-4
[56] progress_1.2.2 gridGraphics_0.5-0 reticulate_1.16-9000 dqrng_0.2.1 bit_1.1-15.2
[61] europepmc_0.4 rsvd_1.0.3 stats4_4.0.2 httr_1.4.1 fgsea_1.14.0
[66] RColorBrewer_1.1-2 ellipsis_0.3.1 pkgconfig_2.0.3 XML_3.99-0.4 farver_2.0.3
[71] locfit_1.5-9.4 labeling_0.3 ggplotify_0.0.5 tidyselect_1.1.0 rlang_0.4.7
[76] reshape2_1.4.4 later_1.1.0.1 AnnotationDbi_1.50.1 munsell_0.5.0 tools_4.0.2
[81] cli_2.0.2 downloader_0.4 generics_0.0.2 RSQLite_2.2.0 ggridges_0.5.2
[86] stringr_1.4.0 fastmap_1.0.1 yaml_2.2.1 knitr_1.29 bit64_0.9-7
[91] tidygraph_1.2.0 bcellViper_1.24.0 purrr_0.3.4 dendextend_1.13.4 ggraph_2.0.3
[96] pbapply_1.4-2 mime_0.9 scran_1.16.0 DO.db_2.9 xml2_1.3.2
[101] compiler_4.0.2 rstudioapi_0.11 beeswarm_0.2.3 e1071_1.7-3 tibble_3.0.3
[106] statmod_1.4.34 tweenr_1.0.1 geneplotter_1.66.0 stringi_1.4.6 lattice_0.20-41
[111] Matrix_1.2-18 vctrs_0.3.1 pillar_1.4.6 lifecycle_0.2.0 BiocManager_1.30.10
[116] triebeard_0.3.0 BiocNeighbors_1.6.0 data.table_1.12.8 cowplot_1.0.0 bitops_1.0-6
[121] irlba_2.3.3 httpuv_1.5.4 conos_1.3.0 GenomicRanges_1.40.0 qvalue_2.20.0
[126] R6_2.4.1 promises_1.1.1 KernSmooth_2.23-17 gridExtra_2.3 vipor_0.4.5
[131] IRanges_2.22.2 assertthat_0.2.1 MASS_7.3-51.6 SummarizedExperiment_1.18.2 rjson_0.2.20
[136] DESeq2_1.28.1 cacoa_0.1 S4Vectors_0.26.1 GenomeInfoDbData_1.2.3 hms_0.5.3
[141] clusterProfiler_3.16.0 grid_4.0.2 class_7.3-17 tidyr_1.1.0 DelayedMatrixStats_1.10.1
[146] rvcheck_0.1.8 segmented_1.2-0 ggforce_0.3.2 base64enc_0.1-3 shiny_1.5.0
[151] tinytex_0.24 ggbeeswarm_0.6.0

Scaled data in dorothea

Hello, I was using dorothea to infer TFs in my scRNA-seq dataset. However, I don't understand why after scaling data, values for many TFs are negative in the majority of clusters.

dorothea_regulon_human <- get(data("dorothea_hs", package = "dorothea"))
regulon <- dorothea_regulon_human %>% dplyr::filter(confidence %in% c("A","B","C"))
pbmc <- run_viper(pbmc, regulon, options = list(method = "scale", minsize = 4, verbose = FALSE)) 	
DefaultAssay(object = pbmc) <- "dorothea"     
pbmc <- ScaleData(pbmc)
VlnPlot(pbmc, features=c('E2F4'), group.by='seurat_clusters')

image

I imagine that it does not matter so much if differences between clusters are clear, but what is different respect to scaling data directly in RNA assay? Thank you

Resume Distributing Tabular Data

An older version of this repository (it appears the git history has been purged) hosted tabular versions of the DoRothEA database from 20180915. More specifically, I was relying on data persisting at the following URL:

https://github.com/saezlab/DoRothEA/blob/master/data/TFregulons/consensus/table/database_normal_20180915.csv.zip?raw=true

My use case was to convert this data to BEL for reuse in larger biological networks (code at https://github.com/bio2bel/bio2bel/blob/master/src/bio2bel/sources/tfregulons.py) as part of the Bio2BEL project, which @deeenes and @Nic-Nic have participated.

Would you be willing to resume distributing the database as a CSV to enable users who aren't using R to access the data? Or maybe there's a link somewhere to a Zenodo archive that I missed, since distributing data through GitHub isn't optimal?

Trace back the evidence

Thank you for developing a great package.

I have a specific TF of interest, and would like to know what evidence the interactions are based on. (Which ChIP seq data, which paper etc...) Would this be possible?

Thank you so much for your help!

Discrepancy in dataset sizes

The entire_database data frame contains 1,019,220 unique TF-target pairs, while the dorothea_hs only 454,504. 564,716 interactions from the former missing from the latter. The missing interactions are from all confidence levels:

library(dorothea)
library(dplyr)

entire_database %>%
left_join(dorothea_hs, by = c('tf', 'target')) %>%
filter(is.na(confidence.y)) %>%
count(confidence.x)
# # A tibble: 5 x 2
#   confidence.x      n
#   <chr>         <int>
# 1 A                32
# 2 B              8073
# 3 C             10690
# 4 D            260365
# 5 E            285556

In addition, at 1,638 interactions the confidence level is different between entire_database and dorothea_hs:

entire_database %>%
inner_join(dorothea_hs, by = c('tf', 'target')) %>%
filter(confidence.x != confidence.y) %>%
nrow
# [1] 1638

I could not find out the reason of these differences, and I am wondering whether is this normal?

The cell annotation need be corrected, after use "dorothea" for FindClusters

Thanks for sharing, DoRothEA is a very useful for me.
In the demo script "TF activity inference from scRNA-seq data with DoRothEA as regulon resource." (https://saezlab.github.io/dorothea/articles/single_cell_vignette.html).
After you use "dorothea" for FindClusters, the cluster is not the same as the previous one using "RNA" for FindClusters, but you use the same Assigning cell type identity to clusters, resulting in wrong cell annotations and needs to be corrected:
`

We compute the Nearest Neighbours to perform cluster

Assigning cell type identity to clusters

new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T",
"FCGR3A+ Mono", "NK", "DC", "Platelet") # This is a wrong cell annotation.

`

df2regulon function returns list instead of regulon object

Dear developers,

I'm trying to run msVIPER on a DESeq2 dataset. With converting the dorothea regulons to a format suitable for msVIPER, I run into the problem that the output of the code below (thanks Christian for pointing me to this function) is a list, and not a regulon object. Could you please help me further?

# Load dorothea
library("dorothea")
# Load mm data from dorothea
data(dorothea_mm, package ="dorothea")
# Convert to the format required by viper
viper_regulons <- dorothea::df2regulon(dorothea_mm)

Thanks,
Justin

Error in mutate()

Hi! I am getting the following errors when I entered get_dorothea. I'm a newbie so any tips would help! Thank you!

net <- get_dorothea(organism='mouse', levels=c('A', 'B', 'C'))
Error in mutate():
ℹ In argument: n_references = ifelse(...).
Caused by error in map():
ℹ In index: 1.
Caused by error in .f():
! Arguments in ... must be passed by position, not name.
✖ Problematic argument:
• outsep = outsep
Run rlang::last_trace() to see where the error occurred.

Preprocessing for Bulk RNA-Seq

Dear authors,

Thank you for making such an easy-to-use package. What is the recommended preprocessing for bulk RNA-Seq data?

Thanks in advance.

data set 'dorothea_hs_pancancer' not found

Hi There,
TCGA regulons are not accessible. Please help.

Code:
data(dorothea_hs_pancancer, package = "dorothea")
Warning message:
In data(dorothea_hs_pancancer, package = "dorothea") :
data set 'dorothea_hs_pancancer' not found

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 4.0.3 (2020-10-10)
os Red Hat Enterprise Linux Server 7.5 (Maipo)
system x86_64, linux-gnu
ui RStudio
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Chicago
date 2021-08-09

Reliability of mode of regulation

Hi,

I wonder if it makes sense to create a Cytoscape TF-target network based on Dorothea interactions. Especially, what do you think of using MoR for edge attributes? I assume it is reliable to use MoR information yet, the question comes from possibly contradictory fact included in the paper:

Here, when the MoR of the TF–target interaction was not defined by the original data set (i.e., those derived from TFBS predictions, ChIP-seq data and most of the curated databases), we assumed a positive regulatory effect of the TF on the target. However, if the TF is known to be a global repressor (data extracted from UniProt) (Supplemental Table S1), the interactions are assumed to have a negative regulatory effect.

How about using Omnipath TF-target interactions and filtering by curation effort >2 ?

Thanks for the resource!

connection error when retrieve the data

After successfully installed the required R packages, I simply tried the retrieval of TF-targets using the command

net <- decoupleR::get_dorothea(levels = c('A', 'B', 'C', 'D'))

however, error occurs 'Error in open.connection(3L, "rb") : HTTP error 500.'

However to solve this?

database for TCGA samples

Hi,

In your paper "Benchmark and integration of resources for the estimation of human transcription factor activities" you mentioned
"These resources are available in OmniPath (Türei et al. 2016; www.omnipathdb .org) and at https://saezlab.github.io/DoRothEA/. The normal collection comprises 1,077,121 TF–target candidate regulatory interactions between 1402 TFs and 26,984 targets. The pancancer collection includes 636,753 TF–target candidate regulatory interactions between 1412 TFs and 26,939 targets".

Is the dataset "dorothea_hs" derived from normal samples or cancer samples? where can we download the TF-target interactions for pancancer samples?

Thanks!

Single-cell vignette needs update

In the chunk "tf activity", there's a mismatch between CellsCluster$cell and rownames(viper_scores_df):

str(CellsClusters)
'data.frame': 2638 obs. of 2 variables:
$ cell : chr "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1" "AAACCGTGCTTCCG-1" ...
$ cell_type: chr "Naive CD4 T" "CD8 T" "FCGR3A+ Mono" "Memory CD4 T" ...

str(viper_scores_df)
num [1:2638, 1:288] 1.04 -2.02 1.17 1.36 2.11 ...
attr(*, "dimnames")=List of 2
..$ : chr [1:2638] "AAACATACAACCAC.1" "AAACATTGAGCTAC.1" "AAACATTGATCAGC.1" "AAACCGTGCTTCCG.1" ...
..$ : chr [1:288] "AHR" "AR" "ARID1A" "ARID1B" ...

Therefore:

viper_scores_clusters <- viper_scores_df %>%
data.frame() %>%
rownames_to_column("cell") %>%
gather(tf, activity, -cell) %>%
inner_join(CellsClusters)

str(viper_scores_clusters)
'data.frame': 0 obs. of 4 variables:
$ cell : chr
$ tf : chr
$ activity : num
$ cell_type: chr

Can be corrected by:

CellsClusters <- data.frame(cell = names(Idents(pbmc)),
cell_type = as.character(Idents(pbmc)),
stringsAsFactors = FALSE) %>%
dplyr::mutate(., cell = sapply(.$cell, function(x) sub("-", ".", x)) %>% unname())

str(CellsClusters)
'data.frame': 2638 obs. of 2 variables:
$ cell : chr "AAACATACAACCAC.1" "AAACATTGAGCTAC.1" "AAACATTGATCAGC.1" "AAACCGTGCTTCCG.1" ...
$ cell_type: chr "Naive CD4 T" "CD8 T" "FCGR3A+ Mono" "Memory CD4 T" ...

str(viper_scores_clusters)
'data.frame': 759744 obs. of 4 variables:
$ cell : chr "AAACATACAACCAC.1" "AAACATTGAGCTAC.1" "AAACATTGATCAGC.1" "AAACCGTGCTTCCG.1" ...
$ tf : chr "AHR" "AHR" "AHR" "AHR" ...
$ activity : num 1.04 -2.02 1.17 1.36 2.11 ...
$ cell_type: chr "Naive CD4 T" "CD8 T" "FCGR3A+ Mono" "Memory CD4 T" ...

Question on mRNA expression versus DoRothEA TF activity

Hello!

Recently I applied your package to some single cell data. We have a transcription factor that has relatively high and specific expression in a subset of cells. However, after I apply the package, the resulting heatmap tells me that this population actually has the lowest expression of the transcription factor. Is the package working as it should, or am I misinterpreting the heatmap output or the algorithm?

Thanks,

Casey

Dorothea using "RNA" assay by default instead of providing the option to indicate the real assay to use.

Dear Dorothea developers,

Digging into the code a little bit, I realised the following potential error in the code for dorothea::run_viper(), when applied to an object of class Seurat.

run_viper.Seurat <- function(input, regulons, options = list(), tidy = FALSE) {
  if (tidy) {
    tidy <- FALSE
    warning("The argument 'tidy' cannot be TRUE for 'Seurat' objects. ",
            "'tidy' is set to FALSE")
  }
  mat <- as.matrix(Seurat::GetAssayData(input, assay = "RNA", slot = "data"))
​
  tf_activities <- run_viper(mat, regulons = regulons, options = options,
                             tidy = FALSE)
​
  # include TF activities in Seurat object
  dorothea_assay <- Seurat::CreateAssayObject(data = tf_activities)
  Seurat::Key(dorothea_assay) <- "dorothea_"
  input[["dorothea"]] <- dorothea_assayreturn(input)
}

As you can see, when defining mat object, it is retrieving by default the RNA object, and the data slot. The data slot stores the normalized non-centered data, as far as I am aware of. Therefore, I assume the VIPER scores are meant to be run on normalized data.

Then, in the case of people deviating from the standard Seurat's vignette, for instance using the new normalization methodology they suggest (Regularised negative binomial regression, using SCTransform package, actually already integrated into the Seurat package), the assay will change from RNA to SCT. Then, the data slot from the RNA assay will still contain the raw counts, and those will be used in VIPER (reference).

Then, I would highly suggest to implement an assay parameter, in the same way as in Progeny, to allow the user to select the assay on which they have stored the normalized data (also, in the case the user, for any given reason, has modified the assay name).

This is how I modified it to take any slot desired:

run_viper <- function(input, 
                          regulons, 
                          options = list(), 
                          tidy = FALSE,
                          assay.use = assay.use) {
        UseMethod("run_viper")
    }
    run_viper.Seurat <- function(input, 
                                 regulons, 
                                 options = list(), 
                                 tidy = FALSE,
                                 assay.use = "RNA") {
        if (tidy) {
            tidy <- FALSE
            warning("The argument 'tidy' cannot be TRUE for 'Seurat' objects. ",
                    "'tidy' is set to FALSE")
        }
        
        mat <- as.matrix(Seurat::GetAssayData(input, assay = assay.use, slot = "data"))
        
        tf_activities <- dorothea::run_viper(mat, regulons = regulons, options = options,
                                   tidy = FALSE)
        
        # include TF activities in Seurat object
        dorothea_assay <- Seurat::CreateAssayObject(data = tf_activities)
        Seurat::Key(dorothea_assay) <- "dorothea_"
        input[["dorothea"]] <- dorothea_assay
        
        return(input)
    }

Best,
Enrique

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.