cbg-ethz / dce Goto Github PK

View Code? Open in Web Editor NEW

9.0 5.0 3.0 13.05 MB

Finding the causality in biological pathways

Home Page: https://cbg-ethz.github.io/dce/

R 80.95% Python 14.38% Jupyter Notebook 2.65% Shell 0.19% MATLAB 1.63% M 0.20%

causality bioconductor r

dce's Introduction

dce

Compute differential causal effects on (biological) networks. Check out our vignettes for more information.

Publication: https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btab847/6470558

Installation

Install the latest stable version from Bioconductor:

BiocManager::install("dce")

Install the latest development version from GitHub:

remotes::install_github("cbg-ethz/dce")

Project structure

.: R package
inst/scripts/: Snakemake workflows for all investigations in publication
- crispr_benchmark: Real-life data validation
- gtex_validation: Deconfounding validation
- ovarian_cancer: How does Ovarian Cancer dysregulate pathways?
- synthetic_benchmark: Synthetic data validation
- tcga_pipeline: Compute effects for loads of data from TCGA

Development notes

Check package locally:
- Rscript -e "lintr::lint_package()"
- Rscript -e "devtools::test()"
- Rscript -e "devtools::check(error_on = 'warning')"
- R CMD BiocCheck
Documentation
- Build locally: Rscript -e "pkgdown::build_site()"
- Deploy: Rscript -e "pkgdown::deploy_to_branch(new_process = FALSE)"
Bioconductor
- The bioc branch stores changes specific to Bioconductor releases
- Update workflow (after git remote add upstream [email protected]:packages/dce.git):
  - git checkout bioc
  - git merge master
  - git push upstream bioc:master

dce's People

Contributors

Stargazers

Watchers

Forkers

martinfxp dcevid

dce's Issues

Why do large networks (>=200 nodes) only produce results when sparse in DCE reconstruction benchmarking?

Merge pathway nodes

In some cases it might be biologically meaningful (ha) to merge certain pathways nodes (e.g. protein complexes), as well as computationally helpful by summing low count vectors.

See section 2.2.2 and figure 2 in https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-20.

Input DAGs may not satisfy faithfulness assumption

Open issues:

Meeting Notes:

better understand difference between separate and joint models (they are actually equivalent: https://books.google.ch/books?id=zyjWBgAAQBAJ&pg=PA137&lpg=PA137&dq=regression+for+to+separate+data+sets+indicator&source=bl&ots=OZiF7M0ShS&sig=ACfU3U3f5mni4Zj7-xY-RdvMsw8eVssHTQ&hl=de&sa=X&ved=2ahUKEwjahYC16e7oAhXM-KQKHR6fCdcQ6AEwAXoECAsQLw#v=onepage&q=regression%20for%20to%20separate%20data%20sets%20indicator&f=false)
try likelihood ratio test model (for single edge)
- with delta vs without delta (with only one delta?)
log link function may be a better idea
try partial correlation with NB assumption
benchmark: set 90% of ground truth DCEs to 0 (makes setting more biologically relevant, maybe AUC can be used)
where to get DAGs from (KEGG, ...)
simulations: sampling beta and subtracting minimum biases beta (?)

Compare multiple performance measurements in benchmarking

bug in generating graph?

graph1 <- create_random_DAG(n=p, prob=.05, lB=1)
Warning message:
In runif(length(negedges), min = lB[1], max = lB[2]) : NAs produced

In documentation it says lB is lower bound and uB is upper bound. I suppose you want to use that rather than lB[1], lB[2]

How to include library size in the model?

The simulations do probably not suffer much/at all from this. However, for real data we cannot just normalize the data, because we need counts for the glm.

Real pathways are not always DAGs

How to deal with this problem?

Ideas:

transform pathway to DAG (how?), how to validate
adapt method (dynamical bayesian network??)

How to proceed, if samples in tissue < genes in pathway?

How to proceed, if samples in tissue < genes in pathway? Subsampling of genes? Prefiltering of genes, based on expression or position in pathway?

Random DAG creation

Is the somewhat complicated DAG creation in dce::create_random_DAG more or less similar to this much simpler implementation:

node_num <- 10
edge_prob <- .9
eff_min <- .2
eff_max <- 1.4

tmp <- matrix(rbinom(node_num * node_num, 1, edge_prob), node_num, node_num)
tmp[lower.tri(tmp)] <- 0
tmp[tmp != 0] <- runif(sum(tmp != 0), min = eff_min, max = eff_max)

dce::plot_network(tmp)

How to choose the "correct" link function?

Computing Causal Effects using glm requires us to use a link function. Common choices are identity or log.

`glm.nb` throws warnings and errors when data is naughty

The following code generates warnings and an error if the value flip is applied.

A <- rnbinom(100, size=100, mu=1000)
B <- rnbinom(100, size=100, mu=0.1*A)

# value flip
A[1] <- 20
B[1] <- 2000

glm.nb(B ~ A, link="identity")

The error is no valid set of coefficients has been found: please supply starting values.

A subset of the warnings:

1: In log(y/mu) : NaNs produced
2: step size truncated due to divergence
3: In log(y/mu) : NaNs produced
4: step size truncated due to divergence
5: glm.fit: algorithm did not converge
6: In log(pmax(1, y)/mu) : NaNs produced
7: In log((y + .Theta)/(mu + .Theta)) : NaNs produced
[..]

How to benchmark perturbed ground truth?

Ground truth:
A -> B - > C

Perturbed:
A -> B - > C; A -> C

dce(A,C) is the same in both settings. However, the ground truth, would not compute a dce(A,C), since there is no edge. Is the dce(A,C) a false positive or not?

Simulating negative binomial read counts on DAGs

Idea 1

beta > 0 describes the relative change. 0.5 corresponds to halving and 2 to doubling the expression levels.
This is problematic because it requires a transformation of causal effects which is non-trivial (but possibly somehow doable?).

Idea 2

beta can be both positive and negative. Counts are propagated by multiplying beta with mean-standardized counts and adding noise.
This is problematic because standardizing might introduce artefacts and can lead to mu < 0 (which yields NaN counts).

beta <- -1.2

set.seed(42)
A.nb <- rnbinom(1000, size=10, mu=10)

B.nb <- beta * A.nb + rnbinom(1000, size=10, mu=10) # leads to negative counts
B.nb <- rnbinom(1000, size=10, mu=mean(A.nb) + beta * A.nb) # leads to negative mu, thus NA counts
B.nb <- rnbinom(1000, size=10, mu=10) + beta * scale(A.nb, scale=FALSE) # leads to negative counts
B.nb <- rnbinom(1000, size=10, mu=mean(A.nb) + beta * scale(A.nb, scale=FALSE)) # leads to negative mu, thus NA counts

MASS::glm.nb(B.nb ~ A.nb, link="identity")

Idea 3

Use a mean function for mu of rnbinom. This requires an appropriate link function during the regression.

beta <- -1.2

set.seed(42)
A.nb <- rnbinom(1000, size=10, mu=10)

B.nb <- rnbinom(1000, size=10, mu=exp(log(10) + beta * (A.nb - mean(A.nb)))) # link function keeps mu positive, exp can lead to extreme values

MASS::glm.nb(B.nb ~ A.nb, link="log")
glm(B.nb ~ A.nb, family=MASS::negative.binomial(theta=10, link="log"))
glm2::glm2(B.nb ~ A.nb, family=MASS::negative.binomial(theta=10, link="log"))

Using causal effects as edge weights can be misleading

Consider the following graph:

A -> B -> C
|         ^
|_________|

Then the edge weight of A->C is not the total causal effect from A to C, but rather this total CE minus the CE of A->B->C

A few approaches from literature

Working with DE vs non-DE genes

Choose real pathways (from GO) and randomly select some. Union of contained genes (+ noise) is labeled as significant. Then do FET for all pathways. (https://academic.oup.com/bioinformatics/article/22/13/1600/193669)
Use real pathways (from MSigDB), randomly select one, and label its members as DE. (https://www.biorxiv.org/content/10.1101/659920v1)

Working with gene expression values

Sample from normal distribution. Gene sets of varying differential expression and correlation patterns are simulated. (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-47)
Sample expression values from normal distribution. (https://www.nature.com/articles/s41540-017-0007-2)
Sample fold changes from normal/uniform distribution. Then draw normalized estimated expression levels from normal distribution. (https://academic.oup.com/bioinformatics/article/25/2/211/218259)
Log-expression values sampled from multivariate normal. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3458527/)
Start from real expression data. Standardize data. Add signal to subsets. (https://link.springer.com/article/10.1186/s12859-019-3146-1)
Draw expression from normal distribution. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5939912/)

Other simulation papers

Draw expression level from gamma distribution. (https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz752/5584234)
Draw expression levels from negative binomial. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4635655/)
Model interaction kinetics. Uses underlying network. (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-43)
Use stochastic simulations with delay. Uses underlying network. (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1731-8)
Draw from negative binomial. (https://bioconductor.org/packages/release/bioc/vignettes/compcodeR/inst/doc/compcodeR.pdf)