osofr / simcausal Goto Github PK

View Code? Open in Web Editor NEW

62.0 4.0 11.0 732 KB

Simulating Longitudinal and Network Data with Causal Inference Applications

R 100.00%

simulated-network simulating-data structural-equations sem counterfactual-data

simcausal's People

Contributors

Stargazers

Watchers

Forkers

migariane anhnguyendepocen bsaul diadochos guhjy mcfowland xiaomangmang yadevi cbundi romainkp fkgruber

simcausal's Issues

Removed from CRAN

Hi,
Just to let the author know that simcausal has been removed from CRAN:
https://cran.r-project.org/web/packages/simcausal/index.html

Mediation Analysis with simcausal

Hi,
I'd like to use simcausal for mediation analysis but wonder how this can best be done. For natural direct and indirect effects I need to simulate the crossworld counterfactual Y(a, M(a*)), - A is the binary exposure variable and M the mediator. M(a*) is the level of the mediator that it would have naturally been under the reference intervention A=a*.

I'm also particularly interested in simulating randomized interventional analogues of the natural effects in the presence of exposure-induced confounding of the relationship of M and the outcome Y. These effects require the counterfactual Y(a, G(m|a*, c)), with G(m|a*, c) denoting a random draw from the distribution of M given A=a* and non-exposure-induced confounders C=c.

What I did so far was rather intuitive. For the interventional effects, I need to draw from the distribution G(m|a*, c) - therefore I simulated data first and then specified a model (lm or glm,e.g.) for M|a, c for each counterfactual dataset a1 and a0. I used the estimated parameters for specifying G(m|a*, c), G(m|a, c) and the respective outcome variables in new nodes and simulated the counterfactual dataset again to evaluate the interventional effects. Below is a simple example for illustration:

################################################################
#Simple Mediation process with an exposure-induced confounder L
#An "intuitive" approach
################################################################

set seed for reproducability

set.seed(41188)

M <- DAG.empty()

M <- M +
node("a", # Binary exposure variable
distr = "rbern",
prob = 0.53) +
node("c", # Confounder variable C
distr="rnorm", mean=5, sd=2)+
node("l", #Binary confounder variable L -> exposure induced
distr="rbern" ,
prob=ifelse(c>=5, 0.6-0.3a, 0.7-0.3a))+
node("m", #Mediator variable M
distr = "rnorm",
mean=10-2a+c+5l,
sd=3)+
node("u.Y", #Latent variable
distr = "rnorm", mean = 0, sd = 100)+
node("y", #Outcome Variable Y
distr = "rnorm", mean = 2000-200a+10m-5am + 50l -20al+ 15c + u.Y, sd=200)
Mset <- set.DAG(M, latent.v = c("u.Y"))
str(Mset)

specify the two interventions

a1 <- node("a", distr = "rbern", prob = 1)
Mset <- Mset + action("a1", nodes = a1)
a0 <- node("a", distr = "rbern", prob = 0)
Mset <- Mset + action("a0", nodes = a0)

counterfactual data to estimate conditional distribution M|a,c and M|a*,c

dat <- simcausal::sim(DAG = Mset, actions = c("a1", "a0"), n = 10000000, rndseed = 345)

look at counterfactual data

head(dat[["a1"]]); head(dat[["a0"]])

plotDAG(Mset, xjitter = 0.2, yjitter = 0.2,
edge_attrs = list(width = 0.5, arrow.width = 0.5, arrow.size = 0.3),
vertex_attrs = list(size = 20, label.cex = 0.8))

#Evaluate distribution M|a,c and M|a*,c

summary(m0<-lm(m~c, data=dat[["a0"]]))
sd.m0<-summary(m0)$sigma; sd.m0
coef.m0<-coefficients(m0); coef.m0

summary(m1<-lm(m~c, data=dat[["a1"]]))
coef.m1<-coefficients(m1); coef.m1
sd.m1<-summary(m1)$sigma; sd.m1

M.r<- M +
node("m.0", distr = "rnorm", mean = 13.7465169 + 0.9006255c, sd=3.828051)+ #insert values by hand
node("m.1", distr = "rnorm", mean = 10.2433650 + 0.9009554 c, sd=3.826656) +
#node("m.0", distr = "rnorm", mean = coef.m0[1]+coef.m0[2]c, sd=sd.m0)+ this syntax doesn't work
#node("m.1", distr = "rnorm", mean =coef.m1[1] + coef.m1[2]c, sd=sd.m1)+
node("y.m0", distr = "rnorm",
mean = 2000-200a+10m.0-5am.0 + 50l -20al+ 15c + u.Y, sd=200)+
node("y.m1", distr = "rnorm",
mean = 2000-200a+10m.1-5am.1 + 50l -20al+ 15c + u.Y, sd=200)
Mset.r <- set.DAG(M.r, latent.v = c("u.Y"))
str(Mset.r)

specify the two interventions

a1 <- node("a", distr = "rbern", prob = 1)
Mset.r <- Mset.r + action("a1", nodes = a1)
a0 <- node("a", distr = "rbern", prob = 0)
Mset.r <- Mset.r + action("a0", nodes = a0)

##################

Evaluate TRUTH

##################
#Simulate counterfactual dataset
dat <- simcausal::sim(DAG = Mset.r, actions = c("a1", "a0"), n = 10000000, rndseed = 345)

#Evaluate true randomized interventional analogue for the direct effect

Mset.r <- set.targetE(Mset.r, outcome = "y.m0", param = "a1")
y_r.m0_a1<-eval.target(Mset.r, data = dat)$res ; y_r.m0_a1

Mset.r <- set.targetE(Mset.r, outcome = "y.m0", param = "a0")
y_r.m0_a0<-eval.target(Mset.r, data = dat)$res; y_r.m0_a0
RIA.NDE<-y_r.m0_a1-y_r.m0_a0; RIA.NDE

#Evaluate true randomized interventional analogue for the indirect effect

Mset.r <- set.targetE(Mset.r, outcome = "y.m1", param = "a1")
y_r.m1_a1<-eval.target(Mset.r, data = dat)$res; y_r.m1_a1

RIA.NIE<-y_r.m1_a1-y_r.m0_a1; RIA.NIE

#Evaluate randomized interventional analogue for total effect:
RIA.TE<-y_r.m1_a1-y_r.m0_a0
RIA.TE == RIA.NIE + RIA.NDE
###################################

I'm not quite sure, if this code is ok or if there are more efficient solutions?
Thanks a lot for suggestions!

fix dependency for rmvnorm

There is an attempt to load mvtnorm namespace in examples -

simcausal/man/node.Rd

Line 285 in d0c5d8b

require("mvtnorm")

As mvtnorm is suggested dependency, it should not be required to run examples according to R-exts. This can be conditionally escaped with requireNamespace.
It is possible there are other cases of that issue in other examples, this one was to first that fails.

> #---------------------------------------------------------------------------------------
> # EXAMPLE 7: Multivariate random variables
> #---------------------------------------------------------------------------------------
> require("mvtnorm")
Loading required package: mvtnorm
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE,  :
  there is no package called ‘mvtnorm’
> D <- DAG.empty()
> 
> # 3 dimensional normal (uncorrelated), using rmvnorm function from rmvnorm package:
> D <- D +
+   node(c("X1","X2","X3"), distr = "rmvnorm", mean = c(0,1,2))
Error in node(c("X1", "X2", "X3"), distr = "rmvnorm", mean = c(0, 1, 2)) : 
  rmvnorm: this node distribution function could not be located
Execution halted

Defining an deterministic action

I am trying to specify an action that assigns a value of "abar" to E[t] while a time-varying covariate C[t-1] is 1, and 0 otherwise. The DAG is defined as follows:

tmax<-10
S1.E.1 <-substitute(plogis(-0.30 + 1.00*W1 + 0.08*W2))
S1.E   <-substitute(plogis(-0.30 + 1*E[t-1]))
S1.D.1 <-substitute(plogis(-2.50 - 0.10*W1 + 0.20*W2 + 0.07*E[t]))
S1.D   <-substitute(plogis(-2.50 - 0.10*W1 + 0.20*W2 + 0.07*E[t]))
S1.Y.1 <-substitute(plogis(-2.50 - 0.50*W1 + 0.50*W2 + 0.10*E[t] + 1.00*S + 0.20*S*E[t]))
S1.Y   <-substitute(plogis(-2.50 - 0.05*W1 - 0.25*W2 + 0.30*E[t] + 0.10*Ebar[t-1] + 2.00*H[t-1] + 2.00*S + 0.20*S*Ebar[t-1]))
S1.H.1 <-substitute(plogis(-3.00 + 0.10*W1 + 0.50*W2 + 2.00*S + 0.25*S*E[t]))
S1.H   <-substitute(plogis(-3.00 + 0.10*W1 + 0.50*W2 + 2.00*S + 0.25*S*E[t] + 0.50*Ebar[t-1]))
S1.C.1 <-substitute(plogis( 4.00 - 0.05*W1 - 0.25*W2)) #staying at work
S1.C   <-S1.C.1
D1 <- DAG.empty() +
  node("S",      distr = "rconst", const = 0)+
  node("TTH",    distr = "runif",  min = -2, max = 10)+
  node("C.star", distr = "rconst", const = floor(TTH))+
  node("W1",     distr = "rbern",  prob  = 0.20) +
  node("W2",     distr = "rbern",  prob  = 0.20) +

  node("E",      t=1,       distr = "rbern",  prob=.(S1.E.1))+
  node("D",      t=1,       distr = "rbern",  prob=.(S1.D.1),EFU=TRUE)+
  node("Y",      t=1,       distr = "rbern",  prob=.(S1.Y.1), EFU=TRUE)+
  node("Ebar",   t=1,       distr = "rconst", const=E[1])+
  node("H",      t=1,       distr = "rbern",  prob=.(S1.H.1))+
  node("C",      t=1,       distr = "rbern",  prob=.(S1.C.1))+

  node("E",      t=2:tmax,  distr="rbern",    prob=ifelse(C[t-1]==0,0,.(S1.E)))+
  node("D",      t=2:tmax,  distr="rbern",    prob=.(S1.D), EFU=TRUE)+
  node("Y",      t=2:tmax,  distr="rbern",    prob=.(S1.Y), EFU=TRUE)+ 
  node("Ebar",   t=2:tmax,  distr="rconst",   const=sum(E[1:t])) + 
  node("H",      t=2:tmax,  distr="rbern",    prob=ifelse(H[t-1]==1,1,.(S1.H)))+
  node("C",      t=2:tmax,  distr="rbern",    prob=ifelse(C[t-1]==0,0,.(S1.C)))
RCDAG1 <- set.DAG(D1)

I tried to generate an observed dataset as follows:

K <- 10
act_dynAD0 <-c(
  node("E", t = 1:K, distr = "rconst", const = ifelse(t==1,abar,ifelse(C[t-1]==1, abar, 0))),
  node("D", t = 1:K, distr = "rconst", const = 0)) 
DynDact   <-  RCDAG1 + action("DynE0D0", nodes = act_dynAD0, abar = 0) + action("DynE1D0", nodes = act_dynAD0, abar = 1)
dt    <-  sim(DynDact,actions = c("DynE0D0","DynE1D0"), n = 1000, rndseed = 4)

And obtained the following warning:

Warning messages:
1: In rbinom(n = n, prob = prob, size = 1) : NAs produced
2: In rbinom(n = n, prob = prob, size = 1) : NAs produced
3: In rbinom(n = n, prob = prob, size = 1) : NAs produced
4: In rbinom(n = n, prob = prob, size = 1) : NAs produced
5: In rbinom(n = n, prob = prob, size = 1) : NAs produced

Many of the E[t] columns contain NAs. Would appreciate your help in resolving this.

latent variables still show up in data when simulating in long format

latent.v does not appear to work when output data is in long format

sim(DAG, wide=FALSE)

Gaussian var affecting categorical var

Hi,

I want to have a Gaussian variable affecting a categorical variable. I tried this which seems to work

C <- rnorm(3,0,2) # or even rnorm(1,0,2)
rcat.b1(250, softmax(C * c(0.5, 0.25, 0.25)))

But running the below only gets me category 4, while there are only three categories?

logsumexp <- function (x) {
  y = max(x)
  y + log(sum(exp(x - y)))
}

softmax <- function (x) {
  exp(x - logsumexp(x))
}

D <- DAG.empty()
N <- 250

D <- D +
  # Confounder with Gaussian distribution
  node("C",
      distr = "rnorm",
      mean = 0,
      sd = 2) +

  # Set BA to categorical (c=3).
  # Make sure it starts at 1, to ensure sane priors later.
  # Let C be a causal effect on BA.
  node("BA", 
    distr = "rcat.b1",
    probs = softmax(C * c(0.5, 0.25, 0.25)))

Dset <- set.DAG(D, vecfun = c("softmax"))
d <- sim(Dset, n = N)

r$> head(d)
  ID          C BA
1  1 -1.3004380  4
2  2  3.4434901  4
3  3 -1.7159391  4
4  4  3.7295937  4
5  5 -1.7967269  4
6  6 -0.8985413  4

Have I misunderstood something? :)

Default values for non-existing time-points

Allow non-existing time-point references to default to some value, for example, instead of doing this

node("TI", t=0:7, distr="rbern", prob=plogis(-5 - 0.3*CVD + 0.5*A1C[t] + 1.5*{if (t==0) {0} else {TI[t-1]}}))

write just this, with TI[t-1] at t=0 defaulting to 0:

node("TI", t=0:7, distr="rbern", prob=plogis(-5 - 0.3*CVD + 0.5*A1C[t] + 1.5*TI[t-1]))

Performing intervention in ver. 0.5.0

I am trying to perform a deterministic intervention on a DAG.

library(simcausal)
library(rje)
D <- DAG.empty()

D <- D +
  node("U1", distr="runif"
  )

D <- D +
  node("U2", distr="runif"
       )

D <- D +
  node("UA", distr="runif"
  )

D <- D +
  node("UY", distr="runif"
  )

D <- D +
  node("W1", distr="rconst",
       const = as.numeric(U1 < 0.5))

D <- D +
  node("W2", distr="rconst",
       const = as.numeric(U2 < 0.5))

D <- D +
  node("A", distr="rconst",
       const = as.numeric(UA < expit(-0.5 + W1 - 1.5 * W2)) )

D <- D +
  node("Y", distr="rconst",
       const = as.numeric(UY < expit(-0.75 + W1 - 2 * W2 + 2.5 * A + A*W1)),
       EFU = TRUE)

setDl2 <- set.DAG(D)
# ----------------------------------------------------------------------
plotDAG(setDl2, xjitter = 0.32, yjitter = 0.03,
        edge_attrs = list(width = 0.5, arrow.width = 0.4, arrow.size = 0.8),
        vertex_attrs = list(size = 19, label.cex = 1.5))
title(main = "SCM graph")
# ----------------------------------------------------------------------
# intervention
actN_A <- node("A",distr="rconst", const = theta)

D_act <- D + action("A_th0", nodes=actN_A, theta=0)
D_act <- D_act + action("A_th1", nodes=actN_A, theta=1)
D_act <- D_act + action("A_no_intervene", nodes=node("A", distr="rconst", const = as.numeric(UA < expit(-0.5 + W1 - 1.5 * W2)) ))

ate.data <- simfull(A(D_act), n=1e5 , rndseed = 252)

The D_act assignment fails to run under 0.5.0 while suceeds under 0.4.0. Could you please take a look when you have time?

double ID column

Simulated data contains two ID columns

Error in rcategor and rcategor.int for large numbers of groups

I'd like to use rcategor (or rcategor.int) to assign individuals to a large number of categories, but the following example illustrates an error:

D <- DAG.empty()
D1 <- D + node('group',
              distr = 'rcategor.int',
              probs = rep(1/50, 50)) 
D2 <- D + node('group',
               distr = 'rcategor.int',
               probs = rep(1/55, 55)) 
D1 <- set.DAG(D1)
D2 <- set.DAG(D2)

D1 is fine. D2, however, presents the following error:

I'm just starting to play with the package. It's great so far!

igraph Compatability

Is there a way to convert the DAG in the "Brief Overview" section to an igraph object?

I did try using igraph::as.igraph() and it doesn't seem to work.

#Simulating from a fixed network

The current simulation seems to only allow us to fix the hyper parameters of the network and whenever we simulate, it will first generate a new network and then simulate the covariates.

However, we want to be able to simulate from a fixed network. I.E at most simulate our network according to hyper parameters once, then fix such network, and simulate many different realizations of the covariates and outcomes.

So, I add such functionality by changing the simFromDAG function.

Understanding Non-standard evaluation in 0.5.1

The updates in v0.5.1 break my code, and I trying to understand why. I'm simulating a space-time structure, so I have two indexes s and t. simcausal understands a single dimension, so I map s and t to a single dimension via the function tFUN.

For each node in the DAG, I have a dataset of parameters that I lookup for s, t.

Here's a short example:

for(t in t_start:TT ){ # Per month
    for(s in s_start:SS ){ # Per location

      tpos  <- tFUN(s, t, SS)
      ## Initial month values ##

      if(s == 1 & t == 0){
       D <- D + node('L1',
               distr = 'rnorm',
               mean  = .(lookup_parameter(pp, node = 'L1', s = s, t = t, variable = '(Intercept)') ),
               sd    = .(lookup_parameter(pp, node = 'L1', s = s, t = t, variable = 'mse') ),
               t     = tpos)
 }
}

lookup_parameter basically uses [ to pick off the correct element in the dataset.

With the new updates, the values for s and t getting passed to lookup_parameter are not what I expected, and I get a One of the distribution parameters evaluated to non-standard vector error. t is the value I'd expect from simcausal rather then the value from the loop. s is a mystery altogether. The other arguments (node and variable pass to lookup_parameter correctly).

I think for my purposes (I'm trying to wrap up this study), I should just roll back to v0.5.0. I figured I would bring this up, in case there is something obvious I'm missing.

Thanks!

cannot find function 'melt'

I've run into the issue a few times, where when I call simobs get a cannot find function 'melt' error. When I run library(data.table), the issue goes away. I have not pinpointed if I have other libraries load that may cause the conflict, but I think changing melt to data.table::melt on line 749 of simulation.R may fix the issue. Though I see that data.table::melt has been commented out.

plot function for DAGs

add S3 method plot for DAG object.

using as.integer() in node formulas prints to the terminal:

using as.integer() in node formulas prints it to the terminal:

library(simcausal)
D <- DAG.empty()
D <- D +
         node("Trexp", distr = "rexp", rate = 1) +
         node("Cweib", distr = "rweibull", shape = 0.5, scale = 1) +
         node("T", distr = "rconst", const = as.integer(Trexp*100))
setD <- set.DAG(D)
dat <- sim(setD, n=100)

Evaluation of categorical by variable name

THIS no longer works, because the dimensions of catprob.W0 and catprob.W1 (1 row) do not match the dims of W (n rows)

  D <- DAG.empty()
  D <- D + node("W", distr = "rbern", prob = 0.3)
  catprob.W0 <- cbind(0.7,0.1,0.2)
  catprob.W1 <- cbind(0.2,0.1,0.7)
  D <- D + node("Cat3", distr = "rcategor.int", probs = (W==0)*.(catprob.W0) + (W==1)*.(catprob.W1))
  checkException(Dset2 <- set.DAG(D))

This works however, because simcausal gets a chance to parse c(0.7,0.1,0.2) into a matrix with appropriate number of rows:

  D <- DAG.empty()
  D <- D + node("W", distr = "rbern", prob = 0.3)
  catprob.W0 <- cbind(0.7,0.1,0.2)
  catprob.W1 <- cbind(0.2,0.1,0.7)
  D <- D + node("Cat3", distr = "rcategor.int", probs = (W==0)*c(0.7,0.1,0.2) + (W==1)*c(0.2,0.1,0.7))
  Dset2 <- set.DAG(D)

Removed from CRAN again

https://cran.r-project.org/web/packages/simcausal/index.html

osofr / simcausal Goto Github PK

simcausal's People

Contributors

Stargazers

Watchers

Forkers

simcausal's Issues

set seed for reproducability

specify the two interventions

counterfactual data to estimate conditional distribution M|a,c and M|a*,c

look at counterfactual data

specify the two interventions

Evaluate TRUTH

Recommend Projects

Recommend Topics

Recommend Org