Giter VIP home page Giter VIP logo

dynamite's Introduction

dynamite: Bayesian Modeling and Causal Inference for Multivariate Longitudinal Data

Project Status: Active – The project has reached a stable, usable state and is being actively developed. R-CMD-check Codecov test coverage Status at rOpenSci Software Peer Review dynamite status badge dynamite CRAN badge

The dynamite R package provides an easy-to-use interface for Bayesian inference of complex panel (time series) data comprising of multiple measurements per multiple individuals measured in time via dynamic multivariate panel models (DMPM). The main features distinguishing the package and the underlying methodology from many other approaches are:

  • Support for regular time-invariant effects, group-level random effects, and time-varying effects modeled via Bayesian P-splines.
  • Joint modeling of multiple measurements per individual (multiple channels) based directly on the assumed data-generating process. Individual channels can be univariate or multivariate.
  • Support for various distributions: Currently Gaussian, Multivariate Gaussian, Student t, Categorical, Ordered, Multinomial, Poisson, Bernoulli, Binomial, Negative Binomial, Gamma, Exponential, and Beta distributions are available, and these can be combined arbitrarily in multichannel models.
  • Allows evaluating realistic long-term counterfactual predictions that take into account the dynamic structure of the model by efficient posterior predictive distribution simulation.
  • Transparent quantification of parameter and predictive uncertainty due to a fully Bayesian approach.
  • Various visualization methods including a method for drawing and producing a TikZ code of the directed acyclic graph (DAG) of the model structure.
  • User-friendly and efficient R interface with state-of-the-art estimation via Stan. Both rstan and cmdstanr backends are supported, with both parallel chains and within-chain parallelization.

The dynamite package is developed with the support of the Research Council of Finland grant 331817 (PREDLIFE). For further information on DMPMs and the dynamite package, see the related papers:

  • Helske J. and Tikka S. (2024). Estimating Causal Effects from Panel Data with Dynamic Multivariate Panel Models. Advances in Life Course Research, 60, 100617. (Journal version, SocArXiv preprint)
  • Tikka S. and Helske J. (2024). dynamite: An R Package for Dynamic Multivariate Panel Models. (arXiv preprint)

Installation

You can install the most recent stable version of dynamite from CRAN or the development version from R-universe by running one the following lines:

install.packages("dynamite")
install.packages("dynamite", repos = "https://ropensci.r-universe.dev")

Example

A single-channel model with time-invariant effect of z, time-varying effect of x, lagged value of the response variable y and a group-specific random intercepts:

set.seed(1)
library("dynamite")
gaussian_example_fit <- dynamite(
  obs(y ~ -1 + z + varying(~ x + lag(y)) + random(~1), family = "gaussian") +
    splines(df = 20),
  data = gaussian_example, time = "time", group = "id",
  iter = 2000, chains = 2, cores = 2, refresh = 0
)

Summary of the model:

print(gaussian_example_fit)
#> Model:
#>   Family   Formula                                       
#> y gaussian y ~ -1 + z + varying(~x + lag(y)) + random(~1)
#> 
#> Correlated random effects added for response(s): y
#> 
#> Data: gaussian_example (Number of observations: 1450)
#> Grouping variable: id (Number of groups: 50)
#> Time index variable: time (Number of time points: 30)
#> 
#> NUTS sampler diagnostics:
#> 
#> No divergences, saturated max treedepths or low E-BFMIs.
#> 
#> Smallest bulk-ESS: 661 (sigma_nu_y_alpha)
#> Smallest tail-ESS: 1058 (sigma_nu_y_alpha)
#> Largest Rhat: 1.003 (sigma_y)
#> 
#> Elapsed time (seconds):
#>         warmup sample
#> chain:1  5.824  3.531
#> chain:2  5.669  3.612
#> 
#> Summary statistics of the time- and group-invariant parameters:
#> # A tibble: 6 × 10
#>   variable      mean median      sd     mad     q5   q95  rhat ess_bulk ess_tail
#>   <chr>        <dbl>  <dbl>   <dbl>   <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
#> 1 beta_y_z    1.97   1.97   0.0116  0.0112  1.95   1.99   1.00    2815.    1434.
#> 2 sigma_nu_y… 0.0944 0.0933 0.0114  0.0107  0.0780 0.114  1.00     661.    1058.
#> 3 sigma_y     0.198  0.198  0.00373 0.00362 0.192  0.204  1.00    2580.    1254.
#> 4 tau_alpha_y 0.212  0.205  0.0483  0.0432  0.146  0.301  1.00    1731.    1606.
#> 5 tau_y_x     0.364  0.355  0.0740  0.0648  0.266  0.494  1.00    2812.    1504.
#> 6 tau_y_y_la… 0.107  0.105  0.0219  0.0213  0.0781 0.148  1.00    2387.    1682.

Posterior estimates of time-varying effects:

plot(gaussian_example_fit, types = c("alpha", "delta"), scales = "free")

And group-specific intercepts (for first 10 groups):

plot(gaussian_example_fit, types = "nu", groups = 1:10)

Traceplots and density plots for time-invariant parameters:

plot(gaussian_example_fit, plot_type = "trace", types = "beta")

Posterior predictive samples for the first 4 groups (using the samples based on the posterior distribution of the model parameters and observed data on the first time point):

library("ggplot2")
pred <- predict(gaussian_example_fit, n_draws = 100)
pred |>
  dplyr::filter(id < 5) |>
  ggplot(aes(time, y_new, group = .draw)) +
  geom_line(alpha = 0.25) +
  # observed values
  geom_line(aes(y = y), colour = "tomato") +
  facet_wrap(~id) +
  theme_bw()

Visualizing the model structure as a DAG (a snapshot at time t):

plot(gaussian_example_fit, plot_type = "dag", show_covariates = TRUE)

For more examples, see the package vignettes and the blog post about dynamite.

Related packages

  • The dynamite package uses Stan via rstan and cmdstanr (see also https://mc-stan.org), which is a probabilistic programming language for general Bayesian modelling.
  • The brms package also uses Stan, and can be used to fit various complex multilevel models.
  • Regression modeling with time-varying coefficients based on kernel smoothing and least squares estimation is available in package tvReg. The tvem package provides similar functionality for gaussian, binomial and poisson responses with mgcv backend.
  • plm contains various methods to estimate linear models for panel data, e.g., fixed effect models.
  • lavaan provides tools for structural equation modeling, and as such can be used to model various panel data models as well.

Contributing

Contributions are very welcome, see CONTRIBUTING.md for general guidelines.

dynamite's People

Contributors

helske avatar jeroen avatar mpadge avatar santikka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

mpadge

dynamite's Issues

Add checks for time variable

Add checks and informative error messages regarding the time indexing variable, with potential coercion of (ordered) factors to numeric.

Optimize code for performance

I think there is some potential for performance improvements by rewriting the definitions of beta/a/likelihood, need to test and profile a bit.

This talk gives some potentially useful info on how autodiff etc works in stan:
There are some speed comparisons here: https://www.youtube.com/watch?v=EJLXoDBoVko

I wonder if it would be beneficial to write our own likelihood functions based on spline coefficients directly with analytical gradients. Lots of work though, and no guarantees it helps...

Add support for std_normal()

Missed this earlier, needs bit tweaking in vectorization due to zero arguments. I'll work on this in the random branch where the other changes to the data/prior generation were made.

Support for multiple `random` components?

If I understand correctly, it is currently possible to only add a single random component to the model. How does one then specify, for example, correlated random intercepts between two channels, and a separate uncorrelated random intercept for a third channel? Should we try to support this?

Add option to model the correlation of random intercepts between channels

If we assume that the random intercept corresponds to some latent "trait" of a group, then it is reasonable also to assume that these could be correlated.

This should be relatively simple to do if we define the common correlation in stanblocks instead of stanblocks_families, and add an argument correlated_nu to dynamite. Other option would be to define new function random(channels = character(), correlated = logical(1)) similarly as lags() and splines() which would be used to add random intercept components instead of inside obs. This latter option would be perhaps more elegant and in line with the overall style of the package but I think it would make it more difficult to define priors etc.

lags() shifts the starting time compared to lag()

set.seed(1)
n_id <- 50
n_time <- 20
d <- data.frame(y = sample(factor(c("a", "b", "c", "d")), size = n_time * n_id, replace = TRUE),
                x = rnorm(n_time * n_id),
                time = 1:n_time, id = rep(1:n_id, each = n_time))

f1 <-
  obs(x ~ 1, family = gaussian()) +
  obs(y ~ 1, family = categorical()) +
  lags()

f2 <-
  obs(x ~ lag(x) + lag(y), family = gaussian()) +
  obs(y ~ lag(x) + lag(y), family = categorical())

> head(dynamite(f1, d, "id", "time", debug = list(no_compile = TRUE))$data)
Key: <id, time>
        y           x  time    id     x_lag_1 y_lag_1
   <fctr>       <num> <int> <int>       <num>  <fctr>
1:      a  0.07730312     1     1          NA    <NA>
2:      d -0.29686864     2     1          NA    <NA>
3:      c -1.18324224     3     1 -0.29686864       d
4:      a  0.01129269     4     1 -1.18324224       c
5:      b  0.99160104     5     1  0.01129269       a
6:      a  1.59396745     6     1  0.99160104       b
> head(dynamite(f2, d, "id", "time", debug = list(no_compile = TRUE))$data)
Key: <id, time>
        y           x  time    id      x_lag1 y_lag1
   <fctr>       <num> <int> <int>       <num> <fctr>
1:      a  0.07730312     1     1          NA   <NA>
2:      d -0.29686864     2     1  0.07730312      d
3:      c -1.18324224     3     1 -0.29686864      d
4:      a  0.01129269     4     1 -1.18324224      c
5:      b  0.99160104     5     1  0.01129269      a
6:      a  1.59396745     6     1  0.99160104      b

Single sequence case throws an error

Something is broken in in case of a single id:

library(dynamite)
d <- data.frame(y = rnorm(20), x = rnorm(20), id = 1, time = 1:20)
fit<-dynamite(
  obs(y ~ -1 + varying(~x), family=gaussian()) + splines(df = 100), 
  d, id, time, chains = 1)
Error in apply(x, 1, function(y) { : dim(X) must have a positive length

Traceback shows that the problem occurs at

apply(x, 1, function(y) {
all(!is.na(y))
}) at convert_data.R#101

This is probably related to automatic dimension drop. But I wouldn't be surprised if there are other issues with single id case as well, at least I haven't tested that for a while.

lags drops fixed covariates

> f <- obs(y2 ~ x2, family = gaussian()) + lags(k = 1, type = "fixed")
> get_priors(f, test_data, "group", "time")
         parameter response                                       prior  type category
1         alpha_y2       y2 normal(0.902522382199484, 3.48489844539401) alpha         
2 beta_y2_y2_lag_1       y2                 normal(0, 1.77635956526457)  beta         
3         sigma_y2       y2              exponential(0.573904815689363) sigma 

vs

> f <- obs(y2 ~ x2 + lag(y2), family = gaussian())
> get_priors(f, test_data, "group", "time")
        parameter response                                       prior  type category
1        alpha_y2       y2 normal(0.902522382199484, 3.48489844539401) alpha         
2      beta_y2_x2       y2                 normal(0, 3.10695826741562)  beta         
3 beta_y2_y2_lag1       y2                 normal(0, 1.77635956526457)  beta         
4        sigma_y2       y2              exponential(0.573904815689363) sigma      

Predict with `type=function`?

With large data, the newdata object in predict can become huge as it has n_time x n_id x n_draws rows. But often we are not interested in all id-specific trajectories (or if we are, we can probably create a smaller newdata and sample based on that), but in estimating causal effects we will eventually average over individuals at each iteration to extract say mean and variance of interventional distribution. So in order to save memory, we could provide a type (or new predict_something function) which would automatically apply some function to samples at each time&iteration combination, instead of storing all the samples, i.e., we would only need to keep track of samples from the previous and current time points and these summaries.

Incorrect indexing in predict

Due to R's recycling rules, indexing is inverted in predict for all other parameters except alpha, beta, delta and nu. One solution is to flip the indexing of newdata, as we no longer require the data to be contiguous in terms of the time index, because lags can be computed using an arbitrary offset instead of -1.

Predict method with unbalanced data

In dynamite, the data is expanded to include missing time points, whereas in predict the newdata was taken as is, which caused issues due to different number of ids per time index and incorrect handling of gaps in the data. I added fill_time_predict function and call to this in parse_newdata in a2fc06e, but now in the case when newdata is NULL, clear_nonfixed will set some groups full of NAs the starting time in the original data was not the first time point in the data. I first thought this was a bug, but this does work as designed. But in predict, perhaps a better way would be to only fill the gaps for each individual, and not expand beyond the start and end time of each individual so we would still get predictions for everyone (Although I can see this being bit confusing in some cases). This would need some changes at least to generate_sim_call.

lags() removes varying components

The varying coefficient for x is currently transformed to fixed:

d <- data.frame(y=1:10,x=1:10,id=1,time=1:10)
> get_priors(obs(y~ -1 + varying(~x), family = gaussian()) + lags() + splines(df=5), d, "id", "time")
       parameter response                          prior  type category
1        alpha_y        y    normal(1, 6.05530070819498) alpha         
2       beta_y_x        y                   normal(0, 2)  beta         
3 beta_y_y_lag_1        y    normal(0, 2.21108319357027)  beta         
4        sigma_y        y exponential(0.330289129537908) sigma    

Without lags() it works:

> get_priors(obs(y~ -1 + varying(~x), family = gaussian()) + splines(df=5), d, "id", "time")
    parameter response                          prior      type category
1     alpha_y        y    normal(1, 6.05530070819498)     alpha         
2 tau_alpha_y        y                   normal(0, 1) tau_alpha         
3   delta_y_x        y                   normal(0, 2)     delta         
4     tau_y_x        y                   normal(0, 1)       tau         
5     sigma_y        y exponential(0.330289129537908)     sigma   

Binomial formula without `trials` should throw an early error

set.seed(0)
timepoints <- 10
individuals <- 5
total_obs <- timepoints * individuals

test_data <- data.frame(
  time = 1:timepoints,
  group = gl(individuals, timepoints),
  trials = sample(50:100, size = total_obs, replace = TRUE)
) |>
  dplyr::mutate(
    y = rbinom(n = total_obs, size = trials, prob = 0.75),  
    x = rnorm(total_obs)
  )

Without trials does not throw an error:

fit<-dynamite(obs(y ~ x, "binomial"), 
  data = test_data, "group", "time", debug = list(no_compile = TRUE))

But there is (obviously) no trials_y in fit$stan$model_vars so sampling will fail.

Add support for beta distribution

This can be useful for modelling data which is constrained to say between 0-100 (after scaling to unit interval). There is a beta_proportion distribution in Stan for this.

standard deviations `tau_alpha` are not correctly estimated

Need to hunt down the issue, at least one issues is incorrect indexing, now we have in the parameters block:

omega_alpha_g[1] = omega_alpha_1_g;
omega_alpha_g[2:D] = omega_raw_alpha_g;

And in model

omega_raw_alpha_g[2] ~ normal(omega_alpha_1_g, tau_alpha_g);
for (i in 3:(D - 1)) {
  omega_raw_alpha_g[i] ~ normal(omega_raw_alpha_g[i - 1], tau_alpha_g);
}

Allow removal of lags in formula syntax

Get the signs of the lag terms present in the formula, and add them to lag_map, and then add the corresponding term to the predictors if the sign is positive and remove it if the sign is negative.

Binomial channel without covariates throws an error

set.seed(0)
timepoints <- 10
individuals <- 5
total_obs <- timepoints * individuals

test_data <- data.frame(
  time = 1:timepoints,
  group = gl(individuals, timepoints),
  trials = sample(50:100, size = total_obs, replace = TRUE)
) |>
  dplyr::mutate(
    y = rbinom(n = total_obs, size = trials, prob = 0.75),  
    x = rnorm(total_obs)
  )

Case of no covariates throws an error:

fit<-dynamite(obs(y ~ trials(trials), "binomial"), 
  data = test_data, "group", "time", debug = list(no_compile = TRUE))
Error in reformulate(attr(termobj, "term.labels")[-dropx], response = if (keep.response) termobj[[2L]], : 
'termlabels' must be a character vector of length at least one

Allow newdata in fitted?

Currently, predict will only simulate for the NA values in the data, or in the default case everything given the first fixed time points, whereas fitted computes new predictions given the fixed (original) data. But often you might want to compute "one-step-ahead" (in case of maximum lag of one) predictions under different scenarios for each time point (in case of time-varying coefficients). For example, in the categorical case it can be of interest to compute how the probability of transitioning from one category to other changes in time. This is currently not possible, as fitted does not accept newdata argument (except by repeated calls to predict with rolling NA pattern).

I think it would be good idea to allow newdata in fitted as well. So basically fitted and predict with type="mean" would be identical except that fitted always conditions on the original newdata whereas predict conditions on the simulated values of the previous time point(s).

Interaction term with lagged variable throws an error

Running the multichannel_example now fails:

multichannel_example_fit <- dynamite(
    f, multichannel_example, "id", "time",
    chains = 1, cores = 1, iter = 2000, warmup = 1000, init = 0, refresh = 0,
    thin = 5, save_warmup = FALSE)
Error in `verify_lag()` at dynamite/R/lags.R:110:8:
! Invalid shifted value expression `1):lag(logp, 1`.
Run `rlang::last_error()` to see where the error occurred.

This is probably related to changes in gregexec pattern in extract_lags in 3600ef1.

Fitted and predict methods fail if model does not contain lags

If fixed = 0, then idx <- as.integer(newdata[ ,.I[newdata[[time_var]] == fixed]]) will return integer(0) and subsequent code errors.

Based on that line it also seems that currently, these methods do not work if the time_var is something else than 1, 2, 3, ....

Disallow the use of the 'as-is' function I(.) in formulas

I think we should disable the as-is function for the following reasons:

  1. Formula syntax would be more robust, i.e., no new variables are generated on the fly that we are not in control of.
  2. Users would be encouraged to use deterministic channels for transformations, so that their predictions will also work.
  3. Transformations of fixed predictors that are independent of the responses can be done beforehand anyway.

Create more test data

Need to create test data and models for all distributions. Perhaps one large case?

Confirm valid model structure

All channels should only depend on previous responses, previous covariates or current covariates. For example, the following should throw an error:

obs(y1 ~ y2) + obs(y2 ~ x)

Stan error on example code in vignette

Hello!

I'm exploring the package and found an error when running the example, here:

```{r, eval = FALSE}
multichannel_example_fit <- dynamite(
f, multichannel_example, "id", "time",
chains = 1, cores = 1, iter = 2000, warmup = 1000, init = 0, refresh = 0,
thin = 5, save_warmup = FALSE)
```

The error message is:

Compiling Stan model.SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Variable identifier (name) may not be reserved word
    found identifier=gamma_p
Variable identifier (name) may not be reserved word
    found identifier=gamma_p
 error in 'model5038607b6a32_570ee1c5e15b6fdb8d3b51191f12f9f8' at line 67, column 13
  -------------------------------------------------
    65:   // Define the first alpha using mean a_p
    66:   {
    67:     vector[3] gamma_p;
                    ^
    68:     gamma_p[{1, 2, 3}] = beta_p;
  -------------------------------------------------

PARSER EXPECTED: <identifier>
Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model '570ee1c5e15b6fdb8d3b51191f12f9f8' due to the above error.

R version, session info:

Click me

platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          2.2                         
year           2022                        
month          10                          
day            31                          
svn rev        83211                       
language       R                           
version.string R version 4.2.2 (2022-10-31)
nickname       Innocent and Trusting  

Session info:

R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.11.0
LAPACK: /usr/lib/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dynamite_1.0.1

loaded via a namespace (and not attached):
 [1] rstan_2.21.7         tidyselect_1.2.0     xfun_0.35            purrr_0.3.5          colorspace_2.0-3     vctrs_0.5.1          generics_0.1.3      
 [8] htmltools_0.5.3      usethis_2.1.6        stats4_4.2.2         loo_2.5.1            yaml_2.3.6           utf8_1.2.2           rlang_1.0.6         
[15] pkgbuild_1.4.0       pillar_1.8.1         glue_1.6.2           DBI_1.1.3            distributional_0.3.1 matrixStats_0.63.0   lifecycle_1.0.3     
[22] posterior_1.3.1      munsell_0.5.0        gtable_0.3.1         codetools_0.2-18     evaluate_0.18        inline_0.3.19        knitr_1.41          
[29] callr_3.7.3          fastmap_1.1.0        ps_1.7.2             parallel_4.2.2       fansi_1.0.3          Rcpp_1.0.9           backports_1.4.1     
[36] scales_1.2.1         checkmate_2.1.0      RcppParallel_5.1.5   StanHeaders_2.21.0-7 abind_1.4-5          farver_2.1.1         fs_1.5.2            
[43] gridExtra_2.3        tensorA_0.36.2       ggplot2_3.4.0        digest_0.6.30        processx_3.8.0       dplyr_1.0.10         grid_4.2.2          
[50] cli_3.4.1            tools_4.2.2          magrittr_2.0.3       tibble_3.1.8         crayon_1.5.2         pkgconfig_2.0.3      data.table_1.14.6   
[57] prettyunits_1.1.1    assertthat_0.2.1     rmarkdown_2.18       rstudioapi_0.14      R6_2.5.1             compiler_4.2.2      

Are there other examples I could run to explore the packages functionality?

Thanks!

Use functions instead of quoted expressions in `stanblocks_families.R`?

Unlike predict, there is no performance overhead resulting from function calls, because each Stan code block expression is called only once per model anyway. Converting these expressions back to functions would make debugging easier and the blocks would also be evaluated by covr.

However, it is annoying to have many functions with the exact same arguments. I suggest we create one "function template" with the full list of arguments, and manually copy these arguments to all of the other functions using formals in a loop.

No prior on lambda

Currently, the user cannot define prior for lambda. Will fix this when working on #33 .

Support deterministic variables

Add support for deterministic variables (auxiliary channels) via aux, e.g., aux(cumulative_work ~ lag(fulltime) + lag(cumulative_work))

Rewrite print method

Current print method for dynamite object is not working correctly. It would also make sense to print something else than the original codes. Perhaps some overall summary of the model, i.e. summary of convergence checks (e.g. as in bssm), number of chains/samples/parameters, model formula etc.

coefficient names and/or priors are incorrect

It seems that with multiple channels the names of the coefficients are sometimes incorrectly extracted from the coef_names. For example:

get_priors(
    obs(y5 ~ x4, family = gaussian()) +
    obs(y1 ~ x1 + x2 + varying(~x4) + lag(y1, 1), family = categorical()) +
    splines(), test_data, ID, time)
             parameter response                          prior  type category
1     beta_(Intercept)       y5                  normal(0, 10)  beta         
2              beta_x4       y5    normal(0, 2.13047007075835)  beta         
3                sigma       y5 exponential(0.550867841105406) sigma         
4              beta_x4       y1    normal(0, 2.13047007075835)  beta        1
5              beta_x1       y1    normal(0, 2.04026413534608)  beta        1
6             beta_x22       y1                   normal(0, 4)  beta        1
7             beta_x23       y1                   normal(0, 4)  beta        1
8             beta_x24       y1                   normal(0, 4)  beta        1
9     beta_(Intercept)       y1                   normal(0, 5) delta        1
10 beta_I(lag_(y1, 1))       y1    normal(0, 3.60555127546399) delta        1
11     tau_(Intercept)       y1                   normal(0, 1)   tau         
12  tau_I(lag_(y1, 1))       y1                   normal(0, 1)   tau         

(btw, all the coefficients are still all named as "beta" instead of time-varying ones being "delta")

Code for a model with only intercept does not work

Currently the generated code is something like y[, t] ~ normal_id_glm(X[t][, {}], alpha_y[t], gamma_y, sigma_y);. Perhaps it is fine to just throw an error, as this is not a particularly important special case.

Varying intercept only fails

Formulas such as obs(y ~ varying(~1)) throws an error probably because the intercept is not captured as "predictor":

fit <- dynamite(
    obs(y ~ -1 + varying(~1), family = categorical()) +
        splines(df = 10),
    data.frame(y=1:10,id=1,time=1:10), id, time)

Error in varying(\~1) : could not find function "varying"

Remove unused columns from data to conserve memory

If newdata is NULL in predict, all columns in the original data will be expanded to the full size. It would make sense to first determine which columns are needed for the simulation, and then drop the rest (also if newdata is supplied by the user, there might be unused columns).

lag(y) + lags() leads to extra covariate

See

f <- obs(y ~ x, family = "gaussian") +
    obs(w ~ x + lag(y), family = "exponential")  +
    lags()
set.seed(0)
d <- data.frame(
    y = rnorm(5),
    w = rexp(5),
    x = rnorm(5),
    z = seq.int(5)
)
fit <- dynamite(f, d, time = "z", debug = list(no_compile = TRUE))
str(fit$stan$model_vars$w)
List of 28
 $ resp                 : chr "w"
 $ L_fixed              : int [1:4(1d)] 1 4 2 3
 $ L_varying            : int[0 (1d)] 
 $ J                    : int [1:4(1d)] 1 2 3 4
 $ J_fixed              : int [1:4(1d)] 1 4 2 3
  ..- attr(*, "dimnames")=List of 1
  .. ..$ : chr [1:4] "x" "lag(y, 1)" "y_lag1" "w_lag1"
 $ J_varying            : int[0 (1d)] 
 $ K                    : int 4
 $ K_fixed              : int 4
 $ K_varying            : int 0
 $ has_missing          : logi FALSE
 $ obs                  : chr ""
 $ has_fixed_intercept  : logi TRUE
 $ has_varying_intercept: logi FALSE
 $ has_random_intercept : logi FALSE
 $ has_fixed            : logi TRUE
 $ has_varying          : logi FALSE
 $ lb                   : num 0
 $ shrinkage            : logi FALSE
 $ noncentered          : logi FALSE
 $ has_offset           : logi FALSE
 $ has_trials           : logi FALSE
 $ alpha_prior_distr    : chr "normal(0.21, 2)"
 $ beta_prior_npars     : num 2
 $ beta_prior_pars      : num [1:4, 1:2] 0 0 0 0 4 4 4 4
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "x" "lag(y, 1)" "y_lag1" "w_lag1"
  .. ..$ : NULL
 $ beta_prior_distr     : chr "normal"
 $ write_beta           : logi TRUE
 $ write_delta          : logi FALSE
 $ write_tau            : logi FALSE

Allow group-specific values of `past` in `aux`

So something like this:

obs(y ~ lag(x)) + aux(x ~ lag(x) + log(y) + past(z))

Probably the best option is to look for z is in the data and use the first values for each group. This keeps all the data used in the modelling in the same place. Small imperfection is that the rest of the z are ignored but are needlessly kept around. On the other hand, it makes it natural to use something like past(log(y)) in the example above. Another option would be to add an additional aux_data argument to dynamite.

Add "Remotes" field to DESC

@santikka Your submission to rOpenSci (ropensci/software-review#554) failed checks because you've got minimal versions for both rstan and data.table that are beyond current CRAN versions (respectively 2.21.5 and 1.14.2, as of 18 Aug 2022). Your package fails to install because these version requirements can not be met.

You need to add these extra lines to your DESCRIPTION file:

Remotes:
  stan-dev/rstan,
  Rdatatable/data.table

That then tells the r-lib suite of installers (devtools, remotes) where to find those more recent versions, and so enables them to be installed. See docs of the remotes package for more info. Thanks 😄

Deterministic channel data type

data.table is quite strict when it comes to column types. If a deterministic channel is NA before time index fixed + 1, then we can't know the column type beforehand, which might cause an error later on. I think we should add an additional argument to deterministic channels that will predefine the column type (factor, integer or real).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.