seananderson / glmmfields Goto Github PK

R package for Bayesian spatial and spatiotemporal GLMMs with possible extremes

R 89.29% Stan 10.66% C++ 0.05%

r spatial-analysis extremes spatiotemporal ecology

glmmfields's Introduction

glmmfields

The glmmfields R package implements Bayesian spatiotemporal models that allow for extreme spatial deviations through time. It uses a predictive process approach with random fields implemented through a multivariate-t distribution instead of a multivariate normal. The models are fit with Stan.

We published a paper describing the model and package in Ecology:

Anderson, S. C., Ward, E. J. 2019. Black swans in space: modelling spatiotemporal processes with extremes. 100(1):e02403. https://doi.org/10.1002/ecy.2403

You can install the CRAN version of the package with:

install.packages("glmmfields")

If you have a C++ compiler installed, you can install the development version of the package with:

# install.packages("remotes")
remotes::install_github("seananderson/glmmfields", build_vignettes = TRUE)

glmmfields can also fit spatial GLMs with Stan. See the vignette:

vignette("spatial-glms", package = "glmmfields")

An example spatiotemporal model

library(glmmfields)
#> Loading required package: Rcpp
library(ggplot2)

Simulate data:

set.seed(42)
s <- sim_glmmfields(
  df = 2.8, n_draws = 12, n_knots = 12, gp_theta = 2.5,
  gp_sigma = 0.2, sd_obs = 0.1
)
head(s$dat)
#>   time pt           y      lon      lat station_id
#> 1    1  1  0.02818963 9.148060 6.262453          1
#> 2    1  2 -0.21924739 9.370754 2.171577          2
#> 3    1  3 -0.34719485 2.861395 2.165673          3
#> 4    1  4 -0.15785483 8.304476 3.889450          4
#> 5    1  5 -0.04703617 6.417455 9.424557          5
#> 6    1  6 -0.23904924 5.190959 9.626080          6

print(s$plot)

Fit the model:

options(mc.cores = parallel::detectCores()) # for parallel processing
m <- glmmfields(y ~ 0,
  data = s$dat, time = "time",
  lat = "lat", lon = "lon",
  nknots = 12, estimate_df = TRUE, iter = 800, seed = 1
)

print(m)
#> Inference for Stan model: glmmfields.
#> 4 chains, each with iter=800; warmup=400; thin=1; 
#> post-warmup draws per chain=400, total post-warmup draws=1600.
#> 
#>             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
#> df[1]       3.72    0.04 1.47    2.08    2.67    3.37    4.28    7.48  1331    1
#> gp_sigma    0.30    0.00 0.04    0.22    0.27    0.30    0.32    0.39   525    1
#> gp_theta    2.58    0.00 0.07    2.46    2.54    2.58    2.63    2.71  1434    1
#> sigma[1]    0.10    0.00 0.00    0.09    0.10    0.10    0.10    0.10  2207    1
#> lp__     2291.28    0.42 9.59 2270.34 2285.12 2291.59 2297.73 2308.74   521    1
#> 
#> Samples were drawn using NUTS(diag_e) at Mon Feb 13 12:45:42 2023.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at 
#> convergence, Rhat=1).

Plot:

plot(m, type = "prediction") + scale_color_gradient2()

plot(m, type = "spatial-residual")

Predictions:

# link scale:
p <- predict(m)
head(p)
#> # A tibble: 6 × 3
#>   estimate conf_low conf_high
#>      <dbl>    <dbl>     <dbl>
#> 1  -0.0283  -0.0868    0.0273
#> 2  -0.291   -0.365    -0.220 
#> 3  -0.397   -0.448    -0.346 
#> 4  -0.196   -0.266    -0.123 
#> 5  -0.0370  -0.110     0.0360
#> 6  -0.214   -0.294    -0.140

# posterior predictive intervals on new observations (include observation error):
p <- predictive_interval(m)
head(p)
#> # A tibble: 6 × 3
#>   estimate conf_low conf_high
#>      <dbl>    <dbl>     <dbl>
#> 1  -0.0283   -0.236   0.181  
#> 2  -0.291    -0.507  -0.0904 
#> 3  -0.397    -0.596  -0.206  
#> 4  -0.196    -0.392   0.00154
#> 5  -0.0370   -0.239   0.172  
#> 6  -0.214    -0.423  -0.00659

Use the tidy method to extract parameter estimates as a data frame:

x <- tidy(m, conf.int = TRUE)
head(x)
#> # A tibble: 6 × 5
#>   term                     estimate std.error conf.low conf.high
#>   <chr>                       <dbl>     <dbl>    <dbl>     <dbl>
#> 1 df[1]                      3.37     1.47      2.08      7.48  
#> 2 gp_sigma                   0.295    0.0432    0.216     0.388 
#> 3 gp_theta                   2.58     0.0662    2.46      2.71  
#> 4 sigma[1]                   0.0979   0.00214   0.0939    0.102 
#> 5 spatialEffectsKnots[1,1]  -0.110    0.0341   -0.175    -0.0442
#> 6 spatialEffectsKnots[2,1]  -0.230    0.0386   -0.305    -0.155

Make predictions on a fine-scale spatial grid:

pred_grid <- expand.grid(
  lat = seq(min(s$dat$lat), max(s$dat$lat), length.out = 25),
  lon = seq(min(s$dat$lon), max(s$dat$lon), length.out = 25),
  time = unique(s$dat$time)
)

pred_grid$prediction <- predict(m,
  newdata = pred_grid, type = "response", iter = 100, estimate_method = "median"
)$estimate

ggplot(pred_grid, aes(lon, lat, fill = prediction)) +
  facet_wrap(~time) +
  geom_raster() +
  scale_fill_gradient2()

References

Anderson, S. C., Ward, E. J. 2019. Black swans in space: modelling spatiotemporal processes with extremes. 100(1):e02403. https://doi.org/10.1002/ecy.2403

Latimer, A. M., S. Banerjee, H. Sang Jr, E. S. Mosher, and J. A. Silander Jr. 2009. Hierarchical models facilitate spatial analysis of large data sets: a case study on invasive plant species in the northeastern United States. Ecology Letters 12:144–154.

Shelton, A. O., J. T. Thorson, E. J. Ward, and B. E. Feist. 2014. Spatial semiparametric models improve estimates of species abundance and distribution. Canadian Journal of Fisheries and Aquatic Sciences 71:1655–1666.

NOAA Disclaimer

This repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project code is provided on an ‘as is’ basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.

glmmfields's People

Contributors

Stargazers

Watchers

Forkers

hbwxf jelsner bgoodri datafusion18 waternk climostatistics qiubangzhu ecustwy alotfata andrjohns nemochina2008

glmmfields's Issues

station_id argument is redundant?

It occurred to me, while staring at the main function call, that the station_id argument may be redundant. Couldn't we just create that index internally by pasting together the latitude and longitude coordinates and turning that into a factor and then integer?

This would also eliminate the possibility for the user to mess things up by changing the coordinate station ID matching somewhere in the data set.

Unless there is some situation I'm not thinking of, I suggest we remove the argument and add that column to the data frame within the package.

Import S3 generics from rstantools rather than rstanarm

I believe you can import the relevant S3 generics (i.e. posterior_linpred) from the rstantools package rather than the rstanarm package. If that is not completely the case, let us know and we can probably move stuff. Installing rstanarm is a relatively big cost for people that are not actually using it to fit models.

Matern covariance calculations error

There seems to be an issue with the Matern covariance calculations. I realized that the unit tests before had matern_kappa = 0.5, which internally switched to the exponential code. Maybe a matrix needs to be transposed? Not sure. I haven't dug into this yet. Example:

  gp_sigma <- 0.2
  sigma <- 0.1
  df <- 10
  gp_theta <- 1.2
  n_draws <- 4
  nknots <- 9
  matern_kappa <- 1.5
  n_data_points <- 50
  set.seed(1)

  s <- sim_glmmfields(df = df, n_draws = n_draws, gp_theta = gp_theta,
    gp_sigma = gp_sigma, sd_obs = sigma, n_knots = nknots, n_data_points = n_data_points,
    covariance = "matern", matern_kappa = matern_kappa)
 
  m <- glmmfields(y ~ 1, data = s$dat, time = "time",
    lat = "lat", lon = "lon", nknots = nknots,
    iter = 1, chains = 1,
    estimate_df = FALSE, fixed_df_value = df,
    covariance = "matern", matern_kappa = matern_kappa)

# > Unrecoverable error evaluating the log probability at the initial value.
# > Exception: multiply: Columns of m1 (9) and Rows of m2 (50) must match 
# > in size  Found before start of program.

That example is pushed as a unit test here: d6f94d9

Fix deep copy warnings

DIAGNOSTIC(S) FROM PARSER:
Info: left-hand side variable (name=SigmaOffDiag) occurs on right-hand side of assignment, causing inefficient deep copy to avoid aliasing.
Info: left-hand side variable (name=y_hat) occurs on right-hand side of assignment, causing inefficient deep copy to avoid aliasing.

https://discourse.mc-stan.org/t/deep-copy-warning/885/11

Is there a way to include an offset in a Poisson/NB model?

My spatial unit is US counties, and I would like to include population of the county as an offset in my NB model. Is there a way to do that with your package? I did not see it mentioned in the documentation.

predict does not return correct predictions if newdata specified

I thought I was going crazy. Model fit comparisons between correctly and incorrectly specified just weren't making sense (the incorrect model was often better). Then occasionally in some scripts they would make sense. Then they would stop working again.

I finally discovered the reason. I think the predict.rrfield function is somehow resorting the data if newdata is specified. It works correctly otherwise.

There's a simple unit test here that is broken: https://github.com/seananderson/rrfields/blob/bea28dda856bf64afe0d1c187139ac81c412e530/tests/testthat/test-predict.R

The solution isn't obvious to me yet, but I'm still digging into it.

Binomial regression

Hi,

Thank you for putting this package together.
I'm trying to fit a binomial model but unsure how to specify the number of failures (or total trials).

data:

> d_sample
  y2                        id      lat      lon   N
1 25 IND0070940639070004_Del_U 28.83111 77.16795 100
2 37 IND0070940639100003_Del_U 28.80293 77.12073 120
3 25 IND0070940639110007_Del_U 28.78427 77.20626 130
4 39 IND0070940639130005_Del_U 28.77438 77.16210 100
5 31 IND0070940639140018_Del_U 28.76010 77.14313 110
6 23 IND0070940639200027_Del_U 28.76169 77.01645  90

models:

I'm not specifying N here.

m1 <- glmmfields(y2 ~ as.factor(id), 
           data = d_sample,  family = binomial(), lat = "lat", lon = "lon", 
           nknots = 2, iter = 500,  
           prior_sigma = half_t(3, 0, 3),
           prior_gp_theta = half_t(3, 0, 10),
           prior_gp_sigma = half_t(3, 0, 3),)

gives

Chain 1: Rejecting initial value:
Chain 1:   Error evaluating the log probability at the initial value.
Chain 1: Exception: bernoulli_logit_lpmf: n[1] is 25, but must be in the interval [0, 1]  (in 'model_glmmfields' at line 224)

Surprised that is using the bernoulli_logit_lpmf instead of binomial_logit_lpmf.

Here's another model

> m2 <- glmmfields(cbind(y2, N - y2) ~ as.factor(id), 
              data = d_sample, family = binomial(), lat = "lat", lon = "lon", 
              nknots = 2, iter = 500,  
              prior_sigma = half_t(3, 0, 3),
              prior_gp_theta = half_t(3, 0, 10),
              prior_gp_sigma = half_t(3, 0, 3),)

gives

Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) : 
  Exception: mismatch in dimension declared and found in context; processing stage=data initialization; variable name=stationID; position=0; dims declared=(12); dims found=(6)  (in 'model_glmmfields' at line 6)

Session info

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] yaml_2.2.1       pool_0.1.4.3     DBI_1.1.0        lubridate_1.7.9  readxl_1.3.1     lme4_1.1-23      Matrix_1.2-18    here_0.1        
 [9] styler_1.3.2     cowplot_1.0.0    broom_0.7.0      geojsonio_0.9.2  dplyr_1.0.0      ggplot2_3.3.2    glmmfields_0.1.4 Rcpp_1.0.5

Add regularized horseshoe prior for beta parameters (feature request)

Hello all,

I have a modeling task where I have quite a few potential predictors (11), but I need a simpler model and don't have a theoretical justification for simpler combinations of predictors. I'm interested in using a regularized horseshoe prior (e.g. rstanarm::hs) for regularization, but it isn't supported in glmmfields. Is it reasonable to you implement such a prior? An alternative approach to variable selection I'm interested in is projection predictive feature selection (i.e. projpred r package), but I imagine that is a bit more difficult to implement.

Thanks,
Connor

Problem error when running Binomial regression

Hi there, I found the following error when trying to run a binomial spatial regression using a data frame. Not sure what is the problem.

My code is:

m_glmmf <- glmmfields(cbind(y, trial - y)~X, data = dta$df, time = NULL,
family = binomial(link = "logit"),
lat = "lat", lon = "lon", nknots = 12,
iter = 2000, chains = 2, prior_beta = student_t(3, 0, 5),
prior_intercept = student_t(3, 0, 5),
covariance = "exponential",
control = list(adapt_delta = 0.95))

and the error I got:
Error in new_CppObject_xp(fields$.module, fields$.pointer, ...) :
Exception: mismatch in dimension declared and found in context; processing stage=data initialization; variable name=stationID; position=0; dims declared=(400); dims found=(200) (in 'model_glmmfields' at line 6)

CRAN Check Failure for Upcoming broom Release

Hi there! The broom dev team just ran reverse dependency checks on the upcoming broom 0.7.0 release and found new errors/test failures for the CRAN version of this package. I've pasted the results below, which seem to result from our planned deprecation of tidiers for mixed models.

Newly broken

checking dependencies in R code ... NOTE

Missing or unexported object: ‘broom::tidyMCMC’

The tidyMCMC function now lives in the broom.mixed package.

We hope to submit this new version of the package to CRAN in the coming weeks. If you encounter any problems fixing these issues, please feel free to reach out!🙂

Can glmmfields be used to fit 1-D models?

Someone asked me the other day about whether glmmfields can fit 1-D spatial models (like along a coastline). The answer is yes -- it's maybe not the fastest implementation of 1-D gaussian process models, but we can do that by setting latitude or longitude to some constant. For example,

s <- sim_glmmfields(n_draws = 12, n_knots = 12, gp_theta = 1.5,
gp_sigma = 0.2, sd_obs = 0.2)

s$dat$lon = 0 # set to arbitrary constant

m <- glmmfields(y ~ 0, time = "time",
lat = "lat", lon = "lon", data = s$dat,
nknots = 12, iter = 1000, chains = 4, seed = 1)

Absence of nb2_phi in negative binomial glmmfelds output

Hi,

I was running the glmmfields function using the nbinom2 family and the output doesn't seem to provide a value for the dispersion parameter (nb2_phi). I reran with the Gamma family to check the output and it did produce a CV value. I was wondering if the issue lies in the stan_pars function setting the nb2_phi using "nb2" rather than "nbinom2". Many thanks in advance.