Giter VIP home page Giter VIP logo

incidence-mapper's Introduction

incidence-mapper

seattleflu/incidence-mapper performs geospatial modeling for epi and incidence data using methods based on R-INLA and an API service to deliver modeled data to seattleflu-viz (to come).

R packages encapsulate key aspects of the workflow.

offline model training

  • dbViewR provides functions to retrieve data from the research database and geospatial data repositories and format it for downstream use
  • incidenceMapR is the core modeling package. It takes data from dbViewR, parses it to define models, runs the model training code (which can take a while!), formats the output for downstream use, and saves and registers trained model artifacts for future recall.
  • modelVisualizeR is a few simple plotting functions and scripts for visualizing model outputs locally.

webservice to deliver model results

  • modelServR interacts with the api_service to find requested model data and return it to the visualization service. (This may migrate over to the Hutch or into the python layer in the future.)

workflow software testing

  • workflowTestR is a placeholder for unit testing of the incidence-mapper workflow.

Cognitive tasks

Incidence-mapper exists to perform three different classes of routine tasks on enrollment and pathogen incidence data.

  • smooth and interpolate among observed data to regularize sparse observations. This is useful for providing more accurate estimates of mean and variance at smaller spatial/temporal/age scales than provided by raw counts/fractions, but may often be redundant with raw data binning at larger scales.
  • latent field inference to infer population-wide infectious disease processes that are driving our observations. This is the most challenging and most powerful aspect of the modeling, and is critical for proper public health understanding. For example, what is the space-time series of flu incidence in the whole population, given the non-representative space-and-age distributions of study participants? In this language, the true incidence is a latent field (set of unobserved variables with some structure) that drives the observed data.
  • quantify effects of factors on outcomes using generalized linear regression, after adjusting for confounders (ie covariates that aren't of interest). This is useful for individual-level understanding of things like vaccine efficacy, and for predicting individual priors for DeDx based on measured covariates. This is not a priority for April as it applies to individuals and isn't the fundamental "flu map".

Model structure

Model definition

Models are defined by:

  • the pathogen of interest, either a specific one from the database (assuming it gets populated..., like FLU_A_H1 or h3n2) or all (for all samples regardless of pathogen), or unknown (the default I have until the taqman data is made available).
  • the strata (/facets/covariates), like [encountered_week, residence_puma] or [flu_shot, residence_census_tract]
  • either the type of model (smooth or latent), or, if that makes little sense on your end, the outcome you want models of, like (count and fraction (observed), or intensity (latent)).

Timeseries models

All examples included in this commit are timeseries models by encountered_week. Both model types need not include time in general, but all the ones for May 22 will. To collapse over time to produce a static map, it is approximately valid to average outcomes over time. Long-term, there are more interesting, and more correct, things to do, but that's not bad and retains interesting meaning for people exploring the data.

Smooth models

Smooth models are interpolations and extrapolations from the observed data. For the purposes of visualization, these models complement raw data summaries in two ways:

  1. They use the statistical properties of the entire sample to regularize estimates for small areas and times. This goes by many names in the statistical modeling literature, but two useful ones are "partial pooling" for "small area estimation". The benefit of partially-pooled estimates is they reduce the influence of small sample statistics on visual outlier detection. (For example, we shouldn't really believe that places with no measurements have no incidence.)
  2. For human comprehension, they "fill in the map" to emphasize larger-scale patterns that are more likely to be real (as in, there's a lot of aggregate evidence) and down-play local variation for small samples. This aspect will be useful for understanding where and when we gathered data, so we can rationally plan to expand next year.

The smooth models in this repo are faceted by factors about our participants, like site_type (where the sample was taken), sex (male or female), and flu_shot (self-reported "have you had flu vaccine in last year?" or records, 1=yes, 0=no). It would be nice to have a selector for factors to facet by, but tell me if you want the deployed models to have fewer facets. From point of view, the only non-negotiable facet is site_type as differences in the total counts and residence locations captured by our different collection modes are very important to our study partners for understanding what we collected this year.

Latent field models

Latent field models represent an inference of the underlying force of infection in the total population over space and time, after adjusting for features of our sampling process and factors associated with our participants. The "latent field" is estimated during the estimation of the observed models above. It represents a model of the residual variation in the data that is not explained by observed factors like site_type, sex, or flu_shot. From the inferred latent field, we can produce an output I'm calling modeled_intensity which is an un-normalized estimate of total population incidence. I'm not yet producing normalized incidence estimates because we (1) haven't linked to census population data yet and (2) haven't worked out what we can about "denominator data" for our sample (what fraction of people who could've participated were sick enough to partipate and willing to enroll?). Regardless, the relative information in the modeled_intensity, both in time and space, can be interpreted similarly to true incidence, and through the modeled intensity, we get a picture of the dynamics of transmission.

The most important thing the latent field model adjusts for is an estimate of the "catchment" of each site_type. The catchment of each site estimates how likely people at each residence location are to have participated in our study, independent of if they are sick with the pathogen of interest.
For example, to estimate the catchment of kiosk collection for h1n1pdm, we infer a smooth model for the expected number of participants in each residence location with all non-h1n1pdm infections who partipate in kiosk sampling, aggegrated over the entire study duration (no time-dependence). We take this as a measure of the rate at which people would interact with a kiosk, independent of being sick with h1n1pdm, the pathogen of interest, and averaged over the variations in space and time of the many other pathogens that also drive participation.
The key assumptions are that (1) averaging of pathogen dynamics that are (2) independent of the specific pathogen being modeled reveals the underlying acces to kiosks and willingness to participate of people in each mapped residence location.

Given the estimated catchment of each site_type, the observed number of h1n1pdm cases over space and time are modeled relative to the catchment. In places where we get a lot of samples for lots of reasons, a few h1n1pdm samples represents a low intesity of transmission. But in places where we get few samples other than h1n1pdm, we infer that flu intensity at that residence location and time is high. The latent field model collectively estimates this intensity across the whole map, thus inferred space-time properties of the total population epidemic.

Year 2 to-do

Provide science documentation

What are the models (in words, equations, and code) and what do they do?

Age-dependent models

Depending on the outcome of interest, age is either like a factor or like time, and so the models and viz will need to depend on the cognitive task.

Improve vaccine-efficacy models

These will be an example of a different cognitive task, where the inference of interest is a parameter and not a direct transformation of data. Line, bar, and map chart viz components will likely be similar, but context will need to be different. This and other intervention effectiveness summaries will be really important for Year 2.

Incorporate census demographic data

  • locate appropriate data source
  • add query and formatting to dbViewR
  • incorporate in incidenceModelR where appropriate

incidenceMapR improvements

  • adjacency networks
    • define lowest spatial scale based on based shapes, instead of residence_census_tract (census tract) only
    • allow adjacency network models beyond default contiguous nearest-neighbor
  • many more!

incidence-mapper's People

Contributors

devclinton avatar famulare avatar grhuynh avatar olesya13 avatar tsibley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

meslubi2021

incidence-mapper's Issues

New issue with permissions in RStudio server

As of this evening, when trying to open a project in Rstudio, I get an error:
image.

This started before the rebuild of #33, so I have no idea why. It could be a VPN thing as I've only tried from home today, but I don't see why....

selectFromDB:: hook to real data

Once we have the handoff credentials set up, we need to add SQL hooks to the real database. We should also maintain the ability to connect to the simulated data for prototyping and model validation.

returnModel needs to be less fragile

the model lookup operation in returnModel is very fragile as it requires an exact string match for a JSON query. This should be replaced with a proper artifact management database. Changes here are coupled with saveModel: #5

consider parallelizing the inla.make.lincomb for loop in latentFieldModel

parallel loops in R; https://www.r-bloggers.com/parallel-r-loops-for-windows-and-linux/

The loop around here:

is slow for the GEOID + time models, and it could easily be parallelized. It's unlikely to change as this follows the INLA recommended way to do a complex linear combinations pass.

But this is far from urgent. While slow, it's still small time compared to fitting the model itself.

smoothModel: should factor levels be IID random effects instead of fixed effects?

Right now, I'm smoothing each factor level independently, given the latent field hyperparameters (one field replicate for each factor level). This is equivalent to a random intercept with infinite variance hyperprior.

It probably makes sense to treat levels as IID random effects, so that the model can pool if data is weak, and asymptote to the fixed effect model when data is strong.

Determine best workflow for model deployment

If the underlying model packages did not update but instead new trained model binaries are produced, we need a good deployment procedure to be able to easily deploy these. Ideally, it should be able to be done atomically from the end of a training session of a model
ie

trained_model
.....
upload_new_model asadsasd34424234234234 model.rds

latentFieldModel - should catchment be an offset or a covariate?

I'm estimating the samplingLocation catchments with the all-disease sampling location maps. If we believe the number of people sampled is proportional to the total population, then the log(catchments) should enter as an offset (or an "E" variable for INLA-poisson).

But, if we aren't sure how the number of participants represents the true denominator, it's more reasonable to input log(catchment) as a covariate and see what the data say about the impact of the catchment on disease. This is what is implemented in the code currently, but it's an open question which is more valid. (Both from theory to me, and with future empirical validation through improved survey design.)

API bugs

In the swagger ui, this query returns a valid csv but an empty json

{
  "model_type": "inla",
  "observed": [
    "residence_census_tract",
    "site_type",
    "pathogen",
    "flu_shot",
    "sex",
    "encountered_week"
  ],
  "pathogen": [
    "all"
  ]
}

clean up saveModel handling of model types

saveModel works, but does a lot of multiple counting with model_type and latent=TRUE that is also redundant with model$modelDefinition$type.

the saveModel model type strings are also not the same as the incidenceMapR model type strings, and this should be made more clear and maintainable.

incidenceMapR probably can't do besag/bym2 models with shapes other than GEOID

So far, I've been assuming that we'll only do neighbor smoothing (Besag/BYM2/CAR) at the GEOID level, and that higher levels just get iid random effects pooling. We probably want the option to explore higher scale smoothing. While it's possible the R stack will "just work" with non-GEOID shapefiles, I expect it won't and we'll have to clean it up.

Related: #4

Incorporate census demographic data

We need to pull in census demographic data to move toward understanding denominators and eventually producing post-stratified incidence estimates for different population groupings, instead of just the latent field estimate.

This will require additional data curation and data-model integration code.

RDS saving/fetching

  • Allow uploading RDS files with model store
  • Allow downloading RDS files from model store

GEOID x time models need too much memory

Documenting for future reference.

Issue: 9276b73

King county at census tract level for 9 months for six pathogens requires at least 120GB of memory to run. I knew memory footprint would be an issue, but I didn't have a good way to estimate it, so now we know.

The short term approach to dealing with this is to do each pathogen one at a time, instead of all pathogens simultaneously. This doesn't allow borrowing across pathogens to estimate the latent field hyperparameters, but it's an open modeling question whether that's a good idea or not, and benchmarking against simulated data is a longer-term to-do.


Turns out the problem was greatly exacerbated by a major bug in appendSmoothModel fixed here: fdeeb1d

The issue of the appropriateness of borrowing power across pathogens still stands.

incidenceMapR formula builder should be its own function

smoothModel, latentFieldModel, (and effectsModel to come) share a lot of code for building the formula from the input table. There are some minor differences, but at some point, the formula logic (at least the random effects part, which I think is identical and likely to stay that way) should get pulled into its own function for better maintainability.

reduce disk footprint of large RDS model saves

The geoid x time inla model objects can take up to 2GB on disk when saved as RDS. There's plenty of inefficiencies and redundancies in the model object right now, so we should try to shrink it, eventually.

WHERE ... IN statements to model requests are required for pathogen field

Is it possible to get IN statements into the model lookup soon?

(Bad JSON, but like)

{
 "model_type": "smooth",
 "observed": [
   "pathogen": ["all"],
   "residence_cra_name",
   "site_type"
 ]
}

This is deeply important for pathogen, but is optional for now on any other field, as it's part of the deployed model definition:

WHERE =list(COLUMN='pathogen', IN=PATHOGEN),

saveModel needs to handle multiple model types

Right now, saveModel can only output a single csv. This works for smoothingModels, but probably will not work for latentField models in which there are both smoothing outputs and latentField outputs. I'm going to try to keep everything in one file as it makes provenance easier to track and retrieval for later manipulation in REACT easier, but I'm not sure that's possible.

Add "spatial_domain" to model definition and lookup

I realize now we need to be explicit about the spatial domain of the models (like seattle, king_county, or wa_state), but this isn't being tracked anywhere.

The domain has to be added to modelDefinitions, and extracted/referenced in modelServR where appropriate, mirroring what we did in #53.

DBI::dbConnect eventually times out on laptop and hangs on server

I'm unable to connect to the database from either my laptop (Window10) or my workstation (Ubuntu). On the laptop, it hangs for about a minute before throwing an error:

line

rawData <- DBI::dbConnect(RPostgres::Postgres(),

> rawData <- DBI::dbConnect(RPostgres::Postgres(), 
                     host=host_string, 
                     dbname = dbname_string, 
                     port=pgpass_file[[host_index]][2],  
                     user=pgpass_file[[host_index]][4], 
                     password=pgpass_file[[host_index]][5])

Error in connection_create(names(opts), as.vector(opts)) : 
  server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

On the docker image on the server, it hangs (as in, I have to kill RStudio with htop) and never returns the error. In both cases, the everything up to the dbConnect line works as far as I can tell.

Dockerfile.RBuildEnv issues with dependencies

geojsonio isn't properly installing in the docker image. Multiple packages dependency libraries aren't coming through, despite having added them to the build environment file.

Called interactively:

> install.packages('geojsonio')

 installing *source* package ‘protolite’ ...
** package ‘protolite’ successfully unpacked and MD5 sums checked
Package protobuf was not found in the pkg-config search path.
Perhaps you should add the directory containing `protobuf.pc'
to the PKG_CONFIG_PATH environment variable
Package 'protobuf', required by 'world', not found
Using PKG_CFLAGS=
Using PKG_LIBS=-lprotobuf
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because protobuf was not found. Try installing:
 * deb: libprotobuf-dev (Debian, Ubuntu, etc)
 * rpm: protobuf-devel (Fedora, EPEL)
 * csw: protobuf_dev (Solaris)
 * brew: protobuf (OSX)
If protobuf is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a protobuf.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘protolite’
* removing ‘/usr/local/lib/R/site-library/protolite’
Warning in install.packages :
  installation of package ‘protolite’ had non-zero exit status
* installing *source* package ‘V8’ ...
** package ‘V8’ successfully unpacked and MD5 sums checked
Using PKG_CFLAGS=-I/usr/include/v8 -I/usr/include/v8-3.14
Using PKG_LIBS=-lv8 -lv8_libplatform
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because  was not found. Try installing:
 * deb: libv8-dev or libnode-dev (Debian / Ubuntu)
 * rpm: v8-devel (Fedora, EPEL)
 * brew: v8 (OSX)
 * csw: libv8_dev (Solaris)
To use a custom libv8, set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘V8’
* removing ‘/usr/local/lib/R/site-library/V8’
Warning in install.packages :
  installation of package ‘V8’ had non-zero exit status
* installing *source* package ‘jqr’ ...
** package ‘jqr’ successfully unpacked and MD5 sums checked
Using PKG_CFLAGS=
Using PKG_LIBS=-ljq
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libjq was not found.
On Ubuntu 14.04 or 16.04 you can use the PPA:
  sudo add-apt-repository -y ppa:opencpu/jq
  sudo apt-get update
  sudo apt-get install libjq-dev
On other sytems try installing:
 * deb: libjq-dev (Debian, Ubuntu 16.10 and up).
 * rpm: jq-devel (Fedora, EPEL)
 * csw: libjq_dev (Solaris)
 * brew: jq (OSX)
If  is already installed set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘jqr’
* removing ‘/usr/local/lib/R/site-library/jqr’
Warning in install.packages :
  installation of package ‘jqr’ had non-zero exit status
ERROR: dependencies ‘protolite’, ‘jqr’ are not available for package ‘geojson’
* removing ‘/usr/local/lib/R/site-library/geojson’
Warning in install.packages :
  installation of package ‘geojson’ had non-zero exit status
ERROR: dependencies ‘V8’, ‘geojson’, ‘jqr’ are not available for package ‘geojsonio’
* removing ‘/usr/local/lib/R/site-library/geojsonio’
Warning in install.packages :
  installation of package ‘geojsonio’ had non-zero exit status

dbConnect throws new error on IDMPPWK....

On IDMPPWK..., running this block

> Sys.setenv(PGSERVICEFILE = file.path(credentials_path, ".pg_service.conf"),
                PGPASSFILE    = file.path(credentials_path, ".pgpass"))

>     rawData <- DBI::dbConnect(RPostgres::Postgres(), service = "seattleflu-production")

yields this error:

WARNING: password file "/home/rstudio/seattle_flu/.pgpass" has group or world access; permissions should be u=rw (0600) or less
Error in connection_create(names(opts), as.vector(opts)) : 
  fe_sendauth: no password supplied

.pgpass and pg_service.conf are in the correct directory, and the script here does work:

#location of your pg_service and pgpass file - users should edit this as appropriate

modelServR will need to handle more model types

Hooking up new model types, as will be required on branch https://github.com/seattleflu/incidence-mapper/tree/famulare/flu-vax-efficacy is basically impossible with the current modelServR::saveModel design, which assumes two model types. Fixing saveModel to be more extensible and maintainable is necessary, and merging in flu-vax-efficacy models is a good prompt to do so. However, the viz can't do anything with these models right now, and they are fast to run, so I don't need this functionality until after clean-alpha-deployment is finished.

Use ISO weeks for international consistency

Great to see the move to binning by week! I think that we should be using ISO week definitions and the standard 2019-W02 format not 2019_W02. ISO weeks are slightly different from CDC weeks, as described most succinctly:

The US CDC defines epidemiological week (known as MMWR week) as seven days beginning with Sunday and ending with Saturday. The WHO defines epidemiological week, based on ISO week, as seven days beginning with Monday and ending with Sunday. In either case, the end of the first epidemiological week of the year by definition must fall at least four days into the year. Most years have 52 epidemiological weeks, but some have 53.

It's my understanding that ISO weeks are used more widely around the world, and just like we use the metric system in science, it seems we should use international datetime standards as well.

Changing this line to do:

db$iso_week <- format(db$encountered_date, "%G-W%V")

will use the datetime (strftime) formatting codes for ISO week year and week number.

factors in smoothModel should be handled as multiple likelihoods

Comment here:

# factors interactions? I don't think we want factor interactions in smoothing, but this is placeholder to think about it...

In modelTestR::testLatentField (

# first, joint catchment maps
), I found that using multiple likelihoods (http://www.r-inla.org/models/tools#TOC-Models-with-more-than-one-type-of-likelihood) with replicated random effects works really nicely for independently smoothing each factor level. Fast and less prone to overifitting or under-fitting.

fix ridiculous GEOIDRow and timeRow_GEOID grouped variable naming convention

The naming convention I chose for indices on grouped random effects is very annoying to work with for linear combinations of latent fields in incidenceMapR::latentFieldModel.R. I finished wrestling with it for now, but this needs a refactor everywhere it appears.

The natural time to do it is when we hook into real data, as there are other column naming stuff that needs reworking.

fix puma indexing to handle multiple standard column names

some shapefile data sources label PUMA columns as PUMA5CE (King County) and others as PUMACE10 (R::tigris), but they have the same 5-digit codes. We could either standardize this in the shapes or in the code that reads them. I lean toward in the code as I'd like to not break standards from public databases where possible, even if the standards are incompatible between public sources.

clean up mishmash of camelCase and snake_case and r.case in codebase

This is low priority, and entirely my fault, but the R code is a mess of camelCase and snake_case for functions and variable names (and an occasional "r.case" inherited from INLA). All data comes in and out in snake_case, and the majority of collaborators either prefer snake_case or don't care. So, we should eventually clean up the R code to use snake_case exclusively.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.