Giter VIP home page Giter VIP logo

incidence-mapper's People

Contributors

devclinton avatar famulare avatar grhuynh avatar olesya13 avatar tsibley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

meslubi2021

incidence-mapper's Issues

consider parallelizing the inla.make.lincomb for loop in latentFieldModel

parallel loops in R; https://www.r-bloggers.com/parallel-r-loops-for-windows-and-linux/

The loop around here:

is slow for the GEOID + time models, and it could easily be parallelized. It's unlikely to change as this follows the INLA recommended way to do a complex linear combinations pass.

But this is far from urgent. While slow, it's still small time compared to fitting the model itself.

clean up mishmash of camelCase and snake_case and r.case in codebase

This is low priority, and entirely my fault, but the R code is a mess of camelCase and snake_case for functions and variable names (and an occasional "r.case" inherited from INLA). All data comes in and out in snake_case, and the majority of collaborators either prefer snake_case or don't care. So, we should eventually clean up the R code to use snake_case exclusively.

DBI::dbConnect eventually times out on laptop and hangs on server

I'm unable to connect to the database from either my laptop (Window10) or my workstation (Ubuntu). On the laptop, it hangs for about a minute before throwing an error:

line

rawData <- DBI::dbConnect(RPostgres::Postgres(),

> rawData <- DBI::dbConnect(RPostgres::Postgres(), 
                     host=host_string, 
                     dbname = dbname_string, 
                     port=pgpass_file[[host_index]][2],  
                     user=pgpass_file[[host_index]][4], 
                     password=pgpass_file[[host_index]][5])

Error in connection_create(names(opts), as.vector(opts)) : 
  server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

On the docker image on the server, it hangs (as in, I have to kill RStudio with htop) and never returns the error. In both cases, the everything up to the dbConnect line works as far as I can tell.

clean up saveModel handling of model types

saveModel works, but does a lot of multiple counting with model_type and latent=TRUE that is also redundant with model$modelDefinition$type.

the saveModel model type strings are also not the same as the incidenceMapR model type strings, and this should be made more clear and maintainable.

modelServR will need to handle more model types

Hooking up new model types, as will be required on branch https://github.com/seattleflu/incidence-mapper/tree/famulare/flu-vax-efficacy is basically impossible with the current modelServR::saveModel design, which assumes two model types. Fixing saveModel to be more extensible and maintainable is necessary, and merging in flu-vax-efficacy models is a good prompt to do so. However, the viz can't do anything with these models right now, and they are fast to run, so I don't need this functionality until after clean-alpha-deployment is finished.

dbConnect throws new error on IDMPPWK....

On IDMPPWK..., running this block

> Sys.setenv(PGSERVICEFILE = file.path(credentials_path, ".pg_service.conf"),
                PGPASSFILE    = file.path(credentials_path, ".pgpass"))

>     rawData <- DBI::dbConnect(RPostgres::Postgres(), service = "seattleflu-production")

yields this error:

WARNING: password file "/home/rstudio/seattle_flu/.pgpass" has group or world access; permissions should be u=rw (0600) or less
Error in connection_create(names(opts), as.vector(opts)) : 
  fe_sendauth: no password supplied

.pgpass and pg_service.conf are in the correct directory, and the script here does work:

#location of your pg_service and pgpass file - users should edit this as appropriate

API bugs

In the swagger ui, this query returns a valid csv but an empty json

{
  "model_type": "inla",
  "observed": [
    "residence_census_tract",
    "site_type",
    "pathogen",
    "flu_shot",
    "sex",
    "encountered_week"
  ],
  "pathogen": [
    "all"
  ]
}

incidenceMapR formula builder should be its own function

smoothModel, latentFieldModel, (and effectsModel to come) share a lot of code for building the formula from the input table. There are some minor differences, but at some point, the formula logic (at least the random effects part, which I think is identical and likely to stay that way) should get pulled into its own function for better maintainability.

Add "spatial_domain" to model definition and lookup

I realize now we need to be explicit about the spatial domain of the models (like seattle, king_county, or wa_state), but this isn't being tracked anywhere.

The domain has to be added to modelDefinitions, and extracted/referenced in modelServR where appropriate, mirroring what we did in #53.

smoothModel: should factor levels be IID random effects instead of fixed effects?

Right now, I'm smoothing each factor level independently, given the latent field hyperparameters (one field replicate for each factor level). This is equivalent to a random intercept with infinite variance hyperprior.

It probably makes sense to treat levels as IID random effects, so that the model can pool if data is weak, and asymptote to the fixed effect model when data is strong.

Use ISO weeks for international consistency

Great to see the move to binning by week! I think that we should be using ISO week definitions and the standard 2019-W02 format not 2019_W02. ISO weeks are slightly different from CDC weeks, as described most succinctly:

The US CDC defines epidemiological week (known as MMWR week) as seven days beginning with Sunday and ending with Saturday. The WHO defines epidemiological week, based on ISO week, as seven days beginning with Monday and ending with Sunday. In either case, the end of the first epidemiological week of the year by definition must fall at least four days into the year. Most years have 52 epidemiological weeks, but some have 53.

It's my understanding that ISO weeks are used more widely around the world, and just like we use the metric system in science, it seems we should use international datetime standards as well.

Changing this line to do:

db$iso_week <- format(db$encountered_date, "%G-W%V")

will use the datetime (strftime) formatting codes for ISO week year and week number.

New issue with permissions in RStudio server

As of this evening, when trying to open a project in Rstudio, I get an error:
image.

This started before the rebuild of #33, so I have no idea why. It could be a VPN thing as I've only tried from home today, but I don't see why....

returnModel needs to be less fragile

the model lookup operation in returnModel is very fragile as it requires an exact string match for a JSON query. This should be replaced with a proper artifact management database. Changes here are coupled with saveModel: #5

selectFromDB:: hook to real data

Once we have the handoff credentials set up, we need to add SQL hooks to the real database. We should also maintain the ability to connect to the simulated data for prototyping and model validation.

GEOID x time models need too much memory

Documenting for future reference.

Issue: 9276b73

King county at census tract level for 9 months for six pathogens requires at least 120GB of memory to run. I knew memory footprint would be an issue, but I didn't have a good way to estimate it, so now we know.

The short term approach to dealing with this is to do each pathogen one at a time, instead of all pathogens simultaneously. This doesn't allow borrowing across pathogens to estimate the latent field hyperparameters, but it's an open modeling question whether that's a good idea or not, and benchmarking against simulated data is a longer-term to-do.


Turns out the problem was greatly exacerbated by a major bug in appendSmoothModel fixed here: fdeeb1d

The issue of the appropriateness of borrowing power across pathogens still stands.

WHERE ... IN statements to model requests are required for pathogen field

Is it possible to get IN statements into the model lookup soon?

(Bad JSON, but like)

{
 "model_type": "smooth",
 "observed": [
   "pathogen": ["all"],
   "residence_cra_name",
   "site_type"
 ]
}

This is deeply important for pathogen, but is optional for now on any other field, as it's part of the deployed model definition:

WHERE =list(COLUMN='pathogen', IN=PATHOGEN),

factors in smoothModel should be handled as multiple likelihoods

Comment here:

# factors interactions? I don't think we want factor interactions in smoothing, but this is placeholder to think about it...

In modelTestR::testLatentField (

# first, joint catchment maps
), I found that using multiple likelihoods (http://www.r-inla.org/models/tools#TOC-Models-with-more-than-one-type-of-likelihood) with replicated random effects works really nicely for independently smoothing each factor level. Fast and less prone to overifitting or under-fitting.

latentFieldModel - should catchment be an offset or a covariate?

I'm estimating the samplingLocation catchments with the all-disease sampling location maps. If we believe the number of people sampled is proportional to the total population, then the log(catchments) should enter as an offset (or an "E" variable for INLA-poisson).

But, if we aren't sure how the number of participants represents the true denominator, it's more reasonable to input log(catchment) as a covariate and see what the data say about the impact of the catchment on disease. This is what is implemented in the code currently, but it's an open question which is more valid. (Both from theory to me, and with future empirical validation through improved survey design.)

fix puma indexing to handle multiple standard column names

some shapefile data sources label PUMA columns as PUMA5CE (King County) and others as PUMACE10 (R::tigris), but they have the same 5-digit codes. We could either standardize this in the shapes or in the code that reads them. I lean toward in the code as I'd like to not break standards from public databases where possible, even if the standards are incompatible between public sources.

Dockerfile.RBuildEnv issues with dependencies

geojsonio isn't properly installing in the docker image. Multiple packages dependency libraries aren't coming through, despite having added them to the build environment file.

Called interactively:

> install.packages('geojsonio')

 installing *source* package ‘protolite’ ...
** package ‘protolite’ successfully unpacked and MD5 sums checked
Package protobuf was not found in the pkg-config search path.
Perhaps you should add the directory containing `protobuf.pc'
to the PKG_CONFIG_PATH environment variable
Package 'protobuf', required by 'world', not found
Using PKG_CFLAGS=
Using PKG_LIBS=-lprotobuf
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because protobuf was not found. Try installing:
 * deb: libprotobuf-dev (Debian, Ubuntu, etc)
 * rpm: protobuf-devel (Fedora, EPEL)
 * csw: protobuf_dev (Solaris)
 * brew: protobuf (OSX)
If protobuf is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a protobuf.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘protolite’
* removing ‘/usr/local/lib/R/site-library/protolite’
Warning in install.packages :
  installation of package ‘protolite’ had non-zero exit status
* installing *source* package ‘V8’ ...
** package ‘V8’ successfully unpacked and MD5 sums checked
Using PKG_CFLAGS=-I/usr/include/v8 -I/usr/include/v8-3.14
Using PKG_LIBS=-lv8 -lv8_libplatform
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because  was not found. Try installing:
 * deb: libv8-dev or libnode-dev (Debian / Ubuntu)
 * rpm: v8-devel (Fedora, EPEL)
 * brew: v8 (OSX)
 * csw: libv8_dev (Solaris)
To use a custom libv8, set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘V8’
* removing ‘/usr/local/lib/R/site-library/V8’
Warning in install.packages :
  installation of package ‘V8’ had non-zero exit status
* installing *source* package ‘jqr’ ...
** package ‘jqr’ successfully unpacked and MD5 sums checked
Using PKG_CFLAGS=
Using PKG_LIBS=-ljq
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libjq was not found.
On Ubuntu 14.04 or 16.04 you can use the PPA:
  sudo add-apt-repository -y ppa:opencpu/jq
  sudo apt-get update
  sudo apt-get install libjq-dev
On other sytems try installing:
 * deb: libjq-dev (Debian, Ubuntu 16.10 and up).
 * rpm: jq-devel (Fedora, EPEL)
 * csw: libjq_dev (Solaris)
 * brew: jq (OSX)
If  is already installed set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘jqr’
* removing ‘/usr/local/lib/R/site-library/jqr’
Warning in install.packages :
  installation of package ‘jqr’ had non-zero exit status
ERROR: dependencies ‘protolite’, ‘jqr’ are not available for package ‘geojson’
* removing ‘/usr/local/lib/R/site-library/geojson’
Warning in install.packages :
  installation of package ‘geojson’ had non-zero exit status
ERROR: dependencies ‘V8’, ‘geojson’, ‘jqr’ are not available for package ‘geojsonio’
* removing ‘/usr/local/lib/R/site-library/geojsonio’
Warning in install.packages :
  installation of package ‘geojsonio’ had non-zero exit status

Incorporate census demographic data

We need to pull in census demographic data to move toward understanding denominators and eventually producing post-stratified incidence estimates for different population groupings, instead of just the latent field estimate.

This will require additional data curation and data-model integration code.

RDS saving/fetching

  • Allow uploading RDS files with model store
  • Allow downloading RDS files from model store

saveModel needs to handle multiple model types

Right now, saveModel can only output a single csv. This works for smoothingModels, but probably will not work for latentField models in which there are both smoothing outputs and latentField outputs. I'm going to try to keep everything in one file as it makes provenance easier to track and retrieval for later manipulation in REACT easier, but I'm not sure that's possible.

reduce disk footprint of large RDS model saves

The geoid x time inla model objects can take up to 2GB on disk when saved as RDS. There's plenty of inefficiencies and redundancies in the model object right now, so we should try to shrink it, eventually.

fix ridiculous GEOIDRow and timeRow_GEOID grouped variable naming convention

The naming convention I chose for indices on grouped random effects is very annoying to work with for linear combinations of latent fields in incidenceMapR::latentFieldModel.R. I finished wrestling with it for now, but this needs a refactor everywhere it appears.

The natural time to do it is when we hook into real data, as there are other column naming stuff that needs reworking.

incidenceMapR probably can't do besag/bym2 models with shapes other than GEOID

So far, I've been assuming that we'll only do neighbor smoothing (Besag/BYM2/CAR) at the GEOID level, and that higher levels just get iid random effects pooling. We probably want the option to explore higher scale smoothing. While it's possible the R stack will "just work" with non-GEOID shapefiles, I expect it won't and we'll have to clean it up.

Related: #4

Determine best workflow for model deployment

If the underlying model packages did not update but instead new trained model binaries are produced, we need a good deployment procedure to be able to easily deploy these. Ideally, it should be able to be done atomically from the end of a training session of a model
ie

trained_model
.....
upload_new_model asadsasd34424234234234 model.rds

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.