seattleflu / incidence-mapper Goto Github PK
View Code? Open in Web Editor NEWR interface to database, map model training, and model data API Server
License: MIT License
R interface to database, map model training, and model data API Server
License: MIT License
parallel loops in R; https://www.r-bloggers.com/parallel-r-loops-for-windows-and-linux/
The loop around here:
is slow for the GEOID + time models, and it could easily be parallelized. It's unlikely to change as this follows the INLA recommended way to do a complex linear combinations pass.But this is far from urgent. While slow, it's still small time compared to fitting the model itself.
This is low priority, and entirely my fault, but the R code is a mess of camelCase and snake_case for functions and variable names (and an occasional "r.case" inherited from INLA). All data comes in and out in snake_case, and the majority of collaborators either prefer snake_case or don't care. So, we should eventually clean up the R code to use snake_case exclusively.
Rename model store api from pathogen_models to models
I'm unable to connect to the database from either my laptop (Window10) or my workstation (Ubuntu). On the laptop, it hangs for about a minute before throwing an error:
line
incidence-mapper/dbViewR/R/selectFromDB.R
Line 74 in fd2c5be
> rawData <- DBI::dbConnect(RPostgres::Postgres(),
host=host_string,
dbname = dbname_string,
port=pgpass_file[[host_index]][2],
user=pgpass_file[[host_index]][4],
password=pgpass_file[[host_index]][5])
Error in connection_create(names(opts), as.vector(opts)) :
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
On the docker image on the server, it hangs (as in, I have to kill RStudio with htop) and never returns the error. In both cases, the everything up to the dbConnect line works as far as I can tell.
If we build on INLA long term, we should explore the integration with the pardiso libraries for efficient sparse matrix calculations: https://www.pardiso-project.org/.
saveModel works, but does a lot of multiple counting with model_type
and latent=TRUE
that is also redundant with model$modelDefinition$type
.
the saveModel model type strings are also not the same as the incidenceMapR model type strings, and this should be made more clear and maintainable.
The demo data has quintile
as an option, but that was post-processed in a script. Add that to appendModelData.R
Hooking up new model types, as will be required on branch https://github.com/seattleflu/incidence-mapper/tree/famulare/flu-vax-efficacy is basically impossible with the current modelServR::saveModel design, which assumes two model types. Fixing saveModel to be more extensible and maintainable is necessary, and merging in flu-vax-efficacy models is a good prompt to do so. However, the viz can't do anything with these models right now, and they are fast to run, so I don't need this functionality until after clean-alpha-deployment is finished.
I need to select models for deployment and write build scripts.
On IDMPPWK..., running this block
> Sys.setenv(PGSERVICEFILE = file.path(credentials_path, ".pg_service.conf"),
PGPASSFILE = file.path(credentials_path, ".pgpass"))
> rawData <- DBI::dbConnect(RPostgres::Postgres(), service = "seattleflu-production")
yields this error:
WARNING: password file "/home/rstudio/seattle_flu/.pgpass" has group or world access; permissions should be u=rw (0600) or less
Error in connection_create(names(opts), as.vector(opts)) :
fe_sendauth: no password supplied
.pgpass and pg_service.conf are in the correct directory, and the script here does work:
In the swagger ui, this query returns a valid csv but an empty json
{
"model_type": "inla",
"observed": [
"residence_census_tract",
"site_type",
"pathogen",
"flu_shot",
"sex",
"encountered_week"
],
"pathogen": [
"all"
]
}
smoothModel, latentFieldModel, (and effectsModel to come) share a lot of code for building the formula from the input table. There are some minor differences, but at some point, the formula logic (at least the random effects part, which I think is identical and likely to stay that way) should get pulled into its own function for better maintainability.
I realize now we need to be explicit about the spatial domain of the models (like seattle
, king_county
, or wa_state
), but this isn't being tracked anywhere.
The domain has to be added to modelDefinitions
, and extracted/referenced in modelServR
where appropriate, mirroring what we did in #53.
latent field should be output with appropriate link function
Right now, I'm smoothing each factor level independently, given the latent field hyperparameters (one field replicate for each factor level). This is equivalent to a random intercept with infinite variance hyperprior.
It probably makes sense to treat levels as IID random effects, so that the model can pool if data is weak, and asymptote to the fixed effect model when data is strong.
Great to see the move to binning by week! I think that we should be using ISO week definitions and the standard 2019-W02
format not 2019_W02
. ISO weeks are slightly different from CDC weeks, as described most succinctly:
The US CDC defines epidemiological week (known as MMWR week) as seven days beginning with Sunday and ending with Saturday. The WHO defines epidemiological week, based on ISO week, as seven days beginning with Monday and ending with Sunday. In either case, the end of the first epidemiological week of the year by definition must fall at least four days into the year. Most years have 52 epidemiological weeks, but some have 53.
It's my understanding that ISO weeks are used more widely around the world, and just like we use the metric system in science, it seems we should use international datetime standards as well.
Changing this line to do:
db$iso_week <- format(db$encountered_date, "%G-W%V")
will use the datetime (strftime
) formatting codes for ISO week year and week number.
expandDB will need to know what validColumnData looks like for the real data. And again, we should maintain a parallel path to simulated data.
incidence-mapper/dbViewR/R/expandDB.R
Line 18 in 98e55a8
We need to agree on shapes, but that requires choosing a valid domain. See seattleflu/seattle-geojson#1 for discussion.
As of this evening, when trying to open a project in Rstudio, I get an error:
.
This started before the rebuild of #33, so I have no idea why. It could be a VPN thing as I've only tried from home today, but I don't see why....
the model lookup operation in returnModel is very fragile as it requires an exact string match for a JSON query. This should be replaced with a proper artifact management database. Changes here are coupled with saveModel: #5
Awaiting guidance on exporting linear combinations of latent field here:
https://groups.google.com/forum/#!topic/r-inla-discussion-group/_e2C2L7Wc30
After we get the mapping server going and the database fully populated, I intend to use/extend incidenceMapR to explore vaccine efficacy patterns based on self-reported flu_shot
and the test-negative design. This should be a good test of the information power in our study.
Filing this ticket in advance of finding problems when we connect to the research db #2 ...
The prototype has a few field names hard coded as validColumnNames. While I can harmonize the simulated data and real data columns, it would be nice to better curate or be more responsive to column names across data sources.
Need to interact with/be informed by https://github.com/seattleflu/genomic-incidence-tracker/blob/master/data/variableChoices.json
Once we have the handoff credentials set up, we need to add SQL hooks to the real database. We should also maintain the ability to connect to the simulated data for prototyping and model validation.
For better readability and interaction with Auspice, switch to snake_case from camelCase in any JSON blobs that get moved around (and thus in much of the code). https://seattle-flu-study.slack.com/archives/CFAEU7MQV/p1554238354020400?thread_ts=1554238173.019700&cid=CFAEU7MQV
Documenting for future reference.
Issue: 9276b73
King county at census tract level for 9 months for six pathogens requires at least 120GB of memory to run. I knew memory footprint would be an issue, but I didn't have a good way to estimate it, so now we know.
The short term approach to dealing with this is to do each pathogen one at a time, instead of all pathogens simultaneously. This doesn't allow borrowing across pathogens to estimate the latent field hyperparameters, but it's an open modeling question whether that's a good idea or not, and benchmarking against simulated data is a longer-term to-do.
Turns out the problem was greatly exacerbated by a major bug in appendSmoothModel fixed here: fdeeb1d
The issue of the appropriateness of borrowing power across pathogens still stands.
Is it possible to get IN
statements into the model lookup soon?
(Bad JSON, but like)
{
"model_type": "smooth",
"observed": [
"pathogen": ["all"],
"residence_cra_name",
"site_type"
]
}
This is deeply important for pathogen
, but is optional for now on any other field, as it's part of the deployed model definition:
incidence-mapper/dbViewR/R/selectFromDB.R
Line 46 in 67f7422
It would be nice to get shape references at all scales from the database, but until then, we use this lookup operation like in seattleflu/seattle-geojson#5
In model outputs, format dates as YYYY-MM-DD instead of decimal years.
Right now, latentFieldModel.R
and smoothModel.R
require spatial shp
data even if the model being built doesn't. shp
should be optional, and modelServR::saveModel()
should be compatible with the future update.
Comment here:
In modelTestR::testLatentField (
), I found that using multiple likelihoods (http://www.r-inla.org/models/tools#TOC-Models-with-more-than-one-type-of-likelihood) with replicated random effects works really nicely for independently smoothing each factor level. Fast and less prone to overifitting or under-fitting.I'm estimating the samplingLocation catchments with the all-disease sampling location maps. If we believe the number of people sampled is proportional to the total population, then the log(catchments) should enter as an offset (or an "E" variable for INLA-poisson).
But, if we aren't sure how the number of participants represents the true denominator, it's more reasonable to input log(catchment) as a covariate and see what the data say about the impact of the catchment on disease. This is what is implemented in the code currently, but it's an open question which is more valid. (Both from theory to me, and with future empirical validation through improved survey design.)
some shapefile data sources label PUMA columns as PUMA5CE
(King County) and others as PUMACE10
(R::tigris), but they have the same 5-digit codes. We could either standardize this in the shapes or in the code that reads them. I lean toward in the code as I'd like to not break standards from public databases where possible, even if the standards are incompatible between public sources.
geojsonio isn't properly installing in the docker image. Multiple packages dependency libraries aren't coming through, despite having added them to the build environment file.
Called interactively:
> install.packages('geojsonio')
installing *source* package ‘protolite’ ...
** package ‘protolite’ successfully unpacked and MD5 sums checked
Package protobuf was not found in the pkg-config search path.
Perhaps you should add the directory containing `protobuf.pc'
to the PKG_CONFIG_PATH environment variable
Package 'protobuf', required by 'world', not found
Using PKG_CFLAGS=
Using PKG_LIBS=-lprotobuf
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because protobuf was not found. Try installing:
* deb: libprotobuf-dev (Debian, Ubuntu, etc)
* rpm: protobuf-devel (Fedora, EPEL)
* csw: protobuf_dev (Solaris)
* brew: protobuf (OSX)
If protobuf is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a protobuf.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘protolite’
* removing ‘/usr/local/lib/R/site-library/protolite’
Warning in install.packages :
installation of package ‘protolite’ had non-zero exit status
* installing *source* package ‘V8’ ...
** package ‘V8’ successfully unpacked and MD5 sums checked
Using PKG_CFLAGS=-I/usr/include/v8 -I/usr/include/v8-3.14
Using PKG_LIBS=-lv8 -lv8_libplatform
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because was not found. Try installing:
* deb: libv8-dev or libnode-dev (Debian / Ubuntu)
* rpm: v8-devel (Fedora, EPEL)
* brew: v8 (OSX)
* csw: libv8_dev (Solaris)
To use a custom libv8, set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘V8’
* removing ‘/usr/local/lib/R/site-library/V8’
Warning in install.packages :
installation of package ‘V8’ had non-zero exit status
* installing *source* package ‘jqr’ ...
** package ‘jqr’ successfully unpacked and MD5 sums checked
Using PKG_CFLAGS=
Using PKG_LIBS=-ljq
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libjq was not found.
On Ubuntu 14.04 or 16.04 you can use the PPA:
sudo add-apt-repository -y ppa:opencpu/jq
sudo apt-get update
sudo apt-get install libjq-dev
On other sytems try installing:
* deb: libjq-dev (Debian, Ubuntu 16.10 and up).
* rpm: jq-devel (Fedora, EPEL)
* csw: libjq_dev (Solaris)
* brew: jq (OSX)
If is already installed set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
ERROR: configuration failed for package ‘jqr’
* removing ‘/usr/local/lib/R/site-library/jqr’
Warning in install.packages :
installation of package ‘jqr’ had non-zero exit status
ERROR: dependencies ‘protolite’, ‘jqr’ are not available for package ‘geojson’
* removing ‘/usr/local/lib/R/site-library/geojson’
Warning in install.packages :
installation of package ‘geojson’ had non-zero exit status
ERROR: dependencies ‘V8’, ‘geojson’, ‘jqr’ are not available for package ‘geojsonio’
* removing ‘/usr/local/lib/R/site-library/geojsonio’
Warning in install.packages :
installation of package ‘geojsonio’ had non-zero exit status
After rebuilding the RStudio container, git config has to be done by hand
git config --global user.email "[email protected]"
git config --global user.name "Your Name"
We need to pull in census demographic data to move toward understanding denominators and eventually producing post-stratified incidence estimates for different population groupings, instead of just the latent field estimate.
This will require additional data curation and data-model integration code.
Right now, saveModel can only output a single csv. This works for smoothingModels, but probably will not work for latentField models in which there are both smoothing outputs and latentField outputs. I'm going to try to keep everything in one file as it makes provenance easier to track and retrieval for later manipulation in REACT easier, but I'm not sure that's possible.
That line definitely doesn't belong, but it doesn't appear to be affecting models that we're building right now, so I'm gonna leave it until after the May demo.
The geoid x time inla model objects can take up to 2GB on disk when saved as RDS. There's plenty of inefficiencies and redundancies in the model object right now, so we should try to shrink it, eventually.
The naming convention I chose for indices on grouped random effects is very annoying to work with for linear combinations of latent fields in incidenceMapR::latentFieldModel.R. I finished wrestling with it for now, but this needs a refactor everywhere it appears.
The natural time to do it is when we hook into real data, as there are other column naming stuff that needs reworking.
So far, I've been assuming that we'll only do neighbor smoothing (Besag/BYM2/CAR) at the GEOID level, and that higher levels just get iid random effects pooling. We probably want the option to explore higher scale smoothing. While it's possible the R stack will "just work" with non-GEOID shapefiles, I expect it won't and we'll have to clean it up.
Related: #4
If the underlying model packages did not update but instead new trained model binaries are produced, we need a good deployment procedure to be able to easily deploy these. Ideally, it should be able to be done atomically from the end of a training session of a model
ie
trained_model
.....
upload_new_model asadsasd34424234234234 model.rds
expandDB has a bug that's probably been there awhile when joining on timeBin. Not all rows join! The how time handling needs to be fixed anyway, per #8. This is critical to fix before moving forward.
For fields like race
or (importantly) pathogen
, multiple values are possible. The easy thing to do is to give array entries their own category, as in pathogen = {Flu_A_H1, RSV}
is a third outcome category, but this won't always make sense for all analyses. Need to think through use cases...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.