global-policy-lab / gpl-covid Goto Github PK

Repo for code and small datasets related to Global Policy Lab's COVID-19 policy analysis. Read and share the acompanying article here:

Home Page: https://rdcu.be/b4Iyo

Jupyter Notebook 52.28% Python 16.39% R 12.49% Stata 18.20% Shell 0.51% Dockerfile 0.12%

covid-19 regression-models stata codeocean-capsule analysis estimate epidemiological-data intervention-study statistical-inference statistical-models

gpl-covid's People

Contributors

Stargazers

Watchers

Forkers

vrathi0 hdruckenmiller lucyxiaoluwang xuf12 hpan8 green-ear-rabbit logicalfellac yongjianzhu1995 yeli0823 shizelong1985 fergald minhngh mbader92

gpl-covid's Issues

decide how to treat intensity-weighted overlapping national/regional/local policies

Do we sum? Do we take the max? See convo on slack. From @estherrolf:

[I’m gonna use US specific words so like state=adm1] specifically,

(a) for non-popweighted variables, how do we aggregate state level contributions from county or city policies with differing policy_intensities? Like if a city has a policy with intensity 1 and a state level policy has intensity 0.2, is the state level (non-popweighted) intensity meant to be max(0.2,1) = 1 ?

(b) for popweighted values, are we counting state level contributions as min( sum_over_policies{percent_state_pop_effected * policy_intensity} , 1) ? If we do this, we would i think have to account for all the overlap in the percent_state_pop_effected and somewhoe also the intensities of the policies that overlap — seems prone to errors and likely that we’re not doing it the same way for each country.

basically I’m asking for us to come to a consensus on how to use policy_intensity (I know you guys have all thought more about this than me, but as the person trying to implement the merge there’s a lot of complications arising). It’s possible theres a straightforward way to use it, it’s also possible that it’s not worth the trouble to account for fractional intensities (which maybe avoids encoding our judments on what fraction to give?).

ED figures and SI tables need to be added to README.md and run script

The code to generate each of the following figures should be in the run script and referenced in the readme:

ED Fig 2 (@luna983 )
ED Figs 3-5 (@sannanphan )
ED Fig 6 (@jeanettelt )
SI Table 2 (@luna983 ) I feel like this might not be a code-generated table - If that's the case, just note that in the run and README files and we're good!
SI Tables 3-5 (@sannanphan or @jeanettelt ) - If these tables just pop out of another script, just note that in the run and README file and we're good!

ED Fig 2 source data will fail tests unless we ignore

We should pull from cutoff_dates to know when to stop ED fig 2, otherwise the figure and the source data will be different every day as new data is available.

We should also add a no-download flag so that in tests we can avoid downloading new data in tests in case new data alters previously-reported values that occur prior to the cutoff date.

Figure code mostly does not run now

Most of the Figure / Table code is not running right now. I think it just requires a few tweaks w/ variable names and (for fig A2), reading from the already downloaded JHU data rather than trying to access the now broken old JHU links.

I think Fig 4 is still working, but not the extra script that calculates data for the paper that's related to figure 4

Warnings from SIR model projections

Warning messages from running codes/models/run_all_CB_simulations.R

Warning messages:
1: In chol.default(mat, pivot = TRUE, tol = tol) :
  the matrix is either rank-deficient or indefinite
2: In sqrt(diag(z$STATS[[lhs]]$clustervcv)) : NaNs produced
3: In sqrt(diag(z$STATS[[lhs]]$robustvcv)) : NaNs produced
4: In chol.default(mat, pivot = TRUE, tol = tol) :
  the matrix is either rank-deficient or indefinite
5: In sqrt(diag(z$STATS[[lhs]]$clustervcv)) : NaNs produced
6: In sqrt(diag(z$STATS[[lhs]]$robustvcv)) : NaNs produced
7: In compute_bootstrap_replications(full_data = mydata, policy_variables_to_use = policy_variables_to_use,  :
  Negative eigenvalues set to zero in clustered variance matrix. See felm(...,psdef=FALSE)
Error in eigen(sigma, symmetric = TRUE) : 
  infinite or missing values in 'x'
Calls: source ... withVisible -> <Anonymous> -> map -> .f -> <Anonymous> -> eigen
In addition: Warning messages:
1: In chol.default(mat, pivot = TRUE, tol = tol) :
  the matrix is either rank-deficient or indefinite
2: In sqrt(diag(z$STATS[[lhs]]$clustervcv)) : NaNs produced
3: In chol.default(mat, pivot = TRUE, tol = tol) :
  the matrix is either rank-deficient or indefinite
4: In sqrt(diag(z$STATS[[lhs]]$clustervcv)) : NaNs produced
5: In compute_bootstrap_replications(full_data = mydata, policy_variables_to_use = policy_variables_to_use,  :
  Negative eigenvalues set to zero in clustered variance matrix. See felm(...,psdef=FALSE)
Execution halted

ensure region is not covered by an optional and a non-optional policy of the same category

A region subject to somepolicy should not also have the "extra treatment" of somepolicy_opt when other regions in the super-region get this treatment, since this optional policy doesn't actually effect any new policy treatment for the region that already had somepolicy. The same goes for popwt versions of somepolicy--we should make sure that any sub-regions (that compose the pop-weight of the region) that are encompassed in somepolicy_popwt are not also aggregated in somepolicy_opt_popwt during any of the days those policies are in place, and vice versa.

add FIPS for USA to adm data

@kendonB if I don't do this soon yell at me

data/codes/usa/merge_policy_and_cases breaking

I started to fix a few lines but I'm getting to the end and it looks like theres a bunch of national cases in there which are causing a mis-matched merge with population data and it's failing an assertion at the end. @estherrolf can you take a look?

Data and code updates

Using this issue to track TODO's for getting new data and/or creating tables/figures to respond to reviewers. When you submit a PR, add something like the following to your PR (e.g. "updates policy data for #21 "). When it is merged, I'll check off the appropriate checkbox in this issue. For all updates, make sure that any new scripts or manual downloads have been added to the README and these scripts have been added to run

Old code revisions

Fix #8 (@trinetta )

New data

Readme

Update README structure as needed (@bolliger32)

Policy data

We are now using the data_sources.gsheet file as the up-to-date version of all manually aggregated policy data. Each country's pipeline should be updated with the following steps:

make sure data_sources.gsheet is up to date with all manually collected data
save respective country's tab from the gsheet as data/raw/[country]/[country_code]_policy_data_sources.csv
for programmatic data collection and processing, have your scripts save the resulting file as data/interim/[country]/[country_code]_policy_data_sources_other.csv. Its possible but not necessary to include downloads to data/raw in these scripts, but the processed output, formatted to match data_sources.gsheet should go in interim.
Push this automated data to github if new (this will continue to happen as time passes, but lets make sure we have this updated to at least 3/24 for this version of the code)

Epi data

Make sure your code pulls the latest data, without filtering to cutoff date (this will be done in regression step). Push a PR with both any code updates AND data updates through at least 3/24

Merging epi+policy

This can either occur in the same script as the pulling epi data script or as a separate step. But make sure this pulls from data/raw/[country]/[country_code]_policy_data_sources.csv,
data/interim/[country]/[country_code]_policy_data_sources_other.csv (if necessary), any of the epi data, and any auxiliary data like population, and saves data/processed/[adm_level]/[country_code]_processed.csv. Make sure that the output has lat and lon columns.

Regression estimation

Make sure that this code filters data to cutoff dates before saving the reg_data.csv files. This could either be done with an easy-to-change set of country-specific variables at the top of the script or (preferably) a csv located at codes/data/regression_cutoff_dates.csv. (@jeanettelt )
Save all regression outputs to results/tables/[country_code]_results_table.rtf for Appendix table (@sannanphan)
Are we adding lags for any countries? or changing the regressions in any other way? (@jeanettelt and sol)

Projections

Make sure this replicates (@kendonB)

Figures

Remove hardcoding of cutoff dates in Fig 1 and instead refer to codes/data/cutoff_dates.csv (see #29) (@hdruckenmiller)

New products

County-level analysis in the US (@estherrolf and @kendonB and @ekrasovich )
Add table of full regression outputs to Appendix (see Regression estimation and reference appropriately in paper (@sannanphan)
Figure that is a version of Fig 3. separately for WA, CA, NY, Wuhan, Tehran, Lodi. This involves estimating the regression again separately for each of these admin units. (@jeanettelt)

Missing date column in China-processed

The date column is missing in data/processed/adm2/CHN_processed.csv

split run script into two children

"data downloads" vs. "analysis", to make it easier to re-run one of these parts.

French daily cases website changed :(

looks like something changed about the french website we were scraping for cases... So now it downloads a blank file with no datestamp. Will want to update the code to handle this (not sure if its temporary or a permanent change)

remove tigris

Fig 1 code downloads US state shapes at runtime, and this sometimes causes errors in the CI pipeline. We can instead pull from the adm1.shp file and drop our dependency on r-tigris.

projection code throws errors periodically

Is there a stochastic component of the CI bounds that we can set a seed for?

add optional CLI arg to the quality-checker code

update readme before next release

A reminder to change title, update tree, and any other changes.

Appendix Fig A1

Appendix Fig A1 , find out why this line *graph combine hist_usa qn_usa, rows(1) xsize(10) saving(figures/appendix/error_dist/error_usa.gph, replace) was commented out in codes/models/alt_growth_rates/USA_adm1.do, preventing combination of graphs to make Appendix Fig A1.

Update format_infected.do to account for more recent data

Right now format_infected.do will treat April like the Y2K bug, so we should fix that up before april comes around. Also, we should have a setting where it just "runs until the latest date thats downloaded". Right now its stops at the 18th and requires having the fr-sars-cov-2-YYYYMMDD data being before that date, with daily data downloaded for each date in between that bulk download and end_sample. After #17 is merged, I think the repo is fully cleaned up and we can tag that version as our "medarXiv" version, so that figures are replicable. And then we can start pulling in newer data across all the scripts.

@jeanettelt @sannanphan can one of you be in charge of getting format_infected.do ready for that? Thanks!

"present day" projections raise errors in CI

@kendonB after #311 was merged to master, it looks like master failed the tests. I'm guessing that's b/c it's getting projecting longer and longer each day? We should either skip the checking of this file or have a fixed end date that it projects to, which we can change manually.

harmonizing FRA policy variable definitions

Right now FRA is using slightly different structure/definitions for policy variables, so @jeanettelt @sannanphan @peiley and I need to put our heads together to sort this out.

USA facts download is breaking

We're getting one county with null imputed values, which is raising an error during one of @kendonB's checks (thanks for adding those!)

Browse[1]> usa_county_data_standardised[is.na(usa_county_data_standardised$cum_confirmed_cases_imputed),]
# A tibble: 3 x 13
  date       adm0_name adm1_name adm2_name population cum_confirmed_c… cum_confirmed_c… active_cases active_cases_im… cum_deaths
  <date>     <chr>     <chr>     <chr>          <dbl>            <dbl>            <dbl>        <dbl>            <dbl>      <dbl>
1 2020-04-01 USA       Wyoming   Weston          7208               NA               NA           NA               NA         NA
2 2020-04-02 USA       Wyoming   Weston          7208               NA               NA           NA               NA         NA
3 2020-04-03 USA       Wyoming   Weston          7208               NA               NA           NA               NA         NA
# … with 3 more variables: cum_deaths_imputed <dbl>, cum_recoveries <dbl>, cum_recoveries_imputed <dbl>