andybega / forecaster2 Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 46.56 MB

Coup forecasts

Home Page: https://www.predictiveheuristics.com/forecasts

License: MIT License

R 53.31% Makefile 0.68% HTML 44.52% CSS 1.49%

forecaster2's People

Contributors

Stargazers

Watchers

forecaster2's Issues

Find better map data for website map

The current map does not include Bahrain for example.

Add copyright footer for website

Try changing map projection from Mercator

https://rstudio.github.io/leaflet/projections.html

How much do the forecasts vary over RNG seeds?

How much sensitivity is there between model runs, due to different starting conditions?

To keep this reproducible, find a way to run the forecast models over a some number of randomly picked RNG seeds, and check the resulting variation in performance.

Add past forecasts and assessments

I think there are coup forecasts for at least 2 past years. In 2017 (?) we had forecasts that we wrote up in WaPo Monkey Cage, and then I think I had some 2018 forecasts that I never did anything with.

Add those to the forecast repo and see what their accuracy was.

One caveat: the process to generate those forecasts was slightly different. E.g. some of the data going into it was different.

Reduce spatial data size

The index page (index.html) is quite big. I started out at almost 6MB when loaded in a browser, and through playing around with st_simplify() for the two map objects I can get it down to around 4.53 MB when loaded. This still doesn't work with the Twitter card validator (see #10), and reducing the number of points further at this point I think will detract too much visibly.

Another option might be to reduce the precision of the coordinates.

Right now index.html stores the coordinates like "-0.456945916095746", i.e. with 15 digits. That's probably way more than needed. (These are lag/long coordinates, so range from -180 - 180 and -90 to 90.)

So try this instead: reduce number of digits in the coordinates (and probabilities, for that matter); and possible maybe try to revert the point reduction (st_simplify()) back to the earlier values.

In rleaflet, for sf, the code to convert sf to GeoJSON straight pulls from the sf coordinates, see https://github.com/rstudio/leaflet/blob/master/R/normalize-sf.R.

sf has a set of precision functions, but these only come into play when writing out data (see https://r-spatial.github.io/sf/reference/st_precision.html). rleafet's straight access to the sf coordinates thus circumvents this. Maybe either:

try to override rleaflet::sf_coords()
write from sf to GeoJSON; manipulate the coordinates somehow, and then add it as a GeoJSON layer in leaflet.

Add REIGN to data sources

The data-sources folder is missing the REIGN data cleaning code, add it.

Re-run the forecast models with RNG seed

Add set.seed() in run-forecasts.R, re-run, and propagate the now reproducible forecasts to the website.

Check OG and TC metadata

Make sure the website does open graph and twitter cards metadata correctly.

Try out a skinny forest HP strategy for the RF models

Instead of a relatively small number of decision trees that themselves operate on a lot of data and are fairly deep, try out an alternative strategy using a large number of trees, but where each tree is relatively shallow and only operates on a relatively small data sample. A variation of this is to also consider stratified sampling with downsampling for negative cases.

mlr3 uses the following defaults for ranger():

learner = mlr3::lrn("classif.ranger")
learner$param_set$default

min.node.size: 1
mtry: no default, ranger's default is rounded down square root of the number of features
ntree: 500
maxdepth: no default
replace: true
sample.fraction: no default, but ranger's default is 1 for sampling with replacement (default) and 0.632 for sampling without replacement

The "sample.fraction" argument can be a vector giving the number of cases (relative to the total number of cases) to sample from each outcome factor class. See the bottom answer at https://stats.stackexchange.com/questions/171380/implementing-balanced-random-forest-brf-in-r-using-randomforests, and the linked ranger issues.

So something like sample.fraction = c(0.1, 0.9) for example should give a resampled dataset with 10% positive cases and same number of rows as original data.

Things to vary:

the number of trees
min.node.size
the total sample size, e.g. whether it should be 1 or some dramatically lower number
and the proportion of class samples, which together with the total sample size gives the sample fraction vector

Chao, Liaw, and Breiman in the balanced random forest paper recommend drawing same number of cases for both classes, i.e. proportion is 1:1, or sample.fraction = c(0.5, 0.5) or something like that. Maybe that's a good starting point.

So basically in total, three tuning strategies:

default RF with sample.fraction = 1 and optimizing over mtry and min.node.size; this I already have
Balanced RF with sample.fraction = c(0.5, 0.5)
Skinny RF with much larger number of trees, but smaller sample fractions, e.g. c(0.1, 0.1)

Add G&W state age to data sources

data-sources is missing the G&W state age cleaning code.

andybega / forecaster2 Goto Github PK

forecaster2's People

Contributors

Stargazers

Watchers

forecaster2's Issues

Find better map data for website map

Add copyright footer for website

Try changing map projection from Mercator

How much do the forecasts vary over RNG seeds?

Add past forecasts and assessments

Reduce spatial data size

Add REIGN to data sources

Re-run the forecast models with RNG seed

Check OG and TC metadata

Try out a skinny forest HP strategy for the RF models

Add G&W state age to data sources

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent