Giter VIP home page Giter VIP logo

forecaster2's People

Contributors

andybega avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

forecaster2's Issues

How much do the forecasts vary over RNG seeds?

How much sensitivity is there between model runs, due to different starting conditions?

To keep this reproducible, find a way to run the forecast models over a some number of randomly picked RNG seeds, and check the resulting variation in performance.

Add past forecasts and assessments

I think there are coup forecasts for at least 2 past years. In 2017 (?) we had forecasts that we wrote up in WaPo Monkey Cage, and then I think I had some 2018 forecasts that I never did anything with.

Add those to the forecast repo and see what their accuracy was.

One caveat: the process to generate those forecasts was slightly different. E.g. some of the data going into it was different.

Reduce spatial data size

The index page (index.html) is quite big. I started out at almost 6MB when loaded in a browser, and through playing around with st_simplify() for the two map objects I can get it down to around 4.53 MB when loaded. This still doesn't work with the Twitter card validator (see #10), and reducing the number of points further at this point I think will detract too much visibly.

Another option might be to reduce the precision of the coordinates.

Right now index.html stores the coordinates like "-0.456945916095746", i.e. with 15 digits. That's probably way more than needed. (These are lag/long coordinates, so range from -180 - 180 and -90 to 90.)

So try this instead: reduce number of digits in the coordinates (and probabilities, for that matter); and possible maybe try to revert the point reduction (st_simplify()) back to the earlier values.

In rleaflet, for sf, the code to convert sf to GeoJSON straight pulls from the sf coordinates, see https://github.com/rstudio/leaflet/blob/master/R/normalize-sf.R.

sf has a set of precision functions, but these only come into play when writing out data (see https://r-spatial.github.io/sf/reference/st_precision.html). rleafet's straight access to the sf coordinates thus circumvents this. Maybe either:

  • try to override rleaflet::sf_coords()
  • write from sf to GeoJSON; manipulate the coordinates somehow, and then add it as a GeoJSON layer in leaflet.

Try out a skinny forest HP strategy for the RF models

Instead of a relatively small number of decision trees that themselves operate on a lot of data and are fairly deep, try out an alternative strategy using a large number of trees, but where each tree is relatively shallow and only operates on a relatively small data sample. A variation of this is to also consider stratified sampling with downsampling for negative cases.

mlr3 uses the following defaults for ranger():

learner = mlr3::lrn("classif.ranger")
learner$param_set$default
  • min.node.size: 1
  • mtry: no default, ranger's default is rounded down square root of the number of features
  • ntree: 500
  • maxdepth: no default
  • replace: true
  • sample.fraction: no default, but ranger's default is 1 for sampling with replacement (default) and 0.632 for sampling without replacement

The "sample.fraction" argument can be a vector giving the number of cases (relative to the total number of cases) to sample from each outcome factor class. See the bottom answer at https://stats.stackexchange.com/questions/171380/implementing-balanced-random-forest-brf-in-r-using-randomforests, and the linked ranger issues.

So something like sample.fraction = c(0.1, 0.9) for example should give a resampled dataset with 10% positive cases and same number of rows as original data.

Things to vary:

  • the number of trees
  • min.node.size
  • the total sample size, e.g. whether it should be 1 or some dramatically lower number
  • and the proportion of class samples, which together with the total sample size gives the sample fraction vector

Chao, Liaw, and Breiman in the balanced random forest paper recommend drawing same number of cases for both classes, i.e. proportion is 1:1, or sample.fraction = c(0.5, 0.5) or something like that. Maybe that's a good starting point.

So basically in total, three tuning strategies:

  1. default RF with sample.fraction = 1 and optimizing over mtry and min.node.size; this I already have
  2. Balanced RF with sample.fraction = c(0.5, 0.5)
  3. Skinny RF with much larger number of trees, but smaller sample fractions, e.g. c(0.1, 0.1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.