Giter VIP home page Giter VIP logo

us-potus-model's Introduction

State and national presidential election forecasting model

Last update on Thursday October 15, 2020 at 12:29 PM EDT

Code for a dynamic multilevel Bayesian model to predict US presidential elections. Written in R and Stan.

Improving on Pierre Kremp’s implementation of Drew Linzer’s dynamic linear model for election forecasting (Linzer 2013), we (1) add corrections for partisan non-response, survey mode and survey population; (2) use informative state-level priors that update throughout the election year; and (3) specify empirical state-level correlations from political and demographic variables.

You can see the model’s predictions for 2020 here and read how it works here.

File dictionary

In terms of useful files, you should pay attention to the 3 scripts for the 2008, 2012 and 2016 US presidential elections are located in the scripts/model directory. There are three R scripts that import data, run models and parse results:

  • final_model_2008.R
  • final_model_2012.R
  • final_model_2016.R

And there are 3 different Stan scripts that will run different versions of our polling aggregate and election forecasting model:

  • poll_model_2020.stan - the final model we use for the 2020 presidential election
  • poll_model_2020_no_mode_adjustment.stan - a model that removes the correction for partisan non-response bias in the polls and the adjustments for the mode in which a survey is conducted (live phone, online, other) and its population (adult, likely voter, registered voter)

Model performance

Here is a graphical summary of the model’s performance in 2008, 2012 and 2016.

2008

Map

Final electoral college histogram

National and state polls and the electoral college over time

State vs national deltas over time

Model results vs polls vs the prior

Performance

outlet ev_wtd_brier unwtd_brier states_correct
economist (backtest) 0.0321964 0.0289902 49

## [1] 0.02318035

Predictions for each state

state mean low high prob se
NC 0.502 0.466 0.538 0.545 0.021
MO 0.509 0.474 0.543 0.704 0.020
FL 0.514 0.479 0.548 0.788 0.020
IN 0.484 0.448 0.519 0.179 0.021
AR 0.477 0.439 0.516 0.124 0.023
OH 0.524 0.489 0.559 0.917 0.021
MT 0.475 0.437 0.512 0.094 0.022
VA 0.527 0.491 0.564 0.917 0.022
GA 0.473 0.437 0.511 0.083 0.022
WV 0.470 0.434 0.507 0.058 0.022
AZ 0.468 0.429 0.507 0.055 0.023
NV 0.532 0.494 0.568 0.952 0.022
CO 0.533 0.496 0.571 0.961 0.022
0.539 0.517 0.562 1.000 0.013
LA 0.460 0.421 0.500 0.025 0.024
MS 0.457 0.418 0.497 0.019 0.024
TX 0.454 0.413 0.495 0.015 0.024
SD 0.453 0.413 0.492 0.008 0.023
SC 0.450 0.412 0.488 0.003 0.023
NH 0.550 0.514 0.587 0.998 0.022
ND 0.449 0.409 0.487 0.006 0.023
PA 0.554 0.518 0.588 0.999 0.020
TN 0.445 0.406 0.484 0.003 0.023
WI 0.557 0.521 0.592 0.999 0.021
KY 0.442 0.405 0.478 0.001 0.021
MN 0.559 0.523 0.595 1.000 0.021
NM 0.561 0.522 0.599 0.998 0.023
IA 0.562 0.526 0.597 1.000 0.021
MI 0.565 0.530 0.598 1.000 0.020
OR 0.572 0.535 0.608 1.000 0.021
KS 0.424 0.387 0.460 0.000 0.022
AK 0.417 0.377 0.458 0.000 0.024
WA 0.584 0.548 0.620 1.000 0.021
ME 0.584 0.545 0.620 1.000 0.021
NJ 0.590 0.552 0.627 1.000 0.022
AL 0.405 0.368 0.444 0.000 0.023
NE 0.402 0.365 0.440 0.000 0.023
DE 0.616 0.577 0.652 1.000 0.022
CA 0.617 0.578 0.653 1.000 0.022
MD 0.618 0.572 0.664 1.000 0.027
CT 0.619 0.579 0.655 1.000 0.022
OK 0.378 0.340 0.417 0.000 0.023
IL 0.629 0.591 0.663 1.000 0.020
WY 0.366 0.331 0.404 0.000 0.022
MA 0.642 0.603 0.680 1.000 0.023
ID 0.354 0.318 0.392 0.000 0.022
NY 0.647 0.610 0.683 1.000 0.021
VT 0.654 0.615 0.693 1.000 0.023
UT 0.343 0.306 0.381 0.000 0.023
HI 0.663 0.622 0.703 1.000 0.024
RI 0.670 0.633 0.706 1.000 0.021
DC 0.933 0.917 0.946 1.000 0.008

2012

Map

Final electoral college histogram

National and state polls and the electoral college over time

State vs national deltas over time

Model results vs polls vs the prior

Performance

outlet ev_wtd_brier unwtd_brier states_correct
Linzer NA 0.0038000 NA
Wang/Ferguson NA 0.0076100 NA
Silver/538 NA 0.0091100 NA
Jackman/Pollster NA 0.0097100 NA
Desart/Holbrook NA 0.0160500 NA
economist (backtest) 0.0324297 0.0193188 50
Intrade NA 0.0281200 NA
Enten/Margin of Error NA 0.0507500 NA

## [1] 0.02247233

Predictions for each state

state mean low high prob se
VA 0.504 0.467 0.539 0.586 0.021
CO 0.505 0.471 0.541 0.617 0.021
FL 0.495 0.457 0.531 0.394 0.021
OH 0.510 0.474 0.545 0.704 0.021
0.510 0.489 0.533 0.773 0.014
NH 0.513 0.478 0.548 0.758 0.021
IA 0.514 0.480 0.550 0.785 0.021
NC 0.485 0.451 0.521 0.212 0.022
NV 0.516 0.478 0.554 0.797 0.022
WI 0.521 0.486 0.556 0.876 0.021
PA 0.528 0.492 0.563 0.937 0.021
MN 0.534 0.500 0.568 0.974 0.021
MI 0.537 0.502 0.572 0.979 0.021
OR 0.538 0.499 0.575 0.971 0.022
MO 0.460 0.424 0.496 0.015 0.022
NM 0.540 0.500 0.580 0.976 0.023
IN 0.455 0.420 0.492 0.006 0.022
MT 0.453 0.417 0.489 0.006 0.022
GA 0.452 0.413 0.492 0.009 0.024
AZ 0.452 0.414 0.491 0.009 0.023
NJ 0.556 0.518 0.593 0.999 0.022
ME 0.557 0.520 0.593 0.999 0.021
WA 0.561 0.526 0.596 1.000 0.021
SC 0.438 0.395 0.483 0.004 0.026
CT 0.567 0.529 0.603 1.000 0.022
SD 0.431 0.392 0.470 0.000 0.023
ND 0.422 0.385 0.461 0.000 0.023
MS 0.421 0.375 0.468 0.000 0.028
TN 0.419 0.383 0.458 0.000 0.023
WV 0.416 0.378 0.456 0.000 0.024
CA 0.585 0.549 0.622 1.000 0.022
MA 0.588 0.552 0.623 1.000 0.021
TX 0.409 0.372 0.448 0.000 0.023
NE 0.407 0.370 0.444 0.000 0.022
IL 0.599 0.562 0.635 1.000 0.021
LA 0.401 0.363 0.442 0.000 0.024
DE 0.602 0.555 0.646 1.000 0.026
KY 0.398 0.360 0.438 0.000 0.024
KS 0.395 0.355 0.435 0.000 0.024
MD 0.607 0.565 0.646 1.000 0.023
RI 0.616 0.575 0.654 1.000 0.023
AR 0.384 0.346 0.423 0.000 0.023
AL 0.383 0.342 0.423 0.000 0.024
NY 0.620 0.584 0.655 1.000 0.021
AK 0.363 0.320 0.407 0.000 0.026
VT 0.660 0.620 0.697 1.000 0.022
HI 0.661 0.620 0.699 1.000 0.023
ID 0.331 0.295 0.369 0.000 0.022
OK 0.330 0.293 0.369 0.000 0.023
WY 0.313 0.277 0.350 0.000 0.022
UT 0.291 0.257 0.326 0.000 0.021
DC 0.903 0.880 0.924 1.000 0.012

2016

Map

Final electoral college histogram

National and state polls and the electoral college over time

State vs national deltas over time

Model results vs polls vs the prior

Performance

outlet ev_wtd_brier unwtd_brier states_correct
economist (backtest) 0.0725679 0.0508319 48
538 polls-plus 0.0928000 0.0664000 46
538 polls-only 0.0936000 0.0672000 46
princeton 0.1169000 0.0744000 47
nyt upshot 0.1208000 0.0801000 46
kremp/slate 0.1210000 0.0766000 46
pollsavvy 0.1219000 0.0794000 46
predictwise markets 0.1272000 0.0767000 46
predictwise overall 0.1276000 0.0783000 46
desart and holbrook 0.1279000 0.0825000 44
daily kos 0.1439000 0.0864000 46
huffpost 0.1505000 0.0892000 46

## [1] 0.02724916

Predictions for each state

state mean low high prob se
FL 0.496 0.457 0.536 0.435 0.024
NV 0.508 0.467 0.548 0.652 0.024
NC 0.492 0.451 0.532 0.340 0.024
0.512 0.485 0.540 0.791 0.017
NH 0.514 0.475 0.554 0.736 0.024
PA 0.514 0.475 0.553 0.745 0.024
CO 0.516 0.476 0.555 0.774 0.023
OH 0.484 0.445 0.523 0.217 0.023
MI 0.518 0.479 0.558 0.816 0.024
WI 0.521 0.483 0.561 0.844 0.024
IA 0.478 0.439 0.517 0.148 0.023
VA 0.523 0.482 0.562 0.863 0.023
MN 0.527 0.489 0.568 0.909 0.024
AZ 0.471 0.430 0.510 0.077 0.024
GA 0.470 0.430 0.508 0.069 0.023
NM 0.532 0.489 0.575 0.926 0.025
ME 0.542 0.503 0.582 0.985 0.024
SC 0.452 0.411 0.493 0.011 0.024
OR 0.549 0.509 0.590 0.992 0.025
TX 0.443 0.403 0.483 0.002 0.024
MO 0.439 0.401 0.478 0.001 0.023
MS 0.436 0.397 0.476 0.001 0.024
CT 0.565 0.525 0.607 1.000 0.024
WA 0.567 0.528 0.607 1.000 0.024
DE 0.568 0.525 0.609 1.000 0.025
AK 0.425 0.384 0.468 0.000 0.026
NJ 0.578 0.537 0.619 1.000 0.024
IN 0.419 0.380 0.458 0.000 0.023
IL 0.583 0.543 0.623 0.999 0.024
LA 0.410 0.371 0.449 0.000 0.024
MT 0.406 0.368 0.446 0.000 0.024
RI 0.595 0.555 0.636 1.000 0.024
KS 0.403 0.368 0.441 0.000 0.022
TN 0.403 0.365 0.442 0.000 0.023
SD 0.398 0.361 0.438 0.000 0.023
NY 0.611 0.571 0.649 1.000 0.023
ND 0.389 0.352 0.429 0.000 0.024
NE 0.388 0.351 0.428 0.000 0.023
AL 0.384 0.346 0.423 0.000 0.023
AR 0.382 0.345 0.421 0.000 0.023
CA 0.620 0.581 0.658 1.000 0.023
UT 0.375 0.337 0.412 0.000 0.022
KY 0.373 0.336 0.411 0.000 0.023
MA 0.629 0.591 0.668 1.000 0.023
MD 0.639 0.598 0.677 1.000 0.023
WV 0.353 0.316 0.390 0.000 0.022
ID 0.349 0.313 0.386 0.000 0.022
OK 0.342 0.305 0.378 0.000 0.022
VT 0.658 0.619 0.696 1.000 0.023
HI 0.661 0.620 0.699 1.000 0.023
WY 0.289 0.255 0.324 0.000 0.021
DC 0.908 0.885 0.928 1.000 0.012

Cumulative charts

Probability calibration plot

Confidence interval coverage

Licence

This software is published by The Economist under the MIT licence. The data generated by The Economist are available under the Creative Commons Attribution 4.0 International License.

The licences include only the data and the software authored by The Economist, and do not cover any Economist content or third-party data or content made available using the software. More information about licensing, syndication and the copyright of Economist content can be found here.

us-potus-model's People

Contributors

martgnz avatar merlinheidemanns avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

us-potus-model's Issues

Thank you for opensource the code, a quick question

Really grateful you guys had shared this. Such an audacious attempt to forecast something so uncertain into the future.

Just to confirm, after reading the "How it works" document explaining the methodology I am trying to locate the corresponding code that generated the beta prior for the popular vote, but could find any.

Maybe it's down to my limited understanding of stan, but I really didn't see any elastic net code either.

Bests,
Paul

Unable to locate file

Hi, I keep receiving the following error on line 250 and am unsure on how to resolve it as I cannot locate the "historical_prior_simulations.rds" file in the data folder.

Error:
In gzfile(file, "rb") :
cannot open compressed file 'data/historical_prior_simulations.rds', probable reason 'No such file or directory'

Thank you!

Error running 2008_model.R

I am running final_2008.R script (in RStudio on a mac). I believe the model is using Rstan as opposed to cmdstanr - therefore I comment out lines 561 to 571 and un-comment 555-559.

When I run the model I get the following error:

Error in mod$fit_ptr() : 
  Exception: variable does not exist; processing stage=data initialization; variable name=poll_mode_state; base type=int  (in 'modela29f20816b55_poll_model_2020' at line 14)

failed to create the sampler; sampling not done  

If I go to line 519-522 where these variables have been commented out and change it I get the following error:

Error in eval(ei, envir) : object 'poll_mode_state' not found

I am clearly not using the model correctly. Can someone help? Apologies, I am sure this is a basic issue

Hi

Hello, thanks for making this available. Is there a .csv with the current year polls in it? I see only 2016 back.

Backtesting Results

Would it be possible to provide a file with the model output from the 2008-2016 backtests? I know I could just run the files myself but that would be time-consuming. It would be helpful if people could see the backtesting results to better compare this model to others.

2008 polls

Hi,

Just wanted to double check that the polls in 2008 are prior to the election or do they cover the post-election period as well?

There are some polls that fall in December 2008. I just want to make sure this isn't a typo.

An example is below:

-- | CNN | 1019 | 12/15/08 | 12/17/08 | 43 | 47 | 7 | 3 | Live Phone | Registered Voters

EDIT: I just double checked RCP (https://www.realclearpolitics.com/epolls/2008/president/us/general_election_mccain_vs_obama-225.html#polls), looks like some 2008 polls were conducted in 2007, so in your data some of the national polls have the wrong year.

Thanks,
Zoe

What is the goal?

I love the fact that this repo is being shared and The Economist is taking an open source approach to this. BRAVO!!!

That said, publicizing the code / data is one thing, making it ready for the community to digest and improve is another thing. So before I post issues, I'd love if you could clarify a few questions of mine:

1. What is the goal / purpose of this being open sourced?
2. What is the goal / purpose for this codebase?

Answering my questions would help me understand the pros/cons of this. For example, if the goal is to accurately predict an outcome given all available data, this model should include more recent data, no? Furthermore, if the goal of this to be open sourced is to get community involvement, this codebase could use some better structure for people to separate "standard code" from "domain specific" code to help focus efforts.

For example, please correct me if I'm wrong, but after diving in a little, it looks like the data feeding this model is all 2016 and before? (This would mean there is no way to correct for specific new candidates that have never influenced the data before).

I'd love any feedback you have on this as the importance of this codebase and the influence of The Economist warrant this being taken seriously.

Towards transparency! Thank you The Economist!

Based on Jackman's work?

I am really thankful that The Economist has brought onboard Andrew Gelman to model the probability of the likely outcome to the 2020 election. I am struck by the Stan code and its the similarity between Simon Jackman's work on election prediction, namely his state-space model for the 2004 Australian federal election published in the Australian Journal of Political Science and more poignantly for the American case, his modelling efforts for the 2012 Presidential election. Linzer may have published the model implemented in the Journal of the American Statistical Association in 2013 but Jackman implemented it in 2012 for the election and got 50 out of 50 states. Linzer's model as far as I can tell is Jackman's 2006 article with the hierarchical structure in his textbook. The intellectual origins are Jackman's; he published in 2006 the state-space model and spelt out in two chapters of his textbooks state-space modelling of vote intention and exchangeability for state-level predictions with hierarchical models. Linzer was visiting Stanford in 2012-2013, yet only completed his PhD at UCLA in 2008 and had mostly published on latent class analysis until 2012; I wonder how much cross-fertilisation there was. Sure Linzer got the publication while visiting (Jackman at) Stanford, but it seems to me like there was a little healthy competition at Stanford in 2012 worth acknowledging.

From my first inspection of the model, perhaps with the exception of the Cholesky factor decomposition that the Bayesian probabilistic language, Stan, has made far easier to implement over the preceding modelling language, JAGS, the model Gelman has used is basically the same as Jackman's modelling efforts and can be found in chapters 8 and 9 of his textbook, Bayesian Analysis for the Social Sciences (2009). I understand it's hard to improve on something simple and excellent ... but Gelman has his own textbooks - most notably Bayesian Data Analysis (Third Edition; 2013) - with lots of nifty tools and tricks of the trade that I would have loved to have seen expressed in this modelling effort. Gelman is the king of priors, I am pretty sure I'm not alone in wishing he had more artistic/scientific licence to perform something a bit more novel, incorporating for example estimates from Multilevel Regression and Postratification analyses he has worked hard to popularise and made some very cool and important findings about American public opinion (in particular, partisan non-response bias), or employed a cool ass Gaussian process or spline to "exploit a sensitivity–stability trade-off" as "they stabilize estimates and predictions by making fitted models less sensitive to certain details of the data" in the hierarchical model component of the model doing a little shrinking or even have the priors be a little more creative, say with a horseshoe prior to handling the sparsity. Or even, it would have been awesome to see an ensemble of model specifications that average over the models and weighting by their performance but all of this has a cost.
I'm willing to accept that the model had to be kept simple since it is being updated daily and incorporating components that explore high dimensional space while theoretically cool to a statistician or political scientist would blow out the time it takes to analyse the data. The practicality of implementation and the questionable value add of a slightly more accurate model makes the trade-off seem appropriate.

The website appears to calculate senate ties incorrectly - and therefore also the total win probability

At the time of this writing, the website reports that Biden has a 91% chance to win the presidency and Democrats a 76% chance to win the senate. Scrolling down and hovering over the histogram of senate simulations, the website reports the Democrats to have 74% chance of winning at least 51 seats. When hovering over the gray bar just to the right, the text is "There is a 10% chance the senate will be split evenly, with an 18% chance a Democrat breaks the tie."

I assume that last phrase refers to the vice president breaking ties in the senate. But then given Biden's lead, surely it should be the Democrats who have the advantage to win senate ties? My guess is that someone inverted that probability and the correct number should be 82% in favor of the Democrats, and therefore the Democrats should have a 74% + 10%*82% ≈ 82% chance to win the senate (instead of the reported 76%).

Or am I misunderstanding something?

Stan warnings in 2016 (and 2020)

Running the final_2016.R code, stan throws warnings:

Warning messages:
1: There were 1 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: Examine the pairs() plot to diagnose sampling problems
 
3: The largest R-hat is NA, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat 
4: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess 
5: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess 

This is with n_iter <- 2500, 4 chains, and rstan::stan_version() tells me I'm running version 2.21.0 (I'm using R 4.0.2 on Windows 10, if that's germane; I'd be happy to supply any other versioning or hardware info).

If I run it with n_iter <- 1000 on 4 chains (as it is by default), I get the last three error messages (largest R-hat is NA, Bulk ESS too low, and Tail ESS too low).

I'd give you the resulting .rds file for the output, but it's several gigs in size.

Incorrect data in poll sheet

Hi there.

I noticed today that the downloadable sheet you have of current polling data on the economist website has incorrect values for the Biden vote share on the latest RMG poll. That poll has a 46% vote share for Biden, not the 48% in your data. I don't know how much that will adjust your model - I assume mostly a wiggle since it's one poll, but wanted to alert you to the error.

If this is a deliberate adjustment to weight a partisan poll or something, ignore this issue!

CSV extract of polls has malformed dates

The CSV extract of polls on https://projects.economist.com/us-2020-forecast/president contains malformed date values in start.date and end.date (specifically the year).

library(tidyverse)

polls <- read_csv("data/2020 US presidential election polls - all_polls.csv")

polls %>% 
  filter(start.date == "5/17/2002" | end.date %in% c("7/14/20220", "5/19/0220")) %>% 
  select(state, pollster, start.date, end.date)
# A tibble: 3 x 4
  state pollster      start.date end.date  
  <chr> <chr>         <chr>      <chr>     
1 FL    St Pete Polls 7/13/2020  7/14/20220
2 NY    Sinea College 5/17/2002  5/21/2020 
3 CA    SurveyUSA     5/18/2020  5/19/0220 

stringsAsFactors = TRUE prevents use of state polling data by stan models fit to 2008, 2012, and 2016

I'd like to report an easy-to-make but consequential bug. It seems the bug prevents use of all state poll data during fitting
of stan models to 2008, 2012, and 2016. I'm unsure of other effects.

Like many bugs in R scripts, this one stems from read.csv having the default option stringsAsFactors = TRUE.

Bug locations

All entries of df$index_s are reassigned value NA at

  • Lines 137-8 of final_2008.R
  • Lines 136-7 of final_2012.R
  • Lines 135-6 of final_2016.R

Bug appearance

All instances look like
index_s = as.numeric(factor(as.character(state), levels = c('--',state_abb_list)))

Reason for bug

Because read.csv defaults to stringsAsFactors=TRUE, state_abb_list has class factor, not the anticipated class character. As a
result c('--', state_abb_list) returns c('--', '1', '2', '3', ...), not the anticipated c('--', 'AK', 'AL', 'AR', 'AZ',...).

A small additional point: It looks like states are misordered in the comment directly above the bug in the R scripts.

Question about `state_correlation_error` variance estimation.

Hello again,

I'm a little confused about what's going on here in the find_sigma2_value function. Specifically this line

y <- MASS::mvrnorm(100000, rep(0.5,10), Sigma = cov_matrix(10, par^2, 1) )

why is the correlation set to 1 in cov_matrix? This results in the 10 columns of y being identical, and hence aggregations like apply(y, MARGIN = 2, mean) result in a constant vector, which seems to make the subsequent call to mean redundant. (EDIT - actually while we're on the subject, what's the significance of the choice of 10 here?)

I notice a few lines down when state_correlation_error is created a correlation of 0.9 is used, and similarly for state_correlation_mu_b_walk and state_correlation_mu_b_T. Is there a reason the variance is estimated using a different correlation than is used to create the covariance matrices themselves?

A related question, why is the mean of y set to 0.5? Applying inv.logit to standard normal samples would result in transformed samples with mean 0.5, but on the logit scale does mean 0 make more sense? (I don't think I understand what's going on here properly yet, so I might be wrong, the question is basically just motivated by symmetry 🙂)

Delegate Confidence Interval conflicts with estimated probability of win.

As of today the model on the economist.com website estimates the win probability of candidate biden / president trump at 96 % / 4 % respectively. At the same time the graph on the estimation of delegates shows that the 95 % confidence interval for both candidates goes across the line of 270 Delegates. From my understanding of statistics and the US Election system this seems to conflict:

image
image

Just a thank you ...

... to the Economist for sharing and to the authors.

Let's pretend this is a enhancement issue:

what about adding some interactive code to switch States?

Really stylish!

R

Release CSV with results of simulations

I would love to be able to download a file with the results of the simulations. I am curious to see how states are correlated with each other in your model. Thank You. I love your model. It's the best, most open model in existence. I hope more data journalists follow your lead.

How to run this?

I'm trying to run the scripts/model/poll_model_2020.stan. After compiling it with cmdstan, I keep getting the error:

$ ./poll_model_2020  method=sample  num_samples=1000 num_warmup=1000
method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
    num_chains = 1 (Default)
id = 1 (Default)
data
  file =  (Default)
init = 2 (Default)
random
  seed = 4125478090 (Default)
output
  file = output.csv (Default)
  diagnostic_file =  (Default)
  refresh = 100 (Default)
  sig_figs = -1 (Default)
  profile_file = profile.csv (Default)
num_threads = 1 (Default)

Exception: variable does not exist; processing stage=data initialization; variable name=N_national_polls; base type=int (in '../us-potus-model/scripts/model/poll_model_2020.stan', line 2, column 2 to column 23)

My sense is that there is probably some data file that I'd have to feed into this, but, being a complete novice with stan, I'm not sure where to find it. If I use

$ ./poll_model_2020  method=sample  num_samples=1000 num_warmup=1000 data file=../../data/all_polls.csv

I still get the same error.

Any thoughts? Coming from https://statmodeling.stat.columbia.edu/2023/06/14/the-economist-is-hiring-a-political-data-scientist-to-do-election-modeling/, where Andrew mentioned that getting this model to run would be "trivial".

Extra levels in factor for population variable

I'm not quite sure what model you use to adjust for pollster and mode effects. But I noticed that the population variable in your data set has the following levels:
"a", "rv", "lv", "rv ", "lv ".
Meaning there are some polls where the population had an extra space. You can't see this but it appears when you run:
levels(data$population). Where data is the data frame of the polling data.
Here are the affected polls:
state pollster sponsor start.date end.date entry.date.time..et. number.of.observations population
374 -- YouGov Economist 2020-08-30 2020-09-01 9/2/2020 8:00:00 1207 rv
448 NH Saint Anselm College 2020-08-15 2020-08-17 8/20/2020 19:21:00 1042 rv
mode biden trump biden_margin other undecided
374 Online 51 40 11 2 4
448 Online 51 43 8 NA NA
url include notes
374 https://today.yougov.com TRUE 63 days
448 https://htv-prod-media.s3.amazonaws.com/files/saintanselmpollaugust20-1597956431.pdf TRUE

state                     pollster sponsor start.date   end.date entry.date.time..et. number.of.observations

662 -- Redfield & Wilton Strategies 2020-07-09 2020-07-09 7/13/2020 14:23:00 1853
population mode biden trump biden_margin other undecided
662 lv Online 48 40 8 3 10
url include notes
662 https://redfieldandwiltonstrategies.com/latest-usa-voting-intention-july-9/ TRUE

Unused parameter in stan model

It looks like all references to the mu_b_T_model_estimation_error parameter are commented out in the poll_model_2020.stan model as well as poll_model_2020_no_mode_adjustment.stan. Presumably this variable should be removed from the parameters declaration. I'm not sure if this is the current Stan behavior, but there are at least some cases from past versions where unused parameters can negatively impact model behavior through unintended impacts on the posterior.

real mu_b_T_model_estimation_error;

Potential use of 2016 election results in 2016 election?

I'm not super confident I'm fully understanding the logic which is going into state_data / state_cov / state_correlation, however it seems like the lines:

state_data <- read.csv("data/potus_results_76_16.csv") state_data <- state_data %>% select(year, state, dem) %>% group_by(state) %>% mutate(dem = dem ) %>% #mutate(dem = dem - lag(dem)) %>% select(state,variable=year,value=dem) %>% ungroup() %>% na.omit() %>% filter(variable == 2016)

(lines 99-107) in final_2016.R

Mean we use the 2016 election results to create these correlations. I would assume we'd want to use the 2012 results as our prior?

Related to this, is there a file for recreating the 2020 results? Or should I modify final_2016 for 2020?

Missing files

This is great. Are there files missing though?

e.g. out <- read_rds(sprintf('models/backtest_2008/stan_model_%s.rds',RUN_DATE))

cannot open compressed file 'models/backtest_2008/stan_model_2008-11-03.rds', probable reason 'No such file or directory'Error in gzfile(file, "rb") : cannot open the connection

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.