torvaney / regista Goto Github PK

View Code? Open in Web Editor NEW

84.0 7.0 8.0 370 KB

An R package for soccer modelling

Home Page: https://torvaney.github.io/regista/

License: GNU General Public License v3.0

R 100.00%

soccer sports-analytics football rstats r

regista's Introduction

regista

Overview

regista is a package for performing some of the common modelling tasks in soccer analytics.

Installation

regista is not currently available on CRAN but can be downloaded from github like so:

# install.packages("devtools")
devtools::install_github("torvaney/regista")

Examples

Dixon-Coles

The “Dixon-Coles model” is a modified poisson model, specifically designed for estimating teams’ strengths and for predicting football matches.

Regista provides an implementation of this model:

library(regista)

fit <- dixoncoles(hgoal, agoal, home, away, data = premier_league_2010)

print(fit)
#> 
#> Dixon-Coles model with specification:
#> 
#> Home goals: hgoal ~ off(home) + def(away) + hfa + 0
#> Away goals: agoal ~ off(away) + def(home) + 0
#> Weights   : 1

A more flexible api is provided with dixoncoles_ext, which allows the base Dixon-Coles model to be extended arbitrarily.

vignette("using-dixon-coles") gives some simple examples for using the model. Additionally, there are more extensive examples and analyses using regista available at the following links:

Other options

The mezzala package provides similar functionality for Python.
The goalmodel R package contains an implementation of the Dixon-Coles model, along with some additional method for modelling the number of goals scored in sports games.

regista's People

Contributors

Stargazers

Watchers

Forkers

vishnumgldn aephidayatuloh nliced biglongnow fintrek englianhu akinwilderman alastairpmills

regista's Issues

Dixon-Robinson fit

Next model to be implemented:

M. J. Dixon and M. E. Robinson, "A birth process model for association football matches", The Statistician, 47(3): 523-538, 1998.

broom model methods

For easy use with broom/tidyverse.

Methods:

tidy
augment

Can't create table of scoreline probabilities without dixoncoles class object

I'm a newbie at R, been learning for only 1 week now. augment.dixoncoles() creates a huge table of scoreline probabilities, which I can use then with the sample() to generate a result for the game. Is there a way to create this table having only the expected goals for each team? For example: in the 2018 World Cup, France was expected to score 2.644 and concede 0.681 against Australia. I have these numbers for all the WC games as a test. Is there a way to transform this data so I can use it in the way I said before? Am I missing something? Sorry to send this here, don't want to bother you on Twitter.

error with broom

Hi,

I've been playing around with regista and trying to get the Dixon-Coles model to work using shot data on the Finnish Veikkausliiga, but getting stuck because of an issue with broom - as can be seen from the below reprex, it gives the error: "Error: No tidy method for objects of class dixoncoles" - this seems to be true for all broom functions, as I was using broom::augment when I first noticed the issue. I replicated the issue using exact code from your blog post (http://www.statsandsnakeoil.com/2018/06/22/dixon-coles-and-xg-together-at-last/) so I don't think it should be an issue on my end.

Thanks,

Axel

library(tidyverse)

games <-
  read_csv("https://git.io/fNmRy") %>%
  filter(season == 2017) %>%
  nest(side, xg, .key = "shots")
#> Parsed with column specification:
#> cols(
#>   match_id = col_double(),
#>   date = col_datetime(format = ""),
#>   home = col_character(),
#>   away = col_character(),
#>   hgoals = col_double(),
#>   agoals = col_double(),
#>   side = col_character(),
#>   xg = col_double(),
#>   league = col_character(),
#>   season = col_double()
#> )
#> Warning: All elements of `...` must be named.
#> Did you want `shots = c(side, xg)`?

simulate_shots <- function(xgs) {
  tibble::tibble(goals = 0:length(xgs),
                 prob  = poisbinom::dpoisbinom(0:length(xgs), xgs))
}

simulate_game <- function(shot_xgs) {
  home_xgs <- shot_xgs %>% dplyr::filter(side == "h") %>% pull(xg)
  away_xgs <- shot_xgs %>% dplyr::filter(side == "a") %>% pull(xg)
  
  home_probs <- simulate_shots(home_xgs) %>% dplyr::rename_all(function(x) paste0("h", x))
  away_probs <- simulate_shots(away_xgs) %>% dplyr::rename_all(function(x) paste0("a", x))
  
  tidyr::crossing(home_probs, away_probs) %>%
    dplyr::mutate(prob = .data$hprob * .data$aprob)
}

simulated_games <-
  games %>%
  mutate(simulated_probabilities = map(shots, simulate_game)) %>%
  select(match_id, home, away, simulated_probabilities) %>%
  unnest(cols = c(simulated_probabilities)) %>%
  filter(prob > 0.001)  # Keep the number of rows vaguely reasonable

library(regista)

# Fit a "vanilla" Dixon-Coles model (on observed goals)
fit_vanilla <- dixoncoles(
  hgoal = hgoals,
  agoal = agoals,
  hteam = home,
  ateam = away,
  data  = factor_teams(games, c("home", "away"))
)
#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored
#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

# Fit on the simulated data, weighted by probability
fit_simulated <- dixoncoles(
  hgoal   = hgoals,
  agoal   = agoals,
  hteam   = home,
  ateam   = away,
  weights = prob,
  data    = factor_teams(simulated_games, c("home", "away")))
#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

estimates <-
  inner_join(
    broom::tidy(fit_vanilla),
    broom::tidy(fit_simulated),
    by = c("parameter", "team"),
    suffix = c("_vanilla", "_xg")
  ) %>%
  mutate(value_vanilla = exp(value_vanilla),
         value_xg      = exp(value_xg))
#> Error: No tidy method for objects of class dixoncoles

Created on 2020-08-20 by the reprex package (v0.3.0)

Add Ranked Probability Score (RPS) metric

As per Constantinou and Fenton (2012), this metric suits for win/draw/loss predictions quite well - in fact, it was used by the challenge behind a (still to be published offline) special issue of the Machine Learning journal.

Unable to install with soccermatics

Original comment (from @Heydary)

i have version 3.5 R. this package cannot install when i devtools::install_github("jogall/soccermatics")
do not load install and has error.how do solve this problem.

Moved from #3.

Use rsample > modelr

In the tests, use rsample

Better optimisation method for Dixon-Coles model

Use an optimisation routine that enforces the constraints that make the Dixon-Coles model identifiable.

Ideally this would allow arbitrary constraints to be added to additional predictor variables specified in dixoncoles_ext.

Return tibbles

Methods should return tibble::tibbles > data.frames where appropriate. Extra dependency but it's small and we may as well lean into the tidyverse.

Add standard errors to parameter estimates

optim has an argument (hessian) that should hopefully make this pretty painless.

Include example goal-times data

Find (and include?) example data containing goal times so that the Dixon-Robinson model can be fit

Summary of fitted dixoncoles model

With summary.dixoncoles and associated methods for summary object

Use tidyeval > lazyeval

Using tidyeval over lazyeval is preferred now (see here). Examples using lazyeval (dixoncoles) should use tidyeval instead (see programming with dplyr).

This also means existing documentation (including code examples in blog posts) should be touched up since this will probably be a breaking change. I guess I could deprecate the use of lazyeval, but in a package this young it seems like overkill.

Dixon-Robinson predict method

Predict in a manner consistent with predict.dixoncoles

predict.dixoncoles should work without a weighting column

Correct old blogs and documentation

Following latest versions' API changes

DixonColes error message

Hi,

I've been trying to replicate your "Dixon Coles and xG: together at last" post and keep hitting the error below after trying to run the simulated_games section

Error: Column name <tibble> must not be duplicated.
Use .name_repair to specify repair.

Any tips would be greatly appreciated

Broom model methods

tidy
augment
glance(?)

Warnings after fit

Hi Ben,

I was wondering if those warnings should raise any concerns ?

library(regista)
library(tidyverse)
library(lubridate)

ligue1_1819 <- read_csv("http://www.football-data.co.uk/mmz4281/1819/F1.csv") %>%
  select(Date, HomeTeam, AwayTeam, FTHG, FTAG) %>%
  mutate(days_ago = as.numeric(today() - dmy(Date)),
         weights = discount_exponential(days_ago, .006),
         HomeTeam = as.factor(HomeTeam),
         AwayTeam = as.factor(AwayTeam))

fit <- dixoncoles(FTHG, FTAG, HomeTeam, AwayTeam, weights = weights, data = ligue1_1819)
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced

Created on 2018-09-17 by the reprex package (v0.2.0).

Non-list contrasts argument ignored while modeling

I'm trying to model the Brasileirão 2023 league with the results directly out of the FBRef site, but I can't create the model for some reason I do not understand. I'm getting this message:
In model.matrix.default(~values - 1, model.frame(~values - 1), contrasts = FALSE) : non-list contrasts argument ignored
What could be happening? df is a dataframe with the columns home, hgoal, away, agoal for all the Brasileirão matches

df <-
  worldfootballR::fb_match_results(
    country = "BRA",
    gender = "M",
    season_end_year = 2023,
    tier = "1st") %>%
  janitor::clean_names() %>%
  regista::factor_teams(c("home", "away")) %>%
  rename(hgoal = home_goals,
         agoal = away_goals) %>%
  select(10:11, 13:14)

unplayed_games <- df %>% filter(is.na(hgoal) & is.na(agoal))
played_games <- df %>% filter(!is.na(hgoal) & !is.na(agoal))
model <- dixoncoles(hgoal, agoal, home, away, data = played_games)

It's also happening to the default premier_league_2010 data. Could it be some package conflict?

Unplayed games - factor issues

Hi @Torvaney ,

I've been playing around with the regista code and am looking to simulate a league season. However, I want to be able to provide the unplayed games via an excel sheet rather than have the code find the unplayed games for me. When I attempt to do the previously mentioned, I get a 'New data must have the same factor levels as the data used to fit' error. What am I doing wrong? Is there a best way to alter the code to do such a thing?

Looking forward to hearing your response,

Thanks!

Allow initial parameter values to be passed to dixoncoles

Might be useful/faster when refitting values over time?

Something like:

fit <- dixoncoles_ext(hgoal ~ off(home) + def(away) + hfa + 0,
                      agoal ~ off(home) + def(home) + 0,
                      weight = ~ some_weighting(column_name)
                      data = games,
                      init = rep(0, 20))

refitting of the model with purrr::accumulate

Vignettes

So that people can see how to actually do things.

Some initial idea:

Re-implementing the original paper
Hyperparameter (time-discount) tuning

Informative error message when predicting with different factor levels

A more informative error message when you try to predict.dixoncoles with new (team) factor levels would be nice:

library(tidyverse)
library(regista)

fit <- suppressWarnings(
  dixoncoles(hgoal, agoal, home, away, data = premier_league_2010)
)

newdata <- 
  premier_league_2010 %>% 
  filter(home != "Arsenal", away != "Arsenal") %>% 
  factor_teams(c("home", "away"))

predict(fit, newdata = newdata)
#> Error in rate_params %*% t(modeldata$mat1): non-conformable arguments

Speed up dixoncoles tests

Tests take a long time on account of repeated fitting of the model. This could probably be sped up in a couple of different ways.

Fit the models outside of test environment (or bind to a variable globally...) and use the objects for multiple tests
Create a dataframe containing a smaller subset of games (e.g. only 5 teams or so) to fit the model on.

Extend README

Now that the package is somewhat useable, a real README should be produced.

Support for arbitrary weighting

The original Dixon-Coles model comes with a time-discount hyperparameter to down-weight games that took place long ago, relative to recent games.

This should be added to the default dixoncoles model. Something like:

res <- dixoncoles(~hgoal, ~agoal, ~home, ~away, time_discount_function, premier_league_2010)

The generic dixoncoles_ext could take an additional argument to control the weighting of games. This would allow weighting not just to time, but also things like friendly matches or anything else. For instance:

fit <- dixoncoles_ext(hgoal ~ off(home) + def(away) + hfa + 0,
                      agoal ~ off(home) + def(home) + 0,
                      weight = ~ some_weighting(column_name)
                      data = games)

With the evaluated weight argument being used to weight the log-likelihoods in the estimation step (as in the original paper).

library(tidyverse)
library(regista)

fit <- dixoncoles(hgoal, agoal, home, away,
                  data = premier_league_2010)

premier_league_2010 %>% 
  predict(fit, newdata = .)
#> # A tibble: 380 x 2
#>    home_rate away_rate
#>        <dbl>     <dbl>
#>  1      2.40     0.873
#>  2      2.35     0.685
#>  3      2.42     0.846
#>  4      3.23     1.04 
#>  5      2.30     0.938
#>  6      1.38     1.22 
#>  7      1.85     0.910
#>  8      1.79     0.892
#>  9      1.80     1.06 
#> 10      1.36     1.07 
#> # ... with 370 more rows

premier_league_2010 %>% 
  select(-hgoal, -agoal) %>%
  predict(fit, newdata = .)
#> Error in eval_tidy(f_lhs(f1), data): object 'hgoal' not found