Giter VIP home page Giter VIP logo

regista's Introduction

regista

Build Status AppVeyor Build Status Coverage status Lifecycle: experimental

Overview

regista is a package for performing some of the common modelling tasks in soccer analytics.

Installation

regista is not currently available on CRAN but can be downloaded from github like so:

# install.packages("devtools")
devtools::install_github("torvaney/regista")

Examples

Dixon-Coles

The “Dixon-Coles model” is a modified poisson model, specifically designed for estimating teams’ strengths and for predicting football matches.

Regista provides an implementation of this model:

library(regista)

fit <- dixoncoles(hgoal, agoal, home, away, data = premier_league_2010)

print(fit)
#> 
#> Dixon-Coles model with specification:
#> 
#> Home goals: hgoal ~ off(home) + def(away) + hfa + 0
#> Away goals: agoal ~ off(away) + def(home) + 0
#> Weights   : 1

A more flexible api is provided with dixoncoles_ext, which allows the base Dixon-Coles model to be extended arbitrarily.

vignette("using-dixon-coles") gives some simple examples for using the model. Additionally, there are more extensive examples and analyses using regista available at the following links:

Other options

  • The mezzala package provides similar functionality for Python.
  • The goalmodel R package contains an implementation of the Dixon-Coles model, along with some additional method for modelling the number of goals scored in sports games.

regista's People

Contributors

torvaney avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

regista's Issues

Dixon-Robinson fit

Next model to be implemented:

M. J. Dixon and M. E. Robinson, "A birth process model for association football matches", The Statistician, 47(3): 523-538, 1998.

Can't create table of scoreline probabilities without dixoncoles class object

I'm a newbie at R, been learning for only 1 week now. augment.dixoncoles() creates a huge table of scoreline probabilities, which I can use then with the sample() to generate a result for the game. Is there a way to create this table having only the expected goals for each team? For example: in the 2018 World Cup, France was expected to score 2.644 and concede 0.681 against Australia. I have these numbers for all the WC games as a test. Is there a way to transform this data so I can use it in the way I said before? Am I missing something? Sorry to send this here, don't want to bother you on Twitter.

error with broom

Hi,

I've been playing around with regista and trying to get the Dixon-Coles model to work using shot data on the Finnish Veikkausliiga, but getting stuck because of an issue with broom - as can be seen from the below reprex, it gives the error: "Error: No tidy method for objects of class dixoncoles" - this seems to be true for all broom functions, as I was using broom::augment when I first noticed the issue. I replicated the issue using exact code from your blog post (http://www.statsandsnakeoil.com/2018/06/22/dixon-coles-and-xg-together-at-last/) so I don't think it should be an issue on my end.

Thanks,

Axel

library(tidyverse)

games <-
  read_csv("https://git.io/fNmRy") %>%
  filter(season == 2017) %>%
  nest(side, xg, .key = "shots")
#> Parsed with column specification:
#> cols(
#>   match_id = col_double(),
#>   date = col_datetime(format = ""),
#>   home = col_character(),
#>   away = col_character(),
#>   hgoals = col_double(),
#>   agoals = col_double(),
#>   side = col_character(),
#>   xg = col_double(),
#>   league = col_character(),
#>   season = col_double()
#> )
#> Warning: All elements of `...` must be named.
#> Did you want `shots = c(side, xg)`?

simulate_shots <- function(xgs) {
  tibble::tibble(goals = 0:length(xgs),
                 prob  = poisbinom::dpoisbinom(0:length(xgs), xgs))
}

simulate_game <- function(shot_xgs) {
  home_xgs <- shot_xgs %>% dplyr::filter(side == "h") %>% pull(xg)
  away_xgs <- shot_xgs %>% dplyr::filter(side == "a") %>% pull(xg)
  
  home_probs <- simulate_shots(home_xgs) %>% dplyr::rename_all(function(x) paste0("h", x))
  away_probs <- simulate_shots(away_xgs) %>% dplyr::rename_all(function(x) paste0("a", x))
  
  tidyr::crossing(home_probs, away_probs) %>%
    dplyr::mutate(prob = .data$hprob * .data$aprob)
}

simulated_games <-
  games %>%
  mutate(simulated_probabilities = map(shots, simulate_game)) %>%
  select(match_id, home, away, simulated_probabilities) %>%
  unnest(cols = c(simulated_probabilities)) %>%
  filter(prob > 0.001)  # Keep the number of rows vaguely reasonable

library(regista)

# Fit a "vanilla" Dixon-Coles model (on observed goals)
fit_vanilla <- dixoncoles(
  hgoal = hgoals,
  agoal = agoals,
  hteam = home,
  ateam = away,
  data  = factor_teams(games, c("home", "away"))
)
#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored
#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

# Fit on the simulated data, weighted by probability
fit_simulated <- dixoncoles(
  hgoal   = hgoals,
  agoal   = agoals,
  hteam   = home,
  ateam   = away,
  weights = prob,
  data    = factor_teams(simulated_games, c("home", "away")))
#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1), contrasts
#> = FALSE): non-list contrasts argument ignored

estimates <-
  inner_join(
    broom::tidy(fit_vanilla),
    broom::tidy(fit_simulated),
    by = c("parameter", "team"),
    suffix = c("_vanilla", "_xg")
  ) %>%
  mutate(value_vanilla = exp(value_vanilla),
         value_xg      = exp(value_xg))
#> Error: No tidy method for objects of class dixoncoles

Created on 2020-08-20 by the reprex package (v0.3.0)

Better optimisation method for Dixon-Coles model

Use an optimisation routine that enforces the constraints that make the Dixon-Coles model identifiable.

Ideally this would allow arbitrary constraints to be added to additional predictor variables specified in dixoncoles_ext.

Return tibbles

Methods should return tibble::tibbles > data.frames where appropriate. Extra dependency but it's small and we may as well lean into the tidyverse.

Use tidyeval > lazyeval

Using tidyeval over lazyeval is preferred now (see here). Examples using lazyeval (dixoncoles) should use tidyeval instead (see programming with dplyr).

This also means existing documentation (including code examples in blog posts) should be touched up since this will probably be a breaking change. I guess I could deprecate the use of lazyeval, but in a package this young it seems like overkill.

DixonColes error message

Hi,

I've been trying to replicate your "Dixon Coles and xG: together at last" post and keep hitting the error below after trying to run the simulated_games section

Error: Column name <tibble> must not be duplicated.
Use .name_repair to specify repair.

Any tips would be greatly appreciated

Warnings after fit

Hi Ben,

I was wondering if those warnings should raise any concerns ?

library(regista)
library(tidyverse)
library(lubridate)

ligue1_1819 <- read_csv("http://www.football-data.co.uk/mmz4281/1819/F1.csv") %>%
  select(Date, HomeTeam, AwayTeam, FTHG, FTAG) %>%
  mutate(days_ago = as.numeric(today() - dmy(Date)),
         weights = discount_exponential(days_ago, .006),
         HomeTeam = as.factor(HomeTeam),
         AwayTeam = as.factor(AwayTeam))

fit <- dixoncoles(FTHG, FTAG, HomeTeam, AwayTeam, weights = weights, data = ligue1_1819)
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced
#> Warning in log(.tau(hg, ag, home_rates, away_rates, rho)): NaNs produced

Created on 2018-09-17 by the reprex package (v0.2.0).

Non-list contrasts argument ignored while modeling

I'm trying to model the Brasileirão 2023 league with the results directly out of the FBRef site, but I can't create the model for some reason I do not understand. I'm getting this message:
In model.matrix.default(~values - 1, model.frame(~values - 1), contrasts = FALSE) : non-list contrasts argument ignored
What could be happening? df is a dataframe with the columns home, hgoal, away, agoal for all the Brasileirão matches

df <-
  worldfootballR::fb_match_results(
    country = "BRA",
    gender = "M",
    season_end_year = 2023,
    tier = "1st") %>%
  janitor::clean_names() %>%
  regista::factor_teams(c("home", "away")) %>%
  rename(hgoal = home_goals,
         agoal = away_goals) %>%
  select(10:11, 13:14)

unplayed_games <- df %>% filter(is.na(hgoal) & is.na(agoal))
played_games <- df %>% filter(!is.na(hgoal) & !is.na(agoal))
model <- dixoncoles(hgoal, agoal, home, away, data = played_games)

It's also happening to the default premier_league_2010 data. Could it be some package conflict?

Unplayed games - factor issues

Hi @Torvaney ,

I've been playing around with the regista code and am looking to simulate a league season. However, I want to be able to provide the unplayed games via an excel sheet rather than have the code find the unplayed games for me. When I attempt to do the previously mentioned, I get a 'New data must have the same factor levels as the data used to fit' error. What am I doing wrong? Is there a best way to alter the code to do such a thing?

Looking forward to hearing your response,

Thanks!

Allow initial parameter values to be passed to dixoncoles

Might be useful/faster when refitting values over time?

Something like:

fit <- dixoncoles_ext(hgoal ~ off(home) + def(away) + hfa + 0,
                      agoal ~ off(home) + def(home) + 0,
                      weight = ~ some_weighting(column_name)
                      data = games,
                      init = rep(0, 20))
  • refitting of the model with purrr::accumulate

Vignettes

So that people can see how to actually do things.

Some initial idea:

  • Re-implementing the original paper
  • Hyperparameter (time-discount) tuning

Informative error message when predicting with different factor levels

A more informative error message when you try to predict.dixoncoles with new (team) factor levels would be nice:

library(tidyverse)
library(regista)

fit <- suppressWarnings(
  dixoncoles(hgoal, agoal, home, away, data = premier_league_2010)
)

newdata <- 
  premier_league_2010 %>% 
  filter(home != "Arsenal", away != "Arsenal") %>% 
  factor_teams(c("home", "away"))

predict(fit, newdata = newdata)
#> Error in rate_params %*% t(modeldata$mat1): non-conformable arguments

Speed up dixoncoles tests

Tests take a long time on account of repeated fitting of the model. This could probably be sped up in a couple of different ways.

  • Fit the models outside of test environment (or bind to a variable globally...) and use the objects for multiple tests
  • Create a dataframe containing a smaller subset of games (e.g. only 5 teams or so) to fit the model on.

Extend README

Now that the package is somewhat useable, a real README should be produced.

Support for arbitrary weighting

The original Dixon-Coles model comes with a time-discount hyperparameter to down-weight games that took place long ago, relative to recent games.

This should be added to the default dixoncoles model. Something like:

res <- dixoncoles(~hgoal, ~agoal, ~home, ~away, time_discount_function, premier_league_2010)

The generic dixoncoles_ext could take an additional argument to control the weighting of games. This would allow weighting not just to time, but also things like friendly matches or anything else. For instance:

fit <- dixoncoles_ext(hgoal ~ off(home) + def(away) + hfa + 0,
                      agoal ~ off(home) + def(home) + 0,
                      weight = ~ some_weighting(column_name)
                      data = games)

With the evaluated weight argument being used to weight the log-likelihoods in the estimation step (as in the original paper).

Time-discount functions

Following on from #3, it would be useful to have time-discount functions (like those in the original Dixon-Coles paper) available to users out-of-the box.

Dixon-Coles model class

Return an actual model from the dixoncoles(_ext) function, so that predict, summary and all the other standard s3 methods can be implemented as well. Should play nice with broom::tidy as well.

Probably not worth doing until #1 is completed.

Function to coerce team names into factors

...with levels from both home and away team.

dixoncoles requires team names as factors. If the user supplies teams as character vectors, the fitting will throw an (uninformative) error. This is annoying and hostile to users.

Dixoncoles really really slow

Hey,
I find it odd that your dixoncoles is really fast with the embedded premier league dataset, while a subset of my own data (same size as premier league 2010) is super super slow. Any tips? :)

predict.dixoncoles requires unnecessary home/away goals columns

Shouldn't need hgoal and agoal specified to predict:

library(tidyverse)
library(regista)

fit <- dixoncoles(hgoal, agoal, home, away,
                  data = premier_league_2010)

premier_league_2010 %>% 
  predict(fit, newdata = .)
#> # A tibble: 380 x 2
#>    home_rate away_rate
#>        <dbl>     <dbl>
#>  1      2.40     0.873
#>  2      2.35     0.685
#>  3      2.42     0.846
#>  4      3.23     1.04 
#>  5      2.30     0.938
#>  6      1.38     1.22 
#>  7      1.85     0.910
#>  8      1.79     0.892
#>  9      1.80     1.06 
#> 10      1.36     1.07 
#> # ... with 370 more rows

premier_league_2010 %>% 
  select(-hgoal, -agoal) %>%
  predict(fit, newdata = .)
#> Error in eval_tidy(f_lhs(f1), data): object 'hgoal' not found

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.