Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Unplayed games - factor issues about regista HOT 8 OPEN

Hughesy commented on May 28, 2024

Unplayed games - factor issues

from regista.

Comments (8)

Torvaney commented on May 28, 2024

Thanks @Hughesy

The error means that the data you're trying to predict has a different set of teams to those used to fit the model. R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. To make predictions, the teams' factor levels have to be the same in the prediction data as in the training data.

So, to get around this, you need to make sure the home and away team columns in your unplayed games dataframe are both a factor (as opposed to character) and have all the teams from the training data represented in the levels.

One way to do this would be to use the factor_teams function on the whole dataset, before splitting into train and predictions subsets. Another way might be to use the factor function to turn your unplayed data's home and away team columns into factors with the right levels.

If you can post a reprex I might be able to give more specific help for your use-case.

from regista.

Hughesy commented on May 28, 2024

Apologies for the delay in response, I really appreciate your help!

For some reason, reprex is producing a bunch of errors, and I haven't had the time this week to work out why, so unfortunately, I can only provide a copy-paste of the code below

When I look through the code below, I can see that after I bring the excel into a data frame, the home and away columns are both characters. What would you suggest the best approach is? Also, I believe all the teams have the same training data

Thanks again!

> library(tidyverse)
-- Attaching packages ------------------------------- tidyverse 1.3.2 --
v ggplot2 3.4.1     v purrr   1.0.1
v tibble  3.1.8     v dplyr   1.1.0
v tidyr   1.3.0     v stringr 1.5.0
v readr   2.1.4     v forcats 1.0.0
-- Conflicts ---------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
> library(regista)
> library(readxl)
> data <-
+     read_xlsx("C:/Users/jake2/OneDrive/Documents/HockeyFixtures.xlsx") %>%
+     factor_teams(c("home", "away"))
> data
# A tibble: 72 x 106
   Div   Date                Time  home   away  hgoal agoal result HTHG 
   <chr> <dttm>              <lgl> <fct>  <fct> <dbl> <dbl> <chr>  <lgl>
 1 E0    2022-09-24 00:00:00 NA    Beest~ Nott~     2     2 D      NA   
 2 E0    2022-09-24 00:00:00 NA    East ~ Birm~     2     1 H      NA   
 3 E0    2022-09-24 00:00:00 NA    Hamps~ Buck~     5     0 H      NA   
 4 E0    2022-09-24 00:00:00 NA    Lough~ Surb~     0     5 A      NA   
 5 E0    2022-09-24 00:00:00 NA    Wimbl~ Read~     1     0 H      NA   
 6 E0    2022-09-24 00:00:00 NA    Clift~ Holc~     2     0 H      NA   
 7 E0    2022-10-01 00:00:00 NA    Readi~ Clif~     1     2 A      NA   
 8 E0    2022-10-01 00:00:00 NA    Holco~ Hamp~     3     3 D      NA   
 9 E0    2022-10-01 00:00:00 NA    Surbi~ East~     3     0 H      NA   
10 E0    2022-10-01 00:00:00 NA    Wimbl~ Loug~     2     2 D      NA   
# ... with 62 more rows, and 97 more variables: HTAG <lgl>, HTR <lgl>,
#   Referee <lgl>, HS <lgl>, AS <lgl>, HST <lgl>, AST <lgl>, HF <lgl>,
#   AF <lgl>, HC <lgl>, AC <lgl>, HY <lgl>, AY <lgl>, HR <lgl>,
#   AR <lgl>, B365H <lgl>, B365D <lgl>, B365A <lgl>, BWH <lgl>,
#   BWD <lgl>, BWA <lgl>, IWH <lgl>, IWD <lgl>, IWA <lgl>, PSH <lgl>,
#   PSD <lgl>, PSA <lgl>, WHH <lgl>, WHD <lgl>, WHA <lgl>, VCH <lgl>,
#   VCD <lgl>, VCA <lgl>, MaxH <lgl>, MaxD <lgl>, MaxA <lgl>, ...
# i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
> teams <- factor(levels(data$home), levels = levels(data$home))
> teams
 [1] Beeston W                 East Grinstead W         
 [3] Hampstead & Westminster W Loughborough W           
 [5] Wimbledon W               Clifton Robinsons W      
 [7] Reading W                 Holcombe W               
 [9] Surbiton W                Nottm Forest W           
[11] Buckingham W              Birmingham W             
12 Levels: Beeston W East Grinstead W ... Birmingham W
> unplayed_games <- read_xlsx("C:/Users/jake2/OneDrive/Documents/Unplayed_HockeyFixtures(Offline).xlsx")
> unplayed_games
# A tibble: 24 x 2
   home                      away               
   <chr>                     <chr>              
 1 Beeston W                 Clifton Robinsons W
 2 East Grinstead W          Wimbledon W        
 3 East Grinstead W          Clifton Robinsons W
 4 East Grinstead W          Surbiton W         
 5 Hampstead & Westminster W Beeston W          
 6 Hampstead & Westminster W East Grinstead W   
 7 Loughborough W            Reading W          
 8 Loughborough W            Holcombe W         
 9 Loughborough W            Birmingham W       
10 Wimbledon W               Beeston W          
# ... with 14 more rows
# i Use `print(n = ...)` to see more rows
> model <- dixoncoles(hgoal, agoal, home, away, data = data)
Warning messages:
1: In model.matrix.default(~values - 1, model.frame(~values - 1), contrasts = FALSE) :
  non-list contrasts argument ignored
2: In model.matrix.default(~values - 1, model.frame(~values - 1), contrasts = FALSE) :
  non-list contrasts argument ignored
3: In model.matrix.default(~values - 1, model.frame(~values - 1), contrasts = FALSE) :
  non-list contrasts argument ignored
4: In model.matrix.default(~values - 1, model.frame(~values - 1), contrasts = FALSE) :
  non-list contrasts argument ignored
> 
> model

Dixon-Coles model with specification:

Home goals: hgoal ~ off(home) + def(away) + hfa + 0
Away goals: agoal ~ off(away) + def(home) + 0
Weights   : 1

> team_parameters <-
+     tidy.dixoncoles(model) %>%
+     filter(parameter %in% c("off", "def")) %>%
+     mutate(value = exp(value)) %>%
+     spread(parameter, value)
> match_probabilities <-
+     regista::augment.dixoncoles(model, unplayed_games, type.predict = "outcomes") %>%
+     unnest() %>%
+     spread(outcome, prob) %>% 
+     mutate(data = paste(home, away))
Error in predict.dixoncoles(x, newdata, type = type.predict) : 
  New data must have the same factor levels as the data used to fit.
See ?factor_teams
In addition: Warning messages:
1: In model.matrix.default(~values - 1, model.frame(~values - 1), contrasts = FALSE) :
  non-list contrasts argument ignored
2: In model.matrix.default(~values - 1, model.frame(~values - 1), contrasts = FALSE) :
  non-list contrasts argument ignored
3: In model.matrix.default(~values - 1, model.frame(~values - 1), contrasts = FALSE) :
  non-list contrasts argument ignored
4: In model.matrix.default(~values - 1, model.frame(~values - 1), contrasts = FALSE) :
  non-list contrasts argument ignored

from regista.

Torvaney commented on May 28, 2024

Thanks

So, you see how in unplayed_games, home and away are character fields (<chr>), and not factors (<fct> as in played_games)?

> unplayed_games
# A tibble: 24 x 2
   home                      away               
   <chr>                     <chr>      
 1 Beeston W                 Clifton Robinsons W
 2 East Grinstead W          Wimbledon W

You need to make these factors with the same levels as those in played_games.

For example,

team_levels <- levels(played_games$home)

unplayed_games <- read_xlsx("C:/Users/jake2/OneDrive/Documents/Unplayed_HockeyFixtures(Offline).xlsx") %>%
 mutate(home = factor(home, levels = team_levels),
        away = factor(away, levels = team_levels))

This might seem a little bit arcane, but there are reasons for doing things this way, and the model uses the levels to match the teams to the parameter estimates.

from regista.

Hughesy commented on May 28, 2024

So I have implemented it as mentioned above, but I am now getting the below error when I run match_probabilities. Any ideas?

match_probabilities <-

regista::augment.dixoncoles(model, unplayed_games, type.predict = "outcomes") %>%
unnest() %>%
spread(outcome, prob) %>%
mutate(data = paste(home, away))
Error in fn(out, elt, ...) :
number of rows of matrices must match (see arg 2)
In addition: Warning messages:
1: In model.matrix.default(~values - 1, model.frame(~values - 1), contrasts = FALSE) :
non-list contrasts argument ignored
2: In model.matrix.default(~values - 1, model.frame(~values - 1), contrasts = FALSE) :
non-list contrasts argument ignored`

from regista.

Torvaney commented on May 28, 2024

Hmmm - I'm struggling to replicate this error with another dataset, I'm afraid. Do you have a link to the specific files you're using?

from regista.

Hughesy commented on May 28, 2024

@Torvaney see attached for both the fixtures played and unplayed. Its a really weird error. Will be interesting to see if you can replicate the error with my files.

Thanks,

UnplayedGames.xlsx
HockeyFixtures.xlsx

from regista.

Torvaney commented on May 28, 2024

So, I've tried to reproduce your error, but seem to be getting predictions. Can you try running in a fresh R session again and paste the output of devtools::session_info() if it fails?

library(tidyverse)
library(regista)
library(readxl)

data <-
  read_xlsx("~/Downloads/HockeyFixtures.xlsx") %>%
  factor_teams(c("home", "away"))

team_levels <- levels(data$home)

unplayed_games <- 
  read_xlsx("~/Downloads/UnplayedGames.xlsx") %>% 
  mutate(home = factor(home, levels = team_levels),
         away = factor(away, levels = team_levels))

model <- dixoncoles(hgoal, agoal, home, away, data = data)
#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1),
#> contrasts = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1),
#> contrasts = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1),
#> contrasts = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1),
#> contrasts = FALSE): non-list contrasts argument ignored
 
team_parameters <-
       tidy.dixoncoles(model) %>%
       filter(parameter %in% c("off", "def")) %>%
       mutate(value = exp(value)) %>%
       spread(parameter, value)

match_probabilities <-
       regista::augment.dixoncoles(model, unplayed_games, type.predict = "outcomes") %>%
       unnest(cols = c(.outcomes)) %>%
       spread(outcome, prob) %>% 
       mutate(data = paste(home, away))
#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1),
#> contrasts = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1),
#> contrasts = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1),
#> contrasts = FALSE): non-list contrasts argument ignored

#> Warning in model.matrix.default(~values - 1, model.frame(~values - 1),
#> contrasts = FALSE): non-list contrasts argument ignored

match_probabilities
#> # A tibble: 19 × 6
#>    home                      away                   away_…¹   draw home_…² data 
#>    <fct>                     <fct>                    <dbl>  <dbl>   <dbl> <chr>
#>  1 Beeston W                 Clifton Robinsons W     0.344  0.257   0.399  Bees…
#>  2 East Grinstead W          Wimbledon W             0.397  0.275   0.328  East…
#>  3 East Grinstead W          Clifton Robinsons W     0.139  0.167   0.694  East…
#>  4 East Grinstead W          Surbiton W              0.800  0.134   0.0660 East…
#>  5 Hampstead & Westminster W East Grinstead W        0.333  0.207   0.460  Hamp…
#>  6 Loughborough W            Holcombe W              0.330  0.228   0.441  Loug…
#>  7 Loughborough W            Birmingham W            0.268  0.251   0.482  Loug…
#>  8 Wimbledon W               Beeston W               0.129  0.238   0.633  Wimb…
#>  9 Wimbledon W               Hampstead & Westminst…  0.314  0.286   0.399  Wimb…
#> 10 Clifton Robinsons W       Surbiton W              0.907  0.0739  0.0196 Clif…
#> 11 Reading W                 Nottingham W            0.316  0.183   0.500  Read…
#> 12 Reading W                 Buckingham W            0.0908 0.152   0.757  Read…
#> 13 Holcombe W                Nottingham W            0.384  0.170   0.446  Holc…
#> 14 Holcombe W                Birmingham W            0.297  0.237   0.466  Holc…
#> 15 Surbiton W                Beeston W               0.0150 0.0599  0.925  Surb…
#> 16 Surbiton W                Hampstead & Westminst…  0.0657 0.142   0.792  Surb…
#> 17 Buckingham W              Loughborough W          0.711  0.169   0.119  Buck…
#> 18 Buckingham W              Nottingham W            0.784  0.109   0.107  Buck…
#> 19 Birmingham W              Reading W               0.451  0.263   0.286  Birm…
#> # … with abbreviated variable names ¹away_win, ²home_win

^{Created on 2023-03-01 with reprex v2.0.2}

Also - Surbiton must be some team! 😄

from regista.

Hughesy commented on May 28, 2024

Genius! Tried the above, and it works, however something funky is going on when I run calculate_table(single_simulation). There is one round of games left in the season, yet when I run calculate_table(single_simulation), a number of simulated fixtures are not pulling through into the calculated table, as can be seen below when comparing the simmed table and regular table.

I think this is a result of pulling in my own list of unplayed games. As I can't see anything else causing the issue. Have you see anything like this before? Latest files have been attached if you don't mind taking a look.

`> calculate_table(data)

A tibble: 12 x 10

team w d l gp gf ga gd points position

1 Surbiton W 13 0 1 14 51 7 44 39 1
2 Wimbledon W 9 5 1 15 25 12 13 32 2
3 Hampstead & Westminster W 8 4 3 15 40 18 22 28 3
4 East Grinstead W 7 4 3 14 39 22 17 25 4
5 Beeston W 6 3 6 15 22 31 -9 21 5
6 Nottingham W 5 3 7 15 37 46 -9 18 6
7 Loughborough W 4 5 6 15 22 30 -8 17 7
8 Clifton Robinsons W 4 4 7 15 22 28 -6 16 8
9 Birmingham W 4 4 7 15 19 29 -10 16 9
10 Reading W 4 3 8 15 20 28 -8 15 10
11 Holcombe W 2 5 8 15 22 37 -15 11 11
12 Buckingham W 3 0 12 15 15 46 -31 9 12

calculate_table(single_simulation)

A tibble: 12 x 10

team w d l gp gf ga gd points position

1 Surbiton W 14 1 1 16 60 9 51 43 1
2 Wimbledon W 10 5 1 16 28 13 15 35 2
3 Hampstead & Westminster W 7 5 3 15 37 19 18 26 3
4 East Grinstead W 7 4 5 16 40 32 8 25 4
5 Beeston W 7 3 6 16 29 31 -2 24 5
6 Nottingham W 6 3 6 15 43 46 -3 21 6
7 Loughborough W 4 5 6 15 22 31 -9 17 7
8 Birmingham W 4 5 7 16 19 29 -10 17 8
9 Clifton Robinsons W 4 4 7 15 21 30 -9 16 9
10 Reading W 4 3 8 15 20 28 -8 15 10
11 Holcombe W 2 4 9 15 23 39 -16 10 11
12 Buckingham W 3 0 13 16 17 52 -35 9 12`

HockeyFixtures.xlsx
UnplayedGames.xlsx

from regista.

Unplayed games - factor issues about regista HOT 8 OPEN

Comments (8)

A tibble: 12 x 10

A tibble: 12 x 10

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent