Predicting JLeague matches usings particle swarm optimization and XGBoost
Inspired by the feature engineering method, rating feature learning from the paper "Incorporating domain knowledge in machine Learning for soccer outcome prediction" by Daniel Berrar, Philippe Lopes and Werner Dubitzky 2017. (Berrar 2017)
Our goal is the find the probabilities of a home team win, tie or an away team win. The three probabilities will add up to 100%.
First we gather the information: we need match data containing team name, score, and xG for both the home and away team. I get my data from Sporteria but the earliest they started to collect xG data is 2019. I have data for all the teams in J1 and J2 League 🇯🇵 since the 2019 season, There are currently 63 teams in my model and +4000 matches as of the 2023 season. When teams are relegated from J2 to J3, their data is not deleted from the model but since they don't have matches in J1 or J2 their ratings/features don't get updated. When teams from J3 get promoted their match data is added as the season progresses and may be inaccurate for their first few games.
Table 1. Information required
Date | Home | Away | ||||
---|---|---|---|---|---|---|
2/22/19 | Cerezo Osaka | 1 | 1.208 | Vissel Kobe | 0 | 1.299 |
2/23/19 | Sagan Tosu | 0 | 0.845 | Grampus | 4 | 1.777 |
2/23/19 | Vegalta Sendai | 0 | 0.479 | Urawa Reds | 0 | 0.922 |
First, we define four quantitative features that capture a team’s performance rating in terms of its ability to score goals and inability to prevent goals at both the home and away venues, respectively:
- Home attacking strength reflects a team’s ability to score goals at its home venue—the higher the value, the higher the strength.
- Home defensive weakness reflects a team’s inability to prevent goals by the opponent at its home venue—the higher the value, the higher the weakness.
- Away attacking strength reflects a team’s ability to score goals at the opponent’s venue— the higher the value, the higher the strength.
- Away defensive weakness reflects a team’s inability to prevent goals by the opponent at the opponent’s venue—the higher the value, the higher the weakness.
Where
Where
Based on these four performance rating features (per team), Eqs. 1 and 2 define a goal- prediction model that predicts the goals scored by the home and away team, respectively.
$$\text{Eq. 1: } \hat{g}h(H{hatt},A_{adef})= \frac{\alpha_h}{1+\exp(-\beta_h(H_{hatt}+A_{adef})-\gamma_h)}$$
$$\text{Eq. 2: }\hat{g}a(A{aatt},H_{hdef})= \frac{\alpha_a}{1+\exp(-\beta_a(A_{aatt}+H_{hdef})-\gamma_a)}$$
where
After every match the team's performance rating are updated depending on whether they were the home or away team using Eq 3 - 6. Those ratings are then used to create the predicted goals for the next match on the schedule with Eq 1 and 2.
Table 2: Performance Rating
Team | H_hatt | H_hdef | A_aatt | A_adef |
---|---|---|---|---|
Cerezo Osaka | 0.8606843 | -7.532469 | -1.898231 | -10.148069 |
Sagan Tosu | -3.6531719 | -1.244026 | -6.971303 | -4.820117 |
Vegalta Sendai | -1.3261847 | -1.327007 | 4.644055 | -5.945538 |
Equation 1 - 6 requires many parameters beyond the given information in Table 1. At the beginning of my model all of the teams home/away defensive and attacking strength started at zero. To calculate the missing parameters we use individual goal-prediction error,
Using particle swarm optimization (PSO) we find the 11 missing parameters by adjusting the parameters until the individual goal-prediction error,
After running our PSO code with all of the historical matches we should have a table with each teams ratings and which the outcome of the match
home_H_hatt_mix | home_H_hdef_mix | home_A_aatt_mix | home_A_adef_mix | away_H_hatt_mix | away_H_hdef_mix | away_A_aatt_mix | away_A_adef_mix |
---|---|---|---|---|---|---|---|
-0.51819662 | -0.2692562 | -0.2124204 | -0.59226160 | -0.51819662 | -0.2692562 | -0.2124204 | -0.59226160 |
-1.47981522 | 1.3623530 | 1.0747815 | -1.69132273 | -1.47981522 | 1.3623530 | 1.0747815 | -1.69132273 |
-2.17377591 | -0.8524936 | -0.6725455 | -2.48447006 | -2.17377591 | -0.8524936 | -0.6725455 | -2.48447006 |
0.00891013 | -0.4438178 | -0.3501348 | 0.01018364 | 0.00891013 | -0.4438178 | -0.3501348 | 0.01018364 |
-0.49386014 | -1.2284265 | -0.9691248 | -0.56444675 | -0.49386014 | -1.2284265 | -0.9691248 | -0.56444675 |
-0.43666572 | -0.3803888 | -0.3000946 | -0.49907762 | -0.43666572 | -0.3803888 | -0.3000946 | -0.49907762 |
These 8 home/away attack and defense rating for each team will be our features into XGBoost.
Our machine learning will gradient boost and attempt to minimize the error when classifying the results into home win, away win or draw. XGboost only reads and writes numerical vectors. The 8 features are turned into a matrix and split randomly 80% - 20% for the training and test set. The accompanying score line used to update the home and away teams attack/defense ratings are turned into a single columns with the result 0, 1, 2 representing the numerical value for a This column is the target values that XGBoost will try and predict in the training and test set.
I fine tuned my XGBoost parameters and found:
Feature subsampling = 1.0 Learning rate = 0.025 Maximum tree depth = 5 Training set subsample = 0.67
I saw that with a max tree depth of 5, I'd see the optimal learning rate at around 100 rounds so I set my total rounds to 150.
As you plot the error versus a range of the parameter you are trying to find tune, you should see a parabolic or convex shape to the curve, where the error is minimized. This is how you fine tune your parameters
The error in our case is found by minimizing the Ranked Probability Score.
Where:
-
$N$ is the total number of outcomes. -
$p_{j}$ is the predicted probability for outcome$j$ . -
$o_{j}$ is the observed outcome (1 if$j$ is the true outcome, 0 otherwise).
The code for this is in the excel files.
There isn't a lot published soccer predictions beyond gambling odds. FiveThirtyEight used to (stopped in June 2023) so I have compared my predictions with theirs. They use a Poisson regression which has been explained quite well by Opisthokonta and StatsandSnakeOil
One downside of Poisson is the time period in which you evaluate the teams performance. WIthout PSO you have to set the weight yourself or use the entire period of your data. Time-Weight example. The PSO tries to optimize the weighting of historical performance but I have seen that it is slower then Poisson even as I tried to maximize
But the main downside of Poisson is that draws are undervalued. Dixon-Coles attempts to fix this but if you look at FiveThirtyEight Jleague prediction the probability of a tie will always be lower than my model.
I found the approach to predict soccer outcomes from Berrar 2017 unique and interesting. I'll continue to update the predictions to compare the performance against a Poisson model. I also hope you watch more Jleague soccer as it's one of the more entertaining leagues to watch. More parity than Big 5 leagues, and more tactical skills than MLS.