alan-turing-institute / airsenal Goto Github PK

View Code? Open in Web Editor NEW

254.0 23.0 83.0 69.64 MB

Machine learning Fantasy Premier League team

License: MIT License

Python 2.52% Stan 0.01% Jupyter Notebook 97.47% Dockerfile 0.01%

hut23 hut23-222 hacktoberfest

airsenal's People

Contributors

Stargazers

Watchers

Forkers

barbourians mpcoombes tallamjr izabelo waseemyusuf stevenfranks radka-j atlonxp mridullpandey keshabb shashanksub42 danhounshell gauravdiwan89 chiefsan mosayed abdallahmetwalli aakashb95 robwhickman carloschavez rwilliams-r7 georgewhewell kaushalvivek pratyushpal rajeshtailor1 saba83rish senjudev abelarm lon3wo7f indrark nickhill97 biggins callistusndemo kreshnaa-raam goraniliev dazaman fandango96 richardthomasdev jpkfin tdarnell atafoa 9401adarsh sreyan-ghosh chahak13 shuyi1981 rohaniyerr arifpras kish-00 tobi-ace rinks2511 subash774 siddheshmhatre roarou nk173 gc-p agmadkafaar a1fus tungom nathan962 mjailani90 ryan205 basscoder2808 gitsim02 ssmichae elvisma0305 donald122 sergionahas dofften treebearded pooyaakm erfan55 thien eddableheath nivaassudhan iansealy rohanday3 brimell hglong16 hozeren foreztgump dovinhanhnguyen frameee17 6076paras

airsenal's Issues

Use exhaustive search for 2 transfer strategies

At the moment, the 2 transfer strategies are investigated using a random process. @nbarlowATI reckons that the search space is sufficiently small that we can use brute force instead.

Produce app where people can input their team and get transfer recommendations

Cool potential feature would be to have a webpage where people can input their teams, and get recommendations for the next week's transfers.

Initial Stan model

First Stan model of attacking and defending (alphas and betas) per team.
Train using historical data.

Revisit use of "method" or "tag" to identify predictions and perform optimizations

This variable is intended to allow the optimizer to retrieve a consistent set of predicted points from the DB.
Currently a lot of places have a default value of this argument - some risk that we may not be using the predictions that we think we are in all places.

Suggested change is to have fill_predictedscore_table generate a UUID (or timestamp) to put into this field in the DB - same value for all rows in a given run of this script.
The fill_transfersuggestion_table will query the predicted_score table, and look at the last row, and then use this value for the optimization.
Remove all default values of this argument from function definitions, to flush out cases where we are calling a function without explicitly specifying it.

Parallelize transfer optimization

Use multiprocess pool to run iterations over different transfer strategies in parallel. Need to think about how to recombine at the end to find overall maximum.

Add remaining scaffolding for python module

We don't have a setup.py and so on.

FPL API URLs Have Changed

See branch: fix/fpl_api_urls_2019

framework/data_fetcher.py was giving me errors (empty data) as some of the FPL API URLs have changed. Plus trailing forward slash seems to be important, e.g. https://fantasy.premierleague.com/api/bootstrap-static/ returns data but https://fantasy.premierleague.com/api/bootstrap-static doesn't.

This set of URLs seems to work (where {} are filled by python .format calls, usually with a team, player or league id):

FPL_SUMMARY_API_URL = "https://fantasy.premierleague.com/api/bootstrap-static/"
FPL_DETAIL_URL = "https://fantasy.premierleague.com/api/element-summary/{}/"
FPL_HISTORY_URL = "https://fantasy.premierleague.com/api/entry/{}/history/"
FPL_TEAM_URL = "https://fantasy.premierleague.com/api/entry/{}/event/{}/picks/"
FPL_TEAM_TRANSFER_URL = "https://fantasy.premierleague.com/api/entry/{}/transfers/"
FPL_LEAGUE_URL = "https://fantasy.premierleague.com/api/leagues-classic/{}/standings/".format(self.FPL_LEAGUE_ID)
FPL_FIXTURE_URL = "https://fantasy.premierleague.com/api/fixtures/"

However, getting league standings now needs authentication - hint here: https://www.reddit.com/r/FantasyPL/comments/c64rrx/fpl_api_url_has_been_changed/ewj4ofd/

I've updated the URLs and implemented the authentication for league standings (means FPL login and password are now needed as env variables for that functionality).

Expected points for players that didn't play GW1

Players that didn't play in gameweek 1 (it seems - might be another reason) appear to be getting an expected points total for gameweek 2 that assumes they will play, followed by expected points of 0 for gameweeks 3 and 4.

For example Lacazette, who didn't play vs. Newcastle in gameweek 1:

Getting points prediction for player Alexandre Lacazette
gameweek: 2 vs BUR home? True
Expected points: 5.19
gameweek: 3 vs LIV home? False
Expected points: 0.00
gameweek: 4 vs TOT home? True
Expected points: 0.00

Make scripts into entry points

More entrypoints

In additions to setup_airsenal_db . we should have executables to:

update DB (new results+playerscores)
run predictions
run optimization
make plots of mini-league points etc.

Scripts to dump DB contents to CSV / JSON

It was more painful than it should have been to get the 18/19 season data ready for the start of the 19/20 season.
For next season we should have scripts ready that output the contents of the DB to the player_summary_YYYY.json, player_detail_YYYY.json and results_YYYY_with_gw.csv files, in the expected formats.

Parse 'information' string from FPL API to get estimated return dates for injured players

Factor result into expected minutes prediction.

Crash when printing team after free hit

In optimization, after playing free hit for next gw and making 14 transfers, get crash when printing the optimum team:

Traceback (most recent call last):
  File "/Users/nbarlow/anaconda3/bin/run_airsenal_optimization", line 10, in <module>
    sys.exit(main())
  File "/Users/nbarlow/anaconda3/lib/python3.6/site-packages/airsenal/scripts/fill_transfersuggestion_table.py", line 279, in main
    print_team_for_next_gw(best_strategy)
  File "/Users/nbarlow/anaconda3/lib/python3.6/site-packages/airsenal/scripts/fill_transfersuggestion_table.py", line 145, in print_team_for_next_gw
    expected_points = t.get_expected_points(next_gw,tag)
  File "/Users/nbarlow/anaconda3/lib/python3.6/site-packages/airsenal/framework/team.py", line 259, in get_expected_points
    raise RuntimeError("Team is incomplete")
RuntimeError: Team is incomplete

why is the team incomplete?

Crash when running predictions possibly due to rescheduled fixtures

When running predictions, without remaking the database from scratch (i.e. running update_airsenal_db after the previous round of fixtures), get a crash when filling the dataframe used for generating predictions.

Traceback:

File "/Users/nbarlow/anaconda3/lib/python3.6/site-packages/airsenal/scripts/fill_predictedscore_table.py", line 39, in make_predictedscore_table
    prediction_dict = calc_all_predicted_points(gw_range, season, tag,  session)
  File "/Users/nbarlow/anaconda3/lib/python3.6/site-packages/airsenal/scripts/fill_predictedscore_table.py", line 23, in calc_all_predicted_points
    model_team, df_player = get_fitted_models(session)
  File "/Users/nbarlow/anaconda3/lib/python3.6/site-packages/airsenal/framework/prediction_utils.py", line 221, in get_fitted_models
    df_team = get_result_df(session)
  File "/Users/nbarlow/anaconda3/lib/python3.6/site-packages/airsenal/framework/bpl_interface.py", line 29, in get_result_df
    for s in session.query(Result).all()
  File "/Users/nbarlow/anaconda3/lib/python3.6/site-packages/airsenal/framework/bpl_interface.py", line 29, in <listcomp>
    for s in session.query(Result).all()
AttributeError: 'NoneType' object has no attribute 'date'

Essentially, after doing update_airsenal_database we are left with some "Result"s that do not have "Fixture"s.

Use relationships/ foreign keys in database tables

i.e. effectively "joins" - avoid duplicating information e.g between matches and fixtures, and speed up search for player score - no need to explicitly search by match_id.

Nonsensical "players out" when printing optimization output with free hit

When running optimization, where best outcome was from playing free hit in the next gw, got the following output:

=========== Gameweek 37 ================

Cards played:  F

Players in:			Players out:
-----------			------------
Sergio Agüero			Asmir Begovic
Raúl Jiménez			Trent Alexander-Arnold
Paul Pogba			Marcos Alonso
Jan Bednarek			César Azpilicueta
Ryan Bertrand			Aaron Wan-Bissaka
Heung-Min Son			Heung-Min Son
Chris Smalling			Mohamed Salah
Lucas Rodrigues Moura da Silva		Lucas Rodrigues Moura da Silva
Aaron Wan-Bissaka			Lys Mousset
Jason Steele			Roberto Firmino
Andre Gray			Joshua King
Eden Hazard			Michel Vorm
Aymeric Laporte			David Silva
Ilkay Gündogan			Cesc Fàbregas
David de Gea			Martin Kelly

however, the "players out" are not the players in the current team! must be from a previous iteration of the make_new_team random process.

Bokeh "dashboard" to quickly view team and player predictions and scores.

Could e.g. select a fixture from drop-down, and view predicted score probabilities, select one or more players and view predicted points over next few fixtures, and/or scores over previous fixtures.

Produce plots and validations of fitted team and player models.

For example, plot alpha vs beta for the team level model, as was done for the 2017/18 season, and p(score) vs p(assist) for FWD, MID, DEF.

Potentially produce a dashboard, where one can see history of a chosen player or team.

Scripts to perform sanity checks on input data

For start of 2019/20 season we had a hard-to-debug crash when initializing the player-level Stan model, which was traced back to an inconsistency in the player and match data for 2018/19 (for one playerscore row, nTeamGoals - numGoals - numAssists was -1) - this was eventually traced back to an incorrect gameweek in results_1819_with_gw.csv.

But we should be able to debug this much faster - we should check for things like this in the DB after every upload.

Enable predictions/ team selection to work on past seasons

From first look will involve:

add "season" column to "player", "fixture", "player_prediction", "transaction", and "transfer_suggestion" tables.
refactor more db-filling code out of scripts and into library code ("history_utils.py"?) with "season" as a potential argument (but default to "1819").
in db-filling code - if season is not "1819", use CSV rather than API inputs for player lists, fixtures, match results etc. - will need to use player-level data to know what matches were in what gameweeks.

Automate filling new data

Cron job or AWS lambda function should fill match and playerscore tables after each gameweek.

Flag to control verbosity of functions used in team optimization

e.g. by default turn off "Cannot afford player xxx" . and "Best formation is yyy" messages.

add Transaction table, and fill with players bought and sold so far.

schema:
player_id, gameweek, bought_or_sold

we can then write a function that combines this information with the week-by-week player prices, and gives us an accurate budget, taking into account player price changes.

(If a player goes up in value while we own them, we only get half the price increase when we sell, so it's not sufficient just to use current player prices to get our budget).

Parallelize prediction calculation

Can at the very least do GK, DEF, MID, FWD in different threads.. Just need to make sure we have the same "tag" for all of them..

AWS - lambda to optimize transfer strategy

run after prediction-calculating lambda - run script to try different transfer strategies and update transfer_suggestions table.

This is the most compute-intensive part, may want to see if we can scale out.

Import bug

When I try to run the notebook sandbox/modelling_test.ipynb, I get an error at the line

from framework.utils import *

that is as follows:

NameError                                 Traceback (most recent call last)
<ipython-input-2-d0dbd8c32518> in <module>()
      9 import seaborn as sns
     10 
---> 11 from framework.utils import *
     12 
     13 np.random.seed(42)

~/Projects/AIrsenal/framework/utils.py in <module>()
     10 from .mappings import alternative_team_names, alternative_player_names
     11 
---> 12 from .data_fetcher import FPLDataFetcher, MatchDataFetcher
     13 from .schema import (
     14     Base,

~/Projects/AIrsenal/framework/data_fetcher.py in <module>()
     47 FPL_TEAM_URL = "https://fantasy.premierleague.com/drf/entry/{}/event/{}/picks"
     48 FPL_LEAGUE_URL = "https://fantasy.premierleague.com/drf/leagues-classic-standings/{}?phase=1&le-page=1&ls-page=1".format(
---> 49     LEAGUE_ID
     50 )
     51 DATA_DIR = "./data"

NameError: name 'LEAGUE_ID' is not defined

Implement player-level forecasts

The team-level model produces probabilities for the scoreline between different teams. To get expected defensive points, we just need to compute the probability of a clean sheet using the team-level model and then multiply this by the number of points a given player will receive for a clean sheet.

For attacking points, a simple approach can be as follows. We learn three numbers per player from historical data:

Pr(score) Pr(assist) and Pr(not involved)

where these are the probabilities of the three possible outcomes for an individual player given that their team has scored a goal. The distribution of n_score, n_assist and n_not_involved (for a given player) is then multinomial given the total number of goals scored by the team. We can use this to compute

Pr(attacking points | goals scored by team)

so that

Pr(attacking points) = sum_{goals scored by team) Pr(attacking points | goals scored by team) * Pr(goals scored by team)

where Pr(goals scored by team) is computed using the team-level model. Using this distribution, we can then compute the expected number of attacking points.

AWS - lambda to update database

Database is on sqlite file stored on S3 bucket.
Can run lambda on a cron-type daily schedule to see if any new matches have been played, and if so, update the db.

Improve estimation of defensive points

Currently ignoring stuff like: a player might have been subbed off when a goal was scored etc etc

Write team optimization function

Given a starting team, should query the player_prediction table, and look N gameweeks ahead and choose best substitutions, with constraint of no more than one 4-point-hit per gameweek.

Unable to install on Linux

When attempting to install bpl the following error occurs (the output is captured by using pip install -vvv .. option)

    *** Error compiling '/tmp/pip-install-22y8ywmq/pystan/pystan/stan/lib/stan_math/lib/boost_1.69.0/status/boost_check_library.py'...
      File "/tmp/pip-install-22y8ywmq/pystan/pystan/stan/lib/stan_math/lib/boost_1.69.0/status/boost_check_library.py", line 166
        print ">>> cwd: %s"%(os.getcwd())
                          ^
    SyntaxError: invalid syntax

System Information:

Operating System: [GNU/Linux 3.10.0-693.5.2.el7.x86_64]
PyStan Version: [2.18.0.0]
GCC Version [7.1.0]

Does this occur for anyone else?

This seems to be an unresolved issue at https://github.com/stan-dev/pystan/issues/584 relating to PyStan

Create a "Teams" table in the DB.

As part of the effort to make AIrsenal easy to run on both past seasons and the current season, it would be good to have a simple table listing what teams were in the premier league for each season. I.e. just two columns:
name, season
"ARS", "1516"
"AVL", "1516"
...
"ARS","1920"
...

Alexa skill

If we get sqlite db onto an S3 bucket, lambda to auto-fill scores, lambda to run predictions, should be easy to get Alexa skill to report on latest status.
For some details about our team, need to find a way to authenticate with FPL API.

Get historical data

Checking if a player is unavailable

This feature doesn't seem to be working currently.

Compile and pickle player-level stan model on install

Estimating time spent on the pitch

This is hard. Some thoughts:

We obviously need to check if they are "red" on the FPL website. If so, we can check when they are likely to return (if players are suspended, there is metadata that says when their ban is over).
To forecast how long they will be on the pitch, we should consider their recent history - how long to consider? And how to use that information?

Open to ideas about this - we need a p(T) for the model to work (to get 60+ mins point, but also to estimate contribution to goals etc).

Lots more tests

Dummy match and player models with e.g. 100% prob of 1-0 win, 50% prob of assist | team-goal etc. etc. and ensure that attacking and defending points are calculated correctly.
Dummy predicted points for players for single gameweek (e.g. 11 players get points 1-11, 4 players get zero) - check that subs and captain optimization works as expected.
Dummy predicted points for a few gameweeks ahead - check that transfer strategy optimization works as expected.

Getting "UNIQUE constraint failed: player.player_id" while running setup_airsenal_database

sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: player.player_id
[SQL: INSERT INTO player (player_id, name) VALUES (?, ?)]
[parameters: ((1, 'Shkodran Mustafi'), (2, 'Héctor Bellerín'), (3, 'Sead Kolasinac'), (4, 'Ainsley Maitland-Niles'), (5, 'Sokratis Papastathopoulos'), (6, 'Nacho Monreal'), (7, 'Laurent Koscielny'), (8, 'Konstantinos Mavropanos') ... displaying 10 of 529 total bound parameter sets ... (487, 'Patrick Cutrone'), (528, 'Pedro Lomba Neto'))]
(Background on this error at: http://sqlalche.me/e/gkpj)

Purchase price of players

We don't have the exact value of players when we purchased them, just the price at the end of the gameweek when we bought them.

AWS - lambda to run predictions

Run daily - if we are less than 24 hours before a gameweek deadline, run the script to fill the player_predictions table in the database.

Bonus points

Find way of predicting bonus points

Improve estimation of time spent on the pitch

Currently we treat the last N games as i.i.d. draws from the pdf P(T). This is ok, but it will not be able to pick up that players are likely to play fewer minutes per game over e.g. the Christmas period. @nbarlowATI has suggested using previous seasons' data to investigate this effect.

Occasional crash in optimization with allow_free_hit

In two separate threads (i.e. evaluation of two strategies: "F0000" and "F0001"), got crash:

File "/Users/nbarlow/anaconda3/lib/python3.6/site-packages/airsenal/framework/team.py", line 237, in apply_formation
    if index < formation[i]:
TypeError: 'NoneType' object is not subscriptable

Caching partial results of optimization.

Currently every "strategy" is evaluated independently (in order to parallelize).
However, this is very wasteful - all strategies that begin with 2 transfers in the next gw will independently calculate the best transfers (which for 0,1,2 transfers is deterministic).
Would be better to cache the results as we go along, so we don't repeat, and instead have more of a tree structure.
Could do this by checking for the presence of JSON files in the /tmp/ directory with a certain identifier unique to this optimization run, and reading the players in/out from there..
Not clear then how this would work with greater parallelization e.g. on AWS at a later date..

Write initial framework

Simple python code that can pick a team obeying all constraints (price, correct numbers of defenders, midfielders, forwards, no more than three players per team), e.g. optimizing using total expected points.

Fix tests

Some tests appear to depend on the database existing. It is not created upon install and so this causes them to fail on travis.

Investigate MC tree search for optimization

e.g. https://github.com/smallnamespace/pymcts

Wildcard strategy

Currently there isn't one. From our brief discussion about it, we could:

At each gameweek, consider the "wildcard strategy" where the entire team is changed. The steps would be something like:

Compute the best expected points 10 weeks into the future when the wildcard is played.
For the remaining gameweeks until the current wildcard expires (until Christmas, or until the end of the season if after Jan 1st [check this is the right date]), repeat the same calculation.
If playing the wildcard now gives the best expected points return, then play it.

This is obviously going to dominate the computational cost (need to forecast for many more weeks than before, plus running the optimizer many times where unlimited transfers are allowed). A simpler strategy could be a deterministic rule - the pre-xmas wildcard is played at the very last moment, and the post-xmas wildcard is played before the first double gameweek, unless the number of injured / banned players in the squad exceeds some threshold, then the wildcard is played.

Add Travis

This repo is open so travis is free.