sfirke / predicting-march-madness Goto Github PK

Machine learning tutorial to create an entry for the Kaggle March Mania contest

License: MIT License

R 100.00%

march-madness machine-learning introduction

predicting-march-madness's Introduction

Predicting March Madness

Kaggle’s March Madness prediction competition is an accessible introduction to machine learning. If you happen to like college basketball, you’ll like that in this competition you can’t bust your bracket, since you make a prediction for every game. Plus this year there’s a big prize pool, and luck plays a big enough role that you can be a legit contender fairly easily.

In 2016, my simple process using tidyverse functions in R placed in the top 10%. I refined it a bit for 2017 and finished in the top 25%.

I’m sharing my code and process here for others to use as a starting point. My approach is similar to that of the 2014 winners, Gregory Matthews and Michael Lopez. They published a paper about the role that luck plays in this competition, putting their model in perspective. A takeaway: take my model, tweak it a bit to generate some distance from the field, and you are competitive to win!

What’s here

In the Kaggle competition, you estimate how likely it is that Team A beats Team B, for each of the 2,278 possible matchups in the tournament. My guide documents a set of scripts for each step of:

Deciding on possible input parameters
Scraping the input data with the rvest package
Cleaning and joining data sources to get tidy, prediction-ready data
Training and evaluating machine learning models on the data
Making and submitting predictions

Licensing/usage

This code is public, please reuse it. It’s under an MIT license. Please acknowledge its role in any write-up or discussion of work that relies on it. And if you win a cash prize from Kaggle using this, congratulations! I wouldn’t turn down a thank-you gift ;)

Thanks

Thanks to contributors @MHenderson and @BillPetti.

Contact me

Let me know what you think, either on twitter @samfirke or compose a friendly e-mail to:

predicting-march-madness's People

Contributors

Stargazers

Watchers

Forkers

billpetti bryansp eringrand katiejolly nameyeh emilelatour swraithel sergiorgiraldo sdcastillo tfos11 spjca an0moly

predicting-march-madness's Issues

Final Predictions in Script 05

I may be incorrect, but should the Pred = stage_1_preds part of this code be replaced with Pred = final_preds_1?

Current:
final_preds_to_send <- final_blank %>%
dplyr::select(id) %>%
mutate(Pred = stage_1_preds)

Should Be:
final_preds_to_send <- final_blank %>%
dplyr::select(id) %>%
mutate(Pred = final_preds_1)

Permute which is the "lower" team

From my comment here: https://www.kaggle.com/c/mens-machine-learning-competition-2018/discussion/51230

I hadn't thought of testing my model on a case where two teams are equally strong. Seems like a great test for any model. When I predict a dummy game where the teams are identical, I get 0.51 when it should be 0.5, even after removing the intercept. Not good. I suspect this is due to failing to permute which team is the "lower" team. Perhaps teams with lower Kaggle numbers have historically done slightly better over the years of data I trained on and that leaked into my model.

Create names-to-numbers crosswalk

See what names already match from Pomeroy to the Kaggle data, write that out to a .csv, edit it in Excel to complete the matches.

Is my randomForest model overconfident because it's not using the leaf class percentages correctly?

Slight change to parsing Ken Pom Data

Looks like readr::parse_number has changed a bit. It now expects a character before converting.

I would make a PR, but I forked into a repo of mine instead of a stand-alone fork. Thanks for providing these resources!

Made a slight change to your parsing function to handle that. The solution is as follows:

process_ken_pom_sheet <- function(dat){
  dat <- dat %>%
    clean_names() %>%
    mutate(seed = str_extract(x2, "[0-9]+")) %>% # extract seed where applicable
    mutate(x2 = gsub(" [0-9]+", "", x2)) # remove seed from school name
  names(dat) <- c("rank", "team", "conf", "wins_losses",
                  "adj_EM", "adj_offensive_efficiency", "adj_offensive_efficiency_seed",
                  "adj_defensive_efficiency", "adj_defensive_efficiency_seed",
                  "adj_tempo", "adj_tempo_seed", "luck", "luck_seed", "sos_adj_em",
                  "sos_adj_em_seed", "opposing_offenses", "opposing_offenses_seed",
                  "opposing_defenses", "opposing_defenses_seed", "ncsos_adj_em",
                  "ncsos_adj_em_seed", "year", "seed")
  
  dat <- dat[-1, ] %>%
    select(rank, everything()) %>%
    filter(!is.na(rank), !rank %in% c("Rank", "Rk")) %>%
    mutate(rank = as.numeric(rank)) %>%
    mutate_all(as.character) %>% 
    mutate_at(vars(adj_EM:year), parse_number) %>%
    mutate(seed = as.numeric(seed))
  dat
}

Use modelr in predictions

Refactor home court advantage

It should help the home team, hurt the away team, and be 0 on neutral turf. With the three level factor I have currently, the training model assigns some value to neutral, i.e., neutral turf slightly changes the odds of the lower team winning. This is nonsense.

Maybe switch to two binary variables, homecourt_adv_lower and homecourt_adv_higher - they would be 1/0 and 0/1, except for when they're 0/0 to represent neutral?

Since I treat all tournament games as "neutral", the tracking of homecourt in the training model is just to cancel out its effect on learning about the impact of ratings, etc.