Giter VIP home page Giter VIP logo

predicting-march-madness's Introduction

Predicting March Madness

Kaggle’s March Madness prediction competition is an accessible introduction to machine learning. If you happen to like college basketball, you’ll like that in this competition you can’t bust your bracket, since you make a prediction for every game. Plus this year there’s a big prize pool, and luck plays a big enough role that you can be a legit contender fairly easily.

In 2016, my simple process using tidyverse functions in R placed in the top 10%. I refined it a bit for 2017 and finished in the top 25%.

I’m sharing my code and process here for others to use as a starting point. My approach is similar to that of the 2014 winners, Gregory Matthews and Michael Lopez. They published a paper about the role that luck plays in this competition, putting their model in perspective. A takeaway: take my model, tweak it a bit to generate some distance from the field, and you are competitive to win!

What’s here

In the Kaggle competition, you estimate how likely it is that Team A beats Team B, for each of the 2,278 possible matchups in the tournament. My guide documents a set of scripts for each step of:

  • Deciding on possible input parameters
  • Scraping the input data with the rvest package
  • Cleaning and joining data sources to get tidy, prediction-ready data
  • Training and evaluating machine learning models on the data
  • Making and submitting predictions

Licensing/usage

This code is public, please reuse it. It’s under an MIT license. Please acknowledge its role in any write-up or discussion of work that relies on it. And if you win a cash prize from Kaggle using this, congratulations! I wouldn’t turn down a thank-you gift ;)

Thanks

Thanks to contributors @MHenderson and @BillPetti.

Contact me

Let me know what you think, either on twitter @samfirke or compose a friendly e-mail to: samuel.firke AT gmail

predicting-march-madness's People

Contributors

billpetti avatar emilelatour avatar mhenderson avatar sfirke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

predicting-march-madness's Issues

Final Predictions in Script 05

I may be incorrect, but should the Pred = stage_1_preds part of this code be replaced with Pred = final_preds_1?

Current:
final_preds_to_send <- final_blank %>%
dplyr::select(id) %>%
mutate(Pred = stage_1_preds)

Should Be:
final_preds_to_send <- final_blank %>%
dplyr::select(id) %>%
mutate(Pred = final_preds_1)

Permute which is the "lower" team

From my comment here: https://www.kaggle.com/c/mens-machine-learning-competition-2018/discussion/51230

I hadn't thought of testing my model on a case where two teams are equally strong. Seems like a great test for any model. When I predict a dummy game where the teams are identical, I get 0.51 when it should be 0.5, even after removing the intercept. Not good. I suspect this is due to failing to permute which team is the "lower" team. Perhaps teams with lower Kaggle numbers have historically done slightly better over the years of data I trained on and that leaked into my model.

Create names-to-numbers crosswalk

See what names already match from Pomeroy to the Kaggle data, write that out to a .csv, edit it in Excel to complete the matches.

Slight change to parsing Ken Pom Data

Looks like readr::parse_number has changed a bit. It now expects a character before converting.

I would make a PR, but I forked into a repo of mine instead of a stand-alone fork. Thanks for providing these resources!

Made a slight change to your parsing function to handle that. The solution is as follows:

process_ken_pom_sheet <- function(dat){
  dat <- dat %>%
    clean_names() %>%
    mutate(seed = str_extract(x2, "[0-9]+")) %>% # extract seed where applicable
    mutate(x2 = gsub(" [0-9]+", "", x2)) # remove seed from school name
  names(dat) <- c("rank", "team", "conf", "wins_losses",
                  "adj_EM", "adj_offensive_efficiency", "adj_offensive_efficiency_seed",
                  "adj_defensive_efficiency", "adj_defensive_efficiency_seed",
                  "adj_tempo", "adj_tempo_seed", "luck", "luck_seed", "sos_adj_em",
                  "sos_adj_em_seed", "opposing_offenses", "opposing_offenses_seed",
                  "opposing_defenses", "opposing_defenses_seed", "ncsos_adj_em",
                  "ncsos_adj_em_seed", "year", "seed")
  
  dat <- dat[-1, ] %>%
    select(rank, everything()) %>%
    filter(!is.na(rank), !rank %in% c("Rank", "Rk")) %>%
    mutate(rank = as.numeric(rank)) %>%
    mutate_all(as.character) %>% 
    mutate_at(vars(adj_EM:year), parse_number) %>%
    mutate(seed = as.numeric(seed))
  dat
}

Refactor home court advantage

It should help the home team, hurt the away team, and be 0 on neutral turf. With the three level factor I have currently, the training model assigns some value to neutral, i.e., neutral turf slightly changes the odds of the lower team winning. This is nonsense.

Maybe switch to two binary variables, homecourt_adv_lower and homecourt_adv_higher - they would be 1/0 and 0/1, except for when they're 0/0 to represent neutral?

Since I treat all tournament games as "neutral", the tracking of homecourt in the training model is just to cancel out its effect on learning about the impact of ratings, etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.