Giter VIP home page Giter VIP logo

mma-data-scrape's Introduction

UFC/MMA Scrape README File

Objective

The purpose of this project is to scrape historical MMA data on fights and fighters, clean the data, and create new feature variables to make it as useful as possible. Project was written in R, using package rvest for scraping.

Files

  1. 0-wiki_ufcbouts.R scrapes the results of every UFC fight from Wikipedia (or updates an existing fight DB).
  2. 1-wiki_ufcfighters.R scrapes the details of every fighter that has ever fought in the UFC and has a Wikipedia page. Fighter details include: age, height, reach, wins/losses, nationality, team/camp association, etc.
  3. wiki_ufcbouts_functions.R houses all custom functions used in ufcbouts scrape file.
  4. wiki_ufcfighters_functions.R houses all custom functions used in ufcfighters scrape file.

Instructions

Save folder mma_scrape to your current working directory. There are two scrape files (0-wiki_ufcbouts.R and 1-wiki_ufcfighters.R), file 0 must be run prior to running file 1. The two function files provide functions used for scraping and are sourced at the top of each of the scrape files. The output of each scrape file is a dataframe saved as an .RData file to the folder mma_scrape. If the ufcbouts scrape file has been run in the past and the output .RData file exists in the directory mma_scrape, then running ufcbouts again will ONLY scrape new fight records that have been added to Wikipedia since the last time the script was run, appends the new records to the boutsdf dataframe and saves it back to the same .RData file.

Notes

The majority of the code is performing text clean up, text extraction, tidying variables and creating new feature variables. I'm planning on add more to this in the near future (scraping historical judging data for all MMA fights, merging of datasets).

List of Variables Within the Output DF From Each Scrape File

Bout Results File 0-wiki_ufcbouts.R:

str(boutsdf)
> str(boutsdf)
'data.frame':	4033 obs. of  28 variables:
 $ Weight          : chr  "Featherweight" "Lightweight" "Welterweight" "Flyweight" ...
 $ FighterA        : chr  "Yair Rodriguez" "Joe Lauzon" "Ben Saunders" "Sergio Pettis" ...
 $ VS              : chr  "def." "def." "def." "def." ...
 $ FighterB        : chr  "B.J. Penn" "Marcin Held" "Court McGee" "John Moraga" ...
 $ Result          : chr  "TKO" "Decision" "Decision" "Decision" ...
 $ Subresult       : chr  "front kick and punches" "split" "unanimous" "unanimous" ...
 $ Round           : num  2 3 3 3 3 3 1 3 3 2 ...
 $ Time            : chr  "0:24" "5:00" "5:00" "5:00" ...
 $ TotalSeconds    : num  324 900 900 900 900 900 177 900 819 461 ...
 $ Event           : chr  "UFC Fight Night: Rodriguez vs. Penn" "UFC Fight Night: Rodriguez vs. Penn" "UFC Fight Night: Rodriguez vs. Penn" "UFC Fight Night: Rodriguez vs. Penn" ...
 $ Date            : Date, format: "2017-01-15" "2017-01-15" "2017-01-15" ...
 $ Venue           : chr  "Talking Stick Resort Arena" "Talking Stick Resort Arena" "Talking Stick Resort Arena" "Talking Stick Resort Arena" ...
 $ City            : chr  "Phoenix" "Phoenix" "Phoenix" "Phoenix" ...
 $ State           : chr  "Arizona" "Arizona" "Arizona" "Arizona" ...
 $ Country         : chr  "U.S." "U.S." "U.S." "U.S." ...
 $ champPost       : chr  NA NA NA NA ...
 $ interimChampPost: chr  NA NA NA NA ...
 $ wikilink        : chr  "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Rodr%C3%ADguez_vs._Penn" "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Rodr%C3%ADguez_vs._Penn" "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Rodr%C3%ADguez_vs._Penn" "https://en.wikipedia.org/wiki/UFC_Fight_Night:_Rodr%C3%ADguez_vs._Penn" ...
 $ over1.5r        : num  0 1 1 1 1 1 0 1 1 1 ...
 $ over2.5r        : num  0 1 1 1 1 1 0 1 1 0 ...
 $ over3.5r        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ over4.5r        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ITD             : num  1 0 0 0 0 0 1 0 1 1 ...
 $ r1Finish        : num  0 0 0 0 0 0 1 0 0 0 ...
 $ r2Finish        : num  1 0 0 0 0 0 0 0 0 1 ...
 $ r3Finish        : num  0 0 0 0 0 0 0 0 1 0 ...
 $ r4Finish        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ r5Finish        : num  0 0 0 0 0 0 0 0 0 0 ...

Fighter Details File 1-wiki_ufcfighters.R:

str(fighters)
Classes 'tbl_df', 'tbl' and 'data.frame':	1235 obs. of  45 variables:
 $ Name                      : chr  "Aaron Brink" "Aaron Riley" "Aaron Rosa" "Aaron Simpson" ...
 $ Current Division          : chr  "Heavyweight" "Lightweight" "Light Heavyweight" "Welterweight" ...
 $ Total Fights              : num  52 45 23 17 22 30 11 17 89 36 ...
 $ Total Wins                : num  26 30 17 12 15 20 6 14 56 28 ...
 $ Wins By knockout          : num  21 6 6 6 5 6 3 8 7 13 ...
 $ Wins By submission        : num  5 13 4 1 4 6 3 1 38 3 ...
 $ Wins By decision          : num  0 11 7 5 5 8 0 5 8 12 ...
 $ Wins By disqualification  : num  0 0 0 0 1 0 0 0 0 0 ...
 $ Wins Unknown              : num  0 0 0 0 0 0 0 0 3 0 ...
 $ Total Losses              : num  25 14 6 5 6 9 5 2 29 8 ...
 $ Losses By knockout        : num  6 7 3 3 1 2 2 1 9 2 ...
 $ Losses By submission      : num  18 2 2 0 3 1 2 1 16 0 ...
 $ Losses By decision        : num  1 5 1 2 2 6 1 0 4 6 ...
 $ Losses By disqualification: num  0 0 0 0 0 0 0 0 0 0 ...
 $ Losses Unknown            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ No Contest                : num  1 0 0 0 1 0 0 0 0 0 ...
 $ Draw                      : num  0 1 0 0 0 1 0 1 4 0 ...
 $ Born                      : chr  "1974-11-12" "1980-12-09" "1983-05-28" "1974-07-20" ...
 $ Height in Inches          : num  75 69 75 72 68 71 74 74 68 70 ...
 $ Reach in Inches           : num  75 69 77 73 70 NA NA 76 NA 72 ...
 $ Team                      : chr  "The Arena" "Jackson's Submission FightingAmerican Top Team (formerly)" "Team Punishment" "Power MMA Team" ...
 $ Trainer                   : chr  NA NA NA NA ...
 $ Weight in Pounds          : num  203 155 204 170 155 ...
 $ Division                  : chr  "Light Heavyweight (formerly)Heavyweight (current)" "LightweightWelterweight" "Light Heavyweight (205 lb) Heavyweight (265 lb)" "WelterweightMiddleweight" ...
 $ Other names               : chr  NA NA "Big Red" "A-Train" ...
 $ Rank                      : chr  NA NA NA NA ...
 $ wikilink                  : chr  "https://en.wikipedia.org/wiki/Aaron_Brink" "https://en.wikipedia.org/wiki/Aaron_Riley" "https://en.wikipedia.org/wiki/Aaron_Rosa" "https://en.wikipedia.org/wiki/Aaron_Simpson_(fighter)" ...
 $ Years active              : chr  "1998-present" "1997-2013" "2005 - present" NA ...
 $ Fighting out of           : chr  "San Diego, California" "Albuquerque, New Mexico, United States" NA "Phoenix, Arizona, U.S." ...
 $ Notable relatives         : chr  NA NA NA NA ...
 $ Residence                 : chr  "Roseville, California" NA "San Antonio, Texas" NA ...
 $ Style                     : chr  NA "Boxing, Wrestling" NA "Wrestling" ...
 $ Nationality               : chr  "American" "American" "American" "American" ...
 $ Notable school(s)         : chr  NA NA NA "Antelope Union High School" ...
 $ Stance                    : chr  "Orthodox" "Southpaw" NA "Orthodox" ...
 $ University                : chr  NA NA NA "Arizona State University" ...
 $ Wrestling                 : chr  NA NA NA "NCAA Division I Wrestling" ...
 $ Website                   : chr  NA NA NA NA ...
 $ Teacher(s)                : chr  NA NA NA NA ...
 $ Ethnicity                 : chr  NA NA NA NA ...
 $ Notable students          : chr  NA NA NA NA ...
 $ Children                  : chr  NA NA NA NA ...
 $ Spouse                    : chr  NA NA NA NA ...
 $ Occupation                : chr  NA NA NA NA ...
 $ Died                      : chr  NA NA NA NA ...

mma-data-scrape's People

Contributors

chrismuir avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.