afc-stat-scraping-project-2019's Introduction

AFC Stat Scraping Project 2019

We need to setup an environment to do our scraping, start off on the Desktop

mkdir scraping00
cd scraping00

Create a virtual Environment within the directory and then activate

virtualenv - p python3 .
soucre bin/activate

create an src directory to store the project

mkdir src
cd src

Download/pip install the dependacies for our projects

pip install requests
pip install BeautifulSoup
pip install jupyter
pip install pandas

lets open up Jupyter Notebook, a window with Jupyter notebooks

jupyter notebook

click on New, Python 3, will give us a brand new notebook for our scraping project
lets start off with our imports

import sys
import collection as co
from requests import *
from bs4 import BeautifulSoup
import pandas as pd

lets get the url, response status code

url = "https://www.pro-football-reference.com/years/2018/index.htm"

response = get(url)

print(response.status_code)

We will create a BeautifulSoup object and pass it through a variable

nfl = BeautifulSoup(response.content, 'html.parser)

Now we search for the element that we are looking

afc_table = nfl.find('div',{'class':'overthrow table_container'})

We are going to grab headers

table_head = afc_table.find('thead')
header = []
for th in table_head.findAll('th):
    key = th.get_text()
    header.append(key)
print(header)

We are going to count the rows

endrows = 0
for tr in afc_table.findAll('tbody'):
    if tr.findAll('th')[0].get_text() in (''):
        endrows += 1

rows = len(afc_table.findAll('tr'))
rows -= endrows + 1

print(rows)

We are going to add it to a Pnadas DataFrame

list_of_dicts = []
for row in range(rows):
    the_row = []
    try:
        table_row = afc_table.findAll('tr')[row]
        for tr in table_row:
            value = tr.get_text()
            the_row.append(value)
        od = co.OrderedDict(zip(header,the_row))
        list_of_dicts.append(od)
    except AttributeError:
        continue

df = pd.DataFrame(list_of_dicts)
df