harryshomer / hockey-scraper Goto Github PK

View Code? Open in Web Editor NEW

140.0 140.0 43.0 2.24 MB

Python Package for scraping NHL Play-by-Play and Shift data

Home Page: http://hockey-data.harryshomer.com

License: GNU General Public License v3.0

Python 100.00%

hockey nhl python scraper sports web-scraping

hockey-scraper's People

Contributors

Stargazers

Watchers

hockey-scraper's Issues

Team tri-code changes

Looks like the NHL changed some of the team tri-codes in the html pbp from last year.

For example, below are a comparison of two events for the lightning in the 2020-2021 season and the 2021-2022 season. Notice that in the first picture they are referred to as "T.B" and in the second as "TBL".

This causes the function hockey_scraper.nhl.pbp.html_pbp.get_player_name to not find the name corresponding to the correct player when the home team's tri-code has changed.

For backwards compatibility issues I will probably create a mapping that converts the new tri-code to the old one but I'm not sure yet.

Allow for config of the PBP Columns

I noticed that pbp_columns is hard coded here: https://github.com/HarryShomer/Hockey-Scraper/blob/master/hockey_scraper/nhl/game_scraper.py#L21

Ideally the methods (game, range scrapers,etc) would allow us to specify valid column names, or, would allow a flag specification to return some of the game and play specific metadata, namely the eventIdx, eventId keys, and dateTime keys, as these are the fields I was looking to use.

NHL Schedule Endpoint Missing Games

Up until now we've used the following query to generate games in an interval:

https://statsapi.web.nhl.com/api/v1/schedule?startDate=YYYY-MM-DD&endDate=YYYY-MM-DD

For some reason, it seems to currently be missing some games. An example is the following query, which is missing all games from 10-14 to 10-20.

Not sure what's up on the NHL side and whether this is just a glitch that'll go away in a few days.

Regardless we can move over to just specifying the season as a query parameter for the endpoint. For example:

http://statsapi.web.nhl.com/api/v1/schedule?season=20212022

This should actually make the code in the nhl.json_schedule module simpler. I further don't think this should break backwards compatibility so it should be fine. I'll likely take care of this over the weekend.

ModuleNotFoundError: No module named 'hockey_scraper.nwhl'

I just pip-installed into a clean python 3.7 environment created with conda, and got the following error on import:

ModuleNotFoundError: No module named 'hockey_scraper.nwhl'

ESPN Api link broken - Failing integration tests

References

Integration tests are failing as the ESPN api link no longer returns game data:

http://www.espn.com/nhl/gamecast/data/masterFeed?lang=en&isAll=true&gameId=400885300
Failing tests
Link usage here

Tested with current 2023 gameId and found link was broken for this as well so ESPN must've changed the url.

Acceptance Criteria

Proper link found and replaced to fetch ESPN game data

Coordinates not properly showing for goals

Not sure if this is related to the new updates with the NHL's data, or just because it's pre-season, but I noticed that the scraper is not populating the coordinates field for most goals so far this pre-season. I've attached a screenshot of the issue. The dataframe I'm running this on is the PBP data from each game on 9/28. It's normal for these other events (Stop, Pend, etc.) to have null coordinates, but that many goals not having proper coordinates seems like an issue possibly. Just wanted to raise awareness to this

scrape_games not returning xC, yC

When I run scrape_games the pbp dataframe returned has NaN for all x and y coordinates for all plays. I think that the issue is in nhl.pbp.json_pbp.get_pbp which seems to get an empty dict returned from shared.scrape_page even though directly calling scrape_page on it's own seems to work.

I'm running python 3.7.3, hockey_scraper version 1.32.4

Sample code:

import hockey_scraper as hs
import json

# Returns a pandas df where xC, yC are null
hs.scrape_games([2016020001], data_format = 'Pandas')

# Returns an empty dict and error message: Json pbp for game [2016020001] is either not there or can't be obtained
hs.nhl.pbp.json_pbp.get_pbp([2016020001])

# Returns a play with x and y coordinates which are not null
json.loads(hs.shared.scrape_page('http://statsapi.web.nhl.com/api/v1/game/2016020001/feed/live'))['liveData']['plays']['allPlays'][3]

json_schedule endpoint appears to be broken

Is anyone else having issues with the endpoint in the json_schedule ? Seems to be returning no data for me when I try to use hockey_scraper.scrape_games or hockey_scraper.scrape_seasons.

code is pretty simple.

import hockey_scraper

hockey_scraper.scrape_games([2023020506], True)

Throwing me this error.

Traceback (most recent call last):
File "/home/hugo/dev/hockey/src/hockey-scraper/scrape_game.py", line 3, in
hockey_scraper.scrape_games([2023020506], True)
File "/home/hugo/dev/hockey/src/hockey_venv/lib/python3.10/site-packages/hockey_scraper/nhl/scrape_functions.py", line 237, in scrape_games
games_list = json_schedule.get_dates(games)
File "/home/hugo/dev/hockey/src/hockey_venv/lib/python3.10/site-packages/hockey_scraper/nhl/json_schedule.py", line 90, in get_dates
schedule = scrape_schedule(date_from, date_to, preseason=True, not_over=True)
File "/home/hugo/dev/hockey/src/hockey_venv/lib/python3.10/site-packages/hockey_scraper/nhl/json_schedule.py", line 114, in scrape_schedule
schedule_json = chunk_schedule_calls(date_from, date_to)
File "/home/hugo/dev/hockey/src/hockey_venv/lib/python3.10/site-packages/hockey_scraper/nhl/json_schedule.py", line 56, in chunk_schedule_calls
chunk_sched = get_schedule(f_chunk, t_chunk)
File "/home/hugo/dev/hockey/src/hockey_venv/lib/python3.10/site-packages/hockey_scraper/nhl/json_schedule.py", line 31, in get_schedule
return json.loads(shared.get_file(page_info, force=True))
File "/usr/lib/python3.10/json/init.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType

Json pbp scraping gives incorrect xC/yC data.

I was pulling down data (using both scrape_seasons and scrape_games) and I noticed that ~30% of shots either had no xC/yC data, or just had it listed at one of the bullet points:

Looking into it a bit, it looks like the "eventID" values in the json are no longer guaranteed to be in order (See snippet of json below). In json_pbp.py, I removed the sorted_events logic, and get data in the "right" order:

Not sorting seems to work mostly? Still need to investigate cases where html event length != json event length. Sorting by seconds_elapsed doesn't work great for stoppages, then faceoffs at the same time point.

I'll might have time to try to find a more elegant fix to this (and maybe adding a test that grabs a couple plays from a game to confirm it's being parsed correctly in the future). But wanted to write this down/make note of it in case anyone else is looking at it.

"plays": [
        {
            "eventId": 102,
            "periodDescriptor": {
                "number": 1,
                "periodType": "REG"
            },
            "timeInPeriod": "00:00",
            "timeRemaining": "20:00",
            "situationCode": "1551",
            "homeTeamDefendingSide": "left",
            "typeCode": 520,
            "typeDescKey": "period-start",
            "sortOrder": 8
        },
        {
            "eventId": 101,
            "periodDescriptor": {
                "number": 1,
                "periodType": "REG"
            },
            "timeInPeriod": "00:00",
            "timeRemaining": "20:00",
            "situationCode": "1551",
            "homeTeamDefendingSide": "left",
            "typeCode": 502,
            "typeDescKey": "faceoff",
            "sortOrder": 9,
            "details": {
                "eventOwnerTeamId": 18,
                "losingPlayerId": 8478519,
                "winningPlayerId": 8475158,
                "xCoord": 0,
                "yCoord": 0,
                "zoneCode": "N"
            }
        },
        {
            "eventId": 8,
            "periodDescriptor": {
                "number": 1,
                "periodType": "REG"
            },
            "timeInPeriod": "00:35",
            "timeRemaining": "19:25",
            "situationCode": "1551",
            "homeTeamDefendingSide": "left",
            "typeCode": 516,
            "typeDescKey": "stoppage",
            "sortOrder": 15,
            "details": {
                "reason": "icing"
            }
        },
        {
            "eventId": 103,
            "periodDescriptor": {
                "number": 1,
                "periodType": "REG"
            },
            "timeInPeriod": "00:35",
            "timeRemaining": "19:25",
            "situationCode": "1551",
            "homeTeamDefendingSide": "left",
            "typeCode": 502,
            "typeDescKey": "faceoff",
            "sortOrder": 17,
            "details": {
                "eventOwnerTeamId": 14,
                "losingPlayerId": 8476925,
                "winningPlayerId": 8478519,
                "xCoord": -69,
                "yCoord": 22,
                "zoneCode": "D"
            }
        },
        {
            "eventId": 9,
            "periodDescriptor": {
                "number": 1,
                "periodType": "REG"
            },
            "timeInPeriod": "00:48",
            "timeRemaining": "19:12",
            "situationCode": "1551",
            "homeTeamDefendingSide": "left",
            "typeCode": 503,
            "typeDescKey": "hit",
            "sortOrder": 20,
            "details": {
                "xCoord": 64,
                "yCoord": 42,
                "zoneCode": "D",
                "eventOwnerTeamId": 18,
                "hittingPlayerId": 8474568,
                "hitteePlayerId": 8476453
            }
        },

Pbp join issue for html and json

The following error is throwing is throwing for these 4 games:

Warning: Problem combining Html Json pbp for game

Games:

2019020054
2019020084
2019020126
2019020141

scraper not working on Python 2.7, Windows 10

@HarryShomer Thanks for putting this together. Any idea why I'm getting the below import error on Python 2.7 and Windows 10?

Thanks!

Parse event distance from pbp description

Event distance for shots could be parsed out from the pbp description into a separate column with a regex match. This column would only be populated for SHOT, MISS, and GOAL events.

Error parsing Json pbp for game X 'period'

Hi,

I have this 'period' error whenever trying to scrap a game:

eg:

running:
hockey_scraper.scrape_games([2023020002], True)

Scraping Game 2023020002 2023-10-10
Error: Error parsing Json pbp for game 2023020002 'period'
Warning: Using espn for pbp
Error: Espn pbp for game 2023-10-10 PIT CHI is either not there or can't be obtained

Thanks,

Date Issues Due to COVID-19 & Return to Play

Because of the delay due to COVID-19 and the return to play protocol, the scraper doesn't detect games after July 1st of the season.

I had oen thought about this and stumbled upon what I believe to be a bug (both in the same code section).

Here is your original code in nhl.json_schedule in the get_dates() function.

# If the last game is part of the ongoing season then only request the schedule until that day
# We get strange errors if we don't do it like this
if year_to == shared.get_season(datetime.strftime(datetime.today(), "%Y-%m-%d")):
    date_to = '-'.join([str(datetime.today().year), str(datetime.today().month), str(datetime.today().day)])
else:
    date_to = '-'.join([str(int(year_to) + 1), '7', '1'])  # Newest game in sample

My first thought was to just change the second date to August 30th -

[...]
else:
    date_to = '-'.join([str(int(year_to) + 1), '8', '30'])  # Newest game in sample

But then I noticed your first line mentions that if the game is in the current season we should only request games until that date, but that can never be possible since you are comparing a string (year_to) and an integer (shared.get_season(datetime.strftime(datetime.today(), "%Y-%m-%d"))

Let me know if you'd like me to submit a PR against this or how you want to work on fixing it up - maybe a logic if the year detected is 2020?

nwhl.scrape_games not pulling data

Working in clean environment with only hockey_scraper and dependent packages installed.
Just using the README commands. Am at work so if you want to hold off until I can duplicate this at home feel free.

Python 3.7.2 (default, Feb 12 2019, 08:15:36)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import hockey_scraper
>>> hockey_scraper.nwhl.scrape_games([14694271, 14814946, 14689491], docs_dir='/Users/mcbarlowe/')
Scraping NWHL Game  14694271
Json pbp for game 14694271 is either not there or can't be obtained
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mbarlowe/code/python/pyenvs/hockey_scrape/lib/python3.7/site-packages/hockey_scraper/nwhl/scrape_functions.py", line 83, in scrape_games
    pbp_df = scrape_list_of_games(games)
  File "/Users/mbarlowe/code/python/pyenvs/hockey_scrape/lib/python3.7/site-packages/hockey_scraper/nwhl/scrape_functions.py", line 52, in scrape_list_of_games
    if not pbp_df.empty:
AttributeError: 'NoneType' object has no attribute 'empty'

Scraping error in game 2008020259

Error:

File "<string>", line unknown
SelectorSyntaxError: Malformed class selector at position 2
 line 1:
td.+.bborder
 ^

scrapping 2014-04-09 generates value error

Scrapping null values generates errors

hockey_scraper.scrape_date_range('2014-04-09', '2014-04-10', False)
generates value error. This is because somehow data from that day has empty or null values.
I added a line to html_pbp.py before the astype conversion to eliminate these values.
now it works

game_df = game_df.replace(r'^\s*$', '0.0', regex=True)
game_df.Seconds_Elapsed = game_df.Seconds_Elapsed.astype(float)

Merge failed in combine_html_json_pbp

I noticed the merge failed for the EDM/NJD game id=2019020049 in combine_html_json_pbp() in game_scraper.py.

I checked and the reason was because the html_df.Seconds_Elapsed column was represented as dtype object and not a float.

I think adding this line after the try would fix the problem:
html_df.Seconds_Elapsed.astype(float)

I would make a pull request, but I don't know if it would cause other errors to go un-checked or something like that.

Scraper not working when when Parsing Games Linux

Heyo- first off thanks for the package!

I'm having issues with a relatively simple function call. I'm running 3.6 on linux (ubuntu 18.04). Any help would be great, let me know if you need any more info.

import hockey_scraper

# Scrapes all games between 2016-10-10 and 2016-10-20 without shifts and stores the data in a Csv file
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)`

Scraping Game  2016020001 2016-10-12
Traceback (most recent call last):

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-24-43f5db1c1b18>", line 4, in <module>
    hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/scrape_functions.py", line 179, in scrape_date_range
    pbp_df, shifts_df = scrape_list_of_games(games, if_scrape_shifts)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/scrape_functions.py", line 128, in scrape_list_of_games
    pbp_df, shifts_df = game_scraper.scrape_game(str(game["game_id"]), game["date"], if_scrape_shifts)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/game_scraper.py", line 351, in scrape_game
    pbp_df = scrape_pbp(game_id, date, roster, game_json, players, teams)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/game_scraper.py", line 272, in scrape_pbp
    html_df = html_pbp.scrape_game(game_id, players, teams)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/html_pbp.py", line 841, in scrape_game
    return scrape_pbp(game_html, game_id, players, teams)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/html_pbp.py", line 802, in scrape_pbp
    cleaned_html = clean_html_pbp(game_html)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/html_pbp.py", line 133, in clean_html_pbp
    soup = get_soup(html)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/html_pbp.py", line 68, in get_soup
    soup = soup.select('td.+.bborder')

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/bs4/element.py", line 1377, in select
    return soupsieve.select(selector, self, namespaces, limit, **kwargs)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/soupsieve/__init__.py", line 108, in select
    return compile(select, namespaces, flags).select(tag, limit)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/soupsieve/__init__.py", line 59, in compile
    return cp._cached_css_compile(pattern, namespaces, flags)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/soupsieve/css_parser.py", line 178, in _cached_css_compile
    CSSParser(pattern, flags).process_selectors(),

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/soupsieve/css_parser.py", line 810, in process_selectors
    return self.parse_selectors(self.selector_iter(self.pattern), index, flags)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/soupsieve/css_parser.py", line 688, in parse_selectors
    key, m = next(iselector)

  File "/home/jmoore/tensorflow/lib/python3.6/site-packages/soupsieve/css_parser.py", line 797, in selector_iter
    raise SyntaxError("Invlaid character {!r} at position {}".format(pattern[index], index))

  File "<string>", line unknown
SyntaxError: Invlaid character '.' at position 2

Merging Pbp and Shifts Data

Some people were discussing this topic on twitter today and I had this lying around so I figured I may as well push it. The idea here is to combine the pbp and shifts into one DataFrame for ease of use. As of right now this only lives on my dev branch.

Here is how it works:

import hockey_scraper as hs

data = hs.scrape_games([2019020001, 2019020003, 2019020003], True, data_format='pandas')
merged_data = hs.utils.merge(data['pbp'], data['shifts'])

I've written this over a year ago and haven't really tested it too well so use at your own risk. The code could likely be cleaned up as well. If you use it let me know how it works. If you find any issues please let me know (in this thread) or feel free to open a merge request.

Scrap game fail when inputing game id

Scrape_games can only handle an array of game id and fail when using the game id only due to sorting.

harryshomer / hockey-scraper Goto Github PK

hockey-scraper's People

Contributors

Stargazers

Watchers

Forkers

hockey-scraper's Issues

References

Acceptance Criteria

Recommend Projects

Recommend Topics

Recommend Org