harryshomer / hockey-scraper Goto Github PK
View Code? Open in Web Editor NEWPython Package for scraping NHL Play-by-Play and Shift data
Home Page: http://hockey-data.harryshomer.com
License: GNU General Public License v3.0
Python Package for scraping NHL Play-by-Play and Shift data
Home Page: http://hockey-data.harryshomer.com
License: GNU General Public License v3.0
Looks like the NHL changed some of the team tri-codes in the html pbp from last year.
For example, below are a comparison of two events for the lightning in the 2020-2021 season and the 2021-2022 season. Notice that in the first picture they are referred to as "T.B" and in the second as "TBL".
This causes the function hockey_scraper.nhl.pbp.html_pbp.get_player_name
to not find the name corresponding to the correct player when the home team's tri-code has changed.
For backwards compatibility issues I will probably create a mapping that converts the new tri-code to the old one but I'm not sure yet.
I noticed that pbp_columns is hard coded here: https://github.com/HarryShomer/Hockey-Scraper/blob/master/hockey_scraper/nhl/game_scraper.py#L21
Ideally the methods (game, range scrapers,etc) would allow us to specify valid column names, or, would allow a flag specification to return some of the game and play specific metadata, namely the eventIdx, eventId keys, and dateTime keys, as these are the fields I was looking to use.
Up until now we've used the following query to generate games in an interval:
https://statsapi.web.nhl.com/api/v1/schedule?startDate=YYYY-MM-DD&endDate=YYYY-MM-DD
For some reason, it seems to currently be missing some games. An example is the following query, which is missing all games from 10-14 to 10-20.
Not sure what's up on the NHL side and whether this is just a glitch that'll go away in a few days.
Regardless we can move over to just specifying the season as a query parameter for the endpoint. For example:
http://statsapi.web.nhl.com/api/v1/schedule?season=20212022
This should actually make the code in the nhl.json_schedule
module simpler. I further don't think this should break backwards compatibility so it should be fine. I'll likely take care of this over the weekend.
I just pip-installed into a clean python 3.7 environment created with conda, and got the following error on import:
ModuleNotFoundError: No module named 'hockey_scraper.nwhl'
Integration tests are failing as the ESPN api link no longer returns game data:
Tested with current 2023 gameId and found link was broken for this as well so ESPN must've changed the url.
Not sure if this is related to the new updates with the NHL's data, or just because it's pre-season, but I noticed that the scraper is not populating the coordinates field for most goals so far this pre-season. I've attached a screenshot of the issue. The dataframe I'm running this on is the PBP data from each game on 9/28. It's normal for these other events (Stop, Pend, etc.) to have null coordinates, but that many goals not having proper coordinates seems like an issue possibly. Just wanted to raise awareness to this
When I run scrape_games
the pbp
dataframe returned has NaN
for all x and y coordinates for all plays. I think that the issue is in nhl.pbp.json_pbp.get_pbp
which seems to get an empty dict returned from shared.scrape_page
even though directly calling scrape_page
on it's own seems to work.
I'm running python 3.7.3, hockey_scraper version 1.32.4
Sample code:
import hockey_scraper as hs
import json
# Returns a pandas df where xC, yC are null
hs.scrape_games([2016020001], data_format = 'Pandas')
# Returns an empty dict and error message: Json pbp for game [2016020001] is either not there or can't be obtained
hs.nhl.pbp.json_pbp.get_pbp([2016020001])
# Returns a play with x and y coordinates which are not null
json.loads(hs.shared.scrape_page('http://statsapi.web.nhl.com/api/v1/game/2016020001/feed/live'))['liveData']['plays']['allPlays'][3]
Is anyone else having issues with the endpoint in the json_schedule ? Seems to be returning no data for me when I try to use hockey_scraper.scrape_games or hockey_scraper.scrape_seasons.
code is pretty simple.
import hockey_scraper
hockey_scraper.scrape_games([2023020506], True)
Throwing me this error.
Traceback (most recent call last):
File "/home/hugo/dev/hockey/src/hockey-scraper/scrape_game.py", line 3, in
hockey_scraper.scrape_games([2023020506], True)
File "/home/hugo/dev/hockey/src/hockey_venv/lib/python3.10/site-packages/hockey_scraper/nhl/scrape_functions.py", line 237, in scrape_games
games_list = json_schedule.get_dates(games)
File "/home/hugo/dev/hockey/src/hockey_venv/lib/python3.10/site-packages/hockey_scraper/nhl/json_schedule.py", line 90, in get_dates
schedule = scrape_schedule(date_from, date_to, preseason=True, not_over=True)
File "/home/hugo/dev/hockey/src/hockey_venv/lib/python3.10/site-packages/hockey_scraper/nhl/json_schedule.py", line 114, in scrape_schedule
schedule_json = chunk_schedule_calls(date_from, date_to)
File "/home/hugo/dev/hockey/src/hockey_venv/lib/python3.10/site-packages/hockey_scraper/nhl/json_schedule.py", line 56, in chunk_schedule_calls
chunk_sched = get_schedule(f_chunk, t_chunk)
File "/home/hugo/dev/hockey/src/hockey_venv/lib/python3.10/site-packages/hockey_scraper/nhl/json_schedule.py", line 31, in get_schedule
return json.loads(shared.get_file(page_info, force=True))
File "/usr/lib/python3.10/json/init.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
I was pulling down data (using both scrape_seasons and scrape_games) and I noticed that ~30% of shots either had no xC/yC data, or just had it listed at one of the bullet points:
Looking into it a bit, it looks like the "eventID" values in the json are no longer guaranteed to be in order (See snippet of json below). In json_pbp.py, I removed the sorted_events logic, and get data in the "right" order:
Not sorting seems to work mostly? Still need to investigate cases where html event length != json event length. Sorting by seconds_elapsed doesn't work great for stoppages, then faceoffs at the same time point.
I'll might have time to try to find a more elegant fix to this (and maybe adding a test that grabs a couple plays from a game to confirm it's being parsed correctly in the future). But wanted to write this down/make note of it in case anyone else is looking at it.
"plays": [
{
"eventId": 102,
"periodDescriptor": {
"number": 1,
"periodType": "REG"
},
"timeInPeriod": "00:00",
"timeRemaining": "20:00",
"situationCode": "1551",
"homeTeamDefendingSide": "left",
"typeCode": 520,
"typeDescKey": "period-start",
"sortOrder": 8
},
{
"eventId": 101,
"periodDescriptor": {
"number": 1,
"periodType": "REG"
},
"timeInPeriod": "00:00",
"timeRemaining": "20:00",
"situationCode": "1551",
"homeTeamDefendingSide": "left",
"typeCode": 502,
"typeDescKey": "faceoff",
"sortOrder": 9,
"details": {
"eventOwnerTeamId": 18,
"losingPlayerId": 8478519,
"winningPlayerId": 8475158,
"xCoord": 0,
"yCoord": 0,
"zoneCode": "N"
}
},
{
"eventId": 8,
"periodDescriptor": {
"number": 1,
"periodType": "REG"
},
"timeInPeriod": "00:35",
"timeRemaining": "19:25",
"situationCode": "1551",
"homeTeamDefendingSide": "left",
"typeCode": 516,
"typeDescKey": "stoppage",
"sortOrder": 15,
"details": {
"reason": "icing"
}
},
{
"eventId": 103,
"periodDescriptor": {
"number": 1,
"periodType": "REG"
},
"timeInPeriod": "00:35",
"timeRemaining": "19:25",
"situationCode": "1551",
"homeTeamDefendingSide": "left",
"typeCode": 502,
"typeDescKey": "faceoff",
"sortOrder": 17,
"details": {
"eventOwnerTeamId": 14,
"losingPlayerId": 8476925,
"winningPlayerId": 8478519,
"xCoord": -69,
"yCoord": 22,
"zoneCode": "D"
}
},
{
"eventId": 9,
"periodDescriptor": {
"number": 1,
"periodType": "REG"
},
"timeInPeriod": "00:48",
"timeRemaining": "19:12",
"situationCode": "1551",
"homeTeamDefendingSide": "left",
"typeCode": 503,
"typeDescKey": "hit",
"sortOrder": 20,
"details": {
"xCoord": 64,
"yCoord": 42,
"zoneCode": "D",
"eventOwnerTeamId": 18,
"hittingPlayerId": 8474568,
"hitteePlayerId": 8476453
}
},
The following error is throwing is throwing for these 4 games:
Warning: Problem combining Html Json pbp for game
Games:
2019020054
2019020084
2019020126
2019020141
@HarryShomer Thanks for putting this together. Any idea why I'm getting the below import error on Python 2.7 and Windows 10?
Thanks!
Event distance for shots could be parsed out from the pbp description into a separate column with a regex match. This column would only be populated for SHOT, MISS, and GOAL events.
Hi,
I have this 'period' error whenever trying to scrap a game:
eg:
running:
hockey_scraper.scrape_games([2023020002], True)
Scraping Game 2023020002 2023-10-10
Error: Error parsing Json pbp for game 2023020002 'period'
Warning: Using espn for pbp
Error: Espn pbp for game 2023-10-10 PIT CHI is either not there or can't be obtained
Thanks,
Because of the delay due to COVID-19 and the return to play protocol, the scraper doesn't detect games after July 1st of the season.
I had oen thought about this and stumbled upon what I believe to be a bug (both in the same code section).
Here is your original code in nhl.json_schedule
in the get_dates()
function.
# If the last game is part of the ongoing season then only request the schedule until that day
# We get strange errors if we don't do it like this
if year_to == shared.get_season(datetime.strftime(datetime.today(), "%Y-%m-%d")):
date_to = '-'.join([str(datetime.today().year), str(datetime.today().month), str(datetime.today().day)])
else:
date_to = '-'.join([str(int(year_to) + 1), '7', '1']) # Newest game in sample
My first thought was to just change the second date to August 30th -
[...]
else:
date_to = '-'.join([str(int(year_to) + 1), '8', '30']) # Newest game in sample
But then I noticed your first line mentions that if the game is in the current season we should only request games until that date, but that can never be possible since you are comparing a string (year_to
) and an integer (shared.get_season(datetime.strftime(datetime.today(), "%Y-%m-%d"))
Let me know if you'd like me to submit a PR against this or how you want to work on fixing it up - maybe a logic if the year detected is 2020?
Working in clean environment with only hockey_scraper
and dependent packages installed.
Just using the README commands. Am at work so if you want to hold off until I can duplicate this at home feel free.
Python 3.7.2 (default, Feb 12 2019, 08:15:36)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import hockey_scraper
>>> hockey_scraper.nwhl.scrape_games([14694271, 14814946, 14689491], docs_dir='/Users/mcbarlowe/')
Scraping NWHL Game 14694271
Json pbp for game 14694271 is either not there or can't be obtained
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mbarlowe/code/python/pyenvs/hockey_scrape/lib/python3.7/site-packages/hockey_scraper/nwhl/scrape_functions.py", line 83, in scrape_games
pbp_df = scrape_list_of_games(games)
File "/Users/mbarlowe/code/python/pyenvs/hockey_scrape/lib/python3.7/site-packages/hockey_scraper/nwhl/scrape_functions.py", line 52, in scrape_list_of_games
if not pbp_df.empty:
AttributeError: 'NoneType' object has no attribute 'empty'
Error:
File "<string>", line unknown
SelectorSyntaxError: Malformed class selector at position 2
line 1:
td.+.bborder
^
Scrapping null values generates errors
hockey_scraper.scrape_date_range('2014-04-09', '2014-04-10', False)
generates value error. This is because somehow data from that day has empty or null values.
I added a line to html_pbp.py before the astype conversion to eliminate these values.
now it works
game_df = game_df.replace(r'^\s*$', '0.0', regex=True)
game_df.Seconds_Elapsed = game_df.Seconds_Elapsed.astype(float)
I noticed the merge failed for the EDM/NJD game id=2019020049
in combine_html_json_pbp()
in game_scraper.py
.
I checked and the reason was because the html_df.Seconds_Elapsed
column was represented as dtype object
and not a float
.
I think adding this line after the try would fix the problem:
html_df.Seconds_Elapsed.astype(float)
I would make a pull request, but I don't know if it would cause other errors to go un-checked or something like that.
Heyo- first off thanks for the package!
I'm having issues with a relatively simple function call. I'm running 3.6 on linux (ubuntu 18.04). Any help would be great, let me know if you need any more info.
import hockey_scraper
# Scrapes all games between 2016-10-10 and 2016-10-20 without shifts and stores the data in a Csv file
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)`
Scraping Game 2016020001 2016-10-12
Traceback (most recent call last):
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-24-43f5db1c1b18>", line 4, in <module>
hockey_scraper.scrape_date_range('2016-10-10', '2016-10-20', False)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/scrape_functions.py", line 179, in scrape_date_range
pbp_df, shifts_df = scrape_list_of_games(games, if_scrape_shifts)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/scrape_functions.py", line 128, in scrape_list_of_games
pbp_df, shifts_df = game_scraper.scrape_game(str(game["game_id"]), game["date"], if_scrape_shifts)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/game_scraper.py", line 351, in scrape_game
pbp_df = scrape_pbp(game_id, date, roster, game_json, players, teams)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/game_scraper.py", line 272, in scrape_pbp
html_df = html_pbp.scrape_game(game_id, players, teams)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/html_pbp.py", line 841, in scrape_game
return scrape_pbp(game_html, game_id, players, teams)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/html_pbp.py", line 802, in scrape_pbp
cleaned_html = clean_html_pbp(game_html)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/html_pbp.py", line 133, in clean_html_pbp
soup = get_soup(html)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/hockey_scraper/html_pbp.py", line 68, in get_soup
soup = soup.select('td.+.bborder')
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/bs4/element.py", line 1377, in select
return soupsieve.select(selector, self, namespaces, limit, **kwargs)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/soupsieve/__init__.py", line 108, in select
return compile(select, namespaces, flags).select(tag, limit)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/soupsieve/__init__.py", line 59, in compile
return cp._cached_css_compile(pattern, namespaces, flags)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/soupsieve/css_parser.py", line 178, in _cached_css_compile
CSSParser(pattern, flags).process_selectors(),
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/soupsieve/css_parser.py", line 810, in process_selectors
return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/soupsieve/css_parser.py", line 688, in parse_selectors
key, m = next(iselector)
File "/home/jmoore/tensorflow/lib/python3.6/site-packages/soupsieve/css_parser.py", line 797, in selector_iter
raise SyntaxError("Invlaid character {!r} at position {}".format(pattern[index], index))
File "<string>", line unknown
SyntaxError: Invlaid character '.' at position 2
Some people were discussing this topic on twitter today and I had this lying around so I figured I may as well push it. The idea here is to combine the pbp and shifts into one DataFrame for ease of use. As of right now this only lives on my dev branch.
Here is how it works:
import hockey_scraper as hs
data = hs.scrape_games([2019020001, 2019020003, 2019020003], True, data_format='pandas')
merged_data = hs.utils.merge(data['pbp'], data['shifts'])
I've written this over a year ago and haven't really tested it too well so use at your own risk. The code could likely be cleaned up as well. If you use it let me know how it works. If you find any issues please let me know (in this thread) or feel free to open a merge request.
Scrape_games can only handle an array of game id and fail when using the game id only due to sorting.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.