a-luna / vigorish Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 24.32 MB

MLB data web scraper, Python CLI application

Home Page: https://aaronluna.dev/projects/vigorish

License: MIT License

Python 99.25% JavaScript 0.75% Shell 0.01% Procfile 0.01%

web-scraper web-scraping mlb python nodejs nightmarejs cli sports-stats sports-data baseball-statistics

vigorish's Introduction

vigorish

vigorish is a hybrid Python/Node.js application that scrapes MLB data from mlb.com, brooksbaseball.net and baseball-reference.com.

My goal is to capture as much data as possible — ranging from PitchFX measurements at the most granular level to play-by-play data (play descriptions, substitutions, manager challenges, etc) and individual player pitch/bat stats at the highest level.

Requirements

Python 3.6+
Node.js 10+ (Tested with Node.js 11-13)
Xvfb
AWS account (optional but recommended, used to store scraped data in S3)

Project Documentation

For a step-by-step install guide and instructions for configuring/using vigorish, please visit the link below:

Vigorish: Hybrid Python/Node.Js Web Scraper

Credits

vigorish either relies on the following projects listed below directly or as a dev dependency. It would not have been possible for me to create vigorish without these projects, thanks to all of the creators/maintainers for making these available (projects are listed alphabetically):

vigorish's People

Contributors

Stargazers

Watchers

Forkers

chrisdlees

vigorish's Issues

vig config: new command, prompts user for config settings and creates the config file for them with the provided values

replace brooksbaseball.net data with mlb api data

brooksbaseball.net has changed the format of the pitch log pages and hidden the XHR call for PitchFX data, so all scrapers for bb data are useless.

fortunately, the mlb statcast api appears to be quite generous and the pitch data contains a lot of data that is not available on bb.net.

this issue will be closed when an adhoc solution is in place to convert mlb api data to the json file formats already created for bb.net.

scrape_bbref_boxscores_for_date is not working for any 2019 dates

_{Sent with GitHawk}

create batter percentiles

possible categories: barrel rate, avg exit velo, line-drive rate, soft contact rate, o-swing rate, k rate, bb rate, contact rate

create tests for brooks_pitchfx

initial smoke test complete, need to add more thorough verification steps

Update setup modules to use pathlib and dataclasses

new materialized view: GameScrapeStatusMV

Replace retry/max attempts logic in get_chromedriver function with the retry/timeout decorator

create module: app.main.tasks.brooks_daily_pitchfx

new task: get player bio for all orphaned player_ids and persist across db backups

new admin task menu item, performs the process below:

find all player_id rows where db_player_id == NULL
send request to mlb player info api endpoint
check response json for player debut date, if below threshold for db, continue
else, parse player bio and add to database
assign created player.id to player_id.db_player_id

also, create a new column on player table, "add_to_db_backup" and set to TRUE for all players that are added by this task.
update db backup task to create a new csv file with these player rows and include in zip
update db restore task to add all players from the csv file

create module: app.main.scrape.brooks.scrape_brooks_pitchfx

recreate test data using mlb api data

update PitchFX views to use new MLB API data fields

Unable to scrape brooks_games_for_date for oak/sea 2019 “regular season” games in Japan

brooksbaseball.com daily dashboard page contains all spring training games as well as the annoying oak/sea regular season games that were foisted upon us a week before every other regular season game. This should not be a blocking issue, however, both dates (3/20 and 3/21) contain a game where the bb_game_id indicates that the game is the second game between these two teams on that day, but the first game is not listed on the dashboard. I assume this would be caused by the park/facility where the first game occurred not having pitchfx equipment/data since that is what the dashboard displays for each pitching appearance in each game.

Since our current logic assumes the first game must exist aIndexError is raised since we are using the bb_game_id as a dict key to retrieve the first game and update the bbref_game_id based on the fact that this value is different for all three possible scenarios: regular non-doubleheader game, first game of a doubleheader, second game of a doubleheader.

My solution is to modify the scrape_brooks_games_for_date function to require input, which is the bbref_games_for_date data for the same date. The bbref daily scoreboard page does not contain any spring training games, and the bbref_games_for_date data contains a list or boxscore urls which can be reduced to a list of game ids. The bbref_game_id contains the home team id which is enough info to match with the games parsed from the brooksbaseball.com dashboard page.

When we are iterating over the games parsed from the brooksbaseball.com dashboard, we can determine if the current game is included in the list of games parsed from the bbref scoreboard page. If it is not included, it is not a regular season game and therefore we can skip it and avoid parsing any spring training data, and thus avoiding the IndexError.

_{Sent with GitHawk}

bug: incorrect value when positive stat is above 99% percentile

update PitchFX views to use new MLB API data fields

vig status: unable to calculate season metrics if zero games scraped for either bbref/brooks

Improve process used to parse bat stats, pitch stats and play-by-play events.

Currently, the scrape_bbref_boxscores_for_date function uses a huge list of xpath queries to parse individual stats/items from the team batting, team pitching and play-by-play tables in a bbref box score page. This results in tons of repeated code; applying each xpath query to the corresponding table and storing the parsed data.

The same improvement can be applied to all three scenarios, the various xpath queries can be replaced by a template string that parses a specific stat/item based on the value provided to the template string as a parameter. This will eliminate the duplication and provide a more maintainable and flexible way to modify the data that is being parsed.

Dependabot can't evaluate your Python dependency files

Dependabot can't evaluate your Python dependency files.

As a result, Dependabot couldn't check whether any of your dependencies are out-of-date.

The error Dependabot encountered was:

InstallationError("Invalid requirement: 'black==21.5b1)' (from line 9 of /home/dependabot/dependabot-updater/dependabot_tmp_dir/requirements-dev.txt)")

View the update logs.

create util functions for brooks_pitchfx

app.main.util.json_decoders
- decode_brooks_pitchfx_log
app.main.util.file_util
- write_brooks_pitchfx_log_to_file
- read_brooks_pitchfx_log_from_file
app.main.util.file_util.s3_helper
- upload_brooks_pitchfx_log
- upload_brooks_pitchfx_log
- get_brooks_pitchfx_log_from_s3
- get_all_brooks_pitchfx_log_ids_scraped

Dependabot couldn't authenticate with https://pypi.python.org/simple/

Dependabot couldn't authenticate with https://pypi.python.org/simple/.

You can provide authentication details in your Dependabot dashboard by clicking into the account menu (in the top right) and selecting 'Config variables'.

View the update logs.

create brooks_pitchfx DTOs: BrooksPitchFxData, BrooksPitchFxLog

app.main.scrape.brooks.models.pitchfx
- BrooksPitchFxData
app.main.scrape.brooks.models.pitchfx_log
- BrooksPitchFxLog

AWS access key, secret key
S3 bucket name
local folder to use to store downloads from S3 (default is app root)
DB URL for dev/prod and test
path to chrome and chromedriver binaries

Dependabot can't evaluate your Python dependency files

Dependabot can't evaluate your Python dependency files.

As a result, Dependabot couldn't check whether any of your dependencies are out-of-date.

The error Dependabot encountered was:

Illformed requirement ["==0.3.61..5.1"]

View the update logs.

create module: app.main.status.update_brooks_pitchfx

scrape_bbref_boxscores_for_date: TypeError("unsupported operand type(s) for +: 'int' and 'lxml.etree._ElementUnicodeResult'")

After fixing #3 a new issue was revealed. Boxscores are once again being successfully parsed, but an error occurs after uploading the scraped boxscores.

a-luna / vigorish Goto Github PK

vigorish's Introduction

vigorish

Requirements

Project Documentation

Credits

vigorish's People

Contributors

Stargazers

Watchers

Forkers

vigorish's Issues

Recommend Projects

Recommend Topics

Recommend Org