Giter VIP home page Giter VIP logo

vigorish's Introduction

PyPI version PyPI - Downloads PyPI - License PyPI - Python Version Maintainability codecov

vigorish

vigorish is a hybrid Python/Node.js application that scrapes MLB data from mlb.com, brooksbaseball.net and baseball-reference.com.

My goal is to capture as much data as possible — ranging from PitchFX measurements at the most granular level to play-by-play data (play descriptions, substitutions, manager challenges, etc) and individual player pitch/bat stats at the highest level.

Requirements

  • Python 3.6+
  • Node.js 10+ (Tested with Node.js 11-13)
  • Xvfb
  • AWS account (optional but recommended, used to store scraped data in S3)

Project Documentation

For a step-by-step install guide and instructions for configuring/using vigorish, please visit the link below:

Vigorish: Hybrid Python/Node.Js Web Scraper

Credits

vigorish either relies on the following projects listed below directly or as a dev dependency. It would not have been possible for me to create vigorish without these projects, thanks to all of the creators/maintainers for making these available (projects are listed alphabetically):

vigorish's People

Contributors

a-luna avatar maxbachmann avatar sourcery-ai[bot] avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

chrisdlees

vigorish's Issues

replace brooksbaseball.net data with mlb api data

brooksbaseball.net has changed the format of the pitch log pages and hidden the XHR call for PitchFX data, so all scrapers for bb data are useless.

fortunately, the mlb statcast api appears to be quite generous and the pitch data contains a lot of data that is not available on bb.net.

this issue will be closed when an adhoc solution is in place to convert mlb api data to the json file formats already created for bb.net.

create batter percentiles

possible categories: barrel rate, avg exit velo, line-drive rate, soft contact rate, o-swing rate, k rate, bb rate, contact rate

new task: get player bio for all orphaned player_ids and persist across db backups

new admin task menu item, performs the process below:

  • find all player_id rows where db_player_id == NULL
  • send request to mlb player info api endpoint
  • check response json for player debut date, if below threshold for db, continue
  • else, parse player bio and add to database
  • assign created player.id to player_id.db_player_id

also, create a new column on player table, "add_to_db_backup" and set to TRUE for all players that are added by this task.
update db backup task to create a new csv file with these player rows and include in zip
update db restore task to add all players from the csv file

Unable to scrape brooks_games_for_date for oak/sea 2019 “regular season” games in Japan

brooksbaseball.com daily dashboard page contains all spring training games as well as the annoying oak/sea regular season games that were foisted upon us a week before every other regular season game. This should not be a blocking issue, however, both dates (3/20 and 3/21) contain a game where the bb_game_id indicates that the game is the second game between these two teams on that day, but the first game is not listed on the dashboard. I assume this would be caused by the park/facility where the first game occurred not having pitchfx equipment/data since that is what the dashboard displays for each pitching appearance in each game.

Since our current logic assumes the first game must exist aIndexError is raised since we are using the bb_game_id as a dict key to retrieve the first game and update the bbref_game_id based on the fact that this value is different for all three possible scenarios: regular non-doubleheader game, first game of a doubleheader, second game of a doubleheader.

My solution is to modify the scrape_brooks_games_for_date function to require input, which is the bbref_games_for_date data for the same date. The bbref daily scoreboard page does not contain any spring training games, and the bbref_games_for_date data contains a list or boxscore urls which can be reduced to a list of game ids. The bbref_game_id contains the home team id which is enough info to match with the games parsed from the brooksbaseball.com dashboard page.

When we are iterating over the games parsed from the brooksbaseball.com dashboard, we can determine if the current game is included in the list of games parsed from the bbref scoreboard page. If it is not included, it is not a regular season game and therefore we can skip it and avoid parsing any spring training data, and thus avoiding the IndexError.

Sent with GitHawk

Improve process used to parse bat stats, pitch stats and play-by-play events.

Currently, the scrape_bbref_boxscores_for_date function uses a huge list of xpath queries to parse individual stats/items from the team batting, team pitching and play-by-play tables in a bbref box score page. This results in tons of repeated code; applying each xpath query to the corresponding table and storing the parsed data.

The same improvement can be applied to all three scenarios, the various xpath queries can be replaced by a template string that parses a specific stat/item based on the value provided to the template string as a parameter. This will eliminate the duplication and provide a more maintainable and flexible way to modify the data that is being parsed.

Dependabot can't evaluate your Python dependency files

Dependabot can't evaluate your Python dependency files.

As a result, Dependabot couldn't check whether any of your dependencies are out-of-date.

The error Dependabot encountered was:

InstallationError("Invalid requirement: 'black==21.5b1)' (from line 9 of /home/dependabot/dependabot-updater/dependabot_tmp_dir/requirements-dev.txt)")

View the update logs.

create util functions for brooks_pitchfx

  • app.main.util.json_decoders
    • decode_brooks_pitchfx_log
  • app.main.util.file_util
    • write_brooks_pitchfx_log_to_file
    • read_brooks_pitchfx_log_from_file
  • app.main.util.file_util.s3_helper
    • upload_brooks_pitchfx_log
    • upload_brooks_pitchfx_log
    • get_brooks_pitchfx_log_from_s3
    • get_all_brooks_pitchfx_log_ids_scraped

config settings file

need to centralize all config settings in a single file:

  • AWS access key, secret key
  • S3 bucket name
  • local folder to use to store downloads from S3 (default is app root)
  • DB URL for dev/prod and test
  • path to chrome and chromedriver binaries

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.