anthon11786 / horse-race-betting Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 6.58 MB

Jupyter Notebook 100.00%

horse-race-betting's People

Contributors

Watchers

horse-race-betting's Issues

Project Proposal

Summary: The goal of this project is to create an algorithm to maximize an individual's expected profit. It is going to use data that describes the features of each race and data about each horse specifically. By using this information, this project hopes to help the public make smarter bets.

Three things I like: I like how there is a good purpose behind this project when you stated the increase amount of betting since the 2018 Supreme Court decision. I also like how you're utilizing more than one dataset and taking into account a lot of the different things that can affect a race. Finally, I like the potential of this project when stated how this model could be used for other sports, which makes this project very intriguing.

Three things that could improve: You could list each horses' racing record as that would be a good indicator. Also, you could use the strategy of placing multiple bets on different horses to decrease the expected loss. Finally, specify what type of betting system is this project is going to focus on. There are three main betting systems used in horse racing.

Project Proposal Peer Review

Summary
The project is about how to make better decisions on horse race betting. The project team will be using historical data from horse races held in Hong Kong. The team is looking to produce a payoff function to predict the expected payoff/dividends.

Things I liked

The problem is well established with a clearly-defined outcome variable: the dividends. This is an objective instead of subjective variable and can be explained relatively well by the given predictors.
The dataset includes information related to the race and the physiques and performance of each individual horse. This opens up opportunities to separate and identify the importance of different predictors.
The team mentioned USSC lifting the ban on sports betting in 2018. I love that this project is tackling a relatively new problem that may not be very well studied in the previous years.

Ideas and areas of improvement

The team did not provide the source and size of the dataset. I would love to know how many rows there are in the two csv files and the overall messiness of the data.
If the team has access to horse race data from regions other than Hong Kong, the model could become more robust and useful.
Some of the predictors you listed may not be very helpful, such as Jockey ID which would create a huge number of columns when performing, for example, one-hot encoding. If possible, I would suggest using columns like jockey weight, height, skill level, W-L records, etc., so variables that describe the jockey better, and then leaving out the ID columns.

midterm report review dmt228

This project uses racehorse data to predict the outcomes of the races the horses participated in. The dataset was created from a join from two separate datasets, one about the races, and one about the horses themselves. The authors use several features to try and predict whether or not a horse reaches 1st place in a race, and visualize said features. The authors then go into detail of the process of cleaning the dataset and the different types of features used. For the preliminary models, they used a least squares regression to predict the binary value of winning. For next steps, the authors hope to use other models to reduce model error and create a separate model to prejudice winning dividends of a race.
In terms of strengths in the paper, the authors gave details on the process of cleaning and analyzing multiple messy datasets, provided helpful data visualization and clear explanations for decisions made. Each step, from multiple messy CSV files to a single dataset, was detailed in the process of cleaning it. This included listing important features, removing unimportant ones and merging datasets. This helps readers understand the process well. In addition, helpful graphs were provided including a histogram and a graph of distribution between winning and losing horses, showing an in-depth analysis of the different values of certain variables. There were clear explanations of logic for each of the decisions made in the data cleaning process.
For what can be improved in the paper, the results need other metrics, provide graphs to represent all numerical results, and there should be more details on the end dataset. The results discuss the mean squared error; however, if the goal is to accurately predict the winner of a horse race before the race even starts, the most important metric should be accuracy (which is not listed at this time). Thus, including metrics such as accuracy would certainly strengthen the results. In addition, more graphs should be included in the results, perhaps ones that detail the mean squared error over training time, or a histogram with different training and testing metrics. On a different note, while there was quite a bit of detail on the process of data cleaning, the information about the resulting dataset is unclear. Nowhere does it mention how many data points there are, and while it mentions that there are 3 features of the 23 columns kept, it does not mention which ones.

Final Review :)

This project aims to predict optimal horse betting to maximize profit. The team utilizes a dataset from the Hong Kong Jockey Club, with over 6000 races from 1997 to 2005. The dataset includes 14 horses and 37 features per horse. In order to maximize payoffs of a horse, they create two different models, one for the probability that a horse wins and one to predict expected payoff. They detail the cleaning of their data and feature extraction as well as error analysis between multiple different models, and conclude that the random forest regression is creates the best predictions for both models.

I think your project provided great background information for people who don’t understand the rules and regulations of horse betting. In addition, there was detailed consideration put into deciding what to values to model and what types of models to make, ie. creating two models for win probability and expected payoff as well as comparing errors across four different models. Furthermore, the exploratory data analysis was very well explained and the team was very thoughtful in choosing encodings and dealing with unnecessary/outlier data.

Some suggestions that I would have moving forward would be that the dataset is over a decade old and thus it would be valuable to try to find a more recent dataset to see if newer features and statistics could affect your predictions. It would also be interesting to think about some further application of these models in other fields of betting. In addition, a lot of features were removed that could have been potentially helpful to your predictions, for example the columns for 2nd, 3rd, and 4th place winners. These columns provide a lot of information about horses who were close or likely to win the race as well. Lastly, the error analysis includes a lot of graphs and I think it would help clarify your findings to specifically explain the graphs and how you have plotted them.

Overall, great job!

Midterm Peer Review (SRL84)

3 Things I liked

I thought the report was really well written and supported the decisions you made very well.
The results were pretty significant. I think especially because it's about betting, being able to produce something with that low of error is pretty impressive.
The next steps seem like a promising turn. I think the additional look at dividends will also help create a well-rounded project that takes into account all aspects of suggesting a bet.

3 Things I think can be improved

I think I would've liked a little bit more explanation about why you chose a 0/1 loss. There's a lot within the betting world about being able to determine place not just overall winner. Expanding the view on loss could be helpful.
There wasn't a lot of explanation about overfitting rather than you don't think it's going to happen. Maybe just include what you would do in the case of overfitting and underfitting.
Finally, I think you could've explained a bit more about what you took away from the first part in order to transition into what you used for the midterm report (I.e. how the findings from the data analysis led you to believe a least squares model would be effective.)

good work!

Midterm Review (jka45)

Aspects I like:

Overall presentation and flow of midterm was easy to follow
Excited to see new model to predict winning dividends of a race
Report describes some of the features in great detail
Great how you showed your predictions about horses with the lighter loads is not true

Possible Improvements:

Explanation of more of the 73 features used would be nice to know and involves reader in analysis more
How will the report/group go about the validation step to test data?

Final Report Review

This team is using data from the Hong Kong Jockey Club to attempt to predict which bets on horse races would produce the largest expected return. To accomplish this, the team fit a few different models on the data after cleaning and feature transformation. To predict the win dividend, the team fit a few models including least squares, ridge regression, lasso with l2 regularizer, and random forest models. Then the team regressed win probability using least squares, ridge regression, logistic regression, and random forest models. Models were all evaluated using MSE, MAE, and Brier scores, and then the team simulated betting on the 100 highest return bets from a random sample of the testing data, with promising results.

This project was supremely interesting, and I really enjoyed reading the whole report, which was well-organized, clear, and interesting. The tables were easy to read and informative, clearly pointing the reader to the conclusion the report draws in its explanation. My favorite part of the report was the simulation at the end, with the random sample of the testing data. The results were clear and fascinating, contributing to a larger movement in betting and other games like poker towards a more data-driven approach to winning.

I wished that the figures throughout the report were a little larger, as it made it difficult to read the labels or discern what they were trying to show (other than Figure 1, which was the right size). My only other comment for constructive feedback surrounds a few grammatical and typing errors, which would have been caught with a simple read-through. Overall, though, I really enjoyed this report and found it incredibly interesting while being simple enough to understand without more knowledge or Googling. Good job!

Final Project Review [ky276]

The objective of this project is to determine if the team could find a positive expected value to generate a profit by predicting a given horse winning its race. The team used a wide range of regression models such as least squares, ridge, lasso and random forest. The models using MSE, MAE, and Brier.

Things I Liked

I thought it was a good idea to keep the sparsity when in the data processing phase. Even with minimal processing, the team already found an interesting insight that jockeys and horse trainers say in the industry for a significant time frame.
I liked how the team discussed why least squares served as a baseline default to compare future models with. This is a good idea and your reasoning for doing so was communicated well. Nice job choosing a diverse set of regression models and regularizers.
Really nice job contextualizing the project by explaining what horse racing is, how it works, and all the important terms related to it.

Things That Could be Improved

There were a couple small typos towards the end of the paper, but it was still very well written for the most part.
I think it would have been nice if the graphs were distributed throughout the paper a little more (ie. when discussing which models you used and why could have been a nice place to incorporate some of the graphs in the end of the paper).

Overall, I really enjoyed reading this paper! It was well structured and easy to understand for those with no prior knowledge on horse racing. Great work.

Final Peer Review

The HorseRaceBetting project aims to build a model that accurately predicts horse bettings that maximize profit. They used a dataset with over 6000 races from 1997 to 2005; the dataset included 14 horses and 37 features per horse. The group creates two models - one for predicting which horse would win and one for predicting the payoff for a bet.

There were many things that were well-done and well-explained in this report. To begin with, I really liked how they used so many data visualizations to showcase how different models (varying regulizers and loss functions) led to differing predictions. I really liked how they used varying models - they used OLS, Ridge, Logistic, and Random Forest models and evaluated each one in-depth. Lastly, I really liked how they used two different metrics - MAE and MSE - to judge their model’s performance, which demonstrated a lot of thoughtfulness.

However, there were some areas of improvements. I feel that they could have kept some of the columns that they removed from the original dataset. This would have provided the team with more features to work with. Secondly, the group could have tried to find more relevant data. The data was from 1997 to 2005, and I feel that due to old data, the resultant model may not be too useful or applicable in today’s world, due to increased inflation and other factors. Lastly, it would have been helpful to describe the applicability of the model more in-depth.

Final Report Review

This report tries to implement a logical analysis to horse betting in Hong Kong, in hope to generate a predictive model to generate maximum positive payoff from horse betting. To achieve this, they first tried to remove irrelevant fields, and then perform data encoding. The report tried to generate maximum payoff by estimating the win dividend and winning probability separately.

Pros:

The report's topic is very interesting, and it generates an actual positive payoff stream with its algorithm, which is rarely seen in betting with the house.
The report considered comprehensively to this topic, as it adds balanced error to reduce the impact on unbalanced data, generating biased results.
The presentation of the report is intuitive, and the graphs are clear and easy to understand.

Potential Improvement:

As in your conclusion, when betting more capital it could yield more promising results. Do you think it is due to the model's prediction power, or due to the law of large number.
As for the features in your input, I am wondering if they are all known parameters before a race, or there are parameters within a race. In addition, I am wondering if you counted in the changing odds through the race with the dynamic balance of people betting it when calculating the payoff. i.e. if the strategy is actually capable of implementing in real life.
As for one part of improvement, in my mind this problem could be more of a classification problem (buy and not buy as binary output) than regression. Since when calculating expected payoff, although it is probability*x, in real life betting it is only buy or not buy, to my best knowledge there is no 40% buy for betting.

Peer Review

Peer Review_Horse Race Betting.docx

Midterm Peer Review

Things I like:

I think the report is generally very well written and easy to follow.
I like that the report describes how features are selected in great detail and visualized the empirical distributions for some data.
I like how you transformed different types of data into learning-ready forms and the way you interpret & present the OLS results (ie. prediction value - win percentage plot).

Areas for improvement:

The report did not mention how large the dataset is. I'd be curious about whether your linear regression model has enough data to avoid colinearity issues.
The report also did not mention what the validation/testing data would be.
I think the report may not have provided enough information about how to avoid overfitting - according to the plot, if you can already achieve 100% accuracy using the OLS model on the entire dataset, then what is the point of trying other more complex models?

Proposal Review

Background
The project is very interesting. It is about building a prediction model on horse racing results. Since sports gambling become more and more popular in the US, the data science team would like to make gambling more data-driven and provide data-driven recommendations for gambling players.

Data
Two data files from the Hong Kong horse racing community would be used. The first one is about race information, including date, venue, surface, distance, etc. The other one is about horse performance, including horse age, horse type, result, etc.

Objective
The project team wants to build a model to maximize the payoff function by predicting the results of horse races. And the established model would be used as a sample for other sports betting prediction models.

Three Things I like

The model background is related to our daily life and it really brings practical meaning.
It can provide scientific recommendations for gambling players that can reduce their loss.
The horse performance analysis can provide recommendations on horse training.

Three Areas for Improvement

The data sources are from Hong Kong instead of from the US. The project should take the difference between these two into consideration.
It would be great if you can make the prediction objective more concrete. I am not quite sure about what consists of the payoff function.
The business implications of the model should be implied. Instead of just apparent benefit, it would be better if you can talk a little more about what business benefit this model can bring.

Midterm Peer Review

Your project is focussed on predicting the winner of horse races and thereby improving betting strategies for the races. You use two csv files, combine them and do a feature selection over the 73 features.

Things I liked:

Really well detailed description of the dataset and its features. It allows a reader completely new to horse racing to understand some of the factors that betters might include in their decision. I further like how you provide a hypothesis and seek out visualizations to confirm/counter your hypothesis.
I like how you have described your process for feature selection and imputation. Your analysis of relevant features allowed you to not waste time imputing unnecessary features.
I like the plot that describes the correlation of your prediction float with the result of the race. Shows you're on the right track and the plots are simple and easily interpretable.

Areas of improvement:

While you have performed feature selection methodically, I would suggested maybe trying out PCA to further check if your selection is valid or add features that your initial selection couldn't include.
Maybe instead of analyzing win/loss you can try predicting position in the race as their might be financial gain from 2nd, 3rd places as well.
Maybe think about how/why your models might underfit/overfit and state that.

Peer Review

This project will analyze horse racing and betting data from Hong Kong and other sources to best calculate strategies for horse betting in the United States. The team will create a payoff function from past data to maximize the expected profit for current betting strategies. In the future, this model can also be expanded and applied to other opportunities in the professional sporting industry.

The data included in proposal is well thought out and details the specific features used for both the race and runs data. Also, the report includes the overview of the calculations that will be used to determine optimal horse betting strategies, specifically using a probability distribution and a payoff function. Additionally, the proposal details how the the model can modularized and scaled to include other industries.

One concern is that the primary source of data is from Hong Kong, which might have different regulations for their horse racing practices and could be misleading for the predictions made in the United States. Also, it would be helpful to detail what kind of visualizations or machine learning models would be used to support this analysis, and how to test the accuracy of these models. Additionally, it would be beneficial to elaborate on why sports betting is influential on a broader scale (ie. how this research impacts the individual consumer and the overall sports industry/economy).

Midterm Review

The project “Horse Race Betting” aims to identify the relationship between historical racing conditions and horse race performance. In particular, the project leverages two key datasets, “Race Data,” which contains information on specific races, and “Run Data,” which contains information on specific horse performances within a race. Utilizing these data sets, the project team identified several relevant features to predict race performance including “Going,” “Draw,” “Distance,” and “Horse Type”.

One aspect about this midterm report I liked was the initial use of least squares regression as it disproportionately penalizes large outliers and resulted in a test MSE that was close to the train MSE so there was not overfitting. In addition, I like how the team graphed win percentage against weight carried by horse. Although one would intuitively believe that lighter racer’s yield better performances, their graph displayed that that may not be a significant factor in predicting race outcomes. I also like how the team used one-hot encodings to preprocess the categorical data.

One potential area of improvement could be expanding the output space beyond a simple win/loss Boolean, as it only provides discernable for horses that win and is rather ambiguous for determining the performance of horses that lose. Another area of improvement is having a larger feature space to make predictions in order to avoid underfitting. In the midterm report the group mentions that there were 23 columns with at least one missing entry, which resulted in them deleting most of these columns. Depending on the null value rate of each of the columns, I believe that it may be useful to impute the missing values for some of the features in order to get a more comprehensive feature space. Lastly, their graph of how “Going” affects “Number of Races” is difficult to understand as the legend is not ordered with the sequence of the bar graphs.

Project Proposal Peer Review

This team is trying to predict which horses to place bets on to get the greatest payoff and then offering the recommendation to users. The data is made up of two parts: race information and horse performance information. The goal is to maximize payoff using the predicted earnings at a placing and the probability distribution of that placing in a two-step process.

I like the clear explanation of the motivation behind the project including reasons such as publicly-available data that will make it feasible. The datasets chosen look good, and I am glad you are combining them. The brief description in the last paragraph of how the final calculations will be completed adds to the team’s trustworthiness and knowledgeability in what they will be undertaking.

Additionally, you can consider data from races held outside of Hong Kong for training or for testing to check your model. There might be differences due to location. In the actual report, elaborate more on the ending calculation with figures or equations to prevent confusion. Maximizing a payoff function is interesting but separate from the data analysis and modeling part that the class focuses on. It might be worth searching and reading previous work done in horse race betting for inspiration or for knowing where improvements can be made.

anthon11786 / horse-race-betting Goto Github PK

horse-race-betting's People

Contributors

Watchers

horse-race-betting's Issues

Recommend Projects

Recommend Topics

Recommend Org