The crosscountrystats from malcolmslaney

Introduction

This code estimates the difficulty of California cross country (XC) courses. The goal is to provide a single number that models each course's difficulty and can be used to adjust expected race times across courses. For example, course A might be 1.1 times harder than course B, and thus the expected times for course A will be 10% higher. The final results are here. You can click on a column header to sort the table by that column.

These results are normalized to times from Crystal Springs, which therefore has a difficulty of 1.0. Given a runner's time at Crystal, you can predict their time at another course by multiplying by that course's difficulty factor (i.e. 1.2). In other words, a difficulty factor of 1.2 means that on average a runner's time will be 20% longer then their time at Crystal. If you have their time at another course you first divide their time by that course's difficulty factor, thus giving you an estimate of their Crystal time, then muliply by the difficulty of the course that you care about.

Using race results from XCStats over multiple years, this code builds a model that takes into account these different factors:

Course difficulty
Runner's inate ability
Average runner's month-to-month improvement over the season
Average runner's year-to-year improvement over their career

The model estimates a single number (ability or difficulty) for each runner and each course. While the month-to-month and year-to-year parameters are averages that apply to all runners.

To be more specific, given each runner's race times, the model fits parameters to a mathematical model that looks like this:

race_time = (average_race_time - race_month*month_slope - student_year*year_slope)
            * runner_ability * course_difficulty

Here, the race times are in seconds. The race_month is the numerical month, starting with September which is 0. The student_year is the high-school year of the student, where freshman is 0. Thus the slopes are in terms of seconds per month or year, to make the results easier to interpret.

There is an individual parameter for each runner's ability, and as well as an individual course_difficulty for each course. These are both multiplicative factors that modify the expected race time during the season. Both factors adjust the expected times, but in different fashions. So higher (>1) course difficulties represent harder courses. While lower (<1) runner abilities represent faster runners. In both cases, higher numbers translate to longer finish times.

Bayesian Modeling

We use a Bayesian framework to derive a probabilistic model to explain the observed data (runner's race times). In a Bayesian model all the parameters of the model are random variables. We don't know Kent's true ability, so it is a random variable. Likewise, the difficulty of Crystal changes with the weather and other variables we do not have control over. Our goal is to find probability distributions that are as narrow as possible to explain the observed data.

By way of contrast, a deterministic model such as linear regression finds the model parameters that produce the smallest possible total error (in the mean-squared sense) when predicting the observed times. Instead, here we use a Bayesian model so we can model and describe the uncertainties in our predictions.

An important part of Bayesian modeling is providing information about expectations of the parameters. In this case, we wish the course and runner parameters to be approximately 1.0. This kind of constraint is added to the Bayesian model in the form of prior distributions.

We have 4 years of high school race results from the subscribers to XCStats.com. This includes 70k boys results and 63k girls results. For the analysis presented here, we use the runners with times in the top 25% of each race, hypothesizing that these are likely to be the more serious runners and will show less variance in their performance. This left us with 22k boys results and 19k girls results. Our boy's model was traimed with 3919 runners running 443 courses. While the girl's model was trained with 3229 runners running 432 courses.

Given our race results, we find the probability distribution for the parameters that best explains the data using a Python package called PyMC.

Results

The goal of this exercise is to estimate the relative difficulty of each course in our dataset. Here are the results for a number of courses run by local high school teams (Palo Alto, Los Altos, Archbishop Mitty, and Lynbrook). The name of each course is followed by its distance in miles.

Index	Course Name	Boys Difficulty	Girls Difficulty	# Boys	# Girls
365	RL Stevenson HS (1.6)	0.523	0.510	82	73
43	Woodward Park (2.0)	0.646	0.645	53	51
20	Lynbrook HS (2.1)	0.650	0.643	330	336
311	Fremont HS (2.05)	0.650	0.646	23	23
81	Hidden Valley Park (2.0)	0.659	0.661	624	624
255	Bol Park (2.18)	0.678	0.667	131	104
203	Prospect HS (2.15)	0.681	0.687	19	20
273	Lagoon Valley Park (2.0)	0.698	0.702	100	79
333	North Monterey County HS (2.4)	0.727	0.721	14	13
182	Westmoor HS (2.33)	0.740	0.741	109	83
339	Central Park (2.3)	0.741	0.740	271	265
76	Westmoor HS '18 (2.4)	0.782	0.759	91	50
271	Westmoor HS (2.4)	0.782	0.779	243	190
220	Golden Gate Park (2.82)	0.913	0.903	212	174
217	Golden Gate Park (2.93)	0.947	0.944	761	635
276	Newhall Park (2.95)	0.961	0.953	139	115
117	North Monterey County HS (3.0)	0.964	0.964	54	58
231	GGP - WCAL pre 2022 (3.0)	0.984	0.980	156	135
247	Kualoa Ranch (3.0)	0.992	0.979	68	69
17	Haggin Oaks Golf Course (3.1)	0.993	0.984	758	683
305	Stanford Golf Course (3.1)	0.993	0.988	1310	1106
248	Kualoa Ranch (3.1)	0.995	0.976	29	24
49	Elkhorn Country Club (3.1)	0.996	0.986	773	674
332	Newhall Park (3.0)	0.997	0.991	877	787
155	Crystal Springs (2.95)	1.000	1.000	2728	2552
253	Mt. Sac (2.93)	1.001	1.005	3407	3080
126	Woodward Park (3.1)	1.012	1.006	5473	5264
224	Lagoon Valley Park (3.0)	1.017	1.015	508	403
8	Hidden Valley Park (3.0)	1.018	1.013	362	313
272	Baylands Park (3.1)	1.019	1.016	685	684
396	Toro Park (3.0)	1.023	1.023	704	665
123	Glendoveer Golf Course, OR (3.1)	1.052	1.041	54	58
179	Mt. Sac (3.1)	1.053	1.017	21	10

Our model has three point variables (average_race_time, monthly_slope, yearly_slope) and two vector variables (runner_ability, course_difficulty). The runner and course variables have one value for each runner and each course, respectively. Runners typically improve month over month during the season, and each year they participate in XC. This model predicts the following improvements:

	Monthly Improvement	Yearly Improvement
Boys	10.5s	15.2s
Girls	16.4s	9.6s

[I don't know why the slopes are so different between boys and girls. Perhaps a function of the girl's earlier maturity?]

The distribution of course difficulties is shown in the next figure below. Most courses are about 3 miles long, and they form the bulk of the difficulties around 1.0. But some are much shorter (2 miles is common) and one, not shown, is much longer.

We don't have the same amount of data for each runner. The histogram below shows the distribution of races per runner.

This code makes predictions of each runner's time on each course. We can plot the training error to get a sense of the model's accuracy.
This is shown below for the full model.

We build our model using a combination of normal (Gaussian) and gamma prior distributions.
Here are the priors for each model:

	Distribution mean	Distribution sigma
Monthly Slope	10.0	10.0
Yearly Slope	10.0	10.0
Course Difficulty	1.0	1.0
Runner Ability	1.0	0.25

These prior distributions lead to these prediction errors. They produce different errors, but most importantly. the gamma model produces fewer MCMC diversions, suggesting it is better behaved.

	Normal	Gamma
VB Prediction Error (%)	1.88	1.88

Different models have different errors. We get better prediction errors if we include the monthly and yearly slope features. This is a histogram of the varsity boy model errors for the gamma prior.

	Model without slopes	Full model
Average Prediction error (%)	2.05	1.88

Ill Posed

Note, the raw outputs from this model are unnormalized and should be considered relative results. While both the ability and difficulty numbers tend to be close to 1, their baselines are arbitrary. Thus an average course_difficulty of 0.5 and an average runner ability of 2 will produce the same overall race-time predictions as the reverse.

Thus this model is ill-posed and this affects our analysis. We have a multiplicative model and in effect we are multiplying A x B x C to predict D. While we have constraints on the expected values of A, B, and C (via a prior distribution), a larger value of A can be matched with lower values for B and C to predict the same observed race time. This affects the model's slope parameters, but we are interested in the relative course difficulties, so we normalize all results to the difficulty of the Crystal Springs course, and the ill-posed nature is washed out.

MCMC predicts the distribution of each model parameter by finding values that result in a high likelihood when scoring the observed data. Each time MCMC is run it produces a "trace" of all the model parameters that explain the data. The values of the trace are an empirical description of the probability distribution for that model parameter. Most importantly, since the trace is random, some traces might assume a lower value of A, and thus high values of B or C, all to explain the same observed data.

The randomness of each trace makes it a litte harder to draw conclusions. For the results presented here, we computed 36 traces, with 2000 samples of each model parameter (this took about 3 hours for each gender's data). The final course difficulty numbers are based on averaging all 72000 trace samples for each course, and then dividing by the average difficulty of the Crystal Springs course, our baseline course.

Slope Results

The distribution of the monthly slope for the varsity boys is shown below for each of the eight traces.

We can do the same plot for the improvement of the varsity boys year over year.

The next figures shows the tradeoff between course difficulties and slopes.

First consider the blue x's. When the monthly slope is high (the group of blue points to the right) the corresponding course difficulty is relatively low (to the bottom). Conversely, where the slopes are low (the blue group on the left), the course difficulties are higher (upper left.) This suggests that the model balances low (or high) slopes with high (or low) difficulties. Likewise we see the same behavior for the yearly slopes.

For the slopes in the table above we took the average value over all traces. For the course difficulties, which is the primary purpose of this model, we report the result after normalizing each trace's results to the Crystal Springs time, and thus this factor doesn't apply.

You can see all these difficulty factors in the interactive comparison viewer that is linked here. You can zoom and move around to explore the data. Course Difficulty Comparison Viewer

malcolmslaney / crosscountrystats Goto Github PK

crosscountrystats's Introduction

Introduction

Bayesian Modeling

Results

Ill Posed

Slope Results

crosscountrystats's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent