For any given set of x and y values, we can build a linear regression model using the least-squares method to find a best-fit line. Just because this model exists does not necessarily mean it is a particularly appropriate or useful way to describe the relationship between the two variables.
In this lesson, you'll learn about techniques to evaluate a regression model. First, we'll use the F-test to test whether the model overall is statistically significant. Then we'll use the coefficient of determination (also known as "R-Squared") to measure the goodness of fit.
You will be able to:
- Describe the concept of an intercept-only model
- Perform an overall F-test for significance
- Calculate the coefficient of determination using self-constructed functions
- Use the coefficient of determination to describe model performance
There is no universal, objective standard for evaluating models, because every set of x and y values is different. Therefore we measure the fit of our models against a baseline model known as the intercept-only model.
An intercept-only model does not make use of any independent variables to predict the value of dependent variable
In the image below, this model is given by the straight orange line.
You can see that, in this plot, the intercept-only model (orange) always predicts the mean of
In order to evaluate the regression model, we ask: is our fitted regression line better than an intercept-only model?
We will determine this answer in terms of both statistical significance and goodness-of-fit by comparing the the errors made by our model (red vertical arrow) to the errors made by the intercept-only model (orange vertical arrow) and determining the difference between them (gray vertical arrow).
Let's return to our toy dataset and fitted regression model:
# import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
# Initialize arrays X and Y with given values
# X = Independent Variable
X = np.array([1,2,3,4,5,6,8,8,9,10], dtype=np.float64)
# Y = Dependent Variable
Y = np.array([7,7,8,9,9,10,10,11,11,12], dtype=np.float64)
plt.scatter(X, Y);
def calc_slope(x_vals, y_vals):
# setting up components of formula
rho = np.corrcoef(x_vals, y_vals)[0][1]
s_y = y_vals.std()
s_x = x_vals.std()
# calculating slope estimate
m = rho * s_y / s_x
return m
def calc_intercept(m, x_vals, y_vals):
# setting up components of formula
y_mean = y_vals.mean()
x_mean = x_vals.mean()
# calculating intercept estimate
c = y_mean - m * x_mean
return c
fig, ax = plt.subplots()
ax.scatter(X, Y, color='#003F72', label="Data points")
m = calc_slope(X, Y)
c = calc_intercept(m, X, Y)
print(f"Our model is: y = {round(m, 3)}x + {round(c, 4)}")
regression_line = m * X + c
ax.plot(X, regression_line, label="Regression Line")
intercept_only_line = np.array([Y.mean() for x in X])
ax.plot(X, intercept_only_line, label="Intercept-Only Line")
ax.legend();
Our model is: y = 0.539x + 6.3796
So, is our overall model statistically significant? Let's frame this in terms of a null and alternative hypothesis:
-
$H_0$ (null hypothesis): the intercept-only model fits the data just as well as (or better than) our model -
$H_a$ (alternative hypothesis): our model fits the data better than the intercept-only model
Then we'll follow the typical hypothesis test structure of computing a test statistic, calculating the p-value, then determining whether we can reject the null hypothesis.
For simple linear regression, the F-statistic is defined as the mean squared error of the model divided by the mean squared error of the residuals. We'll break down the formulas for this below, but the important part to take away is that this statistic is related to how well our model performs in the context of this dataset and the intercept-only model. This concept is more important than digging into the specific formulas, since we will use libraries to calculate these values for us in the future.
-
$MSE_{model}$ : the mean squared error of the model -
$MSE_{resid}$ : the mean squared error of the residuals
Breaking that down further:
-
$ESS$ : the explained sum of squares. In other words, the sum of the squared differences between$\hat{y}$ (the predictions made by our model) and$\bar{y}$ (the mean, i.e. the prediction made by the intercept-only model) -
$df(model)$ : the model degrees of freedom. -
$RSS$ : the residual sum of squares. In other words, the sum of the squared differences between$\hat{y}$ (the predictions made by our model) and$y$ (the actualY
values) -
$df(resid)$ : the residual degrees of freedom. In this case, 8, because we have 10 observations minus 2, the number of parameters (1 coefficient + 1 intercept)
To find the mean squared error of the model, we need to compute the explained sum of squares (ESS) and then divide it by the model degrees of freedom.
We will be comparing regression_line
(representing intercept_only_line
(representing
regression_line - intercept_only_line
array([-2.48101852, -1.94166667, -1.40231481, -0.86296296, -0.32361111,
0.21574074, 1.29444444, 1.29444444, 1.8337963 , 2.37314815])
Then we can square those differences:
(regression_line - intercept_only_line) ** 2
array([6.15545289, 3.77006944, 1.96648684, 0.74470508, 0.10472415,
0.04654407, 1.67558642, 1.67558642, 3.36280886, 5.63183213])
Then finally take the sum:
ess = ((regression_line - intercept_only_line) ** 2).sum()
ess
25.133796296296296
Model MSE is ESS divided by the degrees of freedom of the model. The degrees of freedom of the model is just the number of independent variables, which is 1 for a simple linear regression. Therefore the mean is just the same as the total, because we are only summarizing one independent variable.
model_df = 1
mse_model = ess / model_df
mse_model
25.133796296296296
To find the mean squared error of the residuals, we need to calculate the sum of squared residuals (RSS) and then divide it by the residual degrees of freedom.
Calculating RSS is very similar to calculating ESS, except that it is comparing regression_line
) to Y
). In other words, we're comparing our model's predicted values to the true values, rather than comparing them to the intercept-only values.
We'll follow the same process of finding the differences, then squaring the differences, then finding the sum.
rss = ((regression_line - Y) ** 2).sum()
rss
1.266203703703704
Residual MSE is the mean value for this total. To find the mean, we divide by the residual degrees of freedom, which is the number of observations minus the number of parameters.
In this case we have 2 parameters (1 for the coefficient and 1 for the intercept).
residual_df = len(Y) - 2
mse_residual = rss / residual_df
mse_residual
0.158275462962963
Putting that all together:
f = mse_model / mse_residual
f
158.7978062157221
We have an F-statistic of about 158.8, but what does that mean?
Let's use the survival function for the F-distribution from SciPy to find the p-value:
import scipy.stats
scipy.stats.f.sf(f, model_df, residual_df)
1.4755820678821615e-06
That p-value is well below the typical alpha of 0.05. We can reject the null hypothesis and say that our model is statistically significant!
Now that we know that our model is statistically significant, we can go one step further and quantify how much of the variation in the dependent variable is explained by our model. This measure is called the
R-squared is closely related to the F-statistic calculated above. It is defined as:
We have already calculated ESS (the explained sum of squared error) above. This represents the squared difference between
The only thing left to calculate is TSS (the total sum of squared error). This is the squared difference between
tss = ((Y - intercept_only_line) ** 2).sum()
tss
26.4
r_squared = ess / tss
r_squared
0.9520377384960719
We have a value of about 0.95 as our coefficient of determination (R-squared). What does this mean?
This means that the proportion of the variance explained by our model out of the total variance is about 0.95, or 95%. You can also say, 95% of the variation in $y$ is explained by our model.
For a least-squares regression model, R-Squared can take a value between 0 and 1 where values closer to 0 represent a poor fit and values closer to 1 represent an (almost) perfect fit
An R-Squared of 0 would mean that we are explaining 0% of the variation in
An R-Squared of 1 means that you are explaining 100% of the variation. This is very unlikely to see with linear regression on real-world data, so you probably want to double-check that you have set up your variables properly.
The image below shows what it might look like if you had an R-Squared of 1:
Like many other things in data science, the answer is "it depends"!
If you are doing inferential modeling, i.e. the part of the model you are most interested in is the value of the parameters, then as long as the model is statistically significant overall, you don't need to worry too much about R-Squared.
If you are doing predictive modeling, i.e. you are most interested in making predictions about unknown values of
Either way, R-Squared provides important context for how you communicate about your model to others. If your model is only explaining a tiny amount of the variation in
A good template to use is:
(R-Squared * 100)% of the variation in the dependent variable $y$ is explained by our model.
The formulas and variables described here also appear with slightly different variations in other resources. You don't need to memorize these, but just be aware that the same value might appear with a different name in a different context.
ESS is the explained sum of squared errors, also known as
RSS is the residual sum of squared errors, also known as
TSS is the total sum of squared errors, also known as
TSS is also equal to the sum of RSS + ESS. In other words, the total error is the sum of the explained error plus the residual (unexplained) error.
In the equation above, we said that R-Squared was equal to
If you use the
In this lesson you looked at two ways to evaluate a linear regression model. Both of them involve comparing the regression model to an "intercept-only" baseline model that always returns the mean of
First, the F-test can be used to determine whether the model is statistically significant overall. Second, the coefficient of determination (R-Squared) can be used to describe the amount of variation in
We also discussed how to communicate the results of these assessments, as well as some alternative naming conventions and formulas that you might see.