In this lab, you'll explore interactions in the Boston Housing data set.
You will be able to:
- Understand what interactions are
- Understand how to accommodate for interactions in regression
You'll use a couple of built-in functions, which we imported for you below.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Import the Boston data set using load_boston()
. We won't bother to preprocess the data in this lab. If you still want to build a model in the end, you can do that, but this lab will just focus on finding meaningful insights in interactions and how they can improve
regression = LinearRegression()
boston = load_boston()
Create a baseline model which includes all the variables in the Boston housing data set to predict the house prices. The use 10-fold cross-validation and report the mean
## code here
baseline
Next, create all possible combinations of interactions, loop over them and add them to the baseline model one by one to see how they affect the R^2. We'll look at the 3 interactions which have the biggest effect on our R^2, so print out the top 3 combinations.
You will create a for loop to loop through all the combinations of 2 predictors. You can use combinations
from itertools to create a list of all the pairwise combinations. To find more info on how this is done, have a look here.
from itertools import combinations
combinations = list(combinations(boston.feature_names, 2))
## code to find top 3 interactions by R^2 value here
The top three interactions seem to involve "RM", the number of rooms as a confounding variable for all of them. Let's have a look at interaction plots for all three of them. This exercise will involve:
- splitting our data up in 3 groups: one for houses with a few rooms, one for houses with a "medium" amount of rooms, one for a high amount of rooms.
- Create a function
build_interaction_rm
. This function takes an argumentvarname
(which can be set equal to the column name as a string) and a columndescription
(which describes the variable or varname, to be included on the x-axis of the plot). The function outputs a plot that uses "RM" as a confounding factor.
We split the data set for high, medium and low amount of rooms for you.
rm = np.asarray(df[["RM"]]).reshape(len(df[["RM"]]))
high_rm = all_data[rm > np.percentile(rm, 67)]
med_rm = all_data[(rm > np.percentile(rm, 33)) & (rm <= np.percentile(rm, 67))]
low_rm = all_data[rm <= np.percentile(rm, 33)]
Create build_interaction_rm
.
def build_interaction_rm(varname, description):
pass
Next, use build_interaction_rm with the three variables that came out with the highest effect on
# first plot
# second plot
# third plot
Use 10-fold crossvalidation.
# code here
# code here
Our
# code here
What is your conclusion here?
# formulate your conclusion
You now understand how to include interaction effects in your model!