In this lab, you'll practice fitting a multiple linear regression model on our Boston Housing Data set!
You will be able to:
- Run linear regression on Boston Housing dataset with all the predictors
- Interpret the parameters of the multiple linear regression model
We pre-processed the Boston Housing Data again. This time, however, we did things slightly different:
- We dropped "ZN" and "NOX" completely
- We categorized "RAD" in 3 bins and "TAX" in 4 bins
- We used min-max-scaling on "B", "CRIM" and "DIS" (and logtransformed all of them first, except "B")
- We used standardization on "AGE", "INDUS", "LSTAT" and "PTRATIO" (and logtransformed all of them first, except for "AGE")
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
boston = load_boston()
boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
boston_features = boston_features.drop(["NOX","ZN"],axis=1)
# first, create bins for based on the values observed. 3 values will result in 2 bins
bins = [0,6, 24]
bins_rad = pd.cut(boston_features['RAD'], bins)
bins_rad = bins_rad.cat.as_unordered()
# first, create bins for based on the values observed. 4 values will result in 3 bins
bins = [0, 270, 360, 712]
bins_tax = pd.cut(boston_features['TAX'], bins)
bins_tax = bins_tax.cat.as_unordered()
tax_dummy = pd.get_dummies(bins_tax, prefix="TAX")
rad_dummy = pd.get_dummies(bins_rad, prefix="RAD")
boston_features = boston_features.drop(["RAD","TAX"], axis=1)
boston_features = pd.concat([boston_features, rad_dummy, tax_dummy], axis=1)
age = boston_features["AGE"]
b = boston_features["B"]
logcrim = np.log(boston_features["CRIM"])
logdis = np.log(boston_features["DIS"])
logindus = np.log(boston_features["INDUS"])
loglstat = np.log(boston_features["LSTAT"])
logptratio = np.log(boston_features["PTRATIO"])
# minmax scaling
boston_features["B"] = (b-min(b))/(max(b)-min(b))
boston_features["CRIM"] = (logcrim-min(logcrim))/(max(logcrim)-min(logcrim))
boston_features["DIS"] = (logdis-min(logdis))/(max(logdis)-min(logdis))
#standardization
boston_features["AGE"] = (age-np.mean(age))/np.sqrt(np.var(age))
boston_features["INDUS"] = (logindus-np.mean(logindus))/np.sqrt(np.var(logindus))
boston_features["LSTAT"] = (loglstat-np.mean(loglstat))/np.sqrt(np.var(loglstat))
boston_features["PTRATIO"] = (logptratio-np.mean(logptratio))/(np.sqrt(np.var(logptratio)))
boston_features.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
CRIM | INDUS | CHAS | RM | AGE | DIS | PTRATIO | B | LSTAT | RAD_(0, 6] | RAD_(6, 24] | TAX_(0, 270] | TAX_(270, 360] | TAX_(360, 712] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.000000 | -1.704344 | 0.0 | 6.575 | -0.120013 | 0.542096 | -1.443977 | 1.000000 | -1.275260 | 1 | 0 | 0 | 1 | 0 |
1 | 0.153211 | -0.263239 | 0.0 | 6.421 | 0.367166 | 0.623954 | -0.230278 | 1.000000 | -0.263711 | 1 | 0 | 1 | 0 | 0 |
2 | 0.153134 | -0.263239 | 0.0 | 7.185 | -0.265812 | 0.623954 | -0.230278 | 0.989737 | -1.627858 | 1 | 0 | 1 | 0 | 0 |
3 | 0.171005 | -1.778965 | 0.0 | 6.998 | -0.809889 | 0.707895 | 0.165279 | 0.994276 | -2.153192 | 1 | 0 | 1 | 0 | 0 |
4 | 0.250315 | -1.778965 | 0.0 | 7.147 | -0.511180 | 0.707895 | 0.165279 | 1.000000 | -1.162114 | 1 | 0 | 1 | 0 | 0 |
Remove the necessary variables to make sure the coefficients are the same for Scikit-learn vs Statsmodels
- CRIM: per capita crime rate by town
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population
Make sure to transform your variables as needed!
- CRIM: 0.15
- INDUS: 6.07
- CHAS: 1
- RM: 6.1
- AGE: 33.2
- DIS: 7.6
- PTRATIO: 17
- B: 383
- LSTAT: 10.87
- RAD: 8
- TAX: 284
Congratulations! You've fitted your first multiple linear regression model on the Boston Housing Data.