Interaction Terms


Another Interpretation of Indicator Variable Slopes

In Section 08-07, we discussed how to put an indicator variable slope into words. But let's examine another interpretation that we can make about an indicator variable.

Let's suppose for a moment that we just want to predict price with just:

  • accommodates
  • room type.

The best fit linear regression model using these explanatory variables is found to be the following.
$\hat{price}=\hat{\beta}_0+\hat{\beta}1accommodates+\hat{\beta}2 room_type{Private_room}$
$\hat{price}=6.95+41.61accommodates-6.30room_type
{Private_room}$

new_model = LinearRegression()
new_model.fit(X_train_dummies[['accommodates', 'room_type_Private room']], y_train)
pd.DataFrame([new_model.intercept_]+list(new_model.coef_.T),
index=['intercept']+['accommodates', 'room_type_Private room'])
0
intercept 6.951819
accommodates 41.605659
room_type_Private room -6.303447

By plugging in both possible values for the $room_type_{Private_room}$ indicator variable (1/0), we can see that this linear regression model can be used to represent two separate simple linear regression models.

Entire Home/Apartment Simple Linear Regression Model

By plugging in $room_type_{Private_room}=0$ into our original linear regression model we get the following simple linear regression model that predicts the price of Entire Home/Apartment listings given the number of people the listing accommodates.
$\hat{price}=6.95+41.61accommodates-6.30room_type_{Private_room}=6.95+41.61accommodates-6.30(0) =6.95+41.61accommodates$

Thus, we can interpret the intercept $\hat{\beta}_0=6.95$ as the intercept for the entire home/apartment listing simple linear regression curve.

Private Room Simple Linear Regression Model

By plugging in $room_type_{Private_room}=1$ into our original linear regression model we get the following simple linear regression model that predicts the price of private room listings given the number of people the listing accommodates.
$\hat{price}=6.95+41.61accommodates-6.30room_type_{Private_room}=6.95+41.61accommodates-6.30(1) =0.65+41.61accommodates$

In this scenario, we can interpret the room type indicator slope $\hat{\beta}_2=6.30$ the amount that the intercept $\hat{\beta}_0=6.95$ for the entire home simple linear regression curve was reduced by to get the the resulting intercept of 0.65 for the private listing simple linear regression curve.

Common Slopes of Both Models

So we can see that by including an indicator variable, we allowed for these two simple linear regression curves to have two distinct intercepts. However, notice that the slopes for the accommodate variable for these two curves were the same $\hat{\beta}_1=41.61$.

Recall that in Section 08-03, we fit two best fit lines for the relationship between accommodates and price for both room types. We ended up seeing two different slopes for the two room types. What we see in the plot below suggests that there is what we call an interaction between accommodates and room_type when it comes to predicting price.

sns.lmplot(x='accommodates', y='price', hue='room_type', ci=False, data=df)
plt.show()

Introducting Interaction Terms

Allowing for Differing Slopes in the Multiple Models

In order to allow for different room type slopes in our main model, we can include what we call an interaction term between room_type and accommodates. This interaction term is represented as the product of these two variables $accommodates\cdot room_type_{Private_room}$.

We then add this interaction term with a new slope to our best fit linear regression model that we are trying to fit.
$\hat{price}=\hat{\beta}_0+\hat{\beta}_1accommodates+\hat{\beta}2 room_type{Private_room} + \hat{\beta}3 accommodates*room_type{Private_room}$

Let's first create this interaction term in our training features matrix.

X_train_dummies['room_type*accommodates'] = X_train_dummies['room_type_Private room']*X_train_dummies['accommodates']
X_train_dummies.head()
accommodates bedrooms beds neighborhood_Logan Square neighborhood_Near North Side neighborhood_Near West Side neighborhood_West Town room_type_Private room room_type*accommodates
1193 5 2.0 2.0 1 0 0 0 0 0
1392 8 4.0 4.0 1 0 0 0 0 0
338 12 4.0 5.0 0 0 0 0 0 0
1844 5 2.0 2.0 0 0 0 1 0 0
359 4 2.0 2.0 0 0 0 0 0 0

Then fit the resulting model as follows.

$\hat{price}=11.65+40.68accommodates-62.48 room_type_{Private_room} + 20.21 accommodates*room_type_{Private_room}$

new_model = LinearRegression()
new_model.fit(X_train_dummies[['accommodates', 'room_type_Private room', 'room_type*accommodates']], y_train)
pd.DataFrame([new_model.intercept_]+list(new_model.coef_.T),
index=['intercept']+['accommodates', 'room_type_Private room', 'room_type*accommodates'])
0
intercept 11.646318
accommodates 40.680589
room_type_Private room -62.481063
room_type*accommodates 20.210276

Similarly, by plugging in both possible values for the $room_type_{Private_room}$ indicator variable (1/0), we can again see that this linear regression model can be used to represent two separate simple linear regression models.

Entire Home/Apartment Simple Linear Regression Model

By plugging in $room_type_{Private_room}=0$ into our new linear regression model we get the following simple linear regression model that predicts the price of Entire Home/Apartment listings given the number of people the listing accommodates.
$\hat{price}=11.65+40.68accommodates-62.48 (0) + 20.21 accommodates*(0) = 11.65+40.68accommodates$

Thus, we can interpret:

  • the main model intercept $\hat{\beta}_0=11.65$ as the intercept for the entire home/apartment listing simple linear regression curve, and
  • the main model slope $\hat{\beta}_1=40.68$ as the accommodates slope for the entire home/apartment listing simple linear regression curve

Private Room Simple Linear Regression Model

By plugging in $room_type_{Private_room}=1$ into our new linear regression model we get the following simple linear regression model that predicts the price of private room listings given the number of people the listing accommodates.
$\hat{price}=11.65+40.68accommodates-62.48 (1) + 20.21 accommodates*(1) = -50.83+60.89accommodates$

Thus, we can interpret:

  • the indicator variable slope $\hat{\beta}_1=-62.48$ as the amount that the intercept $\hat{\beta}_0=11.65$ for the entire home/apartment listing is reduced to yield the intercept -50.83 for the private room linear regression model.
  • the interaction term slope $\hat{\beta}_3=20.21$ as the amount that the slope $\hat{\beta}_1=40.68$ for the entire home/apartment listing is reduced to yield the slope 60.89 for the private room linear regression model.

When to Use Interaction Terms

In general, if you observe different slopes between a given numerical explanatory variable and the response variable for different levels of one of your categorical explanatory variable, you might consider adding interaction terms for this numerical explanatory variable and the indicator variables that correspond to this categorical explanatory variable.

sns.lmplot(x='bedrooms', y='price', hue='room_type', ci=False, data=df)
plt.show()

For instance, based on the different slopes that we see in the plot above, we might also consider adding an interaction term between bedrooms and room_type as well to our final model that we selected in section 08-08.

X_train_dummies['room_type*bedrooms'] = X_train_dummies['room_type_Private room']*X_train_dummies['bedrooms']
X_train_dummies.head()
accommodates bedrooms beds neighborhood_Logan Square neighborhood_Near North Side neighborhood_Near West Side neighborhood_West Town room_type_Private room room_type*accommodates room_type*bedrooms
1193 5 2.0 2.0 1 0 0 0 0 0 0.0
1392 8 4.0 4.0 1 0 0 0 0 0 0.0
338 12 4.0 5.0 0 0 0 0 0 0 0.0
1844 5 2.0 2.0 0 0 0 1 0 0 0.0
359 4 2.0 2.0 0 0 0 0 0 0 0.0
test_model = LinearRegression()
test_model.fit(X_train_dummies, y_train)
pd.DataFrame([test_model.intercept_]+list(test_model.coef_.T),
index=['intercept']+list(X_train_dummies.columns))
0
intercept -73.558230
accommodates 8.180998
bedrooms 85.158404
beds 13.134943
neighborhood_Logan Square -11.623274
neighborhood_Near North Side 101.909502
neighborhood_Near West Side 47.773886
neighborhood_West Town 37.538000
room_type_Private room -31.691479
room_type*accommodates 17.216143
room_type*bedrooms -1.706942
X_test_dummies['room_type*accommodates'] = X_test_dummies['room_type_Private room']*X_test_dummies['accommodates']
X_test_dummies['room_type*bedrooms'] = X_test_dummies['room_type_Private room']*X_test_dummies['bedrooms']
X_test_dummies.head()
accommodates bedrooms beds neighborhood_Logan Square neighborhood_Near North Side neighborhood_Near West Side neighborhood_West Town room_type_Private room room_type*accommodates room_type*bedrooms
2592 2 1.0 1.0 0 0 1 0 0 0 0.0
139 1 1.0 1.0 0 0 0 0 1 1 1.0
471 2 1.0 1.0 0 0 0 0 0 0 0.0
2015 5 2.0 2.0 0 0 0 1 0 0 0.0
3324 6 2.0 3.0 0 1 0 0 0 0 0.0
y_pred_test = test_model.predict(X_test_dummies)
r2_score(y_test, y_pred_test)
0.32862111733497645

However, notice that the test dataset R^2 of this new model with the two interaction terms is not quite as high as our best model that we selected in section 08-08 (R^2=0.329) also shown below.

Final/Best Model So Far
$\hat{income}=-85.59+13.36accommodates+94.59bedrooms-11.30neighborhood_{Logan_Square}+105.15neighborhood_{Near_North_Side}+48.06neighborhood_{Near_West_Side}+36.07neighborhood_{West_Town}+17.16roomtype_{private_room}$

final_model = LinearRegression()
final_model.fit(X_train_dummies.drop(['beds','room_type*accommodates', 'room_type*bedrooms'], axis=1), y_train)
pd.DataFrame([final_model.intercept_]+list(final_model.coef_.T),
index=['intercept']+list(X_train_dummies.columns.drop(['beds','room_type*accommodates', 'room_type*bedrooms'])))
0
intercept -85.589216
accommodates 13.363291
bedrooms 94.593238
neighborhood_Logan Square -11.292819
neighborhood_Near North Side 105.149914
neighborhood_Near West Side 48.056851
neighborhood_West Town 36.070250
room_type_Private room 17.156893
y_pred_test = final_model.predict(X_test_dummies.drop(['beds','room_type*accommodates', 'room_type*bedrooms'], axis=1))
r2_score(y_test, y_pred_test)
0.3293011996166699