Interaction Terms

← Evaluating your Linear Regression Model for Machine Learning and Interpretation Purposes Next: Airbnb Research Goal Conclusion →

Another Interpretation of Indicator Variable Slopes

In Section 08-07, we discussed how to put an indicator variable slope into words. But let's examine another interpretation that we can make about an indicator variable.

Let's suppose for a moment that we just want to predict price with just:

accommodates
room type.

The best fit linear regression model using these explanatory variables is found to be the following.
$\hat{price}=\hat{\beta}_0+\hat{\beta}1accommodates+\hat{\beta}2 room_type{Private_room}$
$\hat{price}=6.95+41.61accommodates-6.30room_type{Private_room}$

new_model = LinearRegression()
new_model.fit(X_train_dummies[['accommodates', 'room_type_Private room']], y_train)
pd.DataFrame([new_model.intercept_]+list(new_model.coef_.T),
             index=['intercept']+['accommodates', 'room_type_Private room'])

	0
intercept	6.951819
accommodates	41.605659
room_type_Private room	-6.303447

By plugging in both possible values for the $room_type_{Private_room}$ indicator variable (1/0), we can see that this linear regression model can be used to represent two separate simple linear regression models.

Entire Home/Apartment Simple Linear Regression Model

By plugging in $room_type_{Private_room}=0$ into our original linear regression model we get the following simple linear regression model that predicts the price of Entire Home/Apartment listings given the number of people the listing accommodates.
$\hat{price}=6.95+41.61accommodates-6.30room_type_{Private_room}=6.95+41.61accommodates-6.30(0) =6.95+41.61accommodates$

Thus, we can interpret the intercept $\hat{\beta}_0=6.95$ as the intercept for the entire home/apartment listing simple linear regression curve.

Private Room Simple Linear Regression Model

By plugging in $room_type_{Private_room}=1$ into our original linear regression model we get the following simple linear regression model that predicts the price of private room listings given the number of people the listing accommodates.
$\hat{price}=6.95+41.61accommodates-6.30room_type_{Private_room}=6.95+41.61accommodates-6.30(1) =0.65+41.61accommodates$

In this scenario, we can interpret the room type indicator slope $\hat{\beta}_2=6.30$ the amount that the intercept $\hat{\beta}_0=6.95$ for the entire home simple linear regression curve was reduced by to get the the resulting intercept of 0.65 for the private listing simple linear regression curve.

Common Slopes of Both Models

So we can see that by including an indicator variable, we allowed for these two simple linear regression curves to have two distinct intercepts. However, notice that the slopes for the accommodate variable for these two curves were the same $\hat{\beta}_1=41.61$.

Recall that in Section 08-03, we fit two best fit lines for the relationship between accommodates and price for both room types. We ended up seeing two different slopes for the two room types. What we see in the plot below suggests that there is what we call an interaction between accommodates and room_type when it comes to predicting price.

sns.lmplot(x='accommodates', y='price', hue='room_type', ci=False, data=df)
plt.show()

Introducting Interaction Terms

Allowing for Differing Slopes in the Multiple Models

In order to allow for different room type slopes in our main model, we can include what we call an interaction term between room_type and accommodates. This interaction term is represented as the product of these two variables $accommodates\cdot room_type_{Private_room}$.

We then add this interaction term with a new slope to our best fit linear regression model that we are trying to fit.
$\hat{price}=\hat{\beta}_0+\hat{\beta}_1accommodates+\hat{\beta}2 room_type{Private_room} + \hat{\beta}3 accommodates*room_type{Private_room}$

Let's first create this interaction term in our training features matrix.

X_train_dummies['room_type*accommodates'] = X_train_dummies['room_type_Private room']*X_train_dummies['accommodates']
X_train_dummies.head()

	accommodates	bedrooms	beds	neighborhood_Logan Square	neighborhood_West Town
1193	5	2.0	2.0	1	0
1392	8	4.0	4.0	1	0
338	12	4.0	5.0	0	0
1844	5	2.0	2.0	0	1
359	4	2.0	2.0	0	0

Then fit the resulting model as follows.

$\hat{price}=11.65+40.68accommodates-62.48 room_type_{Private_room} + 20.21 accommodates*room_type_{Private_room}$

new_model = LinearRegression()
new_model.fit(X_train_dummies[['accommodates', 'room_type_Private room', 'room_type*accommodates']], y_train)
pd.DataFrame([new_model.intercept_]+list(new_model.coef_.T),
             index=['intercept']+['accommodates', 'room_type_Private room', 'room_type*accommodates'])

	0
intercept	11.646318
accommodates	40.680589
room_type_Private room	-62.481063
room_type*accommodates	20.210276

Similarly, by plugging in both possible values for the $room_type_{Private_room}$ indicator variable (1/0), we can again see that this linear regression model can be used to represent two separate simple linear regression models.

Entire Home/Apartment Simple Linear Regression Model

By plugging in $room_type_{Private_room}=0$ into our new linear regression model we get the following simple linear regression model that predicts the price of Entire Home/Apartment listings given the number of people the listing accommodates.
$\hat{price}=11.65+40.68accommodates-62.48 (0) + 20.21 accommodates*(0) = 11.65+40.68accommodates$

Thus, we can interpret:

the main model intercept $\hat{\beta}_0=11.65$ as the intercept for the entire home/apartment listing simple linear regression curve, and
the main model slope $\hat{\beta}_1=40.68$ as the accommodates slope for the entire home/apartment listing simple linear regression curve

Private Room Simple Linear Regression Model

By plugging in $room_type_{Private_room}=1$ into our new linear regression model we get the following simple linear regression model that predicts the price of private room listings given the number of people the listing accommodates.
$\hat{price}=11.65+40.68accommodates-62.48 (1) + 20.21 accommodates*(1) = -50.83+60.89accommodates$

Thus, we can interpret:

the indicator variable slope $\hat{\beta}_1=-62.48$ as the amount that the intercept $\hat{\beta}_0=11.65$ for the entire home/apartment listing is reduced to yield the intercept -50.83 for the private room linear regression model.
the interaction term slope $\hat{\beta}_3=20.21$ as the amount that the slope $\hat{\beta}_1=40.68$ for the entire home/apartment listing is reduced to yield the slope 60.89 for the private room linear regression model.

When to Use Interaction Terms

In general, if you observe different slopes between a given numerical explanatory variable and the response variable for different levels of one of your categorical explanatory variable, you might consider adding interaction terms for this numerical explanatory variable and the indicator variables that correspond to this categorical explanatory variable.

sns.lmplot(x='bedrooms', y='price', hue='room_type', ci=False, data=df)
plt.show()

For instance, based on the different slopes that we see in the plot above, we might also consider adding an interaction term between bedrooms and room_type as well to our final model that we selected in section 08-08.

X_train_dummies['room_type*bedrooms'] = X_train_dummies['room_type_Private room']*X_train_dummies['bedrooms']
X_train_dummies.head()

	accommodates	bedrooms	beds	neighborhood_Logan Square	neighborhood_West Town
1193	5	2.0	2.0	1	0
1392	8	4.0	4.0	1	0
338	12	4.0	5.0	0	0
1844	5	2.0	2.0	0	1
359	4	2.0	2.0	0	0

test_model = LinearRegression()
test_model.fit(X_train_dummies, y_train)
pd.DataFrame([test_model.intercept_]+list(test_model.coef_.T),
             index=['intercept']+list(X_train_dummies.columns))

	0
intercept	-73.558230
accommodates	8.180998
bedrooms	85.158404
beds	13.134943
neighborhood_Logan Square	-11.623274
neighborhood_Near North Side	101.909502
neighborhood_Near West Side	47.773886
neighborhood_West Town	37.538000
room_type_Private room	-31.691479
room_type*accommodates	17.216143
room_type*bedrooms	-1.706942

X_test_dummies['room_type*accommodates'] = X_test_dummies['room_type_Private room']*X_test_dummies['accommodates']
X_test_dummies['room_type*bedrooms'] = X_test_dummies['room_type_Private room']*X_test_dummies['bedrooms']
X_test_dummies.head()

	accommodates	bedrooms	beds	neighborhood_Near North Side	neighborhood_Near West Side	neighborhood_West Town	room_type_Private room	room_type*accommodates	room_type*bedrooms
2592	2	1.0	1.0	0	1	0	0	0	0.0
139	1	1.0	1.0	0	0	0	1	1	1.0
471	2	1.0	1.0	0	0	0	0	0	0.0
2015	5	2.0	2.0	0	0	1	0	0	0.0
3324	6	2.0	3.0	1	0	0	0	0	0.0

y_pred_test = test_model.predict(X_test_dummies)
r2_score(y_test, y_pred_test)

0.32862111733497645

However, notice that the test dataset R^2 of this new model with the two interaction terms is not quite as high as our best model that we selected in section 08-08 (R^2=0.329) also shown below.

Final/Best Model So Far
$\hat{income}=-85.59+13.36accommodates+94.59bedrooms-11.30neighborhood_{Logan_Square}+105.15neighborhood_{Near_North_Side}+48.06neighborhood_{Near_West_Side}+36.07neighborhood_{West_Town}+17.16roomtype_{private_room}$

final_model = LinearRegression()
final_model.fit(X_train_dummies.drop(['beds','room_type*accommodates', 'room_type*bedrooms'], axis=1), y_train)
pd.DataFrame([final_model.intercept_]+list(final_model.coef_.T),
             index=['intercept']+list(X_train_dummies.columns.drop(['beds','room_type*accommodates', 'room_type*bedrooms'])))

	0
intercept	-85.589216
accommodates	13.363291
bedrooms	94.593238
neighborhood_Logan Square	-11.292819
neighborhood_Near North Side	105.149914
neighborhood_Near West Side	48.056851
neighborhood_West Town	36.070250
room_type_Private room	17.156893

y_pred_test = final_model.predict(X_test_dummies.drop(['beds','room_type*accommodates', 'room_type*bedrooms'], axis=1))
r2_score(y_test, y_pred_test)

0.3293011996166699

← Evaluating your Linear Regression Model for Machine Learning and Interpretation Purposes Next: Airbnb Research Goal Conclusion →