How to Incorporate Categorical Explanatory Variables

← Fitting a Multiple Linear Regression Curve Next: Interpreting your Model's Slopes →

Let's return back to our original research goal of predicting the price of new Chicago Airbnb listings using the following 5 explanatory variables.

Neighborhood: (filtered for just the 5 most popular neighborhoods: Near North Side, West Town, Lake View, Near West Side, Logan Square)
Room type: (filtered for entire home/apt and private room)
How many people the listing accommodates
How many bedrooms the listing has
How many beds the listing has

Notice how the first two explanatory variables listed are actually categorical variables. In Section 10 we'll discuss how only numerical variables should be used as the response variable. However, we are actually able to incorporate categorical explanatory variables into a linear regression model in the following way.

Below we can use the pd.get_dummies() function to create a set of what we call 0/1 indicator variables (or dummy variables ) for each of our categorical explanatory variables. We do this for both our training features matrix and our test features matrix.

X_train_dummies=pd.get_dummies(X_train,drop_first=True)
X_train_dummies.head()

	accommodates	bedrooms	beds	neighborhood_Logan Square	neighborhood_West Town
1193	5	2.0	2.0	1	0
1392	8	4.0	4.0	1	0
338	12	4.0	5.0	0	0
1844	5	2.0	2.0	0	1
359	4	2.0	2.0	0	0

X_test_dummies=pd.get_dummies(X_test,drop_first=True)
X_test_dummies.head()

	accommodates	bedrooms	beds	neighborhood_Near North Side	neighborhood_Near West Side	neighborhood_West Town	room_type_Private room
2592	2	1.0	1.0	0	1	0	0
139	1	1.0	1.0	0	0	0	1
471	2	1.0	1.0	0	0	0	0
2015	5	2.0	2.0	0	0	1	0
3324	6	2.0	3.0	1	0	0	0

Representing the Neighborhood Variable

By setting the parameter drop_first=True in the pd.get_dummies() function, we end up with the following situation in which there are only four indicator variables…

neighborhood_Logan Square
neighborhood_Near North Side
neighborhood_Near West Side
neighborhood_West Town

… that correspond to the five levels of our original neighborhood_cleansed variable

Logan Square
Near North Side
Near West Side
West Town
Lake View

How do we go about interpreting these four indicator variables?

If neighborhood_Logan Square=1, then that means the corresponding listing is in Logan Square.
Otherwise, if neighborhood_Logan Square =0, then this means that the corresponding listing is not in Logan Square.

You go about interpreting the remaining four indicator variables similarly, for instance…

If neighborhood_Near North Side=1, then that means the corresponding listing is in Near North Side.
Otherwise, if neighborhood_Near North Side =0, then this means that the corresponding listing is not in Near North Side.

Notice how the Lake View neighborhood was not assigned an indicator variable. However, this is ok. If we want to represent the single "left out" level like Lake View, then all we need to do is set…

neighborhood_Logan Square=0
neighborhood_Near North Side=0
neighborhood_Near West Side=0
neighborhood_West Town=0

By process of elimination, the listing MUST then belong to Lake View.

Representing the room_type Variable

Notice how we also used only one indicator variable…

Room_type_Private room

… that corresponds to the two levels of our original room_type variable

Private room
Entire home/apt

Similarly we interpret this one variable as follows.

If Room_type_Private room=1, then that means the corresponding listing is a Private Room.
Otherwise, if Room_type_Private room=0, then this means that the corresponding listing is not a Private Room (and thus by process of elimination it must be the Entire home/apt).

Why do we do it this way?

You may be asking: "why could we not have used five indicator variables that corresponded to each of the five levels of the neighborhood variable, for instance?" Or in other words, why can we not represent our linear regression equation like this?

$\hat{income}=\hat{\beta}_0 +\hat{\beta}_1 accommodates+\hat{\beta}_2 bedrooms+\hat{\beta}_3 beds+\hat{\beta}_4 neighborhood_{Logan\_Square}$
$\qquad +\hat{\beta}_5 neighborhood_{Near\_North\_Side}+\hat{\beta}_6 neighborhood_{Near\_West\_Side}+\hat{\beta}_7 neighborhood_{West\_Town}$
$\qquad +\mathbf{\hat{\beta}_8neighborhood_{Lake\_View}}+\hat{\beta}_9 roomtype_{private\_room}+\mathbf{\hat{\beta}_{10} roomtype_{entire\_home/apt}}$

Unfortunately, if we were to set up our linear regression model like this where we use indicator variables that are not technically needed, when we use our Calculus techniques to solve for the optimal values of $\hat{\beta}_0,\hat{\beta}_1,...,\hat{\beta}_{10}$ we end up running into "multiple solution errors".

Therefore, whenever representing a categorical explanatory variable with p levels, in a regression model, you should always use exactly p-1 indicator variables.

So therefore, the model that we are going to try to fit is shown below.

$\hat{income}=\hat{\beta}_0+\hat{\beta}_1accommodates+\hat{\beta}_2 bedrooms+\hat{\beta}_3beds$
$\qquad +\hat{\beta}_4neighborhood_{Logan\_Square}+\hat{\beta}_5neighborhood_{Near\_North\_Side}$
$\qquad +\hat{\beta}_6 neighborhood_{Near\_West\_Side}+\hat{\beta}_7 neighborhood_{West\_Town}$
$\qquad +\hat{\beta}_8 roomtype_{private\_room}$

Reference Level

For a given categorical explanatory variable, the level that is not assigned an indicator variable is called the reference level.

Also note, that the level that is chosen for the reference level is not important. Your fitted model will end up giving you the same results, regardless of your choice.

← Fitting a Multiple Linear Regression Curve Next: Interpreting your Model's Slopes →