How to Incorporate Categorical Explanatory Variables


Let's return back to our original research goal of predicting the price of new Chicago Airbnb listings using the following 5 explanatory variables.

  1. Neighborhood: (filtered for just the 5 most popular neighborhoods: Near North Side, West Town, Lake View, Near West Side, Logan Square)
  2. Room type: (filtered for entire home/apt and private room)
  3. How many people the listing accommodates
  4. How many bedrooms the listing has
  5. How many beds the listing has

Notice how the first two explanatory variables listed are actually categorical variables. In Section 10 we'll discuss how only numerical variables should be used as the response variable. However, we are actually able to incorporate categorical explanatory variables into a linear regression model in the following way.

Below we can use the pd.get_dummies() function to create a set of what we call 0/1 indicator variables (or dummy variables ) for each of our categorical explanatory variables. We do this for both our training features matrix and our test features matrix.

X_train_dummies=pd.get_dummies(X_train,drop_first=True)
X_train_dummies.head()
accommodates bedrooms beds neighborhood_Logan Square neighborhood_Near North Side neighborhood_Near West Side neighborhood_West Town room_type_Private room
1193 5 2.0 2.0 1 0 0 0 0
1392 8 4.0 4.0 1 0 0 0 0
338 12 4.0 5.0 0 0 0 0 0
1844 5 2.0 2.0 0 0 0 1 0
359 4 2.0 2.0 0 0 0 0 0
X_test_dummies=pd.get_dummies(X_test,drop_first=True)
X_test_dummies.head()
accommodates bedrooms beds neighborhood_Logan Square neighborhood_Near North Side neighborhood_Near West Side neighborhood_West Town room_type_Private room
2592 2 1.0 1.0 0 0 1 0 0
139 1 1.0 1.0 0 0 0 0 1
471 2 1.0 1.0 0 0 0 0 0
2015 5 2.0 2.0 0 0 0 1 0
3324 6 2.0 3.0 0 1 0 0 0

Representing the Neighborhood Variable

By setting the parameter drop_first=True in the pd.get_dummies() function, we end up with the following situation in which there are only four indicator variables…

  • neighborhood_Logan Square
  • neighborhood_Near North Side
  • neighborhood_Near West Side
  • neighborhood_West Town

… that correspond to the five levels of our original neighborhood_cleansed variable

  • Logan Square
  • Near North Side
  • Near West Side
  • West Town
  • Lake View

How do we go about interpreting these four indicator variables?

  • If neighborhood_Logan Square=1, then that means the corresponding listing is in Logan Square.
  • Otherwise, if neighborhood_Logan Square =0, then this means that the corresponding listing is not in Logan Square.

You go about interpreting the remaining four indicator variables similarly, for instance…

  • If neighborhood_Near North Side=1, then that means the corresponding listing is in Near North Side.
  • Otherwise, if neighborhood_Near North Side =0, then this means that the corresponding listing is not in Near North Side.

Notice how the Lake View neighborhood was not assigned an indicator variable. However, this is ok. If we want to represent the single "left out" level like Lake View, then all we need to do is set…

  • neighborhood_Logan Square=0
  • neighborhood_Near North Side=0
  • neighborhood_Near West Side=0
  • neighborhood_West Town=0

By process of elimination, the listing MUST then belong to Lake View.

Representing the room_type Variable

Notice how we also used only one indicator variable…

  • Room_type_Private room

… that corresponds to the two levels of our original room_type variable

  • Private room
  • Entire home/apt

Similarly we interpret this one variable as follows.

  • If Room_type_Private room=1, then that means the corresponding listing is a Private Room.
  • Otherwise, if Room_type_Private room=0, then this means that the corresponding listing is not a Private Room (and thus by process of elimination it must be the Entire home/apt).

Why do we do it this way?

You may be asking: "why could we not have used five indicator variables that corresponded to each of the five levels of the neighborhood variable, for instance?" Or in other words, why can we not represent our linear regression equation like this?

$\hat{income}=\hat{\beta}_0 +\hat{\beta}_1 accommodates+\hat{\beta}_2 bedrooms+\hat{\beta}_3 beds+\hat{\beta}_4 neighborhood_{Logan\_Square}$
$\qquad +\hat{\beta}_5 neighborhood_{Near\_North\_Side}+\hat{\beta}_6 neighborhood_{Near\_West\_Side}+\hat{\beta}_7 neighborhood_{West\_Town}$
$\qquad +\mathbf{\hat{\beta}_8neighborhood_{Lake\_View}}+\hat{\beta}_9 roomtype_{private\_room}+\mathbf{\hat{\beta}_{10} roomtype_{entire\_home/apt}}$

Unfortunately, if we were to set up our linear regression model like this where we use indicator variables that are not technically needed, when we use our Calculus techniques to solve for the optimal values of $\hat{\beta}_0,\hat{\beta}_1,...,\hat{\beta}_{10}$ we end up running into "multiple solution errors".

Therefore, whenever representing a categorical explanatory variable with p levels, in a regression model, you should always use exactly p-1 indicator variables.

So therefore, the model that we are going to try to fit is shown below.

$\hat{income}=\hat{\beta}_0+\hat{\beta}_1accommodates+\hat{\beta}_2 bedrooms+\hat{\beta}_3beds$
$\qquad +\hat{\beta}_4neighborhood_{Logan\_Square}+\hat{\beta}_5neighborhood_{Near\_North\_Side}$
$\qquad +\hat{\beta}_6 neighborhood_{Near\_West\_Side}+\hat{\beta}_7 neighborhood_{West\_Town}$
$\qquad +\hat{\beta}_8 roomtype_{private\_room}$

Reference Level

For a given categorical explanatory variable, the level that is not assigned an indicator variable is called the reference level.

Also note, that the level that is chosen for the reference level is not important. Your fitted model will end up giving you the same results, regardless of your choice.