Backwards Elimination Algorithm


One type of heuristic feature selection technique that is often used is was we call a backwards elimination algorithm.

We can use this type of algorithm to help us select a good combination of explanatory variables to include in a model that optimizes some metric that we are interested in. For instance, we use the algorithm below in attempt to find the model that maximizes the adjusted R^2. However you can use this same algorithm templete to select a model that optimizes some other metric as well.

Backwards Elimination - Attempting to Maximize the Adjusted R^2

Goal: Attempt to find the model with the highest Adjusted R^2

Steps:

  1. Fit a “current model” and find the adjusted R^2 of this model.

    In the beginning, your “current model” should include all possible explanatory variables you are considering.

  2. For each explanatory variable in your “current model” do the following:

    • Fit a “test model”. Your “test model” should be every explanatory variable in the “current model” except for the explanatory variable you are considering.
    • Find the adjusted R^2 of this “test model”.
  3. If NONE of the “test models” from step (2) had a adjusted R^2 that was higher than the adjusted R^2 for the “current model”, then STOP THE ALGORITHM, and return the “current model” as your “final model”.

  4. Otherwise, choose the “test model” from step (2) that had the highest adjusted R^2, and set your new “current model” to be this “test model”. Then go back to step (2).

Using a Backwards Elimination Algorithm

So let's use a backwards elimination algorithm to try to find the linear regression model (not including interaction terms) that has the highest adjusted R^2 when it comes to predicting Airbnb price using some subset of the following 5 explanatory variables.

  1. neighborhood
  2. room_type
  3. accommodates
  4. bedrooms
  5. beds

Iteration 1

1.1 Fit the Current Model

Let's fit the current model using all 5 possible explanatory variables and find the adjusted R^2.

current_model = smf.ols(formula='price~neighborhood+room_type+accommodates+bedrooms+beds', data=df_train).fit()
current_model.rsquared_adj

0.4457419680875391

1.2. Try out 5 Test Models

Next, one at a time, let's remove each of our 5 explanatory variables that exist in the current model and find the adjusted R^2 values of each of the 5 resulting test models.

#Removes neighborhood
test_model = smf.ols(formula='price~room_type+accommodates+bedrooms+beds', data=df_train).fit()
test_model.rsquared_adj

0.4119667314108315

#Removes room_type
test_model = smf.ols(formula='price~neighborhood+accommodates+bedrooms+beds', data=df_train).fit()
test_model.rsquared_adj

0.4456106185105141

#Removes accommodates
test_model = smf.ols(formula='price~neighborhood+room_type+bedrooms+beds', data=df_train).fit()
test_model.rsquared_adj

0.44285244711769367

#Removes bedrooms
test_model = smf.ols(formula='price~neighborhood+room_type+accommodates+beds', data=df_train).fit()
test_model.rsquared_adj

0.4018417742349216

#Removes beds
test_model = smf.ols(formula='price~neighborhood+room_type+accommodates+bedrooms', data=df_train).fit()
test_model.rsquared_adj

0.4439519996127528

1.3. Stopping the Algorithm

Interestingly enough, this is where our algorithm stops! We don't need to go through any more iterations. Because none of the test model adjusted R^2 values are better (ie. higher) than that of the current model (0.4457), the algorithm stops and returns the current model which includes ALL 5 possible explanatory variables.

final_model = smf.ols(formula='price~neighborhood+room_type+accommodates+bedrooms+beds', data=df_train).fit()
final_model.rsquared_adj

0.4457419680875391

Algorithm Conclusion

Hence, the model with the highest adjusted R^2 according found by the backwards elimination algorithm is the full model. Remember, however, that this does not guarantee that this final model necessarily yields the highest adjusted R^2 out of all possible $2^5$ linear regression models.