Forward Selection Algorithm


Another type of heuristic feature selection technique that is often used is was we call a forward selection algorithm.

We can similarly use this type of algorithm to help us select a good combination of explanatory variables to include in a model that optimizes some metric that we are interested in. For instance, we use the algorithm below in attempt to find the model that maximizes the adjusted R^2. However you can use this same algorithm template to select a model that optimizes some other metric as well.

Forward Selection Algorithm - Attempting to Maximize the Adjusted R^2

Goal: Attempt to find the model with the highest Adjusted R^2

Steps:

  1. Fit a “current model” and find the adjusted R^2 of this model.

    In the beginning, your “current model” should include NONE OF the possible explanatory variables you are considering.

  2. For each explanatory variable in your “current model” do the following:

    • Fit a “test model”. Your “test model” should be every explanatory variable in the “current model” in addition to the explanatory variable you are considering.
    • Find the adjusted R^2 of this “test model”.
  3. If NONE of the “test models” from step (2) had a adjusted R^2 that was higher than the adjusted R^2 for the “current model”, then STOP THE ALGORITHM, and return the “current model” as your “final model”

  4. Otherwise, choose the “test model” from step (2) that had the highest adjusted R^2, and set your new “current model” to be this “test model”. Then go back to step (2).

Using a Forward Selection Algorithm

Now let's use a forward selection algorithm to try to find the linear regression model (not including interaction terms) that has the highest adjusted R^2 when it comes to predicting Airbnb price using some subset of the following 5 explanatory variables.

  1. neighborhood
  2. room_type
  3. accommodates
  4. bedrooms
  5. beds

Is it possible that this forward selection algorithm will find a better result than the backwards elimination algorithm did? Let's see!

Iteration 1

1.1 Fit the Current Model

Let's fit the current model using none of the possible explanatory variables. That is, we will fit a model that has just an intercept. We can represent this int he .ols() function by putting a '1' where the explanatory variable names usually go. We then calculate the adjusted R^2 of this current model.

current_model = smf.ols(formula='price~1', data=df_train).fit()
current_model.rsquared_adj

0.0

1.2. Try out 5 Test Models

Next, one at a time, let's add each of our 5 possible explanatory variables and find the adjusted R^2 values of each of the 5 resulting test models.

#Adds neighborhood
test_model = smf.ols(formula='price~neighborhood', data=df_train).fit()
test_model.rsquared_adj

0.013688191780071524

#Adds room_type
test_model = smf.ols(formula='price~room_type', data=df_train).fit()
test_model.rsquared_adj

0.026634999791649516

#Adds accommodates
test_model = smf.ols(formula='price~accommodates', data=df_train).fit()
test_model.rsquared_adj

0.35361688843327155

#Adds bedrooms
test_model = smf.ols(formula='price~bedrooms', data=df_train).fit()
test_model.rsquared_adj

0.39846091969795683

#Adds beds
test_model = smf.ols(formula='price~beds', data=df_train).fit()
test_model.rsquared_adj

0.3563750646908159

1.3. Continuing the Algorithm

Out of our test models, the one that added bedrooms had the highest adjusted R^2 value (0.398). Furthermore, because this test model adjusted R^2 is higher than that of the current model adjusted R^2, we permanently add bedrooms to the current model. We then continue on to the next iteration.

Iteration 2

2.1 Fit the Current Model

current_model = smf.ols(formula='price~bedrooms', data=df_train).fit()
current_model.rsquared_adj

0.39846091969795683

2.2 Try out 4 Test Models

Next, one at a time, let's add each of our 4 remaining possible explanatory variables and find the adjusted R^2 values of each of the 4 resulting test models.

#Adds neighborhood
test_model = smf.ols(formula='price~bedrooms+neighborhood', data=df_train).fit()
test_model.rsquared_adj

0.4346350063083634

#Adds room_type
test_model = smf.ols(formula='price~bedrooms+room_type', data=df_train).fit()
test_model.rsquared_adj

0.39809916410905866

#Adds accommodates
test_model = smf.ols(formula='price~bedrooms+accommodates', data=df_train).fit()
test_model.rsquared_adj

0.40942092917535045

#Adds beds
test_model = smf.ols(formula='price~bedrooms+beds', data=df_train).fit()
test_model.rsquared_adj

0.4092784403701947

2.3. Continuing the Algorithm

Out of our test models, the one that added neighborhood had the highest adjusted R^2 value (0.435). Furthermore, because this test model adjusted R^2 is higher than that of the current model adjusted R^2, we permanently add neighborhood to the current model. We then continue on to the next iteration.

Iteration 3

3.1 Fit the Current Model

current_model = smf.ols(formula='price~bedrooms+neighborhood', data=df_train).fit()
current_model.rsquared_adj

0.4346350063083634

3.2 Try out 3 Test Models

Next, one at a time, let's add each of our 3 remaining possible explanatory variables and find the adjusted R^2 values of each of the 3 resulting test models.

#Adds room_type
test_model = smf.ols(formula='price~bedrooms+neighborhood+room_type', data=df_train).fit()
test_model.rsquared_adj

0.43463809000567954

#Adds accommodates
test_model = smf.ols(formula='price~bedrooms+neighborhood+accommodates', data=df_train).fit()
test_model.rsquared_adj

0.4436410368879983

#Adds beds
test_model = smf.ols(formula='price~bedrooms+neighborhood+beds', data=df_train).fit()
test_model.rsquared_adj

0.4429395769247103

3.3. Continuing the Algorithm

Out of our test models, the one that added accommodates had the highest adjusted R^2 value (0.444). Furthermore, because this test model adjusted R^2 is higher than that of the current model adjusted R^2, we permanently add accommodates to the current model. We then continue on to the next iteration.

Iteration 4

4.1 Fit the Current Model

current_model = smf.ols(formula='price~bedrooms+neighborhood+accommodates', data=df_train).fit()
current_model.rsquared_adj

0.4436410368879983

4.2 Try out 2 Test Models

Next, one at a time, let's add each of our 2 remaining possible explanatory variables and find the adjusted R^2 values of each of the 2 resulting test models.

#Adds room_type
test_model = smf.ols(formula='price~bedrooms+neighborhood+accommodates+room_type', data=df_train).fit()
test_model.rsquared_adj

0.44395199961275267

#Adds beds
test_model = smf.ols(formula='price~bedrooms+neighborhood+accommodates+beds', data=df_train).fit()
test_model.rsquared_adj

0.445610618510514

4.3. Continuing the Algorithm

Out of our test models, the one that added beds had the highest adjusted R^2 value (0.446). Furthermore, because this test model adjusted R^2 is higher than that of the current model adjusted R^2, we permanently add beds to the current model. We then continue on to the next iteration.

Iteration 5

5.1 Fit the Current Model

current_model = smf.ols(formula='price~bedrooms+neighborhood+accommodates+beds', data=df_train).fit()
current_model.rsquared_adj

0.445610618510514

5.2 Try out 1 Test Model

Next, let's add our 1 remaining possible explanatory variable and find the adjusted R^2 value of this 1 resulting test model.

#Adds room_type
test_model = smf.ols(formula='price~bedrooms+neighborhood+accommodates+beds+room_type', data=df_train).fit()
test_model.rsquared_adj

0.4457419680875391

4.3. Stopping the Algorithm

Because this test model that adds room_type adjusted R^2 is higher than that of the current model adjusted R^2, we permanently add room type to the current model. We would then continue on to the next iteraction, but because there are no more explanatory variables we stop with the current model which now includes all 5 possible explanatory variables.

final_model = smf.ols(formula='price~bedrooms+neighborhood+accommodates+beds+room_type', data=df_train).fit()
final_model.rsquared_adj

0.4457419680875391

Algorithm Conclusion and Additional Insights

Hence, the result of our forward selection algorithm ended up agreeing with the result of the backwards elimination algorithm. This is not necessarily guaranteed to always be the case.

In addition, by going through this forward selection algorithm we also learned an additional set of insights particularly about how important each of these explanatory variables were considered to be with respect to model parsimoniousness when added iteratively to the model.

  1. Because bedrooms was added first, we might say that bedrooms contributes the most overall to model parsimoniousness.
  2. Then neighborhoods
  3. Then accommodates
  4. Then beds
  5. Then finally room_type being the least important to building model parsimoniousness given the other variables that were iteratively added.