Backwards Elimination Algorithm
One type of heuristic feature selection technique that is often used is was we call a backwards elimination algorithm.
We can use this type of algorithm to help us select a good combination of explanatory variables to include in a model that optimizes some metric that we are interested in. For instance, we use the algorithm below in attempt to find the model that maximizes the adjusted R^2. However you can use this same algorithm templete to select a model that optimizes some other metric as well.
Backwards Elimination - Attempting to Maximize the Adjusted R^2
Goal: Attempt to find the model with the highest Adjusted R^2
Steps:
-
Fit a “current model” and find the adjusted R^2 of this model.
In the beginning, your “current model” should include all possible explanatory variables you are considering.
-
For each explanatory variable in your “current model” do the following:
- Fit a “test model”. Your “test model” should be every explanatory variable in the “current model” except for the explanatory variable you are considering.
- Find the adjusted R^2 of this “test model”.
-
If NONE of the “test models” from step (2) had a adjusted R^2 that was higher than the adjusted R^2 for the “current model”, then STOP THE ALGORITHM, and return the “current model” as your “final model”.
-
Otherwise, choose the “test model” from step (2) that had the highest adjusted R^2, and set your new “current model” to be this “test model”. Then go back to step (2).
Using a Backwards Elimination Algorithm
So let's use a backwards elimination algorithm to try to find the linear regression model (not including interaction terms) that has the highest adjusted R^2 when it comes to predicting Airbnb price
using some subset of the following 5 explanatory variables.
neighborhood
room_type
accommodates
bedrooms
beds
Iteration 1
1.1 Fit the Current Model
Let's fit the current model using all 5 possible explanatory variables and find the adjusted R^2.
current_model = smf.ols(formula='price~neighborhood+room_type+accommodates+bedrooms+beds', data=df_train).fit()
current_model.rsquared_adj
0.4457419680875391
1.2. Try out 5 Test Models
Next, one at a time, let's remove each of our 5 explanatory variables that exist in the current model and find the adjusted R^2 values of each of the 5 resulting test models.
#Removes neighborhood
test_model = smf.ols(formula='price~room_type+accommodates+bedrooms+beds', data=df_train).fit()
test_model.rsquared_adj
0.4119667314108315
#Removes room_type
test_model = smf.ols(formula='price~neighborhood+accommodates+bedrooms+beds', data=df_train).fit()
test_model.rsquared_adj
0.4456106185105141
#Removes accommodates
test_model = smf.ols(formula='price~neighborhood+room_type+bedrooms+beds', data=df_train).fit()
test_model.rsquared_adj
0.44285244711769367
#Removes bedrooms
test_model = smf.ols(formula='price~neighborhood+room_type+accommodates+beds', data=df_train).fit()
test_model.rsquared_adj
0.4018417742349216
#Removes beds
test_model = smf.ols(formula='price~neighborhood+room_type+accommodates+bedrooms', data=df_train).fit()
test_model.rsquared_adj
0.4439519996127528
1.3. Stopping the Algorithm
Interestingly enough, this is where our algorithm stops! We don't need to go through any more iterations. Because none of the test model adjusted R^2 values are better (ie. higher) than that of the current model (0.4457), the algorithm stops and returns the current model which includes ALL 5 possible explanatory variables.
final_model = smf.ols(formula='price~neighborhood+room_type+accommodates+bedrooms+beds', data=df_train).fit()
final_model.rsquared_adj
0.4457419680875391
Algorithm Conclusion
Hence, the model with the highest adjusted R^2 according found by the backwards elimination algorithm is the full model. Remember, however, that this does not guarantee that this final model necessarily yields the highest adjusted R^2 out of all possible $2^5$ linear regression models.