Finding a Parsimonious Model

← Overfitting vs. Underfitting to a Dataset Next: Overview of Feature Selection Techniques →

A Parsimonious Model

test123
Thus, if our goal is to build a model that yields good predictions for new datasets, ideally we neither want to underfit or overfit a model. So our goal is to find what we call a parsimonious model which aims to strike a balance between overfitting and underfitting the model to the training dataset.

In the context of feature selection decisions, a parsimonious model will try to find the ideal balance of the following:

having a low enough number of explanatory variables to avoid overfitting while
having a high enough model fit to avoid underfitting.

Adjusted R^2: Measuring Model Parsimoniousness

Thus, if our goal is to find a parsimonious model that has been trained on a given training dataset, we may ask ourselves the following question: how might we quantify the parsimoniousness of a given model? Or in other words, to what extent does the model strike this ideal balance between overfitting and underfitting to the training dataset?

One of the most common ways to measure the parsimoniousness of a linear regression model in particular is to use what we call the adjusted R^2 of the linear regression model. It's called the adjusted R^2 as we take the equation for the $R^2$ of a given model and slightly adjust it.

R^2

Recall that the equation for calculating the R^2 of a linear regression model is the following.
$$R^2=\frac{SSR}{SST}=\frac{SST-SSE}{SST}=1-\frac{SSE}{SST}$$

Adjusted R^2

The adjusted R^2 modifies this expression $1-\frac{SSE}{SST}$ with what we might call a predictor penalty term $(\frac{n-1}{n-p-1})$ as follows.

$$R^2_{adj}=1-\frac{SSE}{SST}\cdot(\frac{n-1}{n-p-1})$$

n= the sample size
p= the number of slopes in the regression model

Calculating the Adjusted R^2

So how does this slightly modified version of the $R^2$ work? Let's return back to our best final model from Module 8 for predicting new Chicago Airbnb prices and calculate the adjusted R^2 for the model.

Best Model without Interaction Terms from Module 8

\[ \begin{aligned} \hat{price} &= -85.59 \\ &\quad + 13.36 \cdot \text{accommodates} \\ &\quad + 94 \cdot \text{bedrooms} \\ &\quad - 11.30 \cdot \text{neighborhood}_{\text{Logan_Square}} \\ &\quad + 105.15 \cdot \text{neighborhood}_{\text{Near_North_Side}} \\ &\quad + 48.06 \cdot \text{neighborhood}_{\text{Near_West_Side}} \\ &\quad + 36.07 \cdot \text{neighborhood}_{\text{West_Town}} \\ &\quad + 17.16 \cdot \text{roomtype}_{\text{private_room}} \end{aligned} \]

df=pd.read_csv('chicago_airbnb_listings_cleaned_for_regression.csv')
df.head()

	price	neighborhood	room_type	accommodates	bedrooms	beds
0	379	Lake View	Entire home/apt	7	3.0	5.0
1	479	Lake View	Entire home/apt	9	4.0	5.0
2	479	Lake View	Entire home/apt	9	4.0	5.0
3	75	Lake View	Entire home/apt	2	1.0	1.0
4	75	Lake View	Private room	2	1.0	1.0

Note: We'll use the same random seed to create our training and test datasets as well.

df_train, df_test = train_test_split(df, test_size=0.2, random_state=101)

current_best_model = smf.ols(formula='price~neighborhood+room_type+bedrooms+accommodates', data=df_train).fit()
current_best_model.summary()

OLS Regression Results
Dep. Variable:	price	R-squared:	0.446
Model:	OLS	Adj. R-squared:	0.444
Method:	Least Squares	F-statistic:	188.9
Date:	Fri, 11 Aug 2023	Prob (F-statistic):	2.32e-205
Time:	19:48:52	Log-Likelihood:	-10765.
No. Observations:	1648	AIC:	2.155e+04
Df Residuals:	1640	BIC:	2.159e+04
Df Model:	7
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-85.5892	12.597	-6.794	0.000	-110.297	-60.881
neighborhood[T.Logan Square]	-11.2928	13.648	-0.827	0.408	-38.063	15.477
neighborhood[T.Near North Side]	105.1499	12.545	8.382	0.000	80.544	129.756
neighborhood[T.Near West Side]	48.0569	14.390	3.340	0.001	19.833	76.281
neighborhood[T.West Town]	36.0702	12.596	2.864	0.004	11.364	60.776
room_type[T.Private room]	17.1569	12.389	1.385	0.166	-7.144	41.457
bedrooms	94.5932	6.753	14.007	0.000	81.348	107.839
accommodates	13.3633	2.504	5.337	0.000	8.452	18.274

Omnibus:	2315.322	Durbin-Watson:	1.974
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1234847.233
Skew:	7.662	Prob(JB):	0.00
Kurtosis:	136.223	Cond. No.	35.0

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can actually look up the adjusted R^2 value for a given model and training dataset in our .summary() output table or extract it using the .rsquared_adj attribute.

current_best_model.rsquared_adj

0.44395199961275267

The Adjusted R^2 "Balancing Act"

Recall from Module 8, we do not see a strong association between neighborhood and price in the training dataset. Could this be a sign that our 4 neighborhood indicator variables in the model do not bring enough predictive power? Let's test this with our adjusted R^2 metric.

sns.boxplot(x='neighborhood', y='price', data=df_train)
plt.ylim([0,1000])
plt.show()

Comparing Two Models

Let's see what the adjusted R^2 suggests, by comparing two models. Our new model will consider dropping these 4 neighborhood indicator variables.

Old Model

\[ \begin{aligned} \hat{price} &= \hat{\beta}_0 \\ &\quad + \hat{\beta}_1 \cdot \text{accommodates} \\ &\quad + \hat{\beta}_2 \cdot \text{bedrooms} \\ &\quad + \hat{\beta}_3 \cdot \text{neighborhood}_{\text{Logan_Square}} \\ &\quad + \hat{\beta}_4 \cdot \text{neighborhood}_{\text{Near_North_Side}} \\ &\quad + \hat{\beta}_5 \cdot \text{neighborhood}_{\text{Near_West_Side}} \\ &\quad + \hat{\beta}_6 \cdot \text{neighborhood}_{\text{West_Town}} \\ &\quad + \hat{\beta}_7 \cdot \text{roomtype}_{\text{private_room}} \end{aligned} \]

New Model

Two Competing Effects

Before even fitting the new model, we can already see that two "competing" effects are going to happen to this adjusted R^2 equation, by dropping the number of slopes from p=7 to p=3.

$$R^2_{adj}=1-\frac{SSE}{SST}\cdot(\frac{n-1}{n-p-1})$$

Effect 1: The number of slopes p decreases, which enourages $R^2_{adj}$ to increase.
- Because p decreases, then $n-p-1$ increases.
- Thus, the penalty term $(\frac{n-1}{n-p-1})$ decreases.
- Thus, $R^2_{adj}$ will be enouraged to increase.
Effect 2: The new model error (SSE) most likely increases, which enourages $R^2_{adj}$ to decrease.
- With every new explanatory variable or indicator variable, say $x_r$, that you add to a model, the model will try to use this variable as an opportunity to minimize the squared sum of the residuals (ie. the SSE), even if ever so slightly. In the unlikely scenario that this new variable $x_r$ provides no benefit at all, then the model will simply just set it's corresponding slope $\hat{\beta}_r=0$, and then you, for all intents-and-purposes, end up with the same model that you had before. Your remaining slopes would be set to the same values. Thus your SSE would stay the same.
- Alternatively, by decreasing the number of explanatory or indicator variables in your model, you are giving your model less opportunities to decrease the SSE. So at best, your SSE will stay the same, but most likely it will get worse (ie. SSE will increase).
- Thus, the $\frac{SSE}{SST}$ term will stay the same at best, but will most likely increase.
- Thus, the $R^2_{adj}$ will most likely be encouraged to decrease.

Interpreting the Adjusted R^2

So which one of these competing effects will "win"? This depends on how much of an increase in error the new model SSE will reflect by dropping the 4 neighborhood variables (or equivalently, how much of an decrease in fit $R^2=1-\frac{SSE}{SST}$ the new model will have).

Which Competing Effect "Wins"

$$R^2_{adj}=1-\frac{SSE}{SST}\cdot(\frac{n-1}{n-p-1})$$

Scenario #1:R^2 Fit Drop is not Too Large $\rightarrow$ Adjusted R^2 Increases

In scenario #1 the increase in the error (SSE) of the new model (ie. the decrease in the $R^2$ fit) is SMALL compared to the increase in the adjusted R^2 that decreasing $p$ would bring. Thus, the $R^2_{adj}$ will INCREASE overall. Because the SSE of the new model comparatively did not drop that much, we make the following interpretations.

The slopes that we dropped to not bring enough predictive power to the model.
Thus, if we were to have left them in the model, we may be overfitting.

Scenario #2:R^2 Fit Drop is Large Enough $\rightarrow$ Adjusted R^2 Decreases

In scenario #2 the increase in the error (SSE) of the new model (ie. the decrease in the $R^2$ fit) is LARGE compared to the increase in the adjusted R^2 that decreasing $p$ would bring. Thus, the $R^2_{adj}$ will DECREASE overall. Because the SSE of the new model comparatively dropped quite a bit, we make the following interpretations.

The slopes that we dropped are bringing enough predictive power to the model.
Thus, if we were to have left them out of the model, we may be underfitting.

Interpreting the Adjusted R^2

Thus, the higher the adjusted $R^2$ of a model, the more parsimonious we say that the model is, and therefore, the less likely the model is to be overfit to the training dataset.

Range of Adjusted R^2

Due to the modification, the adjusted R^2 no longer represents a percent. It can actually range from $(-\infty,1]$.

Warning about Interpreting Adjusted R^2!

Due to the modification, the adjusted R^2 no longer represents "the percent of response variable variability that is explained by the model". This is what the regular $R^2$ represents. The adjusted $R^2$ is often not put into words. It simply just represents a way to measure parsimoniousness of a linear regression model.

Is the Neighborhood Variable Leading to Overfitting?

Finally, let's use the adjusted R^2 metric to evaluate whether we may be overfitting our linear regression model with the neighborhood variable.

Old Model

We first calculate the adjusted R^2=0.444 of our old model that includes the neighborhood slopes.

old_model = smf.ols(formula='price~neighborhood+room_type+bedrooms+accommodates', data=df_train).fit()
old_model.rsquared_adj

0.44395199961275267

Candidate Model

We then fit our new model which has dropped the 4 neighborhood indicator variables. Then we calculate the adjusted R^2=0.409 of our new model.

new_model = smf.ols(formula='price~room_type+bedrooms+accommodates', data=df_train).fit()
new_model.rsquared_adj

0.40915551112305926

Conclusion

We see that the adjusted R^2 of our candidate model decreased. Thus, the adjusted R^2 is suggesting that the old model that included the neighborhood indicator variables is more parsimonious than the one that didn't.

	Adjusted R^2
Old Model	0.444
Candidate Model	0.409

Therefore, despite the weak association that we observed between the neighorhood variable and the price variable in our side-by-side boxplots visualization, the adjusted R^2 metric is suggesting that the neighborhood indicator variables bring enough predictive power to the model, in context of the other explanatory variables already included. Thus, it suggests that we are not likely to be overfitting to the training dataset by including these neighborhood indicator variables.

Thus, if our goal is to build a model that yields good predictions for new datasets, then the adjusted R^2 is suggesting that we should stick with our original model and keep the neighorhood variables.

Testing our Conclusion with Another Metric

Remember, however, that we know of another technique that can also help us assess whether a model will overfit to the training dataset. Because we just so happened to set aside a random sample of observations as our test dataset, we can also evaluate model overfitting by inspecting the fit (ie. R^2) of the models with the test dataset.

Remember:

The R^2 evaluates the fit of a given model with any given dataset.
The adjusted R^2 evaluates the parsimoniousness of a given model and the dataset that trained it.

So we calculate the test dataset R^2=0.329 of the old model.

from sklearn.metrics import r2_score

# Test dataset target array
y_test=df_test['price']

#Test R^2 of old model
y_pred_test = old_model.predict(df_test)
r2_old = r2_score(y_test, y_pred_test)
r2_old

0.3293011996166699

And we calculate the test dataset R^2=0.295 of the candidate model.

#Test R^2 of new model
y_pred_test = new_model.predict(df_test)
r2_new = r2_score(y_test, y_pred_test)
r2_new

0.2947119105362944

Corroborating Conclusion

Great! The test dataset R^2 model fit was higher (ie. better) with the old model, than it was with the candidate model. This also suggests that we are in fact, not overfitting to the training dataset by including the neighborhood indicator variables. Because the fit of even the test dataset got better when we included the neighborhood indicator variables, this suggests that these variables are in fact bringing enough predictive power to the model, and we are not overfitting.

	Adjusted R^2	Test Data R^2
Old Model	0.444	0.329
Candidate Model	0.409	0.295

Training Dataset Adjusted R^2 vs. Test Dataset R^2

What we just observed was two techniques that both corroborated the same suggestion. These two techniques may not always agree with respect to their suggestions about whether a given model is overfitting or not. However, when they do this can build more confidence that the model that we have selected will in fact yield better predictions with new datasets.

In addition, calculating and interpreting the adjusted R^2 of model and the dataset that was used to train the model can be particularly useful when you don't want to sacrifice a portion of your original dataset towards a test dataset that is just used for evaluation.

← Overfitting vs. Underfitting to a Dataset Next: Overview of Feature Selection Techniques →