Overfitting vs. Underfitting to a Dataset
Before exploring more linear regression candidate models for our Airbnb dataset, let's examine the following artificial dataset below to learn more about the pros and cons of overfitting and underfitting a given model to our training dataset.
df_temp = pd.read_csv('artificial_dataset.csv')
df_temp
x | y | |
---|---|---|
0 | -1.507185 | -1.700640 |
1 | 0.691179 | 0.351877 |
2 | -0.490616 | -1.218427 |
3 | -1.602506 | -1.843221 |
4 | -0.275597 | -0.354455 |
5 | -1.117979 | -1.769453 |
6 | 0.379086 | 0.257637 |
7 | -0.945600 | -1.293416 |
8 | 0.309061 | 0.031458 |
9 | -0.133834 | 0.050468 |
10 | 0.905778 | 1.267771 |
11 | -0.969701 | -0.810798 |
12 | -0.781417 | -0.997631 |
13 | 0.269303 | -0.026614 |
14 | 0.180662 | 0.348375 |
15 | 0.745693 | 0.620228 |
16 | 0.812167 | 0.544890 |
17 | 0.664805 | 0.660800 |
18 | 1.066191 | 1.140320 |
19 | -1.756251 | -1.521596 |
Suppose our goal is to build a model with this artificial dataset that yields good predictions for y
using the explanatory variable x
for new datasets. So again, let's randomly split this dataset up into a training dataset and a test dataset.
Note: Creating a test dataset that is as high as 50% of the observations is a bit unusual, but we do this for learning purposes.
df_temp_train, df_temp_test = train_test_split(df_temp, test_size=0.5, random_state=207)
df_temp_train
x | y | |
---|---|---|
7 | -0.945600 | -1.293416 |
17 | 0.664805 | 0.660800 |
19 | -1.756251 | -1.521596 |
3 | -1.602506 | -1.843221 |
11 | -0.969701 | -0.810798 |
1 | 0.691179 | 0.351877 |
14 | 0.180662 | 0.348375 |
0 | -1.507185 | -1.700640 |
18 | 1.066191 | 1.140320 |
6 | 0.379086 | 0.257637 |
df_temp_test
x | y | |
---|---|---|
8 | 0.309061 | 0.031458 |
4 | -0.275597 | -0.354455 |
16 | 0.812167 | 0.544890 |
2 | -0.490616 | -1.218427 |
12 | -0.781417 | -0.997631 |
10 | 0.905778 | 1.267771 |
15 | 0.745693 | 0.620228 |
13 | 0.269303 | -0.026614 |
9 | -0.133834 | 0.050468 |
5 | -1.117979 | -1.769453 |
Candidate Model 1: Linear Regression
We might first fit a linear regression model to our training dataset.
linear_model = smf.ols(formula='y~x', data=df_temp_train).fit()
linear_model.summary().tables[1]
C:\Users\vellison\Miniconda3\lib\site-packages\scipy\stats_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
warnings.warn("kurtosistest only valid for n>=20 ... continuing "
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.0599 | 0.078 | -0.773 | 0.462 | -0.239 | 0.119 |
x | 1.0032 | 0.071 | 14.175 | 0.000 | 0.840 | 1.166 |
We then plot this linear regression curve with our training dataset below.
#This will help us plot our linear regression curve
x_fit = np.linspace(-2, 2, 100)
y_fit = -0.06 + 1.003 * x_fit
#Plotting the training data
sns.scatterplot(x='x', y='y', data=df_temp_train, ci=False)
#Plotting the linear regression curve
plt.plot(x_fit, y_fit, label='y = -0.06 + 1.003x')
plt.title('Training Dataset')
plt.legend()
plt.show()

From the plot above, we can see that the residuals for the training dataset and this linear regression curve don't look too far off. We can also quantify this by observing that the RMSE of the training dataset is very low 0.206.
#Get the training response variable
y_temp_train=df_temp_train['y']
#Training RMSE
y_pred_train = linear_model.predict(df_temp_train)
rmse_train_linear = mean_squared_error(y_temp_train, y_pred_train, squared=False)
rmse_train_linear
0.20559412894531362
However, let's now plot this same linear regression curve with our test dataset. We can see that the residuals for the test dataset and this linear regression model look slightly worse. This is to be expected as the test dataset observations were not considered when building this model.
#Plotting the test data
sns.scatterplot(x='x', y='y', data=df_temp_test, ci=False)
#Plotting the linear regression model curve(fit from the training data)
plt.plot(x_fit, y_fit, label='y = -0.06 + 1.003x')
plt.title('Test Dataset')
plt.legend()
plt.show()

We can quantify this worse test dataset fit reflected in the higher test dataset RMSE of 0.347.
#Get the test response variable
y_temp_test=df_temp_test['y']
#Test RMSE
y_pred_test = linear_model.predict(df_temp_test)
rmse_test_linear = mean_squared_error(y_temp_test, y_pred_test, squared=False)
rmse_test_linear
0.34665753045827585
Candidate Model 2: Nonlinear Regression
As another candidate model, we might decide to fit a nonlinear regression model to our training dataset below using the PolynomialFeatures() function.
You can learn more about fitting a nonlinear regression curve in other classes. For now, we just plot one for demonstration purposes.
poly = PolynomialFeatures(degree=8)
x_poly_train=poly.fit_transform(df_temp_train[['x']])
nonlinear_model = LinearRegression()
nonlinear_model.fit(x_poly_train, y_temp_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
We plot this "best fit" nonlinear regression curve with our training dataset below. We can see that the residuals for the training dataset and this nonlinear regression model look almost perfect! This is a really close fit.
#This will help us plot our nonlinear regression curve
x_fit = np.linspace(-1.8, 1.25, 100).reshape(-1, 1)
y_fit = nonlinear_model.predict(poly.transform(x_fit))
C:\Users\vellison\AppData\Roaming\Python\Python39\site-packages\sklearn\base.py:420: UserWarning: X does not have valid feature names, but PolynomialFeatures was fitted with feature names
warnings.warn(
#Plotting the training data
sns.scatterplot(x='x', y='y', data=df_temp_train, ci=False)
#Plotting the nonlinear regression model curve(fit from the training data)
plt.plot(x_fit, y_fit, label='Nonlinear Regression Curve with Training Data')
plt.title('Training Dataset')
plt.legend()
plt.show()

We can quantify this good training dataset fit by observing the very low training dataset RMSE of 0.131.
#Training RMSE
y_pred_train = nonlinear_model.predict(poly.transform(df_temp_train[['x']]))
rmse_train_nonlinear = mean_squared_error(y_temp_train, y_pred_train, squared=False)
rmse_train_nonlinear
0.13142999484720425
However, if we plot this same nonlinear regression curve with the test dataset, we can see that the fit this nonlinear regression model looks pretty far off!
#Plotting test data
sns.scatterplot(x='x', y='y', data=df_temp_test, ci=False)
#Plotting the nonlinear regression model curve(fit from the training data)
plt.plot(x_fit, y_fit, label='Nonlinear Regression Curve with Training Data')
plt.title('Test Dataset')
plt.legend()
plt.show()

We can quantify this worse test dataset fit reflected in the higher test dataset RMSE of 1.03.
#Test RMSE
y_pred_test = nonlinear_model.predict(poly.transform(df_temp_test[['x']]))
rmse_test_nonlinear = mean_squared_error(y_temp_test, y_pred_test, squared=False)
rmse_test_nonlinear
1.0344276211503267
Summary: Example of Overfitting
What we just saw here is an example of what we call overfitting a model to a given training dataset. That is, our nonlinear model was trained too well on the training dataset, such that the test dataset (or any other dataset that is assumed to have been randomly drawn from the same population) has dramatically worse performance. While the RMSE of the training dataset with the nonlinear regression model was great (0.13), the test dataset RMSE with this nonlinear regression model was very poor (1.03).
Linear Regression Model | Nonlinear Regression Model | |
Training Data RMSE | 0.206 | 0.13 |
Test Data RMSE | 0.347 | 1.03 |
If we visualize both our training and test dataset, we can see that the relationship that actually exists in this full dataset is a relatively strong linear relationship. Most likely any deviation from this linear relationship is due to nosie or randomness. Thus, our nonlinear regression model in this case was too complex, attempting to fit "every nook and cranny" of noise in the training dataset that existed in this otherwise linear relationship.
On the other hand, we can see that while the training dataset RMSE of the linear regression model (0.206) is slightly worse than the nonlinear regression model, because the linear regression model was not trying to fit "every nook and cranny" of noise in this training dataset the test dataset RMSE (0.347) was not as bad as the nonlinear regression model.
sns.scatterplot(x='x', y='y', data=df_temp_train, ci=False, label='Training Dataset')
sns.scatterplot(x='x', y='y', data=df_temp_test, ci=False, label='Test Dataset')
plt.plot(x_fit, y_fit)
plt.show()

Some Ways to Overfit a Linear Regression Model
In the example above, we saw that our linear regression model was not the one that overfit the data.
However, there are still many ways that a linear regression model in particular can be overfit to a particular training dataset. One such way that a linear regression model can do this is by including too many explanatory variables that do not "bring enough predictive power" to the regression model.
-
Irrelevant Explanatory Variables
One way that an explanatory variable might not bring enough predictive power to the model is that it's addition to the model only increases the fit (ie. the $R^2$) ever so slightly, due to it independently not having a strong relationship with the response variable.
An example of this might be trying to originally predict theheight
of a person based on the length of theirright foot
. If we were to add the person'sfavorite icecream flavor
(vanilla/chocalate) as an additional explanatory variable, we might be overfitting as we would expect the resulting $R^2$ of the new model to increase ever so slightly, because (most likely) icecream preference is not associated with a person's height. The model most likely would nonetheless still try to use thisfavorite icecream flavor
variable to try to achieve a better fit to the mostly random fluctuations. -
Collinear Explanatory Variables
Another way that an explanatory variable might not bring enough predictive power to the model is that it's addition to the model only increases the $R^2$ ever so slightly, due to it being collinear with another explanatory variable in the model.
An example of this might be the following. Suppose we were trying to originally predict theheight
of a person based on the size of theirright foot
. THEN if we were to add the person'sleft foot
size as an additional explanatory variable, we might be overfitting as well. Because the correlation betweenleft foot
andright foot
size of a person is likely very high, then the relationship between theleft foot
andheight
probably strongly mimics the relationship between theright foot
andheight
. Thus, the bulk of the contribution to the fit to the model would have been handled by the right foot explanatory variable. And thus the inclusion of the left foot variable may instead work to minimize the residuals in the model that are more due to noise, hence overfitting the model.
Some Ways to Underfit a Linear Regression Model
On the other hand, if we choose to leave out an important explanatory variable from our model that does have the ability to increase the $R^2$ of the model enough given the current explanatory variables already in the model, then we might be underfitting this model. In situations like this, the fit of the training dataset and the test dataset are both likely to be worse off.