Overfitting vs. Underfitting to a Dataset

← Introduction Next: Finding a Parsimonious Model →

Before exploring more linear regression candidate models for our Airbnb dataset, let's examine the following artificial dataset below to learn more about the pros and cons of overfitting and underfitting a given model to our training dataset.

df_temp = pd.read_csv('artificial_dataset.csv')
df_temp

	x	y
0	-1.507185	-1.700640
1	0.691179	0.351877
2	-0.490616	-1.218427
3	-1.602506	-1.843221
4	-0.275597	-0.354455
5	-1.117979	-1.769453
6	0.379086	0.257637
7	-0.945600	-1.293416
8	0.309061	0.031458
9	-0.133834	0.050468
10	0.905778	1.267771
11	-0.969701	-0.810798
12	-0.781417	-0.997631
13	0.269303	-0.026614
14	0.180662	0.348375
15	0.745693	0.620228
16	0.812167	0.544890
17	0.664805	0.660800
18	1.066191	1.140320
19	-1.756251	-1.521596

Suppose our goal is to build a model with this artificial dataset that yields good predictions for y using the explanatory variable x for new datasets. So again, let's randomly split this dataset up into a training dataset and a test dataset.

Note: Creating a test dataset that is as high as 50% of the observations is a bit unusual, but we do this for learning purposes.

df_temp_train, df_temp_test = train_test_split(df_temp, test_size=0.5, random_state=207)
df_temp_train

	x	y
7	-0.945600	-1.293416
17	0.664805	0.660800
19	-1.756251	-1.521596
3	-1.602506	-1.843221
11	-0.969701	-0.810798
1	0.691179	0.351877
14	0.180662	0.348375
0	-1.507185	-1.700640
18	1.066191	1.140320
6	0.379086	0.257637

df_temp_test

	x	y
8	0.309061	0.031458
4	-0.275597	-0.354455
16	0.812167	0.544890
2	-0.490616	-1.218427
12	-0.781417	-0.997631
10	0.905778	1.267771
15	0.745693	0.620228
13	0.269303	-0.026614
9	-0.133834	0.050468
5	-1.117979	-1.769453

Candidate Model 1: Linear Regression

We might first fit a linear regression model to our training dataset.

linear_model = smf.ols(formula='y~x', data=df_temp_train).fit()
linear_model.summary().tables[1]

C:\Users\vellison\Miniconda3\lib\site-packages\scipy\stats_stats_py.py:1736: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
warnings.warn("kurtosistest only valid for n>=20 ... continuing "

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-0.0599	0.078	-0.773	0.462	-0.239	0.119
x	1.0032	0.071	14.175	0.000	0.840	1.166

We then plot this linear regression curve with our training dataset below.

#This will help us plot our linear regression curve
x_fit = np.linspace(-2, 2, 100)
y_fit = -0.06 + 1.003 * x_fit

#Plotting the training data
sns.scatterplot(x='x', y='y', data=df_temp_train, ci=False)

#Plotting the linear regression curve
plt.plot(x_fit, y_fit, label='y = -0.06 + 1.003x')
plt.title('Training Dataset')
plt.legend()
plt.show()

From the plot above, we can see that the residuals for the training dataset and this linear regression curve don't look too far off. We can also quantify this by observing that the RMSE of the training dataset is very low 0.206.

#Get the training response variable
y_temp_train=df_temp_train['y']

#Training RMSE
y_pred_train = linear_model.predict(df_temp_train)
rmse_train_linear = mean_squared_error(y_temp_train, y_pred_train, squared=False)
rmse_train_linear

0.20559412894531362

However, let's now plot this same linear regression curve with our test dataset. We can see that the residuals for the test dataset and this linear regression model look slightly worse. This is to be expected as the test dataset observations were not considered when building this model.

#Plotting the test data
sns.scatterplot(x='x', y='y', data=df_temp_test, ci=False)

#Plotting the linear regression model curve(fit from the training data)
plt.plot(x_fit, y_fit, label='y = -0.06 + 1.003x')
plt.title('Test Dataset')
plt.legend()
plt.show()

We can quantify this worse test dataset fit reflected in the higher test dataset RMSE of 0.347.

#Get the test response variable
y_temp_test=df_temp_test['y']

#Test RMSE
y_pred_test = linear_model.predict(df_temp_test)
rmse_test_linear = mean_squared_error(y_temp_test, y_pred_test, squared=False)
rmse_test_linear

0.34665753045827585

Candidate Model 2: Nonlinear Regression

As another candidate model, we might decide to fit a nonlinear regression model to our training dataset below using the PolynomialFeatures() function.

You can learn more about fitting a nonlinear regression curve in other classes. For now, we just plot one for demonstration purposes.

poly = PolynomialFeatures(degree=8)
x_poly_train=poly.fit_transform(df_temp_train[['x']])
nonlinear_model = LinearRegression()
nonlinear_model.fit(x_poly_train, y_temp_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We plot this "best fit" nonlinear regression curve with our training dataset below. We can see that the residuals for the training dataset and this nonlinear regression model look almost perfect! This is a really close fit.

#This will help us plot our nonlinear regression curve
x_fit = np.linspace(-1.8, 1.25, 100).reshape(-1, 1)
y_fit = nonlinear_model.predict(poly.transform(x_fit))

C:\Users\vellison\AppData\Roaming\Python\Python39\site-packages\sklearn\base.py:420: UserWarning: X does not have valid feature names, but PolynomialFeatures was fitted with feature names
warnings.warn(

#Plotting the training data
sns.scatterplot(x='x', y='y', data=df_temp_train, ci=False)

#Plotting the nonlinear regression model curve(fit from the training data)
plt.plot(x_fit, y_fit, label='Nonlinear Regression Curve with Training Data')
plt.title('Training Dataset')
plt.legend()
plt.show()

We can quantify this good training dataset fit by observing the very low training dataset RMSE of 0.131.

#Training RMSE
y_pred_train = nonlinear_model.predict(poly.transform(df_temp_train[['x']]))
rmse_train_nonlinear = mean_squared_error(y_temp_train, y_pred_train, squared=False)
rmse_train_nonlinear

0.13142999484720425

However, if we plot this same nonlinear regression curve with the test dataset, we can see that the fit this nonlinear regression model looks pretty far off!

#Plotting test data
sns.scatterplot(x='x', y='y', data=df_temp_test, ci=False)

#Plotting the nonlinear regression model curve(fit from the training data)
plt.plot(x_fit, y_fit, label='Nonlinear Regression Curve with Training Data')
plt.title('Test Dataset')
plt.legend()
plt.show()

We can quantify this worse test dataset fit reflected in the higher test dataset RMSE of 1.03.

#Test RMSE
y_pred_test = nonlinear_model.predict(poly.transform(df_temp_test[['x']]))
rmse_test_nonlinear = mean_squared_error(y_temp_test, y_pred_test, squared=False)
rmse_test_nonlinear

1.0344276211503267

Summary: Example of Overfitting

What we just saw here is an example of what we call overfitting a model to a given training dataset. That is, our nonlinear model was trained too well on the training dataset, such that the test dataset (or any other dataset that is assumed to have been randomly drawn from the same population) has dramatically worse performance. While the RMSE of the training dataset with the nonlinear regression model was great (0.13), the test dataset RMSE with this nonlinear regression model was very poor (1.03).

	Linear Regression Model	Nonlinear Regression Model
Training Data RMSE	0.206	0.13
Test Data RMSE	0.347	1.03

If we visualize both our training and test dataset, we can see that the relationship that actually exists in this full dataset is a relatively strong linear relationship. Most likely any deviation from this linear relationship is due to nosie or randomness. Thus, our nonlinear regression model in this case was too complex, attempting to fit "every nook and cranny" of noise in the training dataset that existed in this otherwise linear relationship.

On the other hand, we can see that while the training dataset RMSE of the linear regression model (0.206) is slightly worse than the nonlinear regression model, because the linear regression model was not trying to fit "every nook and cranny" of noise in this training dataset the test dataset RMSE (0.347) was not as bad as the nonlinear regression model.

sns.scatterplot(x='x', y='y', data=df_temp_train, ci=False, label='Training Dataset')
sns.scatterplot(x='x', y='y', data=df_temp_test, ci=False, label='Test Dataset')
plt.plot(x_fit, y_fit)
plt.show()

Some Ways to Overfit a Linear Regression Model

In the example above, we saw that our linear regression model was not the one that overfit the data.

However, there are still many ways that a linear regression model in particular can be overfit to a particular training dataset. One such way that a linear regression model can do this is by including too many explanatory variables that do not "bring enough predictive power" to the regression model.

Irrelevant Explanatory Variables
One way that an explanatory variable might not bring enough predictive power to the model is that it's addition to the model only increases the fit (ie. the $R^2$) ever so slightly, due to it independently not having a strong relationship with the response variable.
An example of this might be trying to originally predict the height of a person based on the length of their right foot. If we were to add the person's favorite icecream flavor (vanilla/chocalate) as an additional explanatory variable, we might be overfitting as we would expect the resulting $R^2$ of the new model to increase ever so slightly, because (most likely) icecream preference is not associated with a person's height. The model most likely would nonetheless still try to use this favorite icecream flavor variable to try to achieve a better fit to the mostly random fluctuations.
Collinear Explanatory Variables
Another way that an explanatory variable might not bring enough predictive power to the model is that it's addition to the model only increases the $R^2$ ever so slightly, due to it being collinear with another explanatory variable in the model.
An example of this might be the following. Suppose we were trying to originally predict the height of a person based on the size of their right foot. THEN if we were to add the person's left foot size as an additional explanatory variable, we might be overfitting as well. Because the correlation between left foot and right foot size of a person is likely very high, then the relationship between the left foot and height probably strongly mimics the relationship between the right foot and height. Thus, the bulk of the contribution to the fit to the model would have been handled by the right foot explanatory variable. And thus the inclusion of the left foot variable may instead work to minimize the residuals in the model that are more due to noise, hence overfitting the model.

Some Ways to Underfit a Linear Regression Model

On the other hand, if we choose to leave out an important explanatory variable from our model that does have the ability to increase the $R^2$ of the model enough given the current explanatory variables already in the model, then we might be underfitting this model. In situations like this, the fit of the training dataset and the test dataset are both likely to be worse off.

← Introduction Next: Finding a Parsimonious Model →