Cross-Validation Techniques


Downsides of Selecting a Single Training and Test Dataset

1. Differing Results

As we just observed, based on the way that a dataset is randomly split into just a single training and test dataset, we may get varying results when it comes to:

  1. the type of models that are built based on the training data, and
  2. the type of model that was deemed to be best and thus selected based on test data performance.

2. Not Using the Full Dataset

In addition, there are inevitably going to be some observations that will never be considered when it comes to training the dataset, because they have been placed in the test dataset. And similarly there will also be some observations that will never be considered when it comes to testing the dataset, because they have been placed in the training dataset. Thus, we are not giving each observation the chance to influence our model build and selection.

Cross-Validation Techniques

As a means of making up for these shortcomings of selecting just a single training and test dataset, we introduce what we call cross-validation techiques. These techniques involve creating multiple sets of training and test datasets from the full dataset. By using these techniques, an observation will:

  • appear in at least one training dataset, and
  • appear in at least one test dataset.

Two of the most common cross-validation techniques include:

  1. Leave-One-Out-Cross-Validation (LOOCV)
  2. k-Fold Cross-Validation

Leave-One-Out-Cross-Validation (LOOCV)

The leave-one-out cross-validation method has every observation appear in a test dataset exactly once. Every observation will be in a training dataset n-1 times. Specifically, we use the LOOCV method to evaluate a particular model that we are considering below.

Choosing a Model with LOOCV

For a dataset with $n$ observations and a given model that we are considering (ex: one with a certain subset of explanatory variables), we do the following.

  1. Create $n$ Training and Test Data Pairs: For each of the $n$ observations, we do the following, creating $n$ pairs of training and test datasets.

    • Create a test dataset with just the one observation.
    • Create a training dataset with every other remaining observation in the dataset.
  2. Train $n$ Models: Then we create $n$ models with each of the $n$ training datasets.

  3. Test the $n$ Models: We then test the $n$ models with each of their $n$ corresponding test datasets. For instance, we might calculate the test data residual of it's corresponding model.

  4. Average Model Test Performance: Then we calculate the average test data performance (over all $n$ test datasets). For instance, we might calculate the average test data residual.

  5. Compare Average Model Performance: Compare different model performances (ex: ones with different subsets of explanatory variables) based on this average model performance. For instance, we may select the model with the lowest average test model residual.

Benefits of LOOCV

The LOOCV method has the following benefits.

  • More Accurate Test Data Performance than Train-Test-Split Method: The average test data performance is more accurate than by using just a single test and training dataset.

    Why?

    • Recall how our training dataset predictions tend to be better than our test dataset predictions. This is because the training dataset observations were actually used to build the model. So the model was explicitly trying to minimize the error of their predictions.

    • In LOOCV, each of our training datasets is almost the full dataset, except for the one observation that was left out for the corresponding test dataset. Thus, we might expect each of our trained models to look pretty similar to what model we would have gotten if we used the full dataset to train the model. Thus, the corresponding test dataset predictions that we make with each of these trained models, will similarly reflect this higher accuracy that the full model might have achieved.

  • No Randomness: There is no random nature to the way in which the $n$ training and test dataset pairs are created. Thus, your model results and decisions that you make will not fluctuate based on the random seed that was selected.

  • Low Model Variability: Because each of the $n$ training datasets have mostly the same set of observations in them, each of your $n$ trained models should not have a high degree of variability in their slopes and intercept.

Drawbacks of LOOCV

The LOOCV method has the following drawbacks.

  • Computationally Expensive: This can be a very computationally expensive method, especially for large datasets, because $n$ models need to be trained.

  • More Variable Test Data Predictions: Your test dataset predictions individually may be highly variable.

  • Inflation of Model Performance: The average test dataset performance may also lead to an over-inflated estimate as to how well given model will perform on new datasets.

    Why?

    • Recall how the average error (like RMSE) of our training dataset tends to be better than the average error (like RMSE) of our test dataset in the train-test-split method.
    • Because each of our training datasets of size $n-1$ in the LOOCV method are almost the same and almost the full dataset, we might expect each of our $n$ trained models to look very similar to the model that would have been trained on the full dataset.
    • Thus, when we calculate the error of each of our $n$ single test observations and average them, we might expect this error to look very similar to the average full dataset error of the model trained with the full dataset.
    • Thus, this average test data RMSE is likely to be lower than what it would be for a potential new dataset.

k-Fold Cross-Validation

The k-fold cross-validation method has every observation appear in a test dataset once. Every observation will appear in a training dataset k-1 times. Specifically, we use the k-Fold cross-validation method to evaluate a particular model that we are considering below.

Choosing a Model with k-Fold Cross-Validation

For a dataset with $n$ observations and a given model that we are considering (ex: one with a certain subset of explanatory variables), we do the following.

  1. Create $k$ Folds: Split the dataset into $k$ equally sized folds (ie. observation subsets). (You may choose to randomly shuffle the rows in the full dataset first in order to have your $k$ folds be randomly selected.)

  2. Create $k$ Training and Test Data Pairs: For each of the $k$ folds, we do the following, creating $k$ pairs of training and test datasets.
    a. Create a test dataset with just the one fold.
    b. Create a training dataset comprised of every other remaining fold in the dataset.

  3. Train $n$ Models: Then we create $k$ models with each of the $k$ training datasets.

  4. Test the $n$ Models: We then test the $k$ models with each of their $k$ corresponding test datasets. For instance, we might calculate the test data R^2 of it's corresponding model.

  5. Average Model Test Performance: Then we calculate the average test data performance (over all $k$ test datasets). For instance, we might calculate the average test data R^2.

  6. Compare Average Model Performance: Compare different model performances (ex: ones with different subsets of explanatory variables) based on this average model performance. For instance, we may select the model with the highest average test R^2.

Benefits of k-Fold Cross-Validation

The k-fold cross-validation method has the following benefits.

  • Less Computationally Complex than LOOCV: This method is less computationally complex as LOOCV, as now only $k$ models need to be trained.

  • More Accurate Test Data Performance than Train-Test Split Method: The average test data performance is also more accurate than by using just a single test and training dataset. This is because we averaging out some of the test data performance variability that would have been observed in selecting just a single test dataset.

  • Less Inflation of Model Performance than LOOCV: Compared to the LOOCV, the average test dataset is less likely to lead to an over-inflated estimate as to how well given model will perform on new datasets.
    Why?
    - Unlike LOOCV, each of your training datasets will contain a much lower percentage of observations from the full dataset in k-fold cross-validation. Thus, we might expect each of the $k$ models trained in k-fold cross validation training dataset to look more different from the model trained with the full dataset.
    - Therefore, our test data predictions will look less similar to the inflated accuracy that we might expect to see if we were predicting an observation from the full dataset that trained the model.

Drawbacks of LOOCV

The LOOCV method has the following drawbacks.

  • More Computationally Complex than Train-Test-Split Method: This method is still more computationally complex than the train-test-split method as $k$ models need to be trained, instead of just one.
  • Less Accurate Test Data Performance than LOOCV: The average test data performance is less accurate than the LOOCV method.
  • Randomness: Unlike LOOCV, this method may have a random nature in the way your $k$ folds were created (if you chose to randomly shuffle the full dataset row order first). Thus, your model results and decisions that you make may fluctuate based on the random seed that was selected.

Performing 5-Fold Cross-Validation on the Breast Tumor Linear Regression Model

In attempt to attain a more confident, stable sense as to how well a given linear regression model might perform when it comes to predicting breast tumor size on new datasets, let's perform k-fold cross-validation on a few candidate models that we tried in section 8. Let's use $k=5$ folds, as this is a commonly used number in many cross-validation analyses used in the literature. That means we will create $k=5$ pairs of training and test datasets.

Full Features Matrix and Full Target Array

First, because we are no longer using just a single training and test dataset in k-fold cross validation, let's create a features matrix $X$ and target array $y$ from the full dataset. Because we would also like to interpret the magnitudes of our resulting regression slopes, let's also scale this full features matrix X by taking the column z-score of each column value.

X=df.drop(['size'], axis=1)
X.head()

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1103 X1109 ... X1563 X1574 X1595 X1597 X1609 X1616 X1637 X1656 X1657 X1683
0 0.739512 1.971036 -1.660842 2.777183 3.299062 -1.954834 2.784970 0.411848 3.974931 -0.979751 ... 3.221117 -1.735267 -2.036134 3.114202 -1.567229 -2.100264 -2.067843 2.116261 -2.057195 -0.868534
1 2.037903 2.197854 -1.263034 4.082346 5.426886 -1.732520 3.085890 0.688056 4.503384 -1.185032 ... 2.927229 -1.646363 0.127756 2.772590 -1.451107 0.267480 -1.526069 2.643856 -1.625604 -1.415037
2 2.218338 3.471559 -1.789433 2.829994 4.746466 -2.222392 2.977280 0.944858 4.021099 -1.825502 ... 3.565945 -2.296393 -2.347923 3.577213 -2.175087 -2.084889 -2.106915 2.738768 -1.387816 -0.780555
3 0.972344 2.638734 -2.010999 3.913935 4.744161 -2.496426 3.139577 0.155651 4.632121 -1.671513 ... 3.815160 -1.706846 -2.216318 3.168707 -1.844349 -2.010999 -1.996352 2.797407 -1.743066 -1.010999
4 2.412235 4.033491 -1.536501 4.239650 4.304348 -1.991067 3.700095 0.878536 4.295705 -2.141092 ... 4.264107 -2.424026 -2.448274 3.717911 -2.286523 -2.045515 -1.776328 2.813104 -2.353637 -1.687061

5 rows × 50 columns

#Create a StandardScaler() object and use it to 
scaler_full = StandardScaler()
scaled_expl_vars = scaler_full.fit_transform(X)

#Put this numpy array scaled output back into a dataframe with the same columns
X = pd.DataFrame(scaled_expl_vars, columns=X.columns)
X.head()

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1103 X1109 ... X1563 X1574 X1595 X1597 X1609 X1616 X1637 X1656 X1657 X1683
0 -0.845493 1.526262 -2.218126 1.641149 1.210038 -1.801174 1.991829 -1.266751 1.803601 -1.871297 ... 1.864684 -1.849315 -2.016888 1.556149 -1.735853 -2.134520 -2.417362 1.599595 -2.097080 -1.964144
1 0.878424 1.692429 -2.046898 2.694920 2.512264 -1.506456 2.183692 -1.074956 2.121859 -1.969695 ... 1.638766 -1.804299 -0.930737 1.168764 -1.642269 -0.908221 -1.878902 2.048030 -1.720365 -2.294808
2 1.117994 2.625543 -2.273475 1.683788 2.095848 -2.155872 2.114443 -0.896636 1.831405 -2.276697 ... 2.129761 -2.133432 -2.173389 2.081198 -2.225724 -2.126557 -2.456196 2.128701 -1.512812 -1.910912
3 -0.536355 2.015416 -2.368843 2.558947 2.094437 -2.519153 2.217922 -1.444650 2.199390 -2.202884 ... 2.321338 -1.834924 -2.107330 1.617957 -1.959183 -2.088288 -2.346309 2.178543 -1.822892 -2.050343
4 1.375436 3.037213 -2.164606 2.821925 1.825272 -1.849208 2.575300 -0.942689 1.996785 -2.427971 ... 2.666453 -2.198057 -2.223759 2.240747 -2.315530 -2.106164 -2.127631 2.191884 -2.355830 -2.459397

5 rows × 50 columns

y=df['size']
y.head()

0 13.0
1 15.0
2 15.0
3 20.0
4 10.0
Name: size, dtype: float64

Setting Up 5 models

For now, let's just stick to testing and comparing the following 5 models with k-fold cross-validation.

  1. Non-regularized linear regression
  2. Elastic net linear regression with $\lambda = 0.5$
  3. Elastic net linear regression with $\lambda = 1$
  4. Elastic net linear regression with $\lambda = 2.5$
  5. Elastic net linear regression with $\lambda = 5$

Rather than instantiating and then fitting each of these models (using the .fit() function like we normally do), let's just instantiate the 5 model objects below.

lin_mod = LinearRegression()
en_mod1 = ElasticNet(alpha=0.5, max_iter=1000)
en_mod2 = ElasticNet(alpha=1, max_iter=1000)
en_mod3 = ElasticNet(alpha=2.5, max_iter=1000)
en_mod4 = ElasticNet(alpha=5, max_iter=1000)

Setting up the k=5 Folds

Then let's create a KFold() object which sets up our intended k=5 folds for us.

  • We indicate that we would like n_splits=5 folds in our cross-validation.
  • Furthermore, we indicate would like to shuffle the rows in our dataset first before splitting the dataset up into 5 equally sized folds.
  • Due to the random shuffling that we stipulated, let's also fix a random_state.
from sklearn.model_selection import KFold
cross_val = KFold(n_splits=5, shuffle=True, random_state=207)
cross_val

KFold(n_splits=5, random_state=207, shuffle=True)

Running k-Fold Cross-Validation on the 5 Candidate Models

Finally, let's use the cross_val_score() function to perform k-Fold cross validation on each of our 5 candidate models.

Nonregularized Linear Regression

For instance, below we:

  • use our k=5 training and test dataset pairs created in the cv=cross_val object
  • to train our instantiated nonregularized linear regression model lin_mod
  • with the full features matrix $X$ and
  • target array $y$.

Furthermore,

  • the scoring parameter indicates what model fit metric we'd like to use on each of the $k=5$ test datasets. We choose to use the $R^2$ in this case.
from sklearn.model_selection import cross_val_score

test_fold_r2=cross_val_score(lin_mod, X, y, cv=cross_val, scoring="r2")
print('Test Fold R^2 Values:', test_fold_r2)
print('Mean Test Fold R^2:', test_fold_r2.mean())
print('Std Test Fold R^2:', test_fold_r2.std())

Test Fold R^2 Values: [-0.33509836 -1.51307962 -0.69764378 -0.86681282 -0.53544239]
Mean Test Fold R^2: -0.7896153923433595
Std Test Fold R^2: 0.40224578435797564

We can see that the 5 test dataset R^2 values are all very poor fits (negative R^2 values!) and vary quite a bit. The average test dataset R^2 value was -0.79.

Elastic Net Linear Regression

Now let's try cross-validation on each of our 4 elastic net models that we instantiated above:

  • en_mod1
  • en_mod2
  • en_mod3
  • en_mod4
test_fold_r2=cross_val_score(en_mod1, X, y, cv=cross_val, scoring="r2")
print('Test Fold R^2 Values:',test_fold_r2)
print('Mean Test Fold R^2:', test_fold_r2.mean())
print('Std Test Fold R^2:', test_fold_r2.std())

Test Fold R^2 Values: [ 0.04845339 0.03693266 0.19174454 -0.11044484 -0.04329324]
Mean Test Fold R^2: 0.024678501901048477
Std Test Fold R^2: 0.10149379008193646

test_fold_r2=cross_val_score(en_mod2, X, y, cv=cross_val, scoring="r2")
print('Test Fold R^2 Values:',test_fold_r2)
print('Mean Test Fold R^2:', test_fold_r2.mean())
print('Std Test Fold R^2:', test_fold_r2.std())

Test Fold R^2 Values: [ 0.0591827 0.04978462 0.18600926 -0.07266975 -0.01308511]
Mean Test Fold R^2: 0.04184434439893314
Std Test Fold R^2: 0.08634571524492464

test_fold_r2=cross_val_score(en_mod3, X, y, cv=cross_val, scoring="r2")
print('Test Fold R^2 Values:',test_fold_r2)
print('Mean Test Fold R^2:', test_fold_r2.mean())
print('Std Test Fold R^2:', test_fold_r2.std())

Test Fold R^2 Values: [ 0.04479992 0.05737101 0.12485681 -0.05534877 -0.01414004]
Mean Test Fold R^2: 0.03150778576590822
Std Test Fold R^2: 0.06194741791000637

test_fold_r2=cross_val_score(en_mod4, X, y, cv=cross_val, scoring="r2")
print('Test Fold R^2 Values:',test_fold_r2)
print('Mean Test Fold R^2:', test_fold_r2.mean())
print('Std Test Fold R^2:', test_fold_r2.std())

Test Fold R^2 Values: [ 0.00413185 0.01118841 0.00930435 -0.06664161 -0.07967792]
Mean Test Fold R^2: -0.02433898188445487
Std Test Fold R^2: 0.04014117221866386

Conclusion

We can see that each of our elastic net models performed much better than the linear regression model did when it came to predicting the tumor size of each of the 5 test folds. The elastic net model that had the highest average test fold R^2 (0.04) was elastic net model 2 that used $\lambda=1$ and $\alpha=0.7$. Thus, if our goal was to build a linear regression model that we would expect to have the best predictions for new breast tumor datasets, then we might consider selecting this elastic net model 2 with $\lambda=1$ and $\alpha=0.7$.

To build and learn more about the actual model with these parameters, we'd need to fit the model one more time with (if we'd like) the full dataset.

en_mod2.fit(X,y)
ElasticNet(alpha=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Notice how most of the model slopes in this elastic net model 2 have been zeroed out. Taking these zeroed out slopes as an indication of which gene explanatory variables should be left out, we can write out our equivalent, reduced representation of this linear regression model below.

Linear Regression Model Chosen by Elastic Net Model 2

\begin{align*} \hat{size} &= 22.96 \\ &\quad - 0.36X1023 \\ &\quad + 0.779X1028 \\ &\quad - 0.337X1136 \\ &\quad - 0.19X1141 \\ &\quad - 0.295X1219 \\ &\quad - 0.20X1351 \\ &\quad + 0.008X1430 \\ &\quad - 0.155X1529 \\ &\quad - 0.046X1597 \\ &\quad - 0.117X1656 \end{align*}
df_slopes = pd.DataFrame(en_mod2.coef_.T, columns = ['en_mod2'], index=X.columns)
print('Model Intercept:', en_mod2.intercept_)

print('Model Slopes')
df_slopes

Model Intercept: 22.96
Model Slopes

en_mod2
X159 0.000000
X960 0.000000
X980 0.000000
X986 -0.000000
X1023 -0.364144
X1028 0.778571
X1064 0.000000
X1092 0.000000
X1103 -0.000000
X1109 -0.000000
X1124 0.000000
X1136 -0.337205
X1141 -0.191111
X1144 -0.000000
X1169 -0.000000
X1173 -0.000000
X1179 -0.000000
X1193 -0.000000
X1203 -0.000000
X1206 -0.000000
X1208 0.000000
X1219 -0.295481
X1232 0.000000
X1264 0.000000
X1272 -0.000000
X1292 0.000000
X1297 -0.000000
X1329 -0.000000
X1351 -0.200529
X1362 -0.000000
X1416 -0.211375
X1417 0.000000
X1418 -0.000000
X1430 0.008050
X1444 0.000000
X1470 -0.000000
X1506 -0.000000
X1514 -0.000000
X1529 -0.154875
X1553 0.000000
X1563 -0.000000
X1574 0.000000
X1595 0.000000
X1597 -0.046148
X1609 0.000000
X1616 0.000000
X1637 0.000000
X1656 -0.116647
X1657 -0.000000
X1683 -0.000000

Discussion and Drawbacks

1. More Complete Analysis Take note that we only tried out a nonregularized linear regression model and a few elastic net models in search of a model that would give us the highest average test fold R^2 value. A more complete analysis might have tried out:

  • More than just one $\alpha$ value in the elastic net models.
  • LASSO and Ridge Regression Models with multiple $\lambda$ parameters as well.

You can perform cross-validation on any given LASSO or ridge regression model using a similar structure.

#Cross-validation on a LASSO model
test_lasso_mod = Lasso(alpha=0.5, max_iter=1000)
print('Test Fold R^2 for a LASSO Model')
cross_val_score(test_lasso_mod, X, y, cv=cross_val, scoring="r2")

Test Fold R^2 for a LASSO Model

array([-0.01443285, 0.0191285 , 0.18322466, -0.12150093, -0.05739521])

#Cross-validation on a ridge regression model
test_ridge_mod = Ridge(alpha=0.5, max_iter=1000)
print('Test Fold R^2 for a Ridge Regression Model')
cross_val_score(test_ridge_mod, X, y, cv=cross_val, scoring="r2")

Test Fold R^2 for a Ridge Regression Model

array([-0.21130639, -1.04884562, -0.33237866, -0.69903916, -0.43160445])

2. Not Great Fits Also take note that our best selected model above only yielded an averge test fold R^2 of 0.04. While better than the other models, this is still quite low, indicating that on average, even our best model will not explain too much of the tumor size variance when it comes to new datasets.

3. Test Fold R^2 Variability Also take note that the standard deviation of the five test fold R^2 values for this best elastic net model 2 is 0.086. Thus, we might expect that our chosen model's R^2 values for new datasets might be much higher or much lower than this average R^2=0.04 value.

test_fold_r2=cross_val_score(en_mod2, X, y, cv=cross_val, scoring="r2")
print('Test Fold R^2 Values:',test_fold_r2)
print('Mean Test Fold R^2:', test_fold_r2.mean())
print('Std Test Fold R^2:', test_fold_r2.std())

Test Fold R^2 Values: [ 0.0591827 0.04978462 0.18600926 -0.07266975 -0.01308511]
Mean Test Fold R^2: 0.04184434439893314
Std Test Fold R^2: 0.08634571524492464