Regularization Techniques


Similar to the backwards elimination algorithm and the forward selection algorithm, regularization techniques are another way to go about determining which explanatory variables may lead to overfitting and/or don't bring enough "predictive power" to the model, and should therefore be left out.

Introducing a Penalty Term into a Linear Regression

Similarity to Adjusted R^2 Metric

Similar to the adjusted R^2 metric, which we tried to maximize in our backwards elimination algorithm and the forward selection algorithm, these regularization techniques also introduce a type of penalty term that penalizes models that have too many explanatory variables that don't bring enough predictive power to the model.

In our adjusted R^2 equation, the penalty term was $(\frac{n-1}{n-p-1})$

$$Adjusted \ R^2=1-(\frac{SSE}{SST})\cdot(\frac{n-1}{n-p-1})$$

Adding More Slopes

Similarly, by increasing the number of slopes $p$, the adjusted R^2 will be encouraged to decrease. However, if we were to add a slope and the resulting decrease in the new model's error (SSE) were to be "large enough", then the adjusted R^2 would actually increase overall, and thus:

  1. this resulting model would be deemed "more parsimonious",
  2. this new slope would be considered bringing enough "predictive power" to the model, and
  3. this new slope would not be considered overfitting.

However, unfortunately, the quest of trying to find the linear regression model with the highest adjusted R^2 in the backwards elimination algorithm and the forward selection algorithms involved having to fit multiple models, each time checking the adjusted R^2 of the test models to see if the adjusted R^2 value got any better.

Objective Function

Recall that the task of fitting an ordinary least squares regression involves finding the optimal intercept and slopes ( $\hat{\beta}_0^*,\hat{\beta}_1^*,...,\hat{\beta}_{50}^*$ in this problem) that minimize the sum square of the model error.
Basic (Unregularized) Linear Regression Optimization Problem
$$SSE = min \left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right]$$
In general, when it comes to the task of trying to find the best variable values ($\hat{\beta}_0,\hat{\beta}_1,...,\hat{\beta}_{50}$ in our case) that optimize (maximize or minimize) some function, we call this function the **objective function**. Hence $\sum_{i=1}^n[y-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50})]^2$ is our objective function when it comes to fitting a *basic* linear regression model.

Modifying the Objective Function with a Penalty Term

So, rather than having to fit multiple basic linear regression models (and thus having to optimize multiple basic linear regression objective functions in search of the model that gives you the highest adjusted R^2), wouldn't it be great if we could just incorporate a penalty term directly into linear regression objective function?

This is the idea behind how a regularized linear regression model works. In a regularized linear regression model, we take the objective function from our basic (ie. nonregularized) linear regression model and add a type of penalty term.

Regularized Linear Regression Optimization Problem
\begin{align*} &\min \left[ \sum_{i=1}^n \left( y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_{i,1} + \dots + \hat{\beta}_{50} x_{i,50}) \right)^2 \right] &+ (\text{penalty term}) \\ &\min \text{SSE} &+ (\text{penalty term}) \end{align*}

Goal of the Penalty Term

So what specifically should we set this penalty term to be? Remember, the overall goal of our feature selection techniques is to determine which explanatory variables may cause our model to overfit to the training dataset, and therefore to determine which explanatory should we leave out of the model.

Goal 1: Interpretation of what to leave out

Question: So using just our regularized linear regression optimization problem above, what kind of optimal slope results could we get back that signals to us that the corresponding explanatory variable (or indicator variable) should be left out of the model?

Answer: If a returned optimal slope value is set $\hat{\beta}^*_i=0$, this serves as pretty clear signal that the model thinks the corresponding explanatory variable (or indicator variable) $x_i$ should be left out!

Goal 2: Leave out variables that lead to overfitting

Questions: Now that we know how our fitted regularized linear regression model will indicate to us which variables should be left out, how should we set up this penalty term to encourage the model to only indicate leaving out the variables that lead to "too much" overfitting? Furthermore, how do we define "too much" overfitting?

Answers: In sections 8.2, 8.3, and 8.4 we'll talk about three of the most popular penalty term functions that are used that attempt to satisfy these two goals.

  • 8.2. L1 Penalty Term (ie. LASSO Regression)
  • 8.3. L2 Penalty Term (ie. Ridge Regression)
  • 8.4 Combination of L1 and L2 Penalty Terms (ie. Elastic Net Regression)

LASSO Regression (ie. L1 Penalty Term)

In the regularized LASSO linear regression optimization problem shown below, the penalty term that is added is

$\lambda\cdot\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]$
.

We call this a L1 penalty term, because what we call the L1 norm of a given list (ie. vector) of values is the sum of the absolute values of the values in this list. (The LASSO stands for Least Absolute Shrinkage and Selection Operator).

Also in this penalty term, we introduce a parameter $\lambda\geq 0$ that we need to choose ahead of time. This $\lambda$ value can be anything you want as long as it is not negative.

LASSO Linear Regression Optimization Problem
$$min \left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right] +\lambda\cdot\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]$$

The Ideal Scenario

To help us see why adding this penalty term would help to reduce overfitting, let's think about what the absolute best intercept and slope solutions to this optimization would be. We can see that the full objective function now is comprised of the sum of two terms.

  • $SSE=\left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right]$
  • $\lambda\cdot\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]$
Notice, that *neither* of these terms are allowed to be negative. Because we are trying to minimize the sum of these two terms, the best possible scenario is that both of these terms are equal to 0.

The Penalty Term Balancing Act

However, note that this "ideal scenario" that we just described is unrealistic. It's unlikely that we'll be able to fit a model where ALL the slopes $\hat{\beta}_j=0$ AND there is absolutely NO model error (ie. $SSE=0$). In addition, what we actually see is that these terms end up competing with one another. That is, choosing intercepts and slopes that lead to a decrease of one term, often leads to an increase in the other term, and vice versa.

The Impact of $\lambda$

Now let's think about the impact that our preselected $\lambda$ parameter has on this balancing act.

$\lambda$ is Set Low $\rightarrow \left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]$ Is Higher and SSE tends to be Lower (ie. Fit is Better)

If we were to set $\lambda$ to be as low as possible (=0), then we can see that this penalty term would actually have no impact at all on the linear regression intercepts and slopes that were chosen. With $\lambda=0$, then we can see that our objective function actually just becomes the same as our nonregularized linear regression objective function. Thus in this case, we are fitting just a basic linear regression model. The slopes values face no penalty for being as large as they need to be in order to minimize the SSE.

$min \left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right] +(0)\cdot\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]$
$=min \left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right]$
Notice however, if $\lambda>0$ but is still set to be very low, the optimal slopes still do not face a very large penalty for being as large as they need to be in order to keep SSE low. Because $\lambda$ is low, $\lambda \cdot\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]$ will also still be relatively low, even if some of the slope magnitudes end up being quite large.
$\lambda$ is Set High $\rightarrow \left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]$ Is Lower and SSE Tends to be Higher (ie. Fit is Worse)



Conversely, if $\lambda>0$ but is set to be very high, the optimal slopes face a very large penalty for being as large as they need to be in order to keep SSE low. Because $\lambda$ is high, $\lambda \cdot\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]$ will also be high, even if many of our slope magnitudes are not all that large. Thus, in this scenario, the optimization problem is going to work harder to keep the slope magnitudes low. This most likely will come at the expense of the fit of the model, ie. the SSE will end up being larger.

Which slope magnitudes get to stay large?

So we can see that the higher we set $\lambda$, the more encouraged the optimal intercept and solutions will be to keep the overall term $\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]$ low, which can come at the expense of the model fit (ie. SSE will be higher).

But does this mean that all of the $|\hat{\beta}_j|$ terms need to decrease equally? Or is it possible that the optimization problem will cleverly keep some slopes high that correspond to the variables that contribute the most predictive power to the model thereby leading to much larger reductions to the SSE? And could it also keep other slopes low that correspond to the variables that do not contribute a lot independently to the predictive power of the model (or in other words, may be causing the model to overfit)? Luckily, in practice we see that the answer to these questions is yes!

Thus the resulting optimal slopes that are set low, are the ones that bring the least predictive power to the model and thus are the most likely variable to overfit the model.

Clearer Slope Interpretation with LASSO Regression

So what we learned is that the less predictive power that a given variable independently brings to a regression model, the more likely it's corresponding slope is going to be set low. But then we might ask: how do we know if a resulting slope value is low enough to warrant leaving it out of the model because it does not bring enough predictive power to the model?

One of the benefits of using the L1 penalty term in regression regularization (as opposed to the other two penalty terms that we'll discuss in 8.3 and 8.4) is how the absolute value function interacts with our Calculus techniques that are used to solve for these optimal intercept and slope solutions. In general, if a slope is encouraged by the optimization problem to be comparatively small, then the optimal solution values for these small slopes can actually force these small slope values to be set exactly equal to 0.

These small slope values being set exactly equal to 0, gets rid of a lot of the interpretation ambiguity. If a slope is set to be 0, then the LASSO regression model is suggesting that this slope's corresponding variable can be left out of the model because it is not bringing enough predictive power to the model, based on the $\lambda$ value that you chose.

Nonregularized Linear Regression for Predicting Tumor Size - for Comparison

Before we fit a LASSO linear regression model with our scaled training features matrix and target array, let's also fit a basic nonregularized linear regression model for comparison purposes. To do so, let's use the LinearRegression() that we used in Data Science Discovery.

Recall that we first instantiate an object with this LinearRegression() function. And then we use the .fit() function with our features matrix ($X_{train}$) and target array ($y_{train}$) to actually fit the model.

from sklearn.linear_model import LinearRegression

lin_reg_mod = LinearRegression()
lin_reg_mod.fit(X_train, y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

And let's display the slopes for this fitted linear regression model below by using the .coef_ attribute. Because we scaled our gene explanatory variables, we are technically able to interpret the corresponding slope magnitudes as an indication of how important that particular corresponding variable is when it comes to predicting the tumor size in this particular model, in the context of the other existing variables in the model.

df_slopes = pd.DataFrame(lin_reg_mod.coef_.T, columns = ['lin_reg_mod'], index=X_train.columns)
df_slopes.sort_values(by=['lin_reg_mod'])

lin_reg_mod
X1595 -23.334004
X980 -10.996647
X1169 -10.944499
X1470 -10.656927
X1193 -6.094789
X1141 -5.867328
X1563 -4.857389
X1203 -4.274063
X1023 -3.991874
X1506 -3.580033
X986 -3.413999
X1264 -3.258595
X1514 -2.879311
X1416 -1.948134
X1656 -1.881983
X1637 -1.837732
X1219 -1.244601
X1657 -1.060942
X1417 -0.923660
X1103 -0.739345
X1418 -0.726373
X1529 -0.556923
X1136 -0.547525
X1232 -0.493502
X1092 -0.408996
X1109 0.121617
X1553 0.126864
X1124 0.488547
X1144 0.657140
X1444 0.971105
X1179 1.212503
X1597 1.487872
X1683 1.749397
X1206 1.790304
X1064 2.006906
X159 2.093903
X1609 2.157559
X1351 2.802728
X1028 3.797146
X1362 3.868178
X1208 4.107494
X1292 4.623105
X1272 6.584962
X1574 7.214710
X1173 7.801992
X1297 8.719397
X960 9.995326
X1329 11.802248
X1430 12.056850
X1616 13.586829

Issue #1: Biased Slopes due to Multicollinearity

However, remember that our features matrix showed many pairs of variables that were collinear. Thus, it is possible that our slope interpretations could still be biased due to this multicollinearity.

Issue #2: Could we be Overfitting?

Remember that one of our other goals is to use our predictive model to attain good predictions for new datasets. Let's assess the test dataset $R^2$ of this linear regression model before moving on to our LASSO linear regression model.

We can use the .score() function along with our test features matrix $X_{test}$ and target array $y_{test}$ to calculate this test R^2 of the model.

lin_reg_mod.score(X_test, y_test)

-0.3071595344334954

Negative R^2?

Interestingly enough we got a $R^2$ value that may not be what we were expecting. The R^2 of the test dataset is negative for this model! If the R^2 value was between 0 and 1, this value would represent the percent of tumor size variability that is explained by the model.

However, when the R^2 value is negative, one downside is that we lose the interpretability. Another huge downside, is that this implies that this model is a not a good fit for the test dataset. In general, the lower the R^2 value, the worse the fit.

LASSO Linear Regression for Predicting Tumor Size

Now, let's actually apply our LASSO Linear regression model to this dataset.

Recall that our LASSO linear regression objective function required for us to preselect a value for $\lambda$. Let's try out a couple different values for $\lambda\in [0.025, 0.05, 0.1, 0.5]$ and see how this impacts our results and research goals.

$min \left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right] +\lambda\cdot\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]$

We can use the Lasso() function to fit a linear LASSO regression model. This Lasso() function operates in a similar way to the LinearRegression() function. First, we create a Lasso() object with specified parameters. Then we take this object and .fit() it with the features matrix and the target array.

  • The alpha parameter represents the $\lambda$ parameter.
  • The algorithm that is used to attempt to find the optimal intercept and slopes to this LASSO regression objective function may not always arrive at an optimal solution in a fixed number of steps. Therefore, in the interest of time, we can set the max_iter parameter to be the maximum number of iterations that this algorithm should take before just stopping early and returning the best solution that it currently has.

Fitting Four LASSO Models

Model 1: $\lambda=0.025$

from sklearn.linear_model import Lasso
lasso_mod_025 = Lasso(alpha=0.025, max_iter=1000)
lasso_mod_025.fit(X_train, y_train)
Lasso(alpha=0.025)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model 2: $\lambda=.05$

lasso_mod_05 = Lasso(alpha=.05, max_iter=1000)
lasso_mod_05.fit(X_train, y_train)
Lasso(alpha=0.05)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model 3: $\lambda=.1$

lasso_mod_1 = Lasso(alpha=.1, max_iter=1000)
lasso_mod_1.fit(X_train, y_train)
Lasso(alpha=0.1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model 4: $\lambda=.5$

lasso_mod_5 = Lasso(alpha=.5, max_iter=1000)
lasso_mod_5.fit(X_train, y_train)
Lasso(alpha=0.5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Inspecting the Slopes

Next, let's inspect and compare the slopes of the models that we've created so far. We can similarly using the .coef attribute to extract each of our 4 model slopes.

By comparing the slopes of the four LASSO linear regression models, we can see what we would expect. As we increase $\lambda$, the number of non-zero slopes decreased. In fact, when $\lambda=0.5$, there were only seven genes with non-zero slopes. This result may suggest that these genes are the most important when it comes to predicting breast tumor size.

df_slopes = pd.DataFrame({'lin_reg_mod': lin_reg_mod.coef_.T, 
'lasso_mod_025': lasso_mod_025.coef_.T,
'lasso_mod_05': lasso_mod_05.coef_.T,
'lasso_mod_1': lasso_mod_1.coef_.T,
'lasso_mod_5': lasso_mod_5.coef_.T}, index=X_train.columns)
df_slopes

lin_reg_mod lasso_mod_025 lasso_mod_05 lasso_mod_1 lasso_mod_5
X159 2.093903 1.452509 1.098574 0.593176 0.000000
X960 9.995326 8.302795 6.596317 4.266726 0.000000
X980 -10.996647 -7.199944 -4.237265 -0.704563 -0.000000
X986 -3.413999 -2.592459 -1.775948 -0.000000 -0.000000
X1023 -3.991874 -3.340229 -2.766260 -2.092078 -0.147790
X1028 3.797146 3.196267 2.871440 2.605532 1.912208
X1064 2.006906 2.341590 2.206635 1.691021 0.000000
X1092 -0.408996 -0.338999 -0.259970 -0.234224 0.000000
X1103 -0.739345 -0.000000 -0.000000 0.000000 -0.000000
X1109 0.121617 -0.789248 -2.023169 -2.986337 -0.000000
X1124 0.488547 0.000000 -0.000000 -0.000000 0.000000
X1136 -0.547525 -0.206065 -0.000000 -0.000000 -0.673779
X1141 -5.867328 -5.052785 -4.416822 -2.981458 -0.481311
X1144 0.657140 0.126505 0.000000 -0.000000 0.000000
X1169 -10.944499 -5.637134 -3.492767 -0.986729 -0.000000
X1173 7.801992 5.701120 3.886316 1.669699 -0.000000
X1179 1.212503 0.711820 0.473037 0.412424 -0.000000
X1193 -6.094789 -4.625321 -2.528730 -0.000000 -0.000000
X1203 -4.274063 -1.917869 -0.418720 -0.000000 -0.000000
X1206 1.790304 1.148111 0.786031 0.216690 0.000000
X1208 4.107494 3.441126 2.072339 0.342848 0.000000
X1219 -1.244601 -0.852432 -0.968064 -0.850370 -0.000000
X1232 -0.493502 -0.000000 -0.000000 -0.000000 0.000000
X1264 -3.258595 -2.175231 -1.305576 -0.045646 -0.000000
X1272 6.584962 2.679780 0.528721 0.000000 -0.000000
X1292 4.623105 0.675658 0.000000 -0.000000 -0.000000
X1297 8.719397 1.117108 0.000000 0.000000 -0.000000
X1329 11.802248 8.539188 5.536811 0.918378 -0.000000
X1351 2.802728 1.551829 0.466324 -0.000000 -0.009818
X1362 3.868178 2.878842 2.239372 0.875101 -0.000000
X1416 -1.948134 -1.688431 -1.259215 -0.933760 -0.284509
X1417 -0.923660 -0.825183 -0.432391 -0.479412 -0.000000
X1418 -0.726373 -0.371522 -0.000000 -0.000000 -0.000000
X1430 12.056850 9.071478 7.372312 5.274174 0.000000
X1444 0.971105 0.894966 0.609778 0.000000 0.000000
X1470 -10.656927 -7.210662 -5.213523 -2.155941 -0.000000
X1506 -3.580033 -1.160591 -0.255748 -0.000000 -0.000000
X1514 -2.879311 -2.347902 -2.106235 -2.149132 -0.730586
X1529 -0.556923 -0.321777 -0.328990 -0.380970 -0.000000
X1553 0.126864 -0.000000 -0.000000 -0.000000 0.000000
X1563 -4.857389 -3.100188 -2.107724 -1.007217 -0.000000
X1574 7.214710 0.000000 -0.000000 -0.000000 0.000000
X1595 -23.334004 -1.343896 -1.080912 -0.000000 0.000000
X1597 1.487872 0.420259 -0.000000 -0.126035 -0.000000
X1609 2.157559 1.912054 1.368465 0.826883 0.000000
X1616 13.586829 -0.000000 -0.000000 -0.571179 -0.000000
X1637 -1.837732 -0.810217 -0.103097 -0.000000 -0.000000
X1656 -1.881983 -1.962349 -1.742366 -1.070309 -0.000000
X1657 -1.060942 -0.122512 -0.000000 -0.000000 -0.000000
X1683 1.749397 1.144410 0.469739 0.000000 -0.000000

Which model is best?

But could it be possible that some of these LASSO models have left out too many explanatory variables and thus may be underfitting?

Given that one of our goals is to build a model that will yield good predictions for new breast tumors, let's inspect the test dataset R^2 of each of these 4 LASSO models.

Nonregularized Linear Regression Model: Test Dataset R^2

lin_reg_mod.score(X_test, y_test)

-0.3071595344334954

LASSO Model 1: $\lambda=0.025$ Test Dataset R^2

lasso_mod_025.score(X_test, y_test)

0.1539346136658848

LASSO Model 2: $\lambda=0.05$ Test Dataset R^2

lasso_mod_05.score(X_test, y_test)

0.2786886369663709

LASSO Model 3: $\lambda=0.1$ Test Dataset R^2

lasso_mod_1.score(X_test, y_test)

0.2490838650434536

LASSO Model 4: $\lambda=0.5$ Test Dataset R^2

lasso_mod_5.score(X_test, y_test)

0.1445914387347873

We can see that the linear regression model that achieves the best test dataset R^2=0.279 is the LASSO model that used $\lambda=0.05$. Ideally, a more complete analysis would try out many other values of $\lambda$ to see if we can find a test dataset R^2 that was even higher.

Can we trust the non-zero slope interpretations in our chosen LASSO model?

Ideally, we'd be able to trust our interpretations of our non-zero slopes in this resulting LASSO model that we chose as well. Recall that many of our 50 gene explanatory variables in the nonregularized linear regression model were collinear with each other, which should make us more skeptical about how we interpret the resulting slopes.

If we were to remove the explanatory variables that had zero slopes in our chosen LASSO model, would the remaining explanatory variables still be collinear?

Below is a list of the 37 gene names that had non-zero slopes in our best LASSO model.

nonzero_slope_genes = df_slopes[np.abs(df_slopes['lasso_mod_05'])>0.00001].index
nonzero_slope_genes

Index(['X159', 'X960', 'X980', 'X986', 'X1023', 'X1028', 'X1064', 'X1092',
'X1109', 'X1141', 'X1169', 'X1173', 'X1179', 'X1193', 'X1203', 'X1206',
'X1208', 'X1219', 'X1264', 'X1272', 'X1329', 'X1351', 'X1362', 'X1416',
'X1417', 'X1430', 'X1444', 'X1470', 'X1506', 'X1514', 'X1529', 'X1563',
'X1595', 'X1609', 'X1637', 'X1656', 'X1683'],
dtype='object')

And below is the correlation matrix of each of these 37 remaining genes.

X_train_remaining = X_train[nonzero_slope_genes]
X_train_remaining.corr()

X159 X960 X980 X986 X1023 X1028 X1064 X1092 X1109 X1141 ... X1470 X1506 X1514 X1529 X1563 X1595 X1609 X1637 X1656 X1683
X159 1.000000 0.060265 -0.065454 0.022752 0.002266 -0.226498 -0.035437 -0.160238 -0.110140 0.011729 ... 0.043124 -0.031407 -0.086193 0.262708 0.146743 -0.010273 -0.141372 -0.072578 0.134411 -0.151282
X960 0.060265 1.000000 -0.916792 0.919474 0.864897 -0.656544 0.934777 -0.444383 -0.882282 0.871462 ... 0.927974 -0.927604 -0.820769 0.807609 0.888604 -0.851930 -0.841161 -0.825226 0.793374 -0.838848
X980 -0.065454 -0.916792 1.000000 -0.863944 -0.775288 0.762141 -0.913493 0.513205 0.948234 -0.864145 ... -0.941770 0.977642 0.820674 -0.866144 -0.883796 0.874294 0.873856 0.895619 -0.860707 0.893986
X986 0.022752 0.919474 -0.863944 1.000000 0.810531 -0.650212 0.909173 -0.449640 -0.805998 0.829365 ... 0.906335 -0.886268 -0.755153 0.736371 0.855282 -0.826968 -0.794439 -0.786007 0.716178 -0.804646
X1023 0.002266 0.864897 -0.775288 0.810531 1.000000 -0.618030 0.789146 -0.379213 -0.712516 0.756040 ... 0.785812 -0.788302 -0.685370 0.682320 0.762159 -0.787609 -0.708756 -0.682828 0.638509 -0.662923
X1028 -0.226498 -0.656544 0.762141 -0.650212 -0.618030 1.000000 -0.645414 0.484050 0.722887 -0.651875 ... -0.707760 0.743441 0.683551 -0.696159 -0.706024 0.762429 0.725366 0.729037 -0.659633 0.699566
X1064 -0.035437 0.934777 -0.913493 0.909173 0.789146 -0.645414 1.000000 -0.485517 -0.874262 0.861079 ... 0.926588 -0.928034 -0.837623 0.760780 0.860687 -0.843057 -0.823614 -0.845297 0.776042 -0.844329
X1092 -0.160238 -0.444383 0.513205 -0.449640 -0.379213 0.484050 -0.485517 1.000000 0.400143 -0.469561 ... -0.455920 0.496425 0.470159 -0.429428 -0.516239 0.562968 0.522040 0.410890 -0.441094 0.445237
X1109 -0.110140 -0.882282 0.948234 -0.805998 -0.712516 0.722887 -0.874262 0.400143 1.000000 -0.812283 ... -0.900745 0.935573 0.798052 -0.857029 -0.822570 0.798974 0.822577 0.910524 -0.848353 0.877479
X1141 0.011729 0.871462 -0.864145 0.829365 0.756040 -0.651875 0.861079 -0.469561 -0.812283 1.000000 ... 0.868621 -0.869839 -0.736111 0.745264 0.843947 -0.799305 -0.782430 -0.712398 0.799659 -0.801396
X1169 -0.004202 0.925896 -0.926571 0.864454 0.789092 -0.699165 0.909802 -0.447707 -0.883932 0.924337 ... 0.923992 -0.930818 -0.786308 0.784502 0.867721 -0.863700 -0.846288 -0.823758 0.820538 -0.840337
X1173 0.058787 0.903325 -0.884817 0.870304 0.776981 -0.693137 0.877916 -0.554142 -0.794434 0.909999 ... 0.880804 -0.885603 -0.791922 0.767329 0.863425 -0.850215 -0.833425 -0.736912 0.749456 -0.804784
X1179 0.062648 0.737455 -0.736313 0.654872 0.641327 -0.605075 0.656605 -0.343754 -0.711082 0.775219 ... 0.737398 -0.739159 -0.511046 0.685649 0.740930 -0.685372 -0.635948 -0.620235 0.763833 -0.608884
X1193 0.091274 0.933136 -0.930017 0.873882 0.803894 -0.673898 0.908863 -0.465221 -0.881124 0.893868 ... 0.907854 -0.917627 -0.771814 0.857625 0.887649 -0.828686 -0.830956 -0.804668 0.809600 -0.834385
X1203 0.079657 0.916031 -0.904998 0.899109 0.779479 -0.687560 0.899677 -0.485905 -0.861692 0.861878 ... 0.938010 -0.916234 -0.724234 0.806315 0.910898 -0.810954 -0.803194 -0.828396 0.828596 -0.794665
X1206 0.174463 0.760769 -0.742724 0.796077 0.670605 -0.639950 0.746090 -0.500799 -0.687160 0.741375 ... 0.771183 -0.754391 -0.642960 0.724521 0.802564 -0.695499 -0.656429 -0.651239 0.666202 -0.673853
X1208 0.044885 0.908029 -0.926673 0.884039 0.765225 -0.671927 0.926455 -0.450485 -0.905937 0.877521 ... 0.930512 -0.932100 -0.767078 0.814237 0.879817 -0.815426 -0.795445 -0.824524 0.834101 -0.844577
X1219 0.180275 0.769047 -0.807458 0.769531 0.633759 -0.633706 0.733670 -0.411896 -0.761748 0.752468 ... 0.851443 -0.817103 -0.559037 0.710085 0.830423 -0.711176 -0.705878 -0.733160 0.800655 -0.708285
X1264 -0.051055 -0.781417 0.802760 -0.782776 -0.703112 0.658323 -0.804180 0.397005 0.814639 -0.648905 ... -0.757512 0.795819 0.771927 -0.703418 -0.681377 0.758941 0.786272 0.803112 -0.603683 0.774020
X1272 -0.040265 -0.911553 0.947765 -0.873484 -0.767387 0.751029 -0.915850 0.494310 0.891946 -0.866641 ... -0.937044 0.965954 0.849028 -0.781977 -0.865252 0.881443 0.847869 0.859305 -0.817937 0.893305
X1329 0.009951 0.929842 -0.932724 0.886701 0.789396 -0.695484 0.909363 -0.441399 -0.895399 0.940292 ... 0.954865 -0.940787 -0.778585 0.792883 0.904664 -0.839982 -0.830809 -0.815080 0.867191 -0.863806
X1351 0.074195 0.811867 -0.847300 0.750453 0.680455 -0.660163 0.795969 -0.450868 -0.825490 0.831995 ... 0.843567 -0.838437 -0.692781 0.810342 0.861870 -0.751341 -0.743854 -0.730542 0.850602 -0.750289
X1362 0.020027 0.918716 -0.923671 0.875123 0.802120 -0.676309 0.902487 -0.419660 -0.897589 0.869691 ... 0.923741 -0.927871 -0.724029 0.826481 0.887244 -0.806308 -0.808091 -0.843427 0.835359 -0.804713
X1416 0.180052 0.383643 -0.502941 0.356281 0.214178 -0.423820 0.400211 -0.473305 -0.477921 0.480017 ... 0.436741 -0.450760 -0.351886 0.565908 0.502524 -0.425664 -0.416625 -0.375468 0.593593 -0.429165
X1417 -0.074155 -0.891227 0.949137 -0.835970 -0.775966 0.776325 -0.879697 0.439637 0.910556 -0.832863 ... -0.897464 0.943863 0.839946 -0.832228 -0.830148 0.847958 0.872998 0.868016 -0.804726 0.887542
X1430 -0.077804 -0.891555 0.900095 -0.830997 -0.774981 0.675972 -0.858329 0.479091 0.877924 -0.803209 ... -0.826189 0.885129 0.839241 -0.830799 -0.788063 0.832096 0.837266 0.829237 -0.717934 0.837409
X1444 -0.151062 -0.822917 0.910982 -0.774846 -0.698487 0.763655 -0.821939 0.516456 0.868480 -0.774262 ... -0.848531 0.902977 0.790241 -0.795023 -0.779184 0.826656 0.814365 0.812694 -0.760813 0.851153
X1470 0.043124 0.927974 -0.941770 0.906335 0.785812 -0.707760 0.926588 -0.455920 -0.900745 0.868621 ... 1.000000 -0.952881 -0.771150 0.801664 0.923467 -0.832105 -0.836461 -0.856976 0.863743 -0.854665
X1506 -0.031407 -0.927604 0.977642 -0.886268 -0.788302 0.743441 -0.928034 0.496425 0.935573 -0.869839 ... -0.952881 1.000000 0.828279 -0.830340 -0.892210 0.876923 0.846958 0.901064 -0.843021 0.897225
X1514 -0.086193 -0.820769 0.820674 -0.755153 -0.685370 0.683551 -0.837623 0.470159 0.798052 -0.736111 ... -0.771150 0.828279 1.000000 -0.696167 -0.724666 0.795717 0.783708 0.771904 -0.647816 0.851241
X1529 0.262708 0.807609 -0.866144 0.736371 0.682320 -0.696159 0.760780 -0.429428 -0.857029 0.745264 ... 0.801664 -0.830340 -0.696167 1.000000 0.808604 -0.721919 -0.781370 -0.784322 0.794576 -0.767723
X1563 0.146743 0.888604 -0.883796 0.855282 0.762159 -0.706024 0.860687 -0.516239 -0.822570 0.843947 ... 0.923467 -0.892210 -0.724666 0.808604 1.000000 -0.793890 -0.774063 -0.767569 0.851589 -0.779229
X1595 -0.010273 -0.851930 0.874294 -0.826968 -0.787609 0.762429 -0.843057 0.562968 0.798974 -0.799305 ... -0.832105 0.876923 0.795717 -0.721919 -0.793890 1.000000 0.829484 0.793389 -0.702696 0.779285
X1609 -0.141372 -0.841161 0.873856 -0.794439 -0.708756 0.725366 -0.823614 0.522040 0.822577 -0.782430 ... -0.836461 0.846958 0.783708 -0.781370 -0.774063 0.829484 1.000000 0.828017 -0.722580 0.811844
X1637 -0.072578 -0.825226 0.895619 -0.786007 -0.682828 0.729037 -0.845297 0.410890 0.910524 -0.712398 ... -0.856976 0.901064 0.771904 -0.784322 -0.767569 0.793389 0.828017 1.000000 -0.739133 0.851905
X1656 0.134411 0.793374 -0.860707 0.716178 0.638509 -0.659633 0.776042 -0.441094 -0.848353 0.799659 ... 0.863743 -0.843021 -0.647816 0.794576 0.851589 -0.702696 -0.722580 -0.739133 1.000000 -0.752078
X1683 -0.151282 -0.838848 0.893986 -0.804646 -0.662923 0.699566 -0.844329 0.445237 0.877479 -0.801396 ... -0.854665 0.897225 0.851241 -0.767723 -0.779229 0.779285 0.811844 0.851905 -0.752078 1.000000

37 rows × 37 columns

sns.heatmap(X_train_remaining.corr(),vmin=-1, vmax=1, cmap='RdBu')
plt.show()
Image Description

Unfortunately, there is still a high degree of multicollinearity in the remaining explanatory variables. Therefore, we should also be cautious when interpreting the slopes of our chosen LASSO model as well.

Ridge Regression (ie. L2 Penalty Term)

In the regularized linear ridge regression optimization problem shown below, the penalty term that is added is $\lambda\cdot\left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]$. We call this a L2 penalty term, because what we call the L2 norm of a given list (ie. vector) of values is the square root of the sum of the squared values in this list.

Similar to LASSO regression, we introduce a parameter $\lambda\geq 0$ that we must preselect.

Linear Ridge Regression Optimization Problem
$$min \left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right] +\lambda\cdot\left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]$$

The Ideal Scenario

Similar to the LASSO linear regression, the absolute best intercept and slope solutions to this optimization would be ones in which:

  • $SSE=\left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right]=0$
  • $\lambda\cdot\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]=0$

The Penalty Term Balancing Act

However, similarly, this "ideal scenario" that we just described is unrealistic. We also see that choosing intercepts and slopes that lead to a decrease of one term, often leads to an increase in the other term, and vice versa.

The Impact of $\lambda$

For similar reasons described when we used $\lambda$ in the LASSO linear regression, we see the following impact of $\lambda$ on our results.

$\lambda$ is Set Low

  • $\left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]$ Is Higher
  • SSE tends to be lower (ie. the fit is better)

$\lambda$ is Set High

  • $\left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]$ Is Lower
  • SSE tends to be higher (ie. the fit is worse)

Which slope magnitudes get to stay large?

Similar to LASSO linear regression, the resulting optimal slopes that are set "low", are the ones that bring the least predictive power to the model and thus are the most likely variable to overfit the model.

Less Clear Slope Interpretation with Ridge Regression

Unfortunately, one of the downsides of using the L2 penalty in ridge regression, is that unlike using the L1 penalty, slopes that are "comparatively small" will NOT be set exactly equal to 0. Therefore, the resulting slopes found with ridge regression provide much less of a clear indication as to which explanatory variables should be left out of the model because they don't bring enough predictive power to the model.

Without more careful inspection, it is not always clear if a ridge regression slope is small because it does not bring enough predictive power to the model, or if it is small for other reasons. This also begs the question: "what does it mean for a slope magnitude to be considered 'small'"?

Benefits of Ridge Regression

However, despite being harder to interpret, ridge regression does have some benefits over LASSO regression.

  1. In the presence of multicollinearity, ridge regression slopes can be more trusted than those that would have been returned by an nonregularized linear regression model.

    For instance, suppose again we were trying to predict the height of a person $y$, using their left foot size $x_1$ and right foot size $x_2$. Again, it's likely that $x_1$ and $x_2$ are highly collinear. Thus, it's possible that your basic linear regression model may end up with slopes like $\hat{\beta}_1=1000$ and $\hat{\beta}_2=-999$ that are extremely large in magnitude with opposing signs, the negative slope $\hat{\beta}_2=-999$ being misleading in particular given that there is most likely a positive relationship between right foot size and height.

    Why would the model do this? Because $x_1$ and $x_2$ are so highly correlated, their two large opposing signs in this scenario could easily cancel this "largeness" out. Or in other words, their joint effort together in...
    $$\hat{\beta}_1x_1+\hat{\beta}_2x_2 = 100x_1-999(value\ very\ close\ to\ x_1)\approx 1x_1$$
    ...may actually end up leading to a much more realistic, modest predicted impact on $y$.

    Because ridge regression discourages slopes with large magnitudes, this type of scenario would be less likely to happen. Furthermore, because ridge regression does not "zero out" slopes like LASSO does, the predicted impact of these collinear variables on the response variable is more likely to get evenly distributed in the slopes.

  2. Ridge Regression can be less compuationally complex in larger datasets
    Furthermore, due to some mathematical properties that the L2 norm has, but the L1 doesn't, finding the optimal slopes and intercept for a ridge regression model can take less time than it does for the LASSO model especially for larger datasets.

Linear Ridge Regression for Predicting Tumor Size

Now, let's actually apply our linear ridge regression model to this dataset.

Recall that our linear ridge regression objective function required for us to preselect a value for $\lambda$. Let's try out the following values $\lambda\in [1,2.5,5,10]$ and see how this impacts our results and research goals.

$$min \left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right] +\lambda\cdot\left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]$$

We can similarly use the Ridge() function to fit a linear ridge regression model. This Ridge() function operates in a similar way to the Lasso() function.

Fitting Four Linear Ridge Regression Models

Model 1: $\lambda=1$

from sklearn.linear_model import Ridge
ridge_mod_1 = Ridge(alpha=1, max_iter=1000)
ridge_mod_1.fit(X_train, y_train)
Ridge(alpha=1, max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model 2: $\lambda=2.5$

ridge_mod_25 = Ridge(alpha=2.5, max_iter=1000)
ridge_mod_25.fit(X_train, y_train)
Ridge(alpha=2.5, max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model 3: $\lambda=5$

ridge_mod_5 = Ridge(alpha=5, max_iter=1000)
ridge_mod_5.fit(X_train, y_train)
Ridge(alpha=5, max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Model 4: $\lambda=10$

ridge_mod_10 = Ridge(alpha=10, max_iter=1000)
ridge_mod_10.fit(X_train, y_train)
Ridge(alpha=10, max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Inspecting the Slopes

Next, let's inspect and compare the slopes of the models that we've created so far.

df_slopes = pd.DataFrame({'lin_reg_mod': lin_reg_mod.coef_.T, 
'ridge_mod_1': ridge_mod_1.coef_.T,
'ridge_mod_25': ridge_mod_25.coef_.T,
'ridge_mod_5': ridge_mod_5.coef_.T,
'ridge_mod_10': ridge_mod_10.coef_.T}, index=X_train.columns)
df_slopes

lin_reg_mod ridge_mod_1 ridge_mod_25 ridge_mod_5 ridge_mod_10
X159 2.093903 1.564987 1.361952 1.177710 0.956538
X960 9.995326 7.115780 5.287849 3.822538 2.556407
X980 -10.996647 -5.623053 -3.329081 -2.023085 -1.158402
X986 -3.413999 -2.415057 -1.772567 -1.264054 -0.810833
X1023 -3.991874 -3.265249 -2.816867 -2.374631 -1.887797
X1028 3.797146 3.342845 3.143286 2.952002 2.656488
X1064 2.006906 2.591662 2.569644 2.250205 1.793449
X1092 -0.408996 -0.413064 -0.321912 -0.210659 -0.096273
X1103 -0.739345 -0.424530 -0.162644 -0.022618 0.032141
X1109 0.121617 -1.820703 -2.187465 -2.056208 -1.645878
X1124 0.488547 -0.215853 -0.254652 -0.097914 0.128196
X1136 -0.547525 -0.352572 -0.249713 -0.254968 -0.337248
X1141 -5.867328 -4.680306 -3.724521 -2.921222 -2.156535
X1144 0.657140 0.540347 0.250310 0.149831 0.149776
X1169 -10.944499 -5.565174 -3.599658 -2.426202 -1.571005
X1173 7.801992 5.396328 3.627020 2.385242 1.403658
X1179 1.212503 1.042993 1.043248 1.000648 0.853299
X1193 -6.094789 -4.094394 -2.641905 -1.641212 -0.907783
X1203 -4.274063 -1.935215 -0.947397 -0.435549 -0.149324
X1206 1.790304 1.324069 1.099690 0.946614 0.779058
X1208 4.107494 2.824609 1.849947 1.267326 0.899024
X1219 -1.244601 -0.981037 -1.096062 -1.149374 -1.086146
X1232 -0.493502 -0.118033 0.071458 0.188137 0.266373
X1264 -3.258595 -1.813517 -1.139158 -0.714440 -0.397089
X1272 6.584962 3.120894 1.564486 0.646001 0.094680
X1292 4.623105 2.082749 0.871776 0.200222 -0.143317
X1297 8.719397 1.837983 0.544894 0.087360 -0.077011
X1329 11.802248 6.911412 4.272329 2.525056 1.286927
X1351 2.802728 1.310916 0.494898 -0.049321 -0.408125
X1362 3.868178 2.854623 2.005154 1.355878 0.820093
X1416 -1.948134 -1.568316 -1.262320 -1.053509 -0.886972
X1417 -0.923660 -1.578682 -1.436202 -1.147429 -0.782046
X1418 -0.726373 -0.494504 -0.367910 -0.252362 -0.149841
X1430 12.056850 8.120432 6.068246 4.454804 3.008742
X1444 0.971105 0.955881 0.841838 0.679247 0.485901
X1470 -10.656927 -5.500100 -3.180474 -1.798032 -0.912554
X1506 -3.580033 -2.245236 -1.575135 -1.150218 -0.828351
X1514 -2.879311 -2.235236 -1.961917 -1.744741 -1.487527
X1529 -0.556923 -0.601374 -0.772179 -0.891623 -0.900789
X1553 0.126864 -1.143535 -0.636767 -0.393462 -0.234314
X1563 -4.857389 -2.987602 -2.075233 -1.440987 -0.953017
X1574 7.214710 2.018541 0.803107 0.305086 0.080143
X1595 -23.334004 -1.610346 -0.818146 -0.494847 -0.291165
X1597 1.487872 0.382882 -0.088330 -0.303635 -0.389779
X1609 2.157559 1.858897 1.627585 1.390239 1.120723
X1616 13.586829 -0.779628 -0.678448 -0.516199 -0.355038
X1637 -1.837732 -0.688241 -0.373500 -0.272567 -0.254245
X1656 -1.881983 -1.729769 -1.602954 -1.364397 -1.064114
X1657 -1.060942 -0.239239 -0.014737 0.077494 0.094871
X1683 1.749397 1.124721 0.829526 0.615599 0.391815

By comparing the slopes of the four linear ridge regression models, we can see what we would expect. As we increase $\lambda$, the squared sum of the slopes decreases more and more.

(df_slopes**2).sum()

lin_reg_mod 1989.465059
ridge_mod_1 454.498812
ridge_mod_25 226.579597
ridge_mod_5 120.894160
ridge_mod_10 61.376527
dtype: float64

We can see that many slopes individually decreased as we increased $\lambda$, however there were some that increased, like X1109.

Also notice, unfortunately, none of our slopes were zeroed out as in the case with the LASSO model. Thus, the task of choosing which genes may be overfitting the model and thus we should delete is much more difficult.

plt.figure(figsize=(14,5))
plt.plot(df_slopes['lin_reg_mod'], label='lin_reg_mod')
plt.plot(df_slopes['ridge_mod_1'], label='ridge_mod_1')
plt.plot(df_slopes['ridge_mod_25'], label='ridge_mod_25')
plt.plot(df_slopes['ridge_mod_5'], label='ridge_mod_5')
plt.plot(df_slopes['ridge_mod_10'], label='ridge_mod_10')
plt.axhline(0, color='black', linestyle='--')
plt.ylabel('Slope')
plt.xlabel('Gene Explanatory Variable')
plt.xticks(rotation=90)
plt.legend(bbox_to_anchor=(1,1))
plt.show()
Image Description

We might further inspect the ridge regression slopes by dividing each slope by it's original linear regression model slope to measure the factor by which it changed.

df_slopes_normalized = df_slopes.div(df_slopes['lin_reg_mod'], axis=0)
df_slopes_normalized

lin_reg_mod ridge_mod_1 ridge_mod_25 ridge_mod_5 ridge_mod_10
X159 1.0 0.747402 0.650437 0.562447 0.456821
X960 1.0 0.711911 0.529032 0.382433 0.255760
X980 1.0 0.511343 0.302736 0.183973 0.105341
X986 1.0 0.707398 0.519205 0.370256 0.237502
X1023 1.0 0.817974 0.705650 0.594866 0.472910
X1028 1.0 0.880357 0.827802 0.777427 0.699601
X1064 1.0 1.291372 1.280401 1.121231 0.893639
X1092 1.0 1.009945 0.787079 0.515063 0.235389
X1103 1.0 0.574198 0.219984 0.030592 -0.043472
X1109 1.0 -14.970821 -17.986539 -16.907278 -13.533319
X1124 1.0 -0.441828 -0.521245 -0.200419 0.262402
X1136 1.0 0.643937 0.456077 0.465673 0.615949
X1141 1.0 0.797689 0.634790 0.497879 0.367550
X1144 1.0 0.822271 0.380908 0.228005 0.227921
X1169 1.0 0.508491 0.328901 0.221682 0.143543
X1173 1.0 0.691660 0.464884 0.305722 0.179910
X1179 1.0 0.860198 0.860409 0.825274 0.703750
X1193 1.0 0.671786 0.433470 0.269281 0.148944
X1203 1.0 0.452781 0.221662 0.101905 0.034937
X1206 1.0 0.739578 0.614248 0.528745 0.435154
X1208 1.0 0.687672 0.450383 0.308540 0.218874
X1219 1.0 0.788234 0.880653 0.923488 0.872686
X1232 1.0 0.239174 -0.144797 -0.381228 -0.539760
X1264 1.0 0.556533 0.349586 0.219248 0.121859
X1272 1.0 0.473943 0.237585 0.098102 0.014378
X1292 1.0 0.450509 0.188569 0.043309 -0.031000
X1297 1.0 0.210792 0.062492 0.010019 -0.008832
X1329 1.0 0.585601 0.361993 0.213947 0.109041
X1351 1.0 0.467729 0.176577 -0.017598 -0.145617
X1362 1.0 0.737976 0.518372 0.350521 0.212010
X1416 1.0 0.805035 0.647963 0.540778 0.455293
X1417 1.0 1.709159 1.554903 1.242263 0.846682
X1418 1.0 0.680785 0.506503 0.347428 0.206286
X1430 1.0 0.673512 0.503303 0.369483 0.249546
X1444 1.0 0.984324 0.866887 0.699458 0.500359
X1470 1.0 0.516106 0.298442 0.168720 0.085630
X1506 1.0 0.627155 0.439978 0.321287 0.231381
X1514 1.0 0.776309 0.681384 0.605958 0.516626
X1529 1.0 1.079816 1.386509 1.600980 1.617440
X1553 1.0 -9.013841 -5.019273 -3.101436 -1.846966
X1563 1.0 0.615063 0.427232 0.296659 0.196200
X1574 1.0 0.279781 0.111315 0.042287 0.011108
X1595 1.0 0.069013 0.035062 0.021207 0.012478
X1597 1.0 0.257335 -0.059367 -0.204073 -0.261971
X1609 1.0 0.861574 0.754364 0.644358 0.519440
X1616 1.0 -0.057381 -0.049934 -0.037993 -0.026131
X1637 1.0 0.374506 0.203240 0.148317 0.138347
X1656 1.0 0.919121 0.851737 0.724979 0.565422
X1657 1.0 0.225496 0.013891 -0.073043 -0.089422
X1683 1.0 0.642919 0.474178 0.351892 0.223972

We see that as the value of $\lambda$ is increased, there are some slopes (like X980) that dramatically decline in magnitude compared to their original magnitude in the linear regression model. On the other hand, there were other slopes (like X1109) that dramatically increased in magnitude compared to their original magnitude in the linear regression model.

plt.figure(figsize=(14,5))
plt.plot(df_slopes_normalized['ridge_mod_1'], label='ridge_mod_1')
plt.plot(df_slopes_normalized['ridge_mod_25'], label='ridge_mod_25')
plt.plot(df_slopes_normalized['ridge_mod_5'], label='ridge_mod_5')
plt.plot(df_slopes_normalized['ridge_mod_10'], label='ridge_mod_10')
plt.axhline(0, color='black', linestyle='--')
plt.ylabel('Percent of Linear Regression Slope')
plt.xlabel('Gene Explanatory Variable')
plt.xticks(rotation=90)
plt.ylim([-1,2])
plt.legend(bbox_to_anchor=(1,1))
plt.show()
Image Description

Which model is best?

Finally, let's determine which ridge regression model is best for the purpose of predicting the size of new breast tumors, by again comparing the test data R^2 of each model.

Given that one of our goals is to build a model that will yield good predictions for new breast tumors, let's inspect the test dataset R^2 of each of these 3 LASSO models.

Nonregularized Linear Regression Model: Test Dataset R^2

lin_reg_mod.score(X_test, y_test)

-0.3071595344334954

Ridge Regression Model 1: $\lambda=1$ Test Dataset R^2

ridge_mod_1.score(X_test, y_test)

0.12825147516877422

Ridge Regression Model 2: $\lambda=2.5$ Test Dataset R^2

ridge_mod_25.score(X_test, y_test)

0.24810734542753

Ridge Regression Model 3: $\lambda=5$ Test Dataset R^2

ridge_mod_5.score(X_test, y_test)

0.27410328541870965

Ridge Regression Model 4: $\lambda=10$ Test Dataset R^2

ridge_mod_10.score(X_test, y_test)

0.2607143483417401

We can see that the linear ridge regression model that achieves the best test dataset R^2=0.274 is the model that used $\lambda=5$. Ideally, a more complete analysis would try out many other values of $\lambda$ to see if we can find a test dataset R^2 that was even higher. Notice, that this ridge regression model was not quite as high as the best LASSO model that we found with a test dataset R^2=0.279.

 


Best Test R^2
Nonregularized Linear Regression -0.307
LASSO Linear Regression 0.279
Linear Ridge Regression 0.274

Can we trust the non-zero slope interpretations in our chosen ridge regression model?

Ideally, we'd be able to trust our interpretations of our non-zero slopes in this resulting ridge regression model that we chose as well. Because ridge regression is more effective at downplaying the impacts of multicollinearity (which we know this dataset has), we may be able to trust the interpretive value of the slopes in our ridge regression more.

Notice how the magnitude of some slopes dramatically changed in the two models below.

df_slopes = pd.DataFrame({'lin_reg_mod': lin_reg_mod.coef_.T, 
'ridge_mod_5': ridge_mod_5.coef_.T}, index=X_train.columns)
df_slopes

lin_reg_mod ridge_mod_5
X159 2.093903 1.177710
X960 9.995326 3.822538
X980 -10.996647 -2.023085
X986 -3.413999 -1.264054
X1023 -3.991874 -2.374631
X1028 3.797146 2.952002
X1064 2.006906 2.250205
X1092 -0.408996 -0.210659
X1103 -0.739345 -0.022618
X1109 0.121617 -2.056208
X1124 0.488547 -0.097914
X1136 -0.547525 -0.254968
X1141 -5.867328 -2.921222
X1144 0.657140 0.149831
X1169 -10.944499 -2.426202
X1173 7.801992 2.385242
X1179 1.212503 1.000648
X1193 -6.094789 -1.641212
X1203 -4.274063 -0.435549
X1206 1.790304 0.946614
X1208 4.107494 1.267326
X1219 -1.244601 -1.149374
X1232 -0.493502 0.188137
X1264 -3.258595 -0.714440
X1272 6.584962 0.646001
X1292 4.623105 0.200222
X1297 8.719397 0.087360
X1329 11.802248 2.525056
X1351 2.802728 -0.049321
X1362 3.868178 1.355878
X1416 -1.948134 -1.053509
X1417 -0.923660 -1.147429
X1418 -0.726373 -0.252362
X1430 12.056850 4.454804
X1444 0.971105 0.679247
X1470 -10.656927 -1.798032
X1506 -3.580033 -1.150218
X1514 -2.879311 -1.744741
X1529 -0.556923 -0.891623
X1553 0.126864 -0.393462
X1563 -4.857389 -1.440987
X1574 7.214710 0.305086
X1595 -23.334004 -0.494847
X1597 1.487872 -0.303635
X1609 2.157559 1.390239
X1616 13.586829 -0.516199
X1637 -1.837732 -0.272567
X1656 -1.881983 -1.364397
X1657 -1.060942 0.077494
X1683 1.749397 0.615599

Elastic Net Regression (ie. L1 and L2 Penalty Term)

In the the regularized elastic net linear regression optimization problem shown below, the penalty term that is added is $\lambda\cdot\left[\alpha\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]+ \frac{1-\alpha}{2} \left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]\right]$.

We can see that this is a combination of the:

  • L1 norm $\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]$ from LASSO regression
  • L2 norm $\left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]$ from ridge regression.

Similar to LASSO and ridge regression, we introduce a parameter $\lambda\geq 0$ that we must preselect. We also now use an $0\leq \alpha\leq 1$ parameter that we must also preselect.

Elastic Net Linear Regression Optimization Problem
$$min \left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right] +\lambda\cdot\left[\alpha\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]+ \frac{1-\alpha}{2} \left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]\right]$$

The Ideal Scenario

Similar to the LASSO and ridge linear regression, the absolute best intercept and slope solutions to this optimization would be ones in which:

  • $SSE=\left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right]=0$
  • $\lambda\cdot\left[\alpha\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]+ \frac{1-\alpha}{2} \left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]\right]=0$

The Penalty Term Balancing Act

However, similarly, this "ideal scenario" that we just described is unrealistic. We also see that choosing intercepts and slopes that lead to a decrease of one term, often leads to an increase in the other term, and vice versa.

The Impact of $\lambda$

For similar reasons described when we used $\lambda$ in the LASSO linear regression, we see the following impact of $\lambda$ on our results.

$\lambda$ is Set Low

  • $\lambda\cdot\left[\alpha\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]+ \frac{1-\alpha}{2} \left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]\right]$ Is Higher
  • SSE tends to be lower (ie. the fit is better)

$\lambda$ is Set High

  • $\lambda\cdot\left[\alpha\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]+ \frac{1-\alpha}{2} \left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]\right]$ Is Lower
  • SSE tends to be higher (ie. the fit is worse)

Which slope magnitudes get to stay large?

Similar to LASSO and ridge regression, the resulting optimal slopes that are set "low", are the ones that bring the least predictive power to the model and thus are the most likely variable to overfit the model.

The Impact of the $\alpha$ Parameter

Now, let's explore the impact that $\alpha$ has on our resulting slopes.

$$min \left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right] +\lambda\cdot\left[\alpha\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]+ \frac{1-\alpha}{2} \left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]\right]$$
  • $\alpha$ is Set High

    When $\alpha$ is set higher, the penalty coefficient in front of the L1 term $\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]$ becomes higher. Thus, the optimization problem is going to try harder to keep this term in particular low. Thus, your results slopes will tend to look more like those that were returned by LASSO regression (ie. zeroing slopes out).
  • $\alpha$ is Set High

    Alternativly, When $\alpha$ is set lower, the penalty coefficient in front of the L2 term $\left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]$ becomes higher. Thus, the optimization problem is going to try harder to keep this term in particular low. Thus, your results slopes will tend to look more like those that were returned by ridge regression (ie. less zeroing slopes out and more focus on balancing the weight in collinear slopes).

Less Clear Slope Interpretation Depending on $\alpha$

The smaller the value of $\alpha$ is set, ie. the more the returned slopes tend to resemble a ridge regression solution, the less the small slopes will be set exactly equal to 0. Thus, the less the slopes have interpretive value about which variables should be removed from the model because they do not bring enough predictive power.

Benefits of Elastic Net Regression

The benefits of elastic net regression is that it essentially can provide a "middle ground" solution between the type of results that you might get with LASSO and ridge regression. By toggling the $\alpha$ parameter, you can have your results resemble the type of solutions that you might get with LASSO vs. ridge regression to varying degrees.

Elastic Net Linear Regression for Predicting Tumor Size

Finally, let's actually apply our elastic netlinear regression model to this dataset.

Recall that our linear ridge regression objective function required for us to preselect a value for $\lambda$. Let's try out a series of different $\lambda$ values to see which model yields the best test dataset performance.

Also recall, that we need to preselect our $\alpha$ value as well. A more complete analysis would also try out many different combinations of $\alpha$ and $\lambda$ values to assess which model gave us the best test data results. However, for now we will just try $\alpha=0.7$. Because this $\alpha$ is higher and closer to 1, we might expect our resulting slopes to resemble more what we might get with a LASSO model.

$$min \left[ \sum_{i=1}^n(y_i-(\hat{\beta}_0+\hat{\beta}_1x_{i,1}+...+\hat{\beta}_{50}x_{i,50}))^2 \right] +\lambda\cdot\left[\alpha\left[|\hat{\beta}_1|+|\hat{\beta}_2|+...+|\hat{\beta}_{50}|\right]+ \frac{1-\alpha}{2} \left[\hat{\beta}_1^2+\hat{\beta}_2^2+...+\hat{\beta}_{50}^2\right]\right]$$

We can similarly use the ElasticNet() function to fit a linear ridge regression model. This ElasticNet() function operates in a similar way to the Lasso() function. Let's be careful about the parameters here.

  • the alpha parameter in the ElasticNet() function below represents the $\lambda$ parameter in the objective function above
  • the l1_ratio parameter in the ElasticNet() function below represents the $\alpha$ parameter in the objective function above.

Fitting Four Elastic Net Linear Regression Models

Model 1: $\lambda=0.01$

from sklearn.linear_model import ElasticNet
en_mod_1 = ElasticNet(alpha=.01, l1_ratio=0.7)
en_mod_1.fit(X_train, y_train)
en_mod_1.score(X_test, y_test)

0.06547371156125181

Model 2: $\lambda=0.025$

en_mod_2 = ElasticNet(alpha=.025, l1_ratio=0.7)
en_mod_2.fit(X_train, y_train)
en_mod_2.score(X_test, y_test)

0.23638055648260659

Model 3: $\lambda=0.05$

en_mod_3 = ElasticNet(alpha=.05, l1_ratio=0.7)
en_mod_3.fit(X_train, y_train)
en_mod_3.score(X_test, y_test)

0.28513850610195246

Model 4: $\lambda=0.1$

en_mod_4 = ElasticNet(alpha=.1, l1_ratio=0.7)
en_mod_4.fit(X_train, y_train)
en_mod_4.score(X_test, y_test)

0.2639254089412595

We can see that the elastic net model that used $\lambda=0.05$ and $\alpha=0.7$ achieved the highest test dataset R^2=0.285. Also note that this is the highest breast tumor test dataset R^2 that we have achieved so far out of all explored models!

 


Best Test R^2
Nonregularized Linear Regression -0.307
LASSO Linear Regression 0.279
Linear Ridge Regression 0.274
Elastic Net Regression 0.285

Inspecting the Slopes

Next, let's inspect and compare the slopes of the models that we've created so far. Notice how we do see some zeroed out slopes in our results, regardless of what $\lambda$ value that we chose. This makes sense as our $\alpha$ was set larger, and therefore it's results will look more similar to what we would get with a LASSO model.

Also take note, just like we observed with the LASSO models, the higher the value of $\lambda$, the more slopes that were zeroed out.

df_slopes = pd.DataFrame({'lin_reg_mod': lin_reg_mod.coef_.T, 
'en_mod_1': en_mod_1.coef_.T,
'en_mod_2': en_mod_2.coef_.T,
'en_mod_3': en_mod_3.coef_.T,
'en_mod_4': en_mod_4.coef_.T}, index=X_train.columns)
df_slopes

lin_reg_mod en_mod_1 en_mod_2 en_mod_3 en_mod_4
X159 2.093903 1.615975 1.358475 1.090770 0.746052
X960 9.995326 8.060144 6.416783 4.691085 3.088208
X980 -10.996647 -6.965351 -4.424210 -2.278063 -0.790220
X986 -3.413999 -2.768931 -2.033785 -1.337220 -0.347230
X1023 -3.991874 -3.436708 -3.000938 -2.449317 -1.895899
X1028 3.797146 3.408037 3.099803 2.878624 2.596566
X1064 2.006906 2.342968 2.535539 2.235952 1.875274
X1092 -0.408996 -0.397558 -0.324774 -0.185855 -0.058357
X1103 -0.739345 -0.455390 -0.000000 -0.000000 0.000000
X1109 0.121617 -1.150149 -1.963366 -2.365472 -2.082780
X1124 0.488547 -0.000000 -0.027279 -0.000000 0.000000
X1136 -0.547525 -0.444500 -0.169867 -0.000000 -0.000000
X1141 -5.867328 -5.154343 -4.223917 -3.427463 -2.432639
X1144 0.657140 0.561424 0.000000 0.000000 0.000000
X1169 -10.944499 -6.676555 -4.148376 -2.577002 -1.177291
X1173 7.801992 6.206625 4.338025 2.735242 1.235253
X1179 1.212503 0.968695 0.854449 0.736727 0.616764
X1193 -6.094789 -4.775227 -3.214174 -1.639569 -0.323363
X1203 -4.274063 -2.400150 -1.116151 -0.137988 -0.000000
X1206 1.790304 1.392953 1.051212 0.816235 0.499708
X1208 4.107494 3.316898 2.327103 1.280175 0.465704
X1219 -1.244601 -0.911206 -1.015590 -1.178560 -1.090303
X1232 -0.493502 -0.138169 0.000000 0.021508 0.080950
X1264 -3.258595 -2.242385 -1.352558 -0.714238 -0.103740
X1272 6.584962 3.741728 1.846926 0.287312 -0.000000
X1292 4.623105 2.029130 0.645845 0.000000 -0.000000
X1297 8.719397 2.840941 0.280463 0.000000 0.000000
X1329 11.802248 8.548955 5.633816 3.022931 0.706893
X1351 2.802728 1.701263 0.793606 0.000000 -0.081493
X1362 3.868178 3.190280 2.289450 1.506066 0.582151
X1416 -1.948134 -1.697054 -1.417397 -1.054065 -0.862737
X1417 -0.923660 -1.350689 -1.247298 -0.855488 -0.535160
X1418 -0.726373 -0.470412 -0.305559 -0.031378 -0.031960
X1430 12.056850 9.097448 7.165911 5.381811 3.670749
X1444 0.971105 0.933859 0.851519 0.579380 0.000000
X1470 -10.656927 -7.013554 -4.483356 -2.694715 -0.960104
X1506 -3.580033 -2.234497 -1.437787 -0.749208 -0.397086
X1514 -2.879311 -2.371968 -2.073026 -1.878219 -1.737383
X1529 -0.556923 -0.524987 -0.509621 -0.670207 -0.730044
X1553 0.126864 -0.952337 -0.286741 -0.199374 -0.069218
X1563 -4.857389 -3.388552 -2.341469 -1.586088 -0.925848
X1574 7.214710 1.894890 0.000000 -0.000000 -0.000000
X1595 -23.334004 -2.439731 -0.791844 -0.426136 -0.141519
X1597 1.487872 0.648474 0.000000 -0.213069 -0.250520
X1609 2.157559 1.953701 1.663401 1.253879 0.854298
X1616 13.586829 -0.048237 -0.309621 -0.442962 -0.382420
X1637 -1.837732 -0.930020 -0.362243 -0.000000 -0.000000
X1656 -1.881983 -1.835605 -1.779952 -1.448525 -0.985379
X1657 -1.060942 -0.306691 -0.000000 -0.000000 -0.000000
X1683 1.749397 1.299685 0.837095 0.398304 0.000000

Can we trust the non-zero slope interpretations in our chosen elastic net model?

Ideally, we'd be able to trust our interpretations of our non-zero slopes in this resulting elastic net model that we chose which also happened to have the best test R^2.

Because we set $\alpha$ to be high, this means that our results more closely resembled a LASSO model solution. Thus, we'd also want to inspect whether the remaining explanatory variables (ie. those with non-zero slopes) were collinear. Unfortunately, we do see that quite a few of these remaining explanatory variables are still collinear. Thus we should be more cautious about interpreting the slopes in this model.

nonzero_slope_genes = df_slopes[np.abs(df_slopes['en_mod_3'])>0.00001].index
X_train_remaining = X_train[nonzero_slope_genes]

sns.heatmap(X_train_remaining.corr(),vmin=-1, vmax=1, cmap='RdBu')
plt.show()
Image Description

On the other hand if we had set our $\alpha$ to be lower, that would mean that our results would closely resemble a ridge regression model solution, which inherently tends to mitigate the selection of slopes with unreliable interpretations due to multicollinearity.

Conclusion and Potential Shortcomings

Goal Assessment

Goal 1: Best Predictions for New Breast Tumors

Thus, if our first and foremost goal was to select a model which we think would perform the best predicting new breast tumor sizes, then we would select our model that achieved the highest test data R^2=0.285. This was our elastic net model with $\lambda=0.05$ and $\alpha=0.7$. However, as we mentioned above we cannot quite trust our slopes interpretations due the extent to which the elastic net model did not assist in dealing with the multicollinearity.

Goal 2: More Trust in Slope Interpretations

However, if our goal was to achieve a model in which we'd be able to trust the interpretations of our slopes more (in the unfortunate presence of highly multicollinar explanatory variables), then we might consider selecting one of our ridge regression models. In addition, the test data R^2=0.274 of our best ridge regression model was not too much lower than that of our best final model.

Analysis Shortcomings

In our analysis above, we created a single training and test dataset by randomly splitting the dataset using a random state of 102.

Let's perform this same analysis again but using a different random state = 207 when making this split.

#Creating a new set of training and test datasets
df_train_other, df_test_other = train_test_split(df, test_size=0.1, random_state=207)
X_train_other = df_train_other.drop(['size'], axis=1)
X_test_other = df_test_other.drop(['size'], axis=1)
y_train_other=df_train_other['size']
y_test_other=df_test_other['size']
#Fitting the linear regression model
lin_reg_mod = LinearRegression()
lin_reg_mod.fit(X_train_other, y_train_other)
print('Linear Regression Test R^2:',lin_reg_mod.score(X_test_other, y_test_other))

Linear Regression Test R^2: -0.2213718065663819

Unfortuately, we see that by just choosing a different random state to split up the training and test dataset our analysis shows us different results.

  1. All of our elastic net model test data R^2 values are negative, indicating that all of our elastic net trained models were poor fits for the test dataset.
  2. The best elastic net model in this case the one with $\lambda=.1, \alpha=0.7$, because it had the highest test data R^2. Notice how this is a different $\lambda$ parameter than what was chosen in our first analysis.
#Fitting 4 elastnic net models with the same parameters as before
en_mod = ElasticNet(alpha=.01, l1_ratio=0.7)
en_mod.fit(X_train_other, y_train_other)
print('Elastic Net - alpha=0.01 Test R^2:',en_mod.score(X_test_other, y_test_other))

en_mod = ElasticNet(alpha=.025, l1_ratio=0.7)
en_mod.fit(X_train_other, y_train_other)
print('Elastic Net - alpha=0.025 Test R^2:',en_mod.score(X_test_other, y_test_other))

en_mod = ElasticNet(alpha=.05, l1_ratio=0.7)
en_mod.fit(X_train_other, y_train_other)
print('Elastic Net - alpha=0.05 Test R^2:',en_mod.score(X_test_other, y_test_other))

en_mod = ElasticNet(alpha=.1, l1_ratio=0.7)
en_mod.fit(X_train_other, y_train_other)
print('Elastic Net - alpha=0.1 Test R^2:',en_mod.score(X_test_other, y_test_other))

Elastic Net - alpha=0.01 Test R^2: -0.14750397218825695
Elastic Net - alpha=0.025 Test R^2: -0.11125207416411964
Elastic Net - alpha=0.05 Test R^2: -0.0856548712042815
Elastic Net - alpha=0.1 Test R^2: -0.04788700566442472

This highlights the fact that the act of selecting a single training and test dataset based on a single random state and basing your model selection decisions off can have consequences. Doing so introduces an element of randomness to your model selections. Notice how we get two sets of different slopes, particularly in which different sets of slopes are set to 0.

Based on the differing results that we observed above, you should be more skeptical that your final selected elastic net model from 8.4 will indeed be the best when it comes to predicting new breast tumor sizes.

df_slopes = pd.DataFrame({'best_en_mod_old_rs': en_mod_3.coef_.T,
'best_en_mod_new_rs': en_mod.coef_.T}, index=X_train.columns)
df_slopes

best_en_mod_old_rs best_en_mod_new_rs
X159 1.090770 0.896141
X960 4.691085 3.026627
X980 -2.278063 -0.695596
X986 -1.337220 -0.654839
X1023 -2.449317 -1.772806
X1028 2.878624 3.673288
X1064 2.235952 2.793303
X1092 -0.185855 0.180124
X1103 -0.000000 0.079758
X1109 -2.365472 -1.569070
X1124 -0.000000 -0.845917
X1136 -0.000000 -0.252004
X1141 -3.427463 -2.542786
X1144 0.000000 -0.149432
X1169 -2.577002 -0.711118
X1173 2.735242 1.602873
X1179 0.736727 0.231466
X1193 -1.639569 -0.143467
X1203 -0.137988 -0.003595
X1206 0.816235 0.000000
X1208 1.280175 0.469925
X1219 -1.178560 -0.860819
X1232 0.021508 -0.000000
X1264 -0.714238 -0.000000
X1272 0.287312 -0.000000
X1292 0.000000 0.447804
X1297 0.000000 -0.698379
X1329 3.022931 1.575060
X1351 0.000000 -0.411561
X1362 1.506066 0.897085
X1416 -1.054065 -0.224134
X1417 -0.855488 -0.690652
X1418 -0.031378 -0.000000
X1430 5.381811 2.728960
X1444 0.579380 0.588187
X1470 -2.694715 -1.334511
X1506 -0.749208 -0.641218
X1514 -1.878219 -0.146984
X1529 -0.670207 -0.210893
X1553 -0.199374 -0.434005
X1563 -1.586088 -0.447769
X1574 -0.000000 -0.000000
X1595 -0.426136 -0.000000
X1597 -0.213069 0.696042
X1609 1.253879 1.034474
X1616 -0.442962 -0.000000
X1637 -0.000000 0.000000
X1656 -1.448525 -1.876078
X1657 -0.000000 0.863375
X1683 0.398304 -0.270280