Feature Selection for Logistic Regression
Introduction
Multiple Candidate Models
Just like with a linear regression model, if we have a pool of $p$ potential explanatory variables that we could use in a logistic regression model, then we can create $2^p$ possible logistic regression models based on a subset of these $p$ explanatory variables, not including interaction terms.
Feature Selection Types
Just like in a linear regression model we can employ feature selection methods that can help us either do one of the following.
-
Efficiently determine which explanatory variables should be included (or left out) in the model. Like with linear regression models, these include the following types of techniques.
- Backwards Elimination Algorithms
- Forward Selection Algorithms
- Model Regularization
-
Determine how to better represent the full features matrix $X$ (ie. all of the possible $p$ explanatory variables) in a different way that still preserves important aspects of $X$. Like with a linear regression model, this includes the following type of technique:
- Principal Component Regression
Why use feature selection?
Just like with linear regression models, these techniques can be useful in reducing the likelihood of model overfitting. They can also help deal with multicollinearity issues.
Measuring Parsimoniousness
Linear Regression Parsimoniousness
When exploring the issue of overfitting with a linear regression model, we introduced the adjusted R^2 metric to help us measure the parsimoniousness of the linear regression model.
$
\text{Adjusted } R^2 = 1 - \frac{SSE}{SST}\cdot\frac{n-1}{n-k-1}
$
This adjusted R^2 measured the extent to which the linear regression was "striking a good balance" between:
- having a good enough training data model fit (ie. low SSE) while
- having a low enough number of $k$ explanatory or indicator variables (ie. $k$ slopes) in the model.
By selecting the linear regression model with the best (ie. highest) adjusted $R^2$ that strikes this ideal balance, then we might infer that this model will have the best predictive performance with new/test datasets.
Note that this will not always be the case! In practice, we may find that another one of our $2^p$ potential models had better predictive performance with new datasets or a test dataset. However, metrics that measure parsimoniousness can provide a useful heuristic tool that can help point us in the direction of a model that is less likely to overfit than others. In addition, by using a parsimoniousness metric we can get a good idea about which models are less likely to overfit to the data that trained the model, without needing to split your given dataset into training and test datasets.
Logistic Regression Parsimoniousness
Unfortunately, we've discussed in this modules that the $SSE$ is less meaningful when it comes to a measuring error in a logistic regression model. Thus, the adjusted $R^2$ is not going to be as useful of a metric for measuring the parsimoniousness of a logistic regression model.
Thus, in section 11.2 below we'll discuss two metrics AIC and BIC which can measure the parsimoniousness of a logistic regression model.
Measuring Logistic Regression Parsimoniousness
Below, we introduce two metrics AIC and BIC which can measure the parsimoniousness of a logistic regression model.
AIC
In the equation above $k$ is the number of slopes (ie. explanatory and indicator variables) in the model and $LLF$ is the optimal log-likelihood function value of the model.
We interpret the AIC score of a given model as follows.
The lower the AIC score of a model is, the more parsimonious the model is considered to be.
To see why this AIC metric measures model parsimoniousness, let's first consider the "ideal" model that is "perfectly parsimonious".
Ideal Parsimonious Model
This will be a model in which the following are BOTH the case.
- The lowest amount of explanatory (and indicator) variables are used (ie. $k=0$). In this case the second term in the AIC equation would be $2k=0$.
- The model has PERFECT training data fit (ie. the LLF, which is always $\leq0$, is EXACTLY =0). In this case, the second term in the AIC equation would be $-2LLF=0$.
Comparing Tradeoffs
However, most likely this ideal model parsimonious model does not exist. Pursuing one of these goals comes at the expense of the other.
- By decreasing the number of slopes $k$ (thus decreasing the second term $2k$), the model fit will most likely get worse (it will never get better). Thus, the $LLF$ will become more negative and thus the first term $-2LLF$ will increase.
- By increasing the number of slopes $k$ (thus increasing the second term $2k$), the model fit will most likely get better (it will never get worse). Thus, the $LLF$ will become less negative and thus the first term $-2LLF$ will decrease.
Which effect "wins" in the AIC equation?
Given that decreasing the number of slopes in a model most likely worsens (and never betters) the fit (ie. $LLF$ becomes more negative), what the AIC equation ultimately measures is how much worse does the fit get when we remove this set of explanatory/indicator variables. Or in other words, is this worsening of the fit "acceptable" in comparison to amount of slopes that were deleted (which help decrease the likelihood of overfitting)?
When removing a set of explanatory/indicator variables from the model we're faced with two potential scenarios.
Scenario #1: Removed Variables Brought "Enough" Predictive Power
If removing these variables dramatically worsened the training data fit, then the $LLF$ would become way more negative, and the $-2LLF$ term would become way more high and positive: high enough to counterbalance the decrease in the $2k$ term. Thus, the AIC would increase overall. What this suggests is that this set of explanatory variables:
- did bring "enough" predictive power to the model,
- they were (all together) not leading to overfitting,
- and that by removing them we may get worse test/new dataset performance.
Scenario #2: Removed Variables Didn't Bring "Enough" Predictive Power
If removing these variables only slightly worsened the training data fit, then the $LLF$ would become slightly more negative, and the $-2LLF$ term would become slightly more high and positive: but NOT high enough to counterbalance the decrease in the $2k$ term. Thus, the AIC would decrease overall. What this suggests is that this set of explanatory variables:
- did NOT bring "enough" predictive power to the model,
- they were (all together) leading to overfitting,
- and that by removing them we may get better test/new dataset performance.
Comparing AIC Scores
For instance, let's compare the AIC scores of two of our models of interest.
Full Model
Predicts account_type
with:
has_a_profile_pic
number_of_words_in_name
num_characters_in_bio
number_of_posts
number_of_followers
number_of_follows
Reduced Model 2
Predicts account_type
with:
has_a_profile_pic
number_of_words_in_name
num_characters_in_bio
number_of_posts
number_of_followers
number_of_follows
We'll refit both of these models with our new training dataset.
log_mod_full = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_posts+number_of_followers+number_of_follows', data=df_train).fit()
log_mod_red2 = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_followers+number_of_follows', data=df_train).fit()
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.076935
Iterations: 35
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.077070
Iterations: 35
We can calculate the training data AIC for each of these models by using the .aic attribute.
print('Full Model AIC:', log_mod_full.aic)
print('Reduced Model 2 AIC:', log_mod_red2.aic)
Full Model AIC: 26.92512366178778
Reduced Model 2 AIC: 24.947753200240037
Because the AIC score of reduced model 2 was lower than the AIC of the full model, the AIC score is suggesting that the reduced model 2 is more parsimonious than the full model.
Reduced model 2 deleted number_of_posts
from the full model. Because the model is suggested to be more parsimonious by deleting number_of_posts
, what this suggests is that number_of_posts
does NOT bring enough predictive power to the model. Therefore, this suggests that we may be overfitting and that the reduced model 2 may yield better predictions when it comes to new/test datasets.
AIC Suggestion Corroboration
Because we created a test dataset that acts as a representative for all new datasets, we can test this assumption on our test dataset.
Using our new training and test dataset that we recreated at the very end of section 10, let's recompare the following two ROC curves.
Test Data ROC Curve for the Full Model
#FULL MODEL test data predictive probabilities
df_test['predictive_prob'] = log_mod_full.predict(df_test)
#FULL MODEL TEST DATA ROC Curve
fprs, tprs, thresholds = roc_curve(y_true=df_test['y'],
y_score=df_test['predictive_prob'])
auc = roc_auc_score(y_true=df_test['y'],
y_score=df_test['predictive_prob'])
plot_roc(fprs, tprs, auc)

Test Data ROC Curve for Reduced Model 2
#REDUCED MODEL 2 test data predictive probabilities
df_test['predictive_prob'] = log_mod_red2.predict(df_test)
#REDUCED MODEL 2 test data ROC Curve
fprs, tprs, thresholds = roc_curve(y_true=df_test['y'],
y_score=df_test['predictive_prob'])
auc = roc_auc_score(y_true=df_test['y'],
y_score=df_test['predictive_prob'])
plot_roc(fprs, tprs, auc)

Notice how when we use these new recreated training and test datasets our test dataset ROC curve is now slightly better for the reduced model 2 than it is for the full model. The AUC of the of the reduced model 2 (0.926) is slightly higher than it was for the full model (0.917). This suggests that reduced model 2 will create a classification that gets slightly closer to the ideal scenario of $(fpr,tpr)=(0,1)$.
This showcases two things to us.
-
Our test datasets corroborate what the AIC score was suggesting. Because the AIC score of reduced model 2 was lower than the full model, what the AIC was suggesting was that the reduced model 2 may perform better on the test/new datasets. At the very least, we see that this was the case for the test dataset.
-
Different training and test datasets can produce different analysis results. Notice how in 10.9.4 when we used a different training and test dataset, the full model actually had the better test data AUC than the reduced model 2. It's important to keep this model variability in mind when using a single training and test dataset. Creating multiple sets of training and test datasets in cross-validation can help mitigate the variability of these varying analysis results.
BIC
Rather than using the AIC to measure the parsimoniousness of a logistic regression model, we can instead use a similar metric known as the BIC.
In the equation above $k$ is the number of slopes (ie. explanatory and indicator variables) in the model and $LLF$ is the optimal log-likelihood function value of the model.
We similarly interpret the BIC score of a given model as follows.
The lower the BIC score of a model is, the more parsimonious the model is considered to be.
Comparing BIC Scores
Now let's compare the BIC scores of two of our models of interest.
Full Model
Predicts account_type
with:
has_a_profile_pic
number_of_words_in_name
num_characters_in_bio
number_of_posts
number_of_followers
number_of_follows
Reduced Model 1
Predicts account_type
with:
has_a_profile_pic
number_of_words_in_name
num_characters_in_bio
number_of_posts
number_of_followers
number_of_follows
We can similarly calculate the BIC score of a given model by using the .bic attribute.
log_mod_full = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_posts+number_of_followers+number_of_follows', data=df_train).fit()
log_mod_red1 = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_posts+number_of_followers', data=df_train).fit()
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.076935
Iterations: 35
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.170894
Iterations: 35
print('Full Model BIC:', log_mod_full.bic)
print('Reduced Model 1 BIC:', log_mod_red1.bic)
Full Model BIC: 43.94084125369097
Reduced Model 1 BIC: 55.2951380306892
Because the BIC score of full model was lower than the BIC of reduced model 1, the BIC score is suggesting that the full model is more parsimonious than reduced model 1.
Reduced model 1 deleted number_of_follows
from the full model. Because the model is suggested to be more parsimonious by including number_of_follows
, what this suggests is that number_of_follows
DOES bring enough predictive power to the model. Therefore, this suggests that we may NOT be overfitting by including number_of_follows
and that the reduced model 1 may yield worse predictions when it comes to new/test datasets.
BIC Suggestion Corroboration
Similarly, because we created a test dataset that acts as a representative for all new datasets, we can test this assumption on our test dataset.
Using our new training and test dataset that we recreated at the very end of section 10, let's compare the following two ROC curves.
Test Data ROC Curve for Full Model
#FULL MODEL test data predictive probabilities
df_test['predictive_prob'] = log_mod_full.predict(df_test)
#FULL MODEL TEST DATA ROC Curve
fprs, tprs, thresholds = roc_curve(y_true=df_test['y'],
y_score=df_test['predictive_prob'])
auc = roc_auc_score(y_true=df_test['y'],
y_score=df_test['predictive_prob'])
plot_roc(fprs, tprs, auc)

Test Data ROC Curve for Reduced Model 1
#REDUCED MODEL 1 test data predictive probabilities
df_test['predictive_prob'] = log_mod_red1.predict(df_test)
#REDUCED MODEL 1 test data ROC Curve
fprs, tprs, thresholds = roc_curve(y_true=df_test['y'],
y_score=df_test['predictive_prob'])
auc = roc_auc_score(y_true=df_test['y'],
y_score=df_test['predictive_prob'])
plot_roc(fprs, tprs, auc)

Notice how the test data ROC curve for the full model is better than it is for reduced model 1. The AUC of the of the reduced model 1 (0.898) is lower than it was for the full model (0.917). This suggests that the full model will create a classification that gets slightly closer to the ideal scenario of $(fpr,tpr)=(0,1)$.
Thus, our test datasets corroborate what the BIC score was suggesting. Because the BIC score of reduced model 1 was higher than the full model, what the BIC was suggesting was that the reduced model 1 may perform worse on the test/new datasets. At the very least, we see that this was the case for the test dataset.
Note: This nice corroboration will not always be the case for AIC or BIC suggestions!
AIC vs. BIC
The equations that calculate the AIC and the BIC of a given model are very similar to each other.
The only difference is the "penalty term" in the equation: "2" for AIC vs. "$ln(n)$" for BIC. This penalty term penalizes a given model for having a high amount of slopes.
Let's consider when it's the case that the penalty term for BIC is larger than it is for AIC.
What this means is that when your training dataset size is at least 8, then the BIC score will penalize the number of slopes in a model more than the AIC score will. Therefore, the BIC score is going to encourage you to choose models with fewer slopes than the AIC score will if you are looking for a model that is deemed the "most parsimonious".
Downsides of BIC
Unfortunately, because the BIC score will encourage you more to select models that have a smaller number of slopes, this may come at the expense of training dataset fit, causing you to select a model that has a worse training data fit (ie. an LLF that is more negative).
Downsides of AIC
On the other hand, because the AIC score doesn't penalize a high number of slopes as much as the BIC score does, sometimes the AIC score can be unhelpful for the purpose of model selection. Specifically, in some modeling situations no matter how many explanatory variables you add to the model, the AIC score will keep decreasing. An example such as this runs in opposition to the concept of parsimoniousness. In theory, there should be some "maximum" number of explanatory variables that you can add to a model in which the "parsimoniousness" of the model starts to get worse (ie. AIC starts increasing). Or in other words, if you keep adding explanatory variables, eventually you will start to overfit.
If you observe a situation like this, you might consider using the BIC score instead, which is less likely to exhibit this pattern.
Backwards Elimination Algorithm
As we discuss in module 9 one type of heuristic feature selection technique that is often used is was we call a backwards elimination algorithm.
We can use this type of algorithm to help us select a good combination of explanatory variables to include in a model that optimizes some metric that we are interested in.
For instance, we use the algorithm below in attempt to find the logistic regression model that minimizes the AIC. However you can use this same algorithm templete to select a model that optimizes some other metric as well.
Backwards Elimination - Attempting to Minimize the AIC
Goal: Attempt to find the model with the lowest AIC
Steps:
-
Fit a “current model” and find the AIC of this model.
In the beginning, your “current model” should include all possible explanatory variables you are considering.
-
For each explanatory variable in your “current model” do the following:
- Fit a “test model”. Your “test model” should be every explanatory variable in the “current model” except for the explanatory variable you are considering.
- Find the AIC of this “test model”.
-
If NONE of the “test models” from step (2) had a AIC that was lower than the AIC for the “current model”, then STOP THE ALGORITHM, and return the “current model” as your “final model”.
-
Otherwise, choose the “test model” from step (2) that had the lowest AIC, and set your new “current model” to be this “test model”. Then go back to step (2).
Using a Backwards Elimination Algorithm
So let's use a backwards elimination algorithm to try to find the logistic regression model (not including interaction terms) that has the lowest training data AIC when it comes to predicting the probability account_type=real
with:
has_a_profile_pic
number_of_words_in_name
num_characters_in_bio
number_of_posts
number_of_followers
number_of_follows
Iteration 1
1.1 Fit the Current Model
Let's fit the current model using all 6 possible explanatory variables and find the AIC.
current_mod = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_posts+number_of_followers+number_of_follows', data=df_train).fit(disp=False)
current_mod.aic
26.92512366178778
1.2 Try out 6 Test Models
Next, one at a time, let's remove each of our 6 explanatory variables that exist in the current model and find the AIC scores of each of the 6 resulting test models.
#Removes has_a_profile_pic
test_mod = smf.logit(formula='y~number_of_words_in_name+num_characters_in_bio+number_of_posts+number_of_followers+number_of_follows', data=df_train).fit(disp=False)
test_mod.aic
35.25836637577091
#Removes number_of_words_in_name
test_mod = smf.logit(formula='y~has_a_profile_pic+num_characters_in_bio+number_of_posts+number_of_followers+number_of_follows', data=df_train).fit(disp=False)
test_mod.aic
31.293684776329446
#Removes num_characters_in_bio
test_mod = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+number_of_posts+number_of_followers+number_of_follows', data=df_train).fit(disp=False)
test_mod.aic
34.5677437513574
#Removes number_of_posts
test_mod = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_followers+number_of_follows', data=df_train).fit(disp=False)
test_mod.aic
24.947753200240037
#Removes number_of_followers
test_mod = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_posts+number_of_follows', data=df_train).fit(disp=False)
test_mod.aic
39.8885104499709
#Removes number_of_follows
test_mod = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_posts+number_of_followers', data=df_train).fit(disp=False)
test_mod.aic
40.71023723762932
1.3 Continuing the Algorithm
Because there was a test model with an AIC (24.95) that was lower than the AIC of the current model (26.93), then we will continue the algorithm.
The test model that deleted number_of_posts
produced the lowest AIC, thus we will permanently remove number_of_posts
from the current model and begin the next iteration.
Iteration 2
2.1 Fit the Current Model
Let's fit the current model, which now excludes number_of_posts
.
current_mod = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_followers+number_of_follows', data=df_train).fit(disp=False)
current_mod.aic
24.947753200240037
2.2 Try out 5 Test Models
Next, one at a time, let's remove each of our 5 remaining explanatory variables that exist in the current model and find the AIC scores of each of the 5 resulting test models.
#Removes has_a_profile_pic
test_mod = smf.logit(formula='y~number_of_words_in_name+num_characters_in_bio+number_of_posts+number_of_followers+number_of_follows', data=df_train).fit(disp=False)
test_mod.aic
35.25836637577091
#Removes number_of_words_in_name
test_mod = smf.logit(formula='y~has_a_profile_pic+num_characters_in_bio+number_of_followers+number_of_follows', data=df_train).fit(disp=False)
test_mod.aic
29.349736921080027
#Removes num_characters_in_bio
test_mod = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+number_of_followers+number_of_follows', data=df_train).fit(disp=False)
test_mod.aic
37.68326753071482
#Removes number_of_followers
test_mod = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_follows', data=df_train).fit(disp=False)
test_mod.aic
37.93165900662025
#Removes number_of_follows
test_mod = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_followers', data=df_train).fit(disp=False)
test_mod.aic
38.71983336630902
2.3 Stopping the Algorithm
Because there WAS NOT a test model with an AIC that was lower than the AIC of the current model (24.95), we will stop the algorithm and stick with the current model, which just deleted number_of_posts
.
Thus, our final model selected by this algorithm is shown below.
Backwards Elimination Final Model
$
\hat{p} = \frac{1}{1 + \exp\left(-\begin{aligned}
&-43.0899 \
&+ 34.8449(,\text{has a profile pic}[T.yes]) \
&+ 3.4254(,\text{number of words in name}) \
&+ 0.1212(,\text{num characters in bio}) \
&+ 0.0246(,\text{number of followers}) \
&- 0.0065(,\text{number of follows})
\end{aligned}\right)}
$
final_mod = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_followers+number_of_follows', data=df_train).fit(disp=False)
final_mod.summary()
Dep. Variable: | y | No. Observations: | 84 |
---|---|---|---|
Model: | Logit | Df Residuals: | 78 |
Method: | MLE | Df Model: | 5 |
Date: | Fri, 28 Jul 2023 | Pseudo R-squ.: | 0.8881 |
Time: | 19:14:45 | Log-Likelihood: | -6.4739 |
converged: | False | LL-Null: | -57.843 |
Covariance Type: | nonrobust | LLR p-value: | 1.399e-20 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -43.0899 | 3.06e+05 | -0.000 | 1.000 | -6e+05 | 6e+05 |
has_a_profile_pic[T.yes] | 34.8449 | 3.06e+05 | 0.000 | 1.000 | -6e+05 | 6e+05 |
number_of_words_in_name | 3.4254 | 1.647 | 2.080 | 0.038 | 0.197 | 6.653 |
num_characters_in_bio | 0.1212 | 0.063 | 1.923 | 0.055 | -0.002 | 0.245 |
number_of_followers | 0.0246 | 0.010 | 2.547 | 0.011 | 0.006 | 0.043 |
number_of_follows | -0.0065 | 0.003 | -2.141 | 0.032 | -0.012 | -0.001 |
Possibly complete quasi-separation: A fraction 0.50 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
So what this algorithm suggests is that the number_of_posts
explanatory variable may not bring enough predictive power to the model and thus may make the model more likely to overfit, because the AIC decreased by removing it. Thus, we may get better test/new dataset results by removing it.
However, because the AIC score did not decrease any further in any of our test models after removing number_of_posts
, this may suggest that all of the remaining 5 explanatory variables bring enough predictive power to the model and are not leading to overfitting. Thus, we may NOT get better test/new dataset results by removing any of these variables.
Pros and Cons of Backwards Elimination Algorithms
However, we do not know this for sure! As when we are trying to optimize any metric with a backwards elimination algorithm, we want to remember that this algorithm is not guaranteed to find the model with the ABSOLUTE lowest AIC score. So it is possible there exists another logistic regression model that removes two or more explanatory variables that actually has an even lower AIC score.
However, this algorithm tends to be considered an efficient algorithm that tends to find a model with a "low enough" AIC score.
Forward Selection Algorithm
Also as we discussed in module 9 another type of heuristic feature selection technique that is often used is was we call a forward selection algorithm.
For instance, we use the forward selection algorithm below in attempt to find the logistic regression model that minimizes the AIC. However you can use this same algorithm templete to select a model that optimizes some other metric as well.
Forward Selection - Attempting to Minimize the AIC
Goal: Attempt to find the model with the lowest AIC
Steps:
-
Fit a “current model” and find the AIC of this model.
In the beginning, your “current model” should include NONE OF the possible explanatory variables you are considering.
-
For each explanatory variable in your “current model” do the following:
- Fit a “test model”. Your “test model” should be every explanatory variable in the “current model” in addition to the explanatory variable you are considering.
- Find the AIC of this “test model”.
-
If NONE of the “test models” from step (2) had a AIC that was lower than the AIC for the “current model”, then STOP THE ALGORITHM, and return the “current model” as your “final model”.
-
Otherwise, choose the “test model” from step (2) that had the lowest AIC, and set your new “current model” to be this “test model”. Then go back to step (2).
Regularized Logistic Regression
Similar to the regularization techniques that we used for linear regression models, we can also use regularization with logistic regression models to help us select which explanatory variables may lead to overfitting and/or don't bring enough "predictive power" to the model, and should therefore be left out. We can also use these techniques deal with multicollinearity of our explanatory variables.
Modifying the Objective Function with a Penalty Term
Similar to how regularization is utilized in a linear regression model, the general goal of model regularization is one that seeks to find the optimal variable values that minimize the following type of objective function.
By using $f(\text{original model objective function})$, we mean some function of the original model objective function.
Linear Regression Regularization
For instance, when introducing linear regression regularization, we remember that the original (non-regularized) model objective function that we are trying to minimize is the sum squared error $SSE=\sum_{i=1}^n(y_i-\hat{y}_i)^2$.
When it comes to linear regression regularization, we can actually use the original objective function SSE directly. Or in other words $f(SSE)=SSE$. Below is the type of optimization problem we aim to solve in a linear regression regularization model.
Furthermore, when we introduced linear regression regularization in Module 9, we introduced three types of penalty terms that can be used.
- L1 Norm:$|\hat{\beta}_1|+...+|\hat{\beta}_p|$(ie. LASSO Regression)
- L2 Norm: $\hat{\beta}_1^2+...+\hat{\beta}_p^2$(ie. Ridge Regression)
- Combination of the L1 and L2 Norm: $\left[\alpha\left[|\hat{\beta}_1|+...+|\hat{\beta}_{p}|\right]+ \frac{1-\alpha}{2} \left[\hat{\beta}_1^2+...+\hat{\beta}_{p}^2\right]\right]$(ie. Elastic Net Regression)
Regularized Logistic Regression Optimization Problem
When we fit a nonregularized logistic regression model, the original objective function that we maximize is the log-likelihood function $LLF$.
However, the fact that we are maximizing the objective function of a logistic regression model introduces an additional complication. Take note of how the general model regularization optimization problem restated below specifically needs to be minimizing.
Actually, in order for the general model regularization optimization problem to work for logistic regression, we need to create a $f(LLF)$ that finds the same optimal solutions that maximizing $LLF$ does, but by minimizing.
We get around this by selecting $f(LLF)=-2\cdot LLF$. Or in other words, we use the following optimization problem that is equivalent.
Or in other words, the optimal intercept and slopes that are chosen for these two formulations of the non-regularized logistic regression optimzation problem will be the same.
Therefore, we formulate our logistic regression regularization as follows.
Impact of the Different Penalty Terms and $\lambda$ Value
Penalty Terms
We can similarly utilize the same three penalty terms when it comes to logistic regression regularization.
- L1 Norm:$|\hat{\beta}_1|+...+|\hat{\beta}_p|$(ie. LASSO Regression)
- L2 Norm: $\hat{\beta}_1^2+...+\hat{\beta}_p^2$(ie. Ridge Regression)
- Combination of the L1 and L2 Norm: $\left[\alpha\left[|\hat{\beta}_1|+...+|\hat{\beta}_{p}|\right]+ \frac{1-\alpha}{2} \left[\hat{\beta}_1^2+...+\hat{\beta}_{p}^2\right]\right]$(ie. Elastic Net Regression)
Impact of $\lambda$
Just like in a regularized linear regression model, each of these penalty terms listed above, in their own way, discourages the logistic regression model from having high magnitude slope values at the expense of the fit of the model.
Similarly, the higher we set this $\lambda$ value, we expect the following two things to happen.
- The penalty term will be encouraged to be smaller.
- The training dataset fit will get worse (ie. LLF will become more negative).
Which slopes are allowed to be high?
Given that increasing the $\lambda$ encourages the overall penalty term to be small, this does not necessarily mean that all of the slopes are uniformly encouraged to be small. Remember that the model is also simultaneously attempting to maintain a relatively good training dataset fit (ie. an LLF that is not too negative).
Because of this, only the slopes that correspond to explanatory/indicator variables that bring a "high enough" amount of predictive power to the model will be allowed to be high. Whereas the explanatory/indicator variables that do not bring a "high enough" amount of predictive power to the model will be encouraged to be "small".
Penalty Term Behavior
Also just like in a linear regression model, the way in which the slopes in a regularization model are encouraged to be "small" is dependent on the type of penalty term that you are using.
L1 Penalty Term
In a LASSO logistic regression model, these "small" slopes that correspond to explanatory/indicator variables that do not bring enough predictive power to the model will often be set EXACTLY equal to 0.
Thus, interpreting the feature selection suggestions in a LASSO model becomes easy. If an explanatory/indicator variable has a slope of 0, then the model output is suggesting that the variable should be deleted because it does not bring enough predictive power to the model and you may run the risk of overfitting by including this variable, at least according to the $\lambda$ value that you are using.
In the presence of two or more collinear explanatory variables, a LASSO model is more likely to allow for one of the slopes to be non-zero, while the other slopes are set equal to 0.
L2 Penalty Term
In a ridge logistic regression model, these "small" slopes that correspond to explanatory/indicator variables that do not bring enough predictive power to the model will NOT be set EXACTLY equal to 0.
Thus, interpreting the feature selection suggestions in a ridge regression model is not as easy. When interpreting the slopes of this model output, it's not as clear if a slope is set "small" a.) because it does not bring enough predictive power to model, or b.) because there is just naturally a small slope relationship between this variable and the response variable in the model.
In the presence of two or more explanatory variables that are collinear with eachother, a ridge regression model is more likely to give a similar slope magnitude to all of the variables that are collinear with eachother. Thus, the slope interpretability issue that multicollinearity brings in a non-regularized regression model, is not a prominent in the output of a ridge regression model.
L1 and L2 Combination Penalty Term
Finally, remember that an elastic net regression model penalty term is a combination of the L1 (ie. LASSO) penalty term and the L2 (ie. ridge regression) penalty term.
Thus, the resulting slopes of an elastic net model will resemble a combination of what you might expect to see from a LASSO model and a ridge regression model.
If $0\leq \alpha\leq 1$ is set higher, the model cares more about keeping the L1 penalty small. So your slopes will more so resemble what you might see in a LASSO model.
If $\alpha$ is set lower, then the model cares more about keeping the L2 penalty small. So your slopes will more so resemble what you might see in a ridge regression model.
Nonregularized Logistic Regression
Let's use each of these three types of regularized logistic regression models to try to help us meet our main research goal. In order to implement logistic regression regularization, we need to use the LogisticRegression() function from Scikit-Learn.
Features Matrices and Target Arrays
Similar to the LinearRegression() function, the LogisticRegression() requires that our training and test datasets be split into their respective features matrices and target arrays.
X_train = df_train.drop(['account_type','y'], axis=1)
X_train.head()
has_a_profile_pic | number_of_words_in_name | num_characters_in_bio | number_of_posts | number_of_followers | number_of_follows | |
---|---|---|---|---|---|---|
97 | yes | 1 | 0 | 81 | 75 | 55 |
84 | no | 1 | 0 | 0 | 21 | 31 |
107 | no | 1 | 112 | 4 | 415 | 1445 |
90 | yes | 2 | 0 | 1 | 62 | 126 |
63 | no | 1 | 0 | 0 | 69 | 694 |
X_test = df_test.drop(['account_type','y', 'predictive_prob'], axis=1)
X_test.head()
has_a_profile_pic | number_of_words_in_name | num_characters_in_bio | number_of_posts | number_of_followers | number_of_follows | |
---|---|---|---|---|---|---|
39 | yes | 2 | 73 | 330 | 1572 | 3504 |
89 | no | 0 | 0 | 1 | 17 | 33 |
88 | yes | 5 | 0 | 1 | 39 | 324 |
10 | yes | 1 | 0 | 53 | 229 | 492 |
50 | yes | 3 | 0 | 9 | 62 | 47 |
Furthermore, we also need to convert any categorical explanatory variables into the required amount of 0/1 indicator variables.
X_train=pd.get_dummies(X_train, drop_first=True)
X_train.head()
number_of_words_in_name | num_characters_in_bio | number_of_posts | number_of_followers | number_of_follows | has_a_profile_pic_yes | |
---|---|---|---|---|---|---|
97 | 1 | 0 | 81 | 75 | 55 | 1 |
84 | 1 | 0 | 0 | 21 | 31 | 0 |
107 | 1 | 112 | 4 | 415 | 1445 | 0 |
90 | 2 | 0 | 1 | 62 | 126 | 1 |
63 | 1 | 0 | 0 | 69 | 694 | 0 |
X_test=pd.get_dummies(X_test, drop_first=True)
X_test.head()
number_of_words_in_name | num_characters_in_bio | number_of_posts | number_of_followers | number_of_follows | has_a_profile_pic_yes | |
---|---|---|---|---|---|---|
39 | 2 | 73 | 330 | 1572 | 3504 | 1 |
89 | 0 | 0 | 1 | 17 | 33 | 0 |
88 | 5 | 0 | 1 | 39 | 324 | 1 |
10 | 1 | 0 | 53 | 229 | 492 | 1 |
50 | 3 | 0 | 9 | 62 | 47 | 1 |
y_train=df_train['y']
y_train.head()
97 0
84 0
107 0
90 0
63 0
Name: y, dtype: int64
y_test=df_test['y']
y_test.head()
39 1
89 0
88 0
10 1
50 1
Name: y, dtype: int64
Training the Model
But first, let's also train and test a non-regularized logistic regression model for comparison purposes.
We do so by using the LogisticRegression() function with the following parameters.
- penalty: By selecting 'none' for the penalty, we are indicating that we are just using a basic (non-regularized) logistic regression model.
- solver: We will use the the 'newton-cg' solver. The newton-cg algorithm is a type of numerical analysis algorithm that goes about finding an optimal solution to a given objective function.
- max_iter: This algorithm stops after 1000 iterations or when the algorithm has converged.
Additional Information: The 'newton-cg' solver only works for: basic logistic regression and ridge regression.
After creating our log_mod
class, we will fit it with our training features matrix and target array.
from sklearn.linear_model import LogisticRegression
log_mod = LogisticRegression(penalty='none', solver='newton-cg',
max_iter=1000)
log_mod.fit(X_train,y_train)
LogisticRegression(max_iter=1000, penalty='none', solver='newton-cg')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000, penalty='none', solver='newton-cg')
Slopes and Intercept
When using the LogisticRegression() function to fit a model, we similarly extract the model intercept and slopes by using the .intercept_ and .coef_ attributes.
dfcoef = pd.DataFrame(log_mod.coef_.T,
columns=['Non_Regularized'], index=X_train.columns)
dfcoef.loc['Intercept'] = log_mod.intercept_[0]
dfcoef
Non_Regularized | |
---|---|
number_of_words_in_name | 3.417944 |
num_characters_in_bio | 0.126837 |
number_of_posts | -0.004232 |
number_of_followers | 0.024701 |
number_of_follows | -0.006485 |
has_a_profile_pic_yes | 19.459786 |
Intercept | -27.693121 |
Making Predictions
To predict the predictive probabilities for a given dataset using the LogisticRegression() class, you need to use the .predict_proba() function.
log_mod.predict_proba(X_test)
array([[1.57310831e-10, 1.00000000e+00],
[1.00000000e+00, 1.14980298e-12],
[4.46129387e-04, 9.99553871e-01],
[9.29143883e-01, 7.08561168e-02],
[3.88249490e-02, 9.61175051e-01],
[2.05629106e-01, 7.94370894e-01],
[1.25955308e-01, 8.74044692e-01],
[9.84413046e-01, 1.55869540e-02],
[9.59627372e-01, 4.03726279e-02],
[7.24940860e-06, 9.99992751e-01],
[1.00000000e+00, 3.77363135e-11],
[3.99752211e-01, 6.00247789e-01],
[2.26974059e-04, 9.99773026e-01],
[1.04494191e-12, 1.00000000e+00],
[1.00000000e+00, 2.55112595e-11],
[5.13164844e-01, 4.86835156e-01],
[8.73986268e-01, 1.26013732e-01],
[9.94112056e-01, 5.88794426e-03],
[1.95417428e-04, 9.99804583e-01],
[1.09728957e-04, 9.99890271e-01],
[9.28562219e-01, 7.14377810e-02]])
The output of the .predict_proba() function actually gives us a set of two predicted probabilities for each observation in the test dataset:
- $1-\hat{p}_i$ and
- $\hat{p}_i$
We typically only want the predictive probability values $\hat{p}_i$, so we'll extract just the second column from this output.
predictive_probs_test = log_mod.predict_proba(X_test)[:,1]
predictive_probs_test
array([1.00000000e+00, 1.14980298e-12, 9.99553871e-01, 7.08561168e-02,
9.61175051e-01, 7.94370894e-01, 8.74044692e-01, 1.55869540e-02,
4.03726279e-02, 9.99992751e-01, 3.77363135e-11, 6.00247789e-01,
9.99773026e-01, 1.00000000e+00, 2.55112595e-11, 4.86835156e-01,
1.26013732e-01, 5.88794426e-03, 9.99804583e-01, 9.99890271e-01,
7.14377810e-02])
Evaluating the Model
Finally, to evaluate the unregularized logistic regression model's classification capabilities with the test dataset, let's calculate it's corresponding test ROC curve AUC.
#Full model TEST DATA ROC Curve
fprs, tprs, thresholds = roc_curve(y_true=y_test,
y_score=predictive_probs_test)
auc = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
plot_roc(fprs, tprs, auc)

LASSO Logistic Regression
Next, let's fit a LASSO logistic regression model for a few different values of $\lambda$.
We also use the LogisticRegression() function to fit regularized logistic regression models such as LASSO as well. In this case, we'll use the following parameters.
- penalty: By selecting 'l1' for the penalty, we are fitting a LASSO (L1 penalty) logistic regression.
- solver: We will use the the 'liblinear' solver. The liblinear is a tool that solves linear logistic regression optimization problems.
- max_iter: This algorithm stops after 1000 iterations or when the algorithm has converged.
- C: This value is set to be $\frac{1}{\lambda}$. Thus if we want our $\lambda=3$ in this model, we need to set $C=\frac{1}{\lambda}=\frac{1}{3}$.
Additional Information: The 'liblinear' solver only works for: LASSO logistic regression and logistic ridge regression.
Fitting Three LASSO Models
First, let's fit three logistic regression LASSO models using the following three values of $\lambda$.
lasso_mod_1 = LogisticRegression('l1', solver='liblinear',
max_iter=1000, C=1/.1)
lasso_mod_1.fit(X_train,y_train)
LogisticRegression(C=10.0, max_iter=1000, penalty='l1', solver='liblinear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=10.0, max_iter=1000, penalty='l1', solver='liblinear')
lasso_mod_2 = LogisticRegression('l1', solver='liblinear',
max_iter=1000, C=1/1.5)
lasso_mod_2.fit(X_train,y_train)
LogisticRegression(C=0.6666666666666666, max_iter=1000, penalty='l1',In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.solver='liblinear')
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.6666666666666666, max_iter=1000, penalty='l1',
solver='liblinear')
lasso_mod_3 = LogisticRegression('l1', solver='liblinear',
max_iter=1000, C=1/3)
lasso_mod_3.fit(X_train,y_train)
LogisticRegression(C=0.3333333333333333, max_iter=1000, penalty='l1',In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.solver='liblinear')
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.3333333333333333, max_iter=1000, penalty='l1',
solver='liblinear')
Inspecting the Slopes
Next, let's inspect the slopes and intercepts of these three LASSO models.
dfcoef = pd.DataFrame(
np.concatenate((log_mod.coef_.T,
lasso_mod_1.coef_.T,
lasso_mod_2.coef_.T,
lasso_mod_3.coef_.T),
axis=1),
columns=['Non_Regularized', 'LASSO_1','LASSO_2','LASSO_3'], index=X_train.columns)
dfcoef
Non_Regularized | LASSO_1 | LASSO_2 | LASSO_3 | |
---|---|---|---|---|
number_of_words_in_name | 3.417944 | 2.754981 | 0.333608 | 0.000000 |
num_characters_in_bio | 0.126837 | 0.070990 | 0.042912 | 0.043476 |
number_of_posts | -0.004232 | 0.007689 | 0.014326 | 0.013440 |
number_of_followers | 0.024701 | 0.022007 | 0.018941 | 0.016726 |
number_of_follows | -0.006485 | -0.006301 | -0.006669 | -0.006438 |
has_a_profile_pic_yes | 19.459786 | 5.661241 | 0.000000 | 0.000000 |
These slope results reflect what we might expect to see from a LASSO regression model as we increase $\lambda$. When $\lambda$ was set at the lowest value $0.1$, none of the model slopes have been zeroed out.
As we increased $\lambda$ to 1.5, the has_a_profile_pic_yes
have been zeroed out.
Finally, by increasing $\lambda$ to 3, number_of_words_in_name
has been additionally zeroed out.
What these three LASSO models suggest is that has_a_profile_pic_yes
brings the least amount of predictive power to the model, followed by number_of_words_in_name
. And therefore if the model were to be overfitting, it's more likely to be overfitting via the inclusion of has_a_profile_pic_yes
and then by number_of_words_in_name
, in comparison to the other variables.
Notice how this LASSO model suggestion runs contrary to what our backwards elimination algorithm was suggesting about the same training dataset. Because the backwards elimination algorithm first removed number_of_posts
due to it's removal producing the lowest AIC, this would suggest that number_of_posts
is the variable that is most likely to not bring enough predictive power to the model and bring overfitting.
Testing the Actual Fitted LASSO Models
Lasso models 2 and 3 not only give two informed suggestions about which explanatory variables to try deleting for better test data performance, but they also provide a model in and of themselves.
Let's first test and compare our three LASSO models' test ROC and AUC.
#Non-regularized regression model
predictive_probs_test = log_mod.predict_proba(X_test)[:,1]
auc_nonreg = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Non-Regularized Model Test AUC:', auc_nonreg)
#Lasso model 1
predictive_probs_test = lasso_mod_1.predict_proba(X_test)[:,1]
auc_lasso1 = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Lasso Model 1 Test AUC:', auc_lasso1)
#Lasso model 2
predictive_probs_test = lasso_mod_2.predict_proba(X_test)[:,1]
auc_lasso2 = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Lasso Model 2 Test AUC:', auc_lasso2)
#Lasso model 3
predictive_probs_test = lasso_mod_3.predict_proba(X_test)[:,1]
auc_lasso3 = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Lasso Model 3 Test AUC:', auc_lasso3)
Non-Regularized Model Test AUC: 0.9166666666666667
Lasso Model 1 Test AUC: 0.9074074074074074
Lasso Model 2 Test AUC: 0.9722222222222222
Lasso Model 3 Test AUC: 0.9537037037037037
Best Performance so Far
Terrific! Our LASSO model 2 had the best (ie. highest) test data AUC (0.972) that we've seen so far when using this corresponding training dataset to build our model. This LASSO model 2 "zeroed out" just a single variable has_a_profile_pic_yes
from the model.
LASSO Model 2
$\hat{p} = \frac{1}{1 + \exp\left(-\begin{aligned}
&-2.824735 \
&+ 0.333562(,\text{number of words in name}) \
&+ 0.042911(,\text{num characters in bio}) \
&+ 0.014327(,\text{number of posts}) \
&+ 0.018940(,\text{number of followers}) \
&- 0.006669(,\text{number of follows}) \
&+ 0(,\text{has a profile pic}[T.yes])
\end{aligned}\right)}$
Remember that our backwards elimination algorithm (which used AIC to measure parsimoniousness) suggested a different final model that instead just deleted number_of_posts
. However, when we calculated the test AUC of this that model, it was only 0.926.
Backwards Elimination Final Model
$
\hat{p} = \frac{1}{1 + \exp\left(-\begin{aligned}
&-43.0899 \
&+ 34.8449(,\text{has a profile pic}[T.yes]) \
&+ 3.4254(,\text{number of words in name}) \
&+ 0.1212(,\text{num characters in bio}) \
&+ 0.0246(,\text{number of followers}) \
&- 0.0065(,\text{number of follows})
\end{aligned}\right)}
$
Thus, if our only goal was to simply find the logistic regression model that produced the highest test data AUC for this particular test dataset, then we can see that our LASSO model technique was superior.
However, remember that our actual goal is to build a classifier that will yield the best predictions for new datasets. Can we fully trust that this one test dataset will act as a good representative of ALL new datasets? Again, this is a concern that cross-validation techniques can help with. But if you are unable to conduct cross-validation, given the variable nature of a single training and test dataset, it is useful to consult with multiple feature selection methods and take into account their suggestions.
dfcoef = pd.DataFrame(
np.concatenate((log_mod.coef_.T,
lasso_mod_1.coef_.T,
lasso_mod_2.coef_.T,
lasso_mod_3.coef_.T),
axis=1),
columns=['Non_Regularized', 'LASSO_1','LASSO_2','LASSO_3'], index=X_train.columns)
dfcoef
Non_Regularized | LASSO_1 | LASSO_2 | LASSO_3 | |
---|---|---|---|---|
number_of_words_in_name | 3.417944 | 2.754981 | 0.333608 | 0.000000 |
num_characters_in_bio | 0.126837 | 0.070990 | 0.042912 | 0.043476 |
number_of_posts | -0.004232 | 0.007689 | 0.014326 | 0.013440 |
number_of_followers | 0.024701 | 0.022007 | 0.018941 | 0.016726 |
number_of_follows | -0.006485 | -0.006301 | -0.006669 | -0.006438 |
has_a_profile_pic_yes | 19.459786 | 5.661241 | 0.000000 | 0.000000 |
Actually Deleting the Suggested Variables
Given that the LASSO model 2 produced the best (highest) test data AUC (0.972), this may suggest that we would be overfitting by including has_a_profile_pic_yes
.
However, given that the test data AUC of the LASSO model 3 starts to decrease (0.954), this may suggest that the additionally zeroed out slope explanatory variable of number_of_words_in_name
is not leading to overfitting, and that we should keep it in the model.
Let's take this LASSO model 2 feature selection suggestion in mind and actually fit another non-regularized logistic regression model that now does not include has_a_profile_pic_yes
in the features matrix.
log_mod_no_profile = LogisticRegression(penalty='none', solver='newton-cg',
max_iter=1000)
log_mod_no_profile.fit(X_train.drop(['has_a_profile_pic_yes'], axis=1),
y_train)
LogisticRegression(max_iter=1000, penalty='none', solver='newton-cg')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=1000, penalty='none', solver='newton-cg')
dfcoef = pd.DataFrame(log_mod_no_profile.coef_.T,
columns=['Non_Regularized'], index=X_train.drop(['has_a_profile_pic_yes'], axis=1).columns)
dfcoef.loc['Intercept'] = log_mod_no_profile.intercept_[0]
dfcoef
Non_Regularized | |
---|---|
number_of_words_in_name | 2.617109 |
num_characters_in_bio | 0.047733 |
number_of_posts | 0.024807 |
number_of_followers | 0.027004 |
number_of_follows | -0.007986 |
Intercept | -7.629245 |
And now let's calculate the test data AUC of this model.
#Non-regularized model (with no has_a_profile_pic_yes)
predictive_probs_test = log_mod_no_profile.predict_proba(X_test.drop(['has_a_profile_pic_yes'], axis=1))[:,1]
fprs, tprs, thresholds = roc_curve(y_true=y_test,
y_score=predictive_probs_test)
auc_no_profile = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Non-Regularized Model with no log_mod_no_profile \n Test AUC:', auc_no_profile)
Non-Regularized Model with no log_mod_no_profile
Test AUC: 0.8981481481481481
print('Lasso Model 2 \n Test AUC:', auc_lasso2)
Lasso Model 2
Test AUC: 0.9722222222222222
Interestingly enough, our LASSO Model 2 which zeroed out the has_a_profile_pic_yes
variable had a much better test data AUC (0.972), than our non-regularized regression model which actually did not include the has_a_profile_pic_yes
variable (0.898).
LASSO Model 2
$\hat{p} = \frac{1}{1 + \exp\left(-\begin{aligned}
&-2.824735 \
&+ 0.333562(,\text{number of words in name}) \
&+ 0.042911(,\text{num characters in bio}) \
&+ 0.014327(,\text{number of posts}) \
&+ 0.018940(,\text{number of followers}) \
&- 0.006669(,\text{number of follows}) \
&+ 0(,\text{has a profile pic}[T.yes])
\end{aligned}\right)}$
Non-Regularized Regression Model without has_a_profile_pic_yes
$\hat{p} = \frac{1}{1 + \exp\left(-\begin{aligned}
&-7.622512 \
&+ 2.614405(,\text{number of words in name}) \
&+ 0.047701(,\text{num characters in bio}) \
&+ 0.024782(,\text{number of posts}) \
&+ 0.026986(,\text{number of followers}) \
&- 0.007981(,\text{number of follows})
\end{aligned}\right)}$
However, notice how it was not just the has_a_profile_pic_yes
variable that is different between the two models above. The other slope values changed as well. Take particular note the dramatic changed that happened to the slope of the number_of_words_in_name
variable.
So we can see that the changed slopes in this model also contributed to producing a model that had better test dataset performance.
Ridge Logistic Regression
Next, let's fit a ridge logistic regression model for three values of $\lambda$.
We also use the LogisticRegression() function to fit regularized logistic regression models such as ridge regression as well. In this case, we'll use the following parameters.
- penalty: By selecting 'l2' for the penalty, we are fitting a logistic ridge regression (L2 penalty) model.
- solver: We will use the the 'liblinear' solver. The liblinear is a tool that solves linear logistic regression optimization problems.
- max_iter: This algorithm stops after 1000 iterations or when the algorithm has converged.
- C: This value is set to be $\frac{1}{\lambda}$. Thus if we want our $\lambda=3$ in this model, we need to set $C=\frac{1}{\lambda}=\frac{1}{3}$.
Additional Information: The 'liblinear' solver only works for: LASSO logistic regression and logistic ridge regression.
Fitting Three Ridge Regression Models
ridge_mod1 = LogisticRegression('l2', solver='liblinear',
max_iter=1000, C=1/0.1)
ridge_mod1.fit(X_train,y_train)
LogisticRegression(C=10.0, max_iter=1000, solver='liblinear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=10.0, max_iter=1000, solver='liblinear')
ridge_mod2 = LogisticRegression('l2', solver='liblinear',
max_iter=1000, C=1/1.5)
ridge_mod2.fit(X_train,y_train)
LogisticRegression(C=0.6666666666666666, max_iter=1000, solver='liblinear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.6666666666666666, max_iter=1000, solver='liblinear')
ridge_mod3 = LogisticRegression('l2', solver='liblinear',
max_iter=1000, C=1/10)
ridge_mod3.fit(X_train,y_train)
LogisticRegression(C=0.1, max_iter=1000, solver='liblinear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.1, max_iter=1000, solver='liblinear')
Inspecting the Slopes
Next, let's inspect the slopes and intercepts of these three ridge regression models. Notice how none of the slopes have been zeroed out as we might expect.
dfcoef = pd.DataFrame(
np.concatenate((log_mod.coef_.T,
ridge_mod1.coef_.T,
ridge_mod2.coef_.T,
ridge_mod3.coef_.T),
axis=1),
columns=['Non_Regularized', 'Ridge_1','Ridge_2','Ridge_3'], index=X_train.columns)
dfcoef
Non_Regularized | Ridge_1 | Ridge_2 | Ridge_3 | |
---|---|---|---|---|
number_of_words_in_name | 3.417944 | 1.579619 | 0.080976 | -0.265475 |
num_characters_in_bio | 0.126837 | 0.048380 | 0.043408 | 0.044468 |
number_of_posts | -0.004232 | 0.010184 | 0.012730 | 0.013329 |
number_of_followers | 0.024701 | 0.018378 | 0.015872 | 0.013302 |
number_of_follows | -0.006485 | -0.006004 | -0.006247 | -0.006247 |
has_a_profile_pic_yes | 19.459786 | 2.116854 | 0.279057 | -0.035240 |
Notice, for instance, how the has_a_profile_pic_yes
and number_of_words_in_name
experienced more dramatic declines in slope magnitudes than the other variables as we increase $\lambda$ in the ridge regression models. This may suggest that these variables are bringing less predictive power to the model, but this suggestion is much less clear than the suggestions that we got from the LASSO models.
Testing the Actual Fitted Ridge Regression Models
Let's also test and compare our three ridge regression test ROC and AUC.
#Non-regularized regression model
predictive_probs_test = log_mod.predict_proba(X_test)[:,1]
auc_nonreg = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Non-Regularized Model Test AUC:', auc_nonreg)
#Ridge regression model 1
predictive_probs_test = ridge_mod1.predict_proba(X_test)[:,1]
auc_ridge1 = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Ridge regression Model 1 Test AUC:', auc_ridge1)
#Ridge regression model 2
predictive_probs_test = ridge_mod2.predict_proba(X_test)[:,1]
auc_ridge2 = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Ridge regression Model 2 Test AUC:', auc_ridge2)
#Ridge regression model 3
predictive_probs_test = ridge_mod3.predict_proba(X_test)[:,1]
auc_ridge3 = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Ridge regression Model 3 Test AUC:', auc_ridge3)
Non-Regularized Model Test AUC: 0.9166666666666667
Ridge regression Model 1 Test AUC: 0.9074074074074074
Ridge regression Model 2 Test AUC: 0.9444444444444445
Ridge regression Model 3 Test AUC: 0.9074074074074074
Evaluating Performance
It looks like we do get better test data AUC performance (0.944) with our ridge regression 2 model than what we saw in our non-regularized model (0.916), however an AUC=0.944 is not the highest that we've seen in this module. Our LASSO model 2 AUC was higher.
Elastic Net Logistic Regression
Finally, let's fit an elastic net logistic regression model for three values of $\lambda$ and a given value of $\alpha$.
Suppose in this situation, we would like for our fitted models to look slightly more like what a ridge regression model right return than a LASSO model. Thus, let's select a value of $\alpha=0.4$ that is slightly closer to 0 than it is to 1.
We also use the LogisticRegression() function to fit regularized logistic regression models such as elastic nets as well. In this case, we'll use the following parameters.
- penalty: By selecting 'elasticnet' for the penalty, we are fitting an elastic net (L1 and L2 penalty) model.
- solver: We will use the the 'saga' solver. aga is a numerical optimization method that only works for specific types of objective functions.
- max_iter: This algorithm stops after 1000 iterations or when the algorithm has converged.
- C: This value is set to be $\frac{1}{\lambda}$. Thus if we want our $\lambda=3$ in this model, we need to set $C=\frac{1}{\lambda}=\frac{1}{3}$.
- The $\alpha$ in sklearn is represented as the "l1_ratio" parameter in the function. With an $\alpha=$l1_ratio=0.7, this means that this particular elastic net model will favor solutions that more closely resemble the LASSO model results than the ridge regression model results.
Additional Information: The 'saga' solver only works for: elastic net logistic regression.
Fitting Three Elastic Net Regression Models
en_mod1 = LogisticRegression('elasticnet', solver='saga',
max_iter=1000, l1_ratio=0.4, C=1/.1)
en_mod1.fit(X_train,y_train)
LogisticRegression(C=10.0, l1_ratio=0.4, max_iter=1000, penalty='elasticnet',In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.solver='saga')
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=10.0, l1_ratio=0.4, max_iter=1000, penalty='elasticnet',
solver='saga')
en_mod2 = LogisticRegression('elasticnet', solver='saga',
max_iter=1000, l1_ratio=0.4, C=1/1.5)
en_mod2.fit(X_train,y_train)
LogisticRegression(C=0.6666666666666666, l1_ratio=0.4, max_iter=1000,In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.penalty='elasticnet', solver='saga')
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.6666666666666666, l1_ratio=0.4, max_iter=1000,
penalty='elasticnet', solver='saga')
en_mod3 = LogisticRegression('elasticnet', solver='saga',
max_iter=1000, l1_ratio=0.4, C=1/5)
en_mod3.fit(X_train,y_train)
LogisticRegression(C=0.2, l1_ratio=0.4, max_iter=1000, penalty='elasticnet',In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.solver='saga')
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.2, l1_ratio=0.4, max_iter=1000, penalty='elasticnet',
solver='saga')
Inspecting the Slopes
Next, let's inspect the slopes and intercepts of these three elastic net regression models. Notice how none of the slopes have been zeroed out as we might expect.
dfcoef = pd.DataFrame(
np.concatenate((log_mod.coef_.T,
en_mod1.coef_.T,
en_mod2.coef_.T,
en_mod3.coef_.T),
axis=1),
columns=['Non_Regularized', 'EN_1','EN_2','EN_3'], index=X_train.columns)
dfcoef
Non_Regularized | EN_1 | EN_2 | EN_3 | |
---|---|---|---|---|
number_of_words_in_name | 3.417944 | -0.000968 | -0.000909 | -0.000763 |
num_characters_in_bio | 0.126837 | 0.017061 | 0.017025 | 0.016928 |
number_of_posts | -0.004232 | 0.011596 | 0.011579 | 0.011510 |
number_of_followers | 0.024701 | 0.009041 | 0.009035 | 0.009022 |
number_of_follows | -0.006485 | -0.004902 | -0.004897 | -0.004884 |
has_a_profile_pic_yes | 19.459786 | -0.000234 | -0.000182 | -0.000046 |
Notice, for instance, how the has_a_profile_pic_yes
slope is zeroed out. However, the number_of_words_in_name
experienced a dramatic declines in it's slope magnitude, but is never actually zeroed out.
This is what we might expect from an elastic net model which combines elements of both a LASSO model and a ridge regression model.
Testing the Actual Fitted Elastic Net Models
Let's also test and compare our three elastic net regression test ROC and AUC.
#Non-regularized regression model
predictive_probs_test = log_mod.predict_proba(X_test)[:,1]
auc_nonreg = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Non-Regularized Model Test AUC:', auc_nonreg)
#Elastic Net Regression model 1
predictive_probs_test = en_mod1.predict_proba(X_test)[:,1]
auc_en1 = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Elastic Net Regression Model 1 Test AUC:', auc_en1)
#Elastic Net Regression model 2
predictive_probs_test = en_mod2.predict_proba(X_test)[:,1]
auc_en2 = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Elastic Net Regression Model 2 Test AUC:', auc_en2)
#Elastic Net Regression model 3
predictive_probs_test = en_mod3.predict_proba(X_test)[:,1]
auc_en3 = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Elastic Net Regression Model 3 Test AUC:', auc_en3)
Non-Regularized Model Test AUC: 0.9166666666666667
Elastic Net Regression Model 1 Test AUC: 0.8796296296296297
Elastic Net Regression Model 2 Test AUC: 0.8796296296296297
Elastic Net Regression Model 3 Test AUC: 0.8796296296296297
Evaluating Performance
It looks like none of our elastic net models yielded superior performance to our non-regularized model unfortunately. This may be a bit puzzling. But let's remember that this is the first time the LogisticRegression() function has used solver='saga' to attempt to find the optimal intercepts and slopes for these elastic net models. Perhaps the fact that we are using the saga solver may point to the reason worse results.
Testing Suspicion
Let's explore this suspicion. Let's test one more elastic net model that has specifically set $\alpha=1$. Thus, the results for this model in theory should look just like what a LASSO model with the same $\lambda$ value produced. Let's also use the $\lambda$ value that created our LASSO model 2 which had the best test data AUC we've seen so far of 0.977.
en_mod4 = LogisticRegression('elasticnet', solver='saga',
max_iter=1000, l1_ratio=1, C=1/1.5)
en_mod4.fit(X_train,y_train)
LogisticRegression(C=0.6666666666666666, l1_ratio=1, max_iter=1000,In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.penalty='elasticnet', solver='saga')
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.6666666666666666, l1_ratio=1, max_iter=1000,
penalty='elasticnet', solver='saga')
#Elastic Net Regression model 4
predictive_probs_test = en_mod4.predict_proba(X_test)[:,1]
auc_en4 = roc_auc_score(y_true=y_test,
y_score=predictive_probs_test)
print('Elastic Net Regression Model 4 Test AUC:', auc_en4)
Elastic Net Regression Model 4 Test AUC: 0.8796296296296297
The fact that this elastic net model that has been theoretically designed to be exactly the same as our LASSO model 2 had a dramatically lower test data AUC (0.8796), may suggest that it is the saga solver that is producing inferior results in the elastic net models. We might want to be suspicious about the elastic net models suggestions at least when it comes to this particular dataset.
Principal Component Logistic Regression
We can use the same principal component regression techniques when fitting a logistic regression as well. Similarly, we can use principal component analysis to represent our $X_{train}$ matrix with it's first $k$ principal components of it's corresponding principal component matrix $X_{pctrain}$ instead.
By employing this technique we can solve any mulicollinearity issues as ALL principal component columns of any $X_{pctrain}$ will always have PERFECT pairwise correlations of 0! Similarly, using the first $k$ columns in $X_{pctrain}$ instead of $X_{train}$ as our features matrix in the logistic regression model may also help reduce the chance of overfitting and yield better test/new dataset performance.
We represent the $X_{test}$ matrix with it's first $k$ principal components of it's corresponding principal component matrix $X_{pctest}$ as well when testing. But don't forget to use the special technique for creating $X_{pctest}$ that we referenced in Module 9 that involves using the $X_{pctrain}$ loading vectors.