Introducing Logistic Regression


Categorical Response Variable with Two Levels

Let's scale back our analysis for now, and let's just suppose that we would like to predict our response variable account_type with just number_of_followers.

As we've discussed before we usually visualize the relationship between a numerical variable and acategorical variable with side-by-side boxplots.

sns.boxplot(x='number_of_followers', y='account_type', data=df_train)
plt.title('Training Data')
plt.show()
Image Description

Attempting to use a Linear Regression Model

But suppose for a moment, we wanted to try to predict account_type given the number_of_followers using a linear regression model. Given that the response variable account_type is categorical, how might we attempt to do this?

0/1 Response Variable Conversion

One thing that we could try is creating a new binary numerical variable in our dataframes in which:

  • 1 = real accounts
  • 0 = fake accounts.

Which account type we choose to represent a 1 vs. a 0 is somewhat important.

  • Usually the value that represented with a 1 is called the success level. It usually represents the particular level that we are interested in. For instance, our research goal might state "we are interested in predicting whether an account is real".
  • In addition, the value that is represented with a 0 is called the failure level. Technically, we know, for instance that if our model predicts that an observation is not real (ie. not 1), then we know that is fake (ie. 0).

Based on this relationship, switching up the particular 0/1 value assignments yields the same core model results. However, if you are more interested in predicting whether the account is real, setting 1=real will be better for interpretation purposes (we'll see why in section ???).

Creating the 0/1 Response Variable

Let's call this new variable y and let's create it for both our training and test dataset. We can create this variable by using the .replace() function after the pandas series column that we would like to create a modified version of. We supply the .replace() function with a dictionary which maps the value conversion that we would like to happen.

  • the keys represent the old values in the old dataframe column
  • the values represent the new values that we would like to replace them with.

Notice that the .replace() function creates a new column called 'y' and it keeps the 'account_type' column as well.

df_train['y'] = df_train['account_type'].replace({'real':1, 'fake':0})
df_train.head()

has_a_profile_pic number_of_words_in_name num_characters_in_bio number_of_posts number_of_followers number_of_follows account_type y
47 yes 2 0 0 87 40 real 1
51 yes 2 81 25 341 274 real 1
75 no 1 0 1 24 2 fake 0
93 yes 0 0 15 772 3239 fake 0
76 no 2 0 0 13 22 fake 0
df_test['y'] = df_test['account_type'].replace({'real':1, 'fake':0})
df_test.head()

has_a_profile_pic number_of_words_in_name num_characters_in_bio number_of_posts number_of_followers number_of_follows account_type y
71 no 1 0 0 16 2 fake 0
6 yes 1 132 9 213 254 real 1
16 yes 2 86 25 96 499 real 1
84 no 1 0 0 21 31 fake 0
55 yes 7 24 465 654 535 real 1

Why not linear regression?

Attempting a Linear Regression Curve

With this 0/1 'numerical' representation of account_type (y), we now have y and number_of_followers as two numerical variables which can be represented in a scatterplot.

sns.lmplot(x='number_of_followers', y='y', data=df_train, ci=False)
plt.show()
Image Description

And thus, now we could technically fit the simple linear regression model to predict y as we do below.
$$\hat{y}=0.2198+0.0009number_of_followers$$

lin_mod = smf.ols(formula='y~number_of_followers', data=df_train).fit()
lin_mod.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 0.2198 0.063 3.478 0.001 0.094 0.346
number_of_followers 0.0009 0.000 5.275 0.000 0.001 0.001

...

But let's try to think of some drawbacks that we might encounter when fitting this linear regression curve to this 0/1 response variable.

Not a Suitable Model

First, we can see that there is clearly not a linear relationship between these two numerical variables. Because this is a simple linear regression model we can see this simply by looking at a scatterplot between the explanatory variable and the respons variable.

sns.lmplot(x='number_of_followers', y='y', data=df_train, ci=False)
plt.show()
Image Description

Or we could use our more sophisticated technique that works also for multiple linear regression. Because there is at least one small-width x-axis window in the fitted values vs. residuals plot for this model that does not have an even number of positive and negative residuals (take a look at the right), this tells us that there is not a linear relationship between the explanatory variable(s) and the response variable. Thus this tells us that a linear regression model is not suitable because the data does not meet this linearity assumption.

plt.scatter(lin_mod.fittedvalues, lin_mod.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted (Predicted) values')
plt.ylabel('Residuals')
plt.title('Fitted values vs. Residuals Plot')
plt.show()
Image Description

Bad Predictions

Sometimes even if a linear model is not the most suitable, we may still use it if it happens to give us the best predictions and good predictions is the main thing that we care about.

However, by looking at a best fit linear regression curve of the data as well as the low R^2 value, we can see that our predictions made with this model will be way off. Because our main goal for this dataset is to build a model that will yield good predictions, this is not ideal.

lin_mod.rsquared

0.2533483509117226

sns.lmplot(x='number_of_followers', y='y', data=df_train, ci=False)
plt.show()
Image Description

Confusing Prediction Interpretations

Finally, let's consider how our linear regression model might make hard to interpret predictions. If for a given account, our model yields a prediction of $\hat{y}=1$ then we could predict that this account is real and if $\hat{y}=0$ then we could predict that this account is fake.

$\hat{y}=0.2198+0.0009number_of_followers$

However, notice in the best fit line and scatterplot above, most likely none of our predictions will be exactly $\hat{y}=1$ and $\hat{y}=0$. So this begs the following two complicating questions.

What does it mean for $\hat{y}>1$ or $\hat{y}<0$

One of the main drawbacks of a linear regression models is that the model is allowed to "overshoot" are only easily interpretable predicted values of $\hat{y}=1$ and $\hat{y}=0$. A prediction that is $\hat{y}>1$ or $\hat{y}<0$ is unfortunately confusing and nonsensical.

What does it mean for $0<\hat{y}<1$?

Our model above is also allowed to have predictions that are $0<\hat{y}<1$. We might be tempted to say that predictions such as this are also nonsensical. However, what we'll see in section 3.3 when we fit our logistic regression model, that predicted values such as this actually do have a very interesting and useful meaning!

Looking for a better model

Given these drawbacks that we observed by fitting a linear regression model, it'd be ideal if we could find another curve $f(number_of_followers)$ (not a line) that could be bounded to stay between $\hat{y}=0$ and $\hat{y}=1$.

Sigmoid Function

It turns out there is a function $f(x)$ that has this property known as a sigmoid function.

Sigmoid Function and its Properties

Sigmoid Function

A basic sigmoid function is defined by either one of the two equivalent functions.

$S(x)=\frac{1}{1+e^{-x}}=\frac{e^x}{e^x+1}$

Properties

The sigmoid function creates an "S"-shaped curve that is bounded between two asymptotes as shown below:

  • one at $S(x)=1$
  • one at $S(x)=0$.
#Defining some x-values
x = np.linspace(-10, 10, 100)

#Basic sigmoid function
p = 1 / (1 + np.exp(-x))

# Plotting
plt.plot(x, p)
plt.xlabel('x')
plt.ylabel('S(x)')
plt.show()

...

Image Description

If we plot a basic sigmoid curve $S(x)=\frac{1}{1+e^{-x}}$ along with our training dataset below we see that this is actually not a good fit. We can see that the model predictions for the fake accounts (ie. those with y=0) will be terrible in particular.

But if we could horizontally shift as well as horizontally stretch/compress this basic sigmoid function $S(x)$, we might be able to get a better fit.

#Data
sns.scatterplot(x='number_of_followers', y='y',data=df_train)

#Basic sigmoid function curve
x = np.linspace(-1600, 1600, 100)
p = 1 / (1 + np.exp(-x))
plt.plot(x, p, color='orange', label='y=1/(1+e^(-x))')
plt.legend()
plt.show()

C:\Users\vellison\AppData\Local\Temp\ipykernel_13180\789729522.py:6: RuntimeWarning: overflow encountered in exp
p = 1 / (1 + np.exp(-x))

Image Description

Best Fit Sigmoid Function

Let's remember from our algebra classes how we can modify the shape of any given function $S(x)$ as follows.

  • **Multiplying** the variable $x$ on the inside of the function by some coefficient $\hat{\beta}_1$ **horizontally stretches/compresses** the original function shape.
  • **Adding** some constant $\hat{\beta}_0$ on the inside of the function **horizontally shifts** the original function shape.*
$S(x)=\frac{1}{1+e^{-x}}$

$S(\hat{\beta}_0+\hat{\beta}_1x)=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$

So for instance, suppose we let our stretch parameter be $\hat{\beta}_1=0.006$ and our shift parameter be $\hat{\beta}_0=-1.58$. Then this new stretched and shifted sigmoid curve $S(-1.58+0.006x)=\frac{1}{1+e^{-(-1.58+0.006x)}}$ is actually going to achieve a better overall fit of our data as we can see below.

#Data
sns.scatterplot(x='number_of_followers', y='y',data=df_train)

#Two sigmoid curves
p = 1 / (1 + np.exp(-x))
p_new= 1 / (1 + np.exp(-(-1.58+0.006*x)))

#Horizontally shifted and stretched sigmoid curve
plt.plot(x, p,color='orange', label='y=1/(1+e^(-x))')
plt.plot(x, p_new, color='green', label='y=1/(1+e^(-(-1.58+0.006x))')
plt.legend(bbox_to_anchor=(1,1))
plt.show()

C:\Users\vellison\AppData\Local\Temp\ipykernel_13180\1263333318.py:5: RuntimeWarning: overflow encountered in exp
p = 1 / (1 + np.exp(-x))

Image Description

We can automatically overlay a "best fit sigmoid curve" to our dataset by using the sns.regplot() function in the same way that we use the sns.lmplot() function to plot a "best fit linear curve". The only additional parameter we need to stipulate in sns.regplot() is logistic=True.

sns.regplot(x='number_of_followers', y='y',
data=df_train, ci=False,
logistic=True)
plt.show()
Image Description

Predictive Probability and Logistic Regression

Notice how that even in our "best fit sigmoid" function we still end up with predicted values that are $0<\hat{y}<1$.

$\hat{y}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$

So we have yet to answer the question: how should we interpret a predicted value that is between 0 and 1?

More formally, we actually formulate and represent this "best fit sigmoid function" as the equation shown below with the variable $\hat{p}$ on the left hand side. We call this a simple logistic regression curve. Much like a simple linear regression, this is a simple logistic regression because it only has one explanatory variable $x$.

Simple Logistic Regression Model

$P(\hat{Y}=1)=\hat{p}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$

We call this $\hat{p}$ the predictive probability which actually represents the probability that the response variable $y$ that corresponds to the given explanatory variable(s) is equal to a 1 (or in other words $P(\hat{Y}=1)$).

Logistic Regression Model Representations

With some algebraic manipulation, there's actually many ways that we can choose to represent a logistic regression model. Below are three of the most common ways.

Simple Logistic Regression Model
  • $\hat{p}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$
  • $\frac{\hat{p}}{1-\hat{p}}=e^{\hat{\beta}_0+\hat{\beta}_1x}$
  • $log(\frac{\hat{p}}{1-\hat{p}})=\hat{\beta}_0+\hat{\beta}_1x$

Note: Traditionally in statistics, when the $log()$ function is referenced, unless stated otherwise we are usually talking about the natural log $ln()$

The third representation above is great because the right hand side now looks exactly like the right hand side of a linear regression curve.

But now this begs the question: what exactly does $ln(\frac{\hat{p}}{1-\hat{p}})$ or $\frac{\hat{p}}{1-\hat{p}}$ represent?