Introducing Logistic Regression
Categorical Response Variable with Two Levels
Let's scale back our analysis for now, and let's just suppose that we would like to predict our response variable account_type with just number_of_followers.
As we've discussed before we usually visualize the relationship between a numerical variable and acategorical variable with side-by-side boxplots.
sns.boxplot(x='number_of_followers', y='account_type', data=df_train)
plt.title('Training Data')
plt.show()

Attempting to use a Linear Regression Model
But suppose for a moment, we wanted to try to predict account_type given the number_of_followers using a linear regression model. Given that the response variable account_type is categorical, how might we attempt to do this?
0/1 Response Variable Conversion
One thing that we could try is creating a new binary numerical variable in our dataframes in which:
- 1 = real accounts
- 0 = fake accounts.
Which account type we choose to represent a 1 vs. a 0 is somewhat important.
- Usually the value that represented with a 1 is called the success level. It usually represents the particular level that we are interested in. For instance, our research goal might state "we are interested in predicting whether an account is real".
- In addition, the value that is represented with a 0 is called the failure level. Technically, we know, for instance that if our model predicts that an observation is not real (ie. not 1), then we know that is fake (ie. 0).
Based on this relationship, switching up the particular 0/1 value assignments yields the same core model results. However, if you are more interested in predicting whether the account is real, setting 1=real will be better for interpretation purposes (we'll see why in section ???).
Creating the 0/1 Response Variable
Let's call this new variable y and let's create it for both our training and test dataset. We can create this variable by using the .replace() function after the pandas series column that we would like to create a modified version of. We supply the .replace() function with a dictionary which maps the value conversion that we would like to happen.
- the keys represent the old values in the old dataframe column
- the values represent the new values that we would like to replace them with.
Notice that the .replace() function creates a new column called 'y' and it keeps the 'account_type' column as well.
df_train['y'] = df_train['account_type'].replace({'real':1, 'fake':0})
df_train.head()
has_a_profile_pic | number_of_words_in_name | num_characters_in_bio | number_of_posts | number_of_followers | number_of_follows | account_type | y | |
---|---|---|---|---|---|---|---|---|
47 | yes | 2 | 0 | 0 | 87 | 40 | real | 1 |
51 | yes | 2 | 81 | 25 | 341 | 274 | real | 1 |
75 | no | 1 | 0 | 1 | 24 | 2 | fake | 0 |
93 | yes | 0 | 0 | 15 | 772 | 3239 | fake | 0 |
76 | no | 2 | 0 | 0 | 13 | 22 | fake | 0 |
df_test['y'] = df_test['account_type'].replace({'real':1, 'fake':0})
df_test.head()
has_a_profile_pic | number_of_words_in_name | num_characters_in_bio | number_of_posts | number_of_followers | number_of_follows | account_type | y | |
---|---|---|---|---|---|---|---|---|
71 | no | 1 | 0 | 0 | 16 | 2 | fake | 0 |
6 | yes | 1 | 132 | 9 | 213 | 254 | real | 1 |
16 | yes | 2 | 86 | 25 | 96 | 499 | real | 1 |
84 | no | 1 | 0 | 0 | 21 | 31 | fake | 0 |
55 | yes | 7 | 24 | 465 | 654 | 535 | real | 1 |
Why not linear regression?
Attempting a Linear Regression Curve
With this 0/1 'numerical' representation of account_type (y), we now have y and number_of_followers as two numerical variables which can be represented in a scatterplot.
sns.lmplot(x='number_of_followers', y='y', data=df_train, ci=False)
plt.show()

And thus, now we could technically fit the simple linear regression model to predict y as we do below.
$$\hat{y}=0.2198+0.0009number_of_followers$$
lin_mod = smf.ols(formula='y~number_of_followers', data=df_train).fit()
lin_mod.summary().tables[1]
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.2198 | 0.063 | 3.478 | 0.001 | 0.094 | 0.346 |
number_of_followers | 0.0009 | 0.000 | 5.275 | 0.000 | 0.001 | 0.001 |
...
But let's try to think of some drawbacks that we might encounter when fitting this linear regression curve to this 0/1 response variable.
Not a Suitable Model
First, we can see that there is clearly not a linear relationship between these two numerical variables. Because this is a simple linear regression model we can see this simply by looking at a scatterplot between the explanatory variable and the respons variable.
sns.lmplot(x='number_of_followers', y='y', data=df_train, ci=False)
plt.show()

Or we could use our more sophisticated technique that works also for multiple linear regression. Because there is at least one small-width x-axis window in the fitted values vs. residuals plot for this model that does not have an even number of positive and negative residuals (take a look at the right), this tells us that there is not a linear relationship between the explanatory variable(s) and the response variable. Thus this tells us that a linear regression model is not suitable because the data does not meet this linearity assumption.
plt.scatter(lin_mod.fittedvalues, lin_mod.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted (Predicted) values')
plt.ylabel('Residuals')
plt.title('Fitted values vs. Residuals Plot')
plt.show()

Bad Predictions
Sometimes even if a linear model is not the most suitable, we may still use it if it happens to give us the best predictions and good predictions is the main thing that we care about.
However, by looking at a best fit linear regression curve of the data as well as the low R^2 value, we can see that our predictions made with this model will be way off. Because our main goal for this dataset is to build a model that will yield good predictions, this is not ideal.
lin_mod.rsquared
0.2533483509117226
sns.lmplot(x='number_of_followers', y='y', data=df_train, ci=False)
plt.show()

Confusing Prediction Interpretations
Finally, let's consider how our linear regression model might make hard to interpret predictions. If for a given account, our model yields a prediction of $\hat{y}=1$ then we could predict that this account is real and if $\hat{y}=0$ then we could predict that this account is fake.
$\hat{y}=0.2198+0.0009number_of_followers$
However, notice in the best fit line and scatterplot above, most likely none of our predictions will be exactly $\hat{y}=1$ and $\hat{y}=0$. So this begs the following two complicating questions.
What does it mean for $\hat{y}>1$ or $\hat{y}<0$
One of the main drawbacks of a linear regression models is that the model is allowed to "overshoot" are only easily interpretable predicted values of $\hat{y}=1$ and $\hat{y}=0$. A prediction that is $\hat{y}>1$ or $\hat{y}<0$ is unfortunately confusing and nonsensical.
What does it mean for $0<\hat{y}<1$?
Our model above is also allowed to have predictions that are $0<\hat{y}<1$. We might be tempted to say that predictions such as this are also nonsensical. However, what we'll see in section 3.3 when we fit our logistic regression model, that predicted values such as this actually do have a very interesting and useful meaning!
Looking for a better model
Given these drawbacks that we observed by fitting a linear regression model, it'd be ideal if we could find another curve $f(number_of_followers)$ (not a line) that could be bounded to stay between $\hat{y}=0$ and $\hat{y}=1$.
Sigmoid Function
It turns out there is a function $f(x)$ that has this property known as a sigmoid function.
Sigmoid Function and its Properties
Sigmoid Function
A basic sigmoid function is defined by either one of the two equivalent functions.
$S(x)=\frac{1}{1+e^{-x}}=\frac{e^x}{e^x+1}$
Properties
The sigmoid function creates an "S"-shaped curve that is bounded between two asymptotes as shown below:
- one at $S(x)=1$
- one at $S(x)=0$.
#Defining some x-values
x = np.linspace(-10, 10, 100)
#Basic sigmoid function
p = 1 / (1 + np.exp(-x))
# Plotting
plt.plot(x, p)
plt.xlabel('x')
plt.ylabel('S(x)')
plt.show()
...

If we plot a basic sigmoid curve $S(x)=\frac{1}{1+e^{-x}}$ along with our training dataset below we see that this is actually not a good fit. We can see that the model predictions for the fake accounts (ie. those with y=0) will be terrible in particular.
But if we could horizontally shift as well as horizontally stretch/compress this basic sigmoid function $S(x)$, we might be able to get a better fit.
#Data
sns.scatterplot(x='number_of_followers', y='y',data=df_train)
#Basic sigmoid function curve
x = np.linspace(-1600, 1600, 100)
p = 1 / (1 + np.exp(-x))
plt.plot(x, p, color='orange', label='y=1/(1+e^(-x))')
plt.legend()
plt.show()
C:\Users\vellison\AppData\Local\Temp\ipykernel_13180\789729522.py:6: RuntimeWarning: overflow encountered in exp
p = 1 / (1 + np.exp(-x))

Best Fit Sigmoid Function
Let's remember from our algebra classes how we can modify the shape of any given function $S(x)$ as follows.
- **Multiplying** the variable $x$ on the inside of the function by some coefficient $\hat{\beta}_1$ **horizontally stretches/compresses** the original function shape.
- **Adding** some constant $\hat{\beta}_0$ on the inside of the function **horizontally shifts** the original function shape.*
$S(\hat{\beta}_0+\hat{\beta}_1x)=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$
So for instance, suppose we let our stretch parameter be $\hat{\beta}_1=0.006$ and our shift parameter be $\hat{\beta}_0=-1.58$. Then this new stretched and shifted sigmoid curve $S(-1.58+0.006x)=\frac{1}{1+e^{-(-1.58+0.006x)}}$ is actually going to achieve a better overall fit of our data as we can see below.
#Data
sns.scatterplot(x='number_of_followers', y='y',data=df_train)
#Two sigmoid curves
p = 1 / (1 + np.exp(-x))
p_new= 1 / (1 + np.exp(-(-1.58+0.006*x)))
#Horizontally shifted and stretched sigmoid curve
plt.plot(x, p,color='orange', label='y=1/(1+e^(-x))')
plt.plot(x, p_new, color='green', label='y=1/(1+e^(-(-1.58+0.006x))')
plt.legend(bbox_to_anchor=(1,1))
plt.show()
C:\Users\vellison\AppData\Local\Temp\ipykernel_13180\1263333318.py:5: RuntimeWarning: overflow encountered in exp
p = 1 / (1 + np.exp(-x))

We can automatically overlay a "best fit sigmoid curve" to our dataset by using the sns.regplot() function in the same way that we use the sns.lmplot() function to plot a "best fit linear curve". The only additional parameter we need to stipulate in sns.regplot() is logistic=True.
sns.regplot(x='number_of_followers', y='y',
data=df_train, ci=False,
logistic=True)
plt.show()

Predictive Probability and Logistic Regression
Notice how that even in our "best fit sigmoid" function we still end up with predicted values that are $0<\hat{y}<1$.
$\hat{y}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$
So we have yet to answer the question: how should we interpret a predicted value that is between 0 and 1?
More formally, we actually formulate and represent this "best fit sigmoid function" as the equation shown below with the variable $\hat{p}$ on the left hand side. We call this a simple logistic regression curve. Much like a simple linear regression, this is a simple logistic regression because it only has one explanatory variable $x$.
$P(\hat{Y}=1)=\hat{p}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$
We call this $\hat{p}$ the predictive probability which actually represents the probability that the response variable $y$ that corresponds to the given explanatory variable(s) is equal to a 1 (or in other words $P(\hat{Y}=1)$).
Logistic Regression Model Representations
With some algebraic manipulation, there's actually many ways that we can choose to represent a logistic regression model. Below are three of the most common ways.
- $\hat{p}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$
- $\frac{\hat{p}}{1-\hat{p}}=e^{\hat{\beta}_0+\hat{\beta}_1x}$
- $log(\frac{\hat{p}}{1-\hat{p}})=\hat{\beta}_0+\hat{\beta}_1x$
Note: Traditionally in statistics, when the $log()$ function is referenced, unless stated otherwise we are usually talking about the natural log $ln()$
The third representation above is great because the right hand side now looks exactly like the right hand side of a linear regression curve.
But now this begs the question: what exactly does $ln(\frac{\hat{p}}{1-\hat{p}})$ or $\frac{\hat{p}}{1-\hat{p}}$ represent?