Multiple Logistic Regression


Similar to linear regression, we can include multiple explanatory variables in our logistic regression model, creating a multiple logistic regression model.

Multiple Logistic Regression Model (Equivalent Formulations)
  • $\hat{p}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x_1+...+\hat{\beta}_px_p)}}$
  • $\hat{odds}=\frac{\hat{p}}{1-\hat{p}}=e^{\hat{\beta}_0+\hat{\beta}_1x_1+...+\hat{\beta}_px_p}$
  • $log(\hat{odds})=log(\frac{\hat{p}}{1-\hat{p}})=\hat{\beta}_0+\hat{\beta}_1x_1+...+\hat{\beta}_px_p$

Numerical and Categorical Explanatory Variables Allowed

We can incorporate both numerical and categorical explanatory variables into the model. Similar to linear regression, if we are dealing with a categorical explanatory variable with $w$ levels, then we create $w-1$ 0/1 indicator variables that correspond to $w-1$ of these levels.

Full Model

Circling back to our Instagram classifier research goal, let's fit what we'll call our full model with the training dataset. That is, let's fit a logistic regression model that predicts the probability that an account is real using all 6 of our available explanatory variables:

  • has_a_profile_pic
  • number_of_words_in_name
  • num_characters_in_bio
  • number_of_posts
  • number_of_followers
  • number_of_follows

Like with the smf.ols() linear regression function, we can add multiple explanatory variables to the smf.logit() logistic regression function by adding the respective column names on the right hand side of the formula string.

log_mod_full = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_posts+number_of_followers+number_of_follows', data=df_train).fit()
log_mod_full.summary()

Warning: Maximum number of iterations has been exceeded.
Current function value: 0.123632
Iterations: 35

Logit Regression Results
Dep. Variable: y No. Observations: 84
Model: Logit Df Residuals: 77
Method: MLE Df Model: 6
Date: Fri, 28 Jul 2023 Pseudo R-squ.: 0.8198
Time: 19:14:29 Log-Likelihood: -10.385
converged: False LL-Null: -57.628
Covariance Type: nonrobust LLR p-value: 3.538e-18
coef std err z P>|z| [0.025 0.975]
Intercept -37.4658 4.38e+04 -0.001 0.999 -8.58e+04 8.57e+04
has_a_profile_pic[T.yes] 30.7281 4.38e+04 0.001 0.999 -8.57e+04 8.58e+04
number_of_words_in_name 2.5983 1.203 2.161 0.031 0.241 4.955
num_characters_in_bio 0.0874 0.053 1.646 0.100 -0.017 0.192
number_of_posts -0.0060 0.014 -0.426 0.670 -0.033 0.021
number_of_followers 0.0252 0.009 2.779 0.005 0.007 0.043
number_of_follows -0.0046 0.002 -2.142 0.032 -0.009 -0.000


Possibly complete quasi-separation: A fraction 0.44 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Thus our full logistic regression model trained with the training dataset is as follows, using the three equivalent formulations.

Multiple Logistic Regression Model

Predicting Probability the Account is Real

\begin{align*} \hat{p} = \frac{1}{1 + \exp\left(-\begin{aligned} &-37.47 \\ &+ 30.73(\,\text{has a profile pic}[T.yes]) \\ &+ 2.60(\,\text{number of words in name}) \\ &+ 0.087(\,\text{num characters in bio}) \\ &- 0.0060(\,\text{number of posts}) \\ &+ 0.025(\,\text{number of followers}) \\ &- 0.0046(\,\text{number of follows}) \end{aligned}\right)} \end{align*}



Predicting Odds the Account is Real

\begin{align*} \hat{odds} = \frac{\hat{p}}{1-\hat{p}} =\exp\left(\begin{aligned} &-37.47 \\ &+ 30.73(\,\text{has a profile pic}[T.yes]) \\ &+ 2.60(\,\text{number of words in name}) \\ &+ 0.087(\,\text{num characters in bio}) \\ &- 0.0060(\,\text{number of posts}) \\ &+ 0.025(\,\text{number of followers}) \\ &- 0.0046(\,\text{number of follows}) \end{aligned}\right) \end{align*}



Predicting Log-Odds the Account is Real

\begin{align*} log(\hat{odds})=log\left(\frac{\hat{p}}{1-\hat{p}}\right) =\left(\begin{aligned} &-37.47 \\ &+ 30.73(\,\text{has a profile pic}[T.yes]) \\ &+ 2.60(\,\text{number of words in name}) \\ &+ 0.087(\,\text{num characters in bio}) \\ &- 0.0060(\,\text{number of posts}) \\ &+ 0.025(\,\text{number of followers}) \\ &- 0.0046(\,\text{number of follows}) \end{aligned}\right) \end{align*}

By inspecting our one indicator variable, has_a_profile_pic_{yes}, we can see that the model chose to make 'no' the reference level as there is no indicator variable for the 'no' level.

So we can see that: $has\_a\_profile\_pic_{yes} = \begin{cases} 1 & \text{if } yes \\ 0 & \text{if } no \end{cases}$