Multiple Logistic Regression
Similar to linear regression, we can include multiple explanatory variables in our logistic regression model, creating a multiple logistic regression model.
- $\hat{p}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x_1+...+\hat{\beta}_px_p)}}$
- $\hat{odds}=\frac{\hat{p}}{1-\hat{p}}=e^{\hat{\beta}_0+\hat{\beta}_1x_1+...+\hat{\beta}_px_p}$
- $log(\hat{odds})=log(\frac{\hat{p}}{1-\hat{p}})=\hat{\beta}_0+\hat{\beta}_1x_1+...+\hat{\beta}_px_p$
Numerical and Categorical Explanatory Variables Allowed
We can incorporate both numerical and categorical explanatory variables into the model. Similar to linear regression, if we are dealing with a categorical explanatory variable with $w$ levels, then we create $w-1$ 0/1 indicator variables that correspond to $w-1$ of these levels.
Full Model
Circling back to our Instagram classifier research goal, let's fit what we'll call our full model with the training dataset. That is, let's fit a logistic regression model that predicts the probability that an account is real using all 6 of our available explanatory variables:
- has_a_profile_pic
- number_of_words_in_name
- num_characters_in_bio
- number_of_posts
- number_of_followers
- number_of_follows
Like with the smf.ols() linear regression function, we can add multiple explanatory variables to the smf.logit() logistic regression function by adding the respective column names on the right hand side of the formula string.
log_mod_full = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_posts+number_of_followers+number_of_follows', data=df_train).fit()
log_mod_full.summary()
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.123632
Iterations: 35
Dep. Variable: | y | No. Observations: | 84 |
---|---|---|---|
Model: | Logit | Df Residuals: | 77 |
Method: | MLE | Df Model: | 6 |
Date: | Fri, 28 Jul 2023 | Pseudo R-squ.: | 0.8198 |
Time: | 19:14:29 | Log-Likelihood: | -10.385 |
converged: | False | LL-Null: | -57.628 |
Covariance Type: | nonrobust | LLR p-value: | 3.538e-18 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -37.4658 | 4.38e+04 | -0.001 | 0.999 | -8.58e+04 | 8.57e+04 |
has_a_profile_pic[T.yes] | 30.7281 | 4.38e+04 | 0.001 | 0.999 | -8.57e+04 | 8.58e+04 |
number_of_words_in_name | 2.5983 | 1.203 | 2.161 | 0.031 | 0.241 | 4.955 |
num_characters_in_bio | 0.0874 | 0.053 | 1.646 | 0.100 | -0.017 | 0.192 |
number_of_posts | -0.0060 | 0.014 | -0.426 | 0.670 | -0.033 | 0.021 |
number_of_followers | 0.0252 | 0.009 | 2.779 | 0.005 | 0.007 | 0.043 |
number_of_follows | -0.0046 | 0.002 | -2.142 | 0.032 | -0.009 | -0.000 |
Possibly complete quasi-separation: A fraction 0.44 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Thus our full logistic regression model trained with the training dataset is as follows, using the three equivalent formulations.
Multiple Logistic Regression Model
Predicting Probability the Account is Real
Predicting Odds the Account is Real
Predicting Log-Odds the Account is Real
By inspecting our one indicator variable, has_a_profile_pic_{yes}, we can see that the model chose to make 'no' the reference level as there is no indicator variable for the 'no' level.