Multiple Logistic Regression

← Fitting a Logistic Regression Model Next: Making Predictions →

Similar to linear regression, we can include multiple explanatory variables in our logistic regression model, creating a multiple logistic regression model.

Multiple Logistic Regression Model (Equivalent Formulations)

$\hat{p}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x_1+...+\hat{\beta}_px_p)}}$
$\hat{odds}=\frac{\hat{p}}{1-\hat{p}}=e^{\hat{\beta}_0+\hat{\beta}_1x_1+...+\hat{\beta}_px_p}$
$log(\hat{odds})=log(\frac{\hat{p}}{1-\hat{p}})=\hat{\beta}_0+\hat{\beta}_1x_1+...+\hat{\beta}_px_p$

Numerical and Categorical Explanatory Variables Allowed

We can incorporate both numerical and categorical explanatory variables into the model. Similar to linear regression, if we are dealing with a categorical explanatory variable with $w$ levels, then we create $w-1$ 0/1 indicator variables that correspond to $w-1$ of these levels.

Full Model

Circling back to our Instagram classifier research goal, let's fit what we'll call our full model with the training dataset. That is, let's fit a logistic regression model that predicts the probability that an account is real using all 6 of our available explanatory variables:

has_a_profile_pic
number_of_words_in_name
num_characters_in_bio
number_of_posts
number_of_followers
number_of_follows

Like with the smf.ols() linear regression function, we can add multiple explanatory variables to the smf.logit() logistic regression function by adding the respective column names on the right hand side of the formula string.

log_mod_full = smf.logit(formula='y~has_a_profile_pic+number_of_words_in_name+num_characters_in_bio+number_of_posts+number_of_followers+number_of_follows', data=df_train).fit()
log_mod_full.summary()

Warning: Maximum number of iterations has been exceeded.
Current function value: 0.123632
Iterations: 35

Logit Regression Results
Dep. Variable:	y	No. Observations:	84
Model:	Logit	Df Residuals:	77
Method:	MLE	Df Model:	6
Date:	Fri, 28 Jul 2023	Pseudo R-squ.:	0.8198
Time:	19:14:29	Log-Likelihood:	-10.385
converged:	False	LL-Null:	-57.628
Covariance Type:	nonrobust	LLR p-value:	3.538e-18

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-37.4658	4.38e+04	-0.001	0.999	-8.58e+04	8.57e+04
has_a_profile_pic[T.yes]	30.7281	4.38e+04	0.001	0.999	-8.57e+04	8.58e+04
number_of_words_in_name	2.5983	1.203	2.161	0.031	0.241	4.955
num_characters_in_bio	0.0874	0.053	1.646	0.100	-0.017	0.192
number_of_posts	-0.0060	0.014	-0.426	0.670	-0.033	0.021
number_of_followers	0.0252	0.009	2.779	0.005	0.007	0.043
number_of_follows	-0.0046	0.002	-2.142	0.032	-0.009	-0.000

Possibly complete quasi-separation: A fraction 0.44 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Thus our full logistic regression model trained with the training dataset is as follows, using the three equivalent formulations.

Multiple Logistic Regression Model

Predicting Probability the Account is Real

\begin{align*} \hat{p} = \frac{1}{1 + \exp\left(-\begin{aligned} &-37.47 \\ &+ 30.73(\,\text{has a profile pic}[T.yes]) \\ &+ 2.60(\,\text{number of words in name}) \\ &+ 0.087(\,\text{num characters in bio}) \\ &- 0.0060(\,\text{number of posts}) \\ &+ 0.025(\,\text{number of followers}) \\ &- 0.0046(\,\text{number of follows}) \end{aligned}\right)} \end{align*}

Predicting Odds the Account is Real

\begin{align*} \hat{odds} = \frac{\hat{p}}{1-\hat{p}} =\exp\left(\begin{aligned} &-37.47 \\ &+ 30.73(\,\text{has a profile pic}[T.yes]) \\ &+ 2.60(\,\text{number of words in name}) \\ &+ 0.087(\,\text{num characters in bio}) \\ &- 0.0060(\,\text{number of posts}) \\ &+ 0.025(\,\text{number of followers}) \\ &- 0.0046(\,\text{number of follows}) \end{aligned}\right) \end{align*}

Predicting Log-Odds the Account is Real

\begin{align*} log(\hat{odds})=log\left(\frac{\hat{p}}{1-\hat{p}}\right) =\left(\begin{aligned} &-37.47 \\ &+ 30.73(\,\text{has a profile pic}[T.yes]) \\ &+ 2.60(\,\text{number of words in name}) \\ &+ 0.087(\,\text{num characters in bio}) \\ &- 0.0060(\,\text{number of posts}) \\ &+ 0.025(\,\text{number of followers}) \\ &- 0.0046(\,\text{number of follows}) \end{aligned}\right) \end{align*}

By inspecting our one indicator variable, has_a_profile_pic_{yes}, we can see that the model chose to make 'no' the reference level as there is no indicator variable for the 'no' level.

So we can see that: $has\_a\_profile\_pic_{yes} = \begin{cases} 1 & \text{if } yes \\ 0 & \text{if } no \end{cases}$

← Fitting a Logistic Regression Model Next: Making Predictions →