Fitting a Logistic Regression Model

← Odds and Probability Next: Multiple Logistic Regression →

Now that we understand all of the components of a logistic regression model, let's next discuss how we go about finding the optimal intercept $\hat{\beta}_0$ and slope $\hat{\beta}_1$ values that give us the "best fit" of our training dataset.

Simple Logistic Regression Model
$log(\hat{odds})=log(\frac{\hat{p}}{1-\hat{p}})=\hat{\beta}_0+\hat{\beta}_1x$

What objective function to optimize?

When it comes to picking the $\hat{\beta}_0$ and $\hat{\beta}_1$ that give us the "best fit" of the data, we want to think about what objective function $f(\hat{\beta}_0,\hat{\beta}_1)$ do we want to try to optimize? Or in other words, what function should measure the "fit" of the model?

Why not use Sum Square Error (SSE)?

Recall that if we were fitting a linear regression model $\hat{y}=\hat{\beta}_0+\hat{\beta}_1x$, then we would seek to find the optimal $\hat{\beta}_0^*$ and $\hat{\beta}_1^*$ values that maximize the sum square of the error.

$SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \left(y_i - (\hat{\beta}^*_0 + \hat{\beta}^*_1 x_{1,i})\right)^2$

However, remember that in a logistic regression model $\hat{p}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$, we are not predicting a response variable value itself $\hat{y}$. Instead, we are predicting $\hat{p}$, the probability that our response variable value $y$ is equal to 1.

We could *in theory* use $\hat{p}_i$ instead of $\hat{y}_i$ in the SSE equation $SSE=\sum_{i=1}^n(y_i - \hat{p}_i)^2=\sum_{i=1}^n(y_i - (\hat{\beta}^*_0+\hat{\beta}^*_1x_{1,i}))^2$.

However, the act of measuring the distance between two different types of metrics (a probability and a 0/1 response variable value) in $(y_i - \hat{p}_i)^2$ is not as meaningful or easy to interpret.

Maximum Likelihood Estimatation

Rather than tring to minimize the SSE, we instead use a technique known as maximum likelihood estimation to help us select the optimal $\hat{\beta}_0$ and $\hat{\beta}_1$ values in our logistic regression model.

Breaking this name down, we seek to find the optimal estimates of $\hat{\beta}_0$ and $\hat{\beta}_1$ that maximize the likelihood (ie. probability) of the response variable values in the training dataset under a certain set of assumptions.

Probability of a Single Response Variable Value $y_i$

Maximum likelihood estimation specifically involves calculating the following conditional probability that corresponds to each observation $(x_i, y_i)$ in the training dataset:

$P(Y=y_i|\beta_0, \beta_1, x_i)$

What does this single probability mean?

Suppose we randomly select an observation $(x,y)$ from our population. We assume that the $x$ and $y$ values of this population follow this logistic regression model relationship $\hat{p}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$ which has a specific intercept and slope.

Under this model assumption and assuming that the explanatory variable value that we drew was $x_i$ (ie. the GIVENS of the conditional probability), we calculate the probability that the response variable value is actually our observed $y_i$ value (rather than it's opposite).

How to calculate this single probability?

Case 1: $y_i$ is actually 1

Recall that a given logistic regression model predicts $\hat{p}=P(Y=1|\beta_0, \beta_1, x)$.

So if a given $y_i$ observation is actually 1, then

$P(Y=y_i|\beta_0, \beta_1, x_i) = P(Y=1|\beta_0, \beta_1, x_i)=\hat{p}_i=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x_i)}}$

Case 2: $y_i$ is actually 0

Similarly, if a given $y_i$ observation is actually 0, then

$P(Y=y_i|\beta_0, \beta_1, x_i) = P(Y=0|\beta_0, \beta_1, x_i) = 1- P(Y=1|\beta_0, \beta_1, x_i)=1-\hat{p}_i=1-\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x_i)}}$

Example: For instance, suppose we are considering the following potential logistic regression model $\hat{p}=\frac{1}{1+e^{-(2+3x)}}$
and our training dataset is only comprised of two observations:

$(x_1, y_1) = (0.5,1)$ and
$(x_2, y_2) = (0.25,0)$.

Then the probability that our first randomly selected response variable would be indeed equal to $y_1=1$ is:
$P(Y=1|\beta_0=2, \beta_1=3, x_1=0.5) =\frac{1}{1+e^{-(2+3(0.5)}} = 0.97$
This is quite a high probability! This indicates that this hypothetical model would fit the first data point pretty well! We are likely to have observed the actual response variable that we did with $y_1=1$.

p_1 = 1/(1+np.exp(-(2+3*0.5)))
p_1

0.9706877692486436

Furthermore, the probability that our second randomly selected response variable would be indeed be equal to $y_2=0$ is:
$P(Y=0|\beta_0=2, \beta_1=3, x_1=0.25) =1-\frac{1}{1+e^{-(2+3(0.25))}} = 1-0.94 =0.06$
This is quite a low probability! This indicates that this hypothetical model would not fit the second data point well! We are unlikely to have observed the actual response variable that we did with $y_2=0$.

p_2 = 1/(1+np.exp(-(2+3*0.25)))
p_2

0.9399133498259924

Representing this Probability in a Compact Way

So to summarize, we calculate the individual probability that a randomly selected response variable value is equal to $y_i$ (given $\beta_0, \beta_1, x_i$) as follows.

$
P(Y=y_i|\beta_0, \beta_1, x_i) = \begin{cases}
\hat{p}_i & \text{if } y_i=1 \
1-\hat{p}_i & \text{if } y_i=0
\end{cases}
$

Using a piece-wise defined function is going to be mathematically messy and unhelpful in this analysis. So let's try to find a way to represent the piecewise defined function above as an equivalent single expression.

We can actually do this with the following equation.

$P(Y=y_i|\beta_0, \beta_1, x_i) = \hat{p}_i^{y_i}(1-\hat{p}_i)^{(1-y_i)}$

A quick check of both scenarios for $y_i=0,1$ can prove to us that:

$P(Y=1|\beta_0, \beta_1, x_i) = \hat{p}_i^{1}(1-\hat{p}_i)^{(1-1)}=\hat{p}_i$
$P(Y=0|\beta_0, \beta_1, x_i) = \hat{p}_i^{0}(1-\hat{p}_i)^{(1-0)}=1-\hat{p}_i$

Probability of All Response Variable Values $y_1,y_2,...,y_n$

We can also calculate the probability of randomly selecting ALL of our observed response variable values $y_1,y_2,...,y_n$ in the training dataset from a population that assumes a given logistic regression model relationship $\hat{p}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$ between $x$ and $y$ and our given set of explanatory variable values $x_1,x_2,...,x_n$.

We can represent this probability as follows.

$LF(\beta_0, \beta_1)=P(Y_1=y_1\ AND\ Y_2=y_2\ ...AND\ Y_n=y_n |\beta_0, \beta_1, x_1,x_2,...,x_n)$

However, in a logistic regression model, we always assume that that each of our $n$ sampled response variable outcomes $y_1,y_2,...,y_n$ are independent of each other. So using this fact, we can actually represent our probability above as follows.

Likelihood Function

\begin{align*} LF(\beta_0, \beta_1)&=P(Y=y_1|\beta_0, \beta_1, x_1)\cdot P(Y=y_2|\beta_0, \beta_1, x_2)\cdot...\cdot P(Y=y_n|\beta_0, \beta_1, x_n) \\ &=[\hat{p}_1^{y_1}(1-\hat{p}_1)^{(1-y_1)}]\cdot [\hat{p}_2^{y_2}(1-\hat{p}_2)^{(1-y_2)}]\cdot...\cdot [\hat{p}_n^{y_n}(1-\hat{p}_n)^{(1-y_n)}] \end{align*}

We call this the **likelihood function** of the logistic regression model $\hat{p}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$ and the training dataset $(x_1,y_1),...,(x_n,y_n)$.

Example: Let's refer back to our logistic regression model $\hat{p}=\frac{1}{1+e^{-(2+3x)}}$
and our training dataset from before

$(x_1, y_1) = (0.5,1)$ and
$(x_2, y_2) = (0.25,0)$,

The likelihood function value for this model and this dataset is:

\begin{align*} P(Y_1=1|\beta_0=2, \beta_1=3, x_1=0.5)\cdot P(Y_2=0|\beta_0=2, \beta_1=3, x_2=0.25) &= (0.97)\cdot (0.06) \\ &= 0.058. \end{align*}

This is quite a low probability! This indicates that this hypothetical model would not fit the whole training dataset well in general! We are unlikely to have observed ALL the actual response variable that we did with $y_1=1$ and $y_2=0$.

p_1*(1-p_2)

0.058325376419031065

Maximizing the Likelihood Function

Thus, technically when fitting a logistic regression model, we seek to find the optimal values of $\hat{\beta}_0,\hat{\beta}_1$ that maximize this likelihood function probability.

Maximizing the Likelihood Function

\begin{align*} \max_{\hat{\beta}_0,\hat{\beta}_1} LF(\beta_0, \beta_1) &= \max_{\hat{\beta}_0,\hat{\beta}_1} P(Y=y_1|\beta_0, \beta_1, x_1)\cdot\ldots\cdot P(Y=y_n|\beta_0, \beta_1, x_n) \\ &= \max_{\hat{\beta}_0,\hat{\beta}_1} [\hat{p}_1^{y_1}(1-\hat{p}_1)^{(1-y_1)}]\cdot [\hat{p}_2^{y_2}(1-\hat{p}_2)^{(1-y_2)}]\cdot...\cdot [\hat{p}_n^{y_n}(1-\hat{p}_n)^{(1-y_n)}] \end{align*}

We call the task that we just set up above an optimization problem. An optimization problem always comes with an:

objective function that we would like to optimize ($LF(\beta_0, \beta_1)$ in our case),
a stipulation as to whether we intend to maximize or minimize this objective function (maximize in our case), and
a set of what we call decision variables which are the variables we are trying to select that optimize our objective function ($\hat{\beta}_0,\hat{\beta}_1$ in our case).

Like when we fit our linear regression model, we would theoretically try to use multivariate calculus techniques to find these optimal values for $\hat{\beta}_0,\hat{\beta}_1$. However, given how many products there are in our likelihood function, this makes the calculus task (ie. taking a derivative) of finding these optimal $\hat{\beta}_0,\hat{\beta}_1$ values computationally difficult.

Maximizing the Log-Likelihood Function

So what we do instead is maximize another optimization problem that is equivalent to optimizing our likelihood function $LF(\beta_0, \beta_1)$. When two optimization problems are equivalent, their optimal decision variables values $(\beta_0, \beta_1$ in our case) will be the same.

Rather than maximizing the likelihood function $LF(\beta_0, \beta_1)$, we instead maximize the log of the likelihood function which we call the log-likelihood function $LLF(\beta_0, \beta_1)$.

Maximizing the Log Likelihood Function

\begin{align*} \max_{\hat{\beta}_0,\hat{\beta}_1} LLF(\beta_0, \beta_1) &= max_{\hat{\beta}_0,\hat{\beta}_1} log\left( [\hat{p}_1^{y_1}(1-\hat{p}_1)^{(1-y_1)}]\cdot [\hat{p}_2^{y_2}(1-\hat{p}_2)^{(1-y_2)}]\cdot...\cdot [\hat{p}_n^{y_n}(1-\hat{p}_n)^{(1-y_n)}] \right)\\ &= max_{\hat{\beta}_0,\hat{\beta}_1} y_1log(\hat{p}_1)+(1-y_1)log(1-\hat{p}_1)+ y_2log(\hat{p}_2)+(1-y_2)log(1-\hat{p}_2)+...+ y_nlog(\hat{p}_n)+(1-y_n)log(1-\hat{p}_n) \end{align*}

By using some of our log properties that we learned in algebra, we attain a more mathematically convenient equivalent expression of $LLF(\beta_0, \beta_1)$ in the second line of the equation above that is the sum of simple terms.

General Logarithm Properties

$log(a\cdot b) = log(a)+log(b)$
$log(a^c) = c\cdot log(a)$

Thus, we more easily use calculus based techniques to find the values of $\beta_0, \beta_1$ that maximize $LLF(\beta_0, \beta_1)$. And then we automatically know that these optimal $\beta_0, \beta_1$ values also happen to maximize $LF(\beta_0, \beta_1)$ as well.

Fitting a Logistic Regression Model

We can find these optimal $\beta_0, \beta_1$ values for our logistic regression model $\hat{p}=\frac{1}{1+e^{-(\hat{\beta}_0+\hat{\beta}_1x)}}$ now using the smf.logit() function. The parameters and structure of this function is very similar to the parameters that we used when we fit a linear regression model using the smf.ols() function

We place our 0/1 response variable column name on the left hand side of our formula string parameter.
We place (add) our explanatory variable column name(s) on the right hand side of our formula string parameter.
We also supply the dataframe name.
We also need to .fit() this smf.logit() object.

Use this code we create a logistic regression object that we call log_mod which will similarly hold all relevant information we'd like to know about our fitted logistic regression model.

Warning: Make sure that your response variable column is a 0/1 variable otherwise this function will not run!

log_mod = smf.logit(formula='y~number_of_followers', data=df_train).fit()

Optimization terminated successfully.
Current function value: 0.524442
Iterations 6

Similarly, we can use the .summary() function that corresponds to this log_mod object to display relevant logistic regression model information.

log_mod.summary()

Logit Regression Results
Dep. Variable:	y	No. Observations:	84
Model:	Logit	Df Residuals:	82
Method:	MLE	Df Model:	1
Date:	Fri, 28 Jul 2023	Pseudo R-squ.:	0.2356
Time:	19:14:29	Log-Likelihood:	-44.053
converged:	True	LL-Null:	-57.628
Covariance Type:	nonrobust	LLR p-value:	1.883e-07

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-1.5810	0.394	-4.012	0.000	-2.353	-0.809
number_of_followers	0.0060	0.002	3.854	0.000	0.003	0.009

We can similarly look up our intercept and coefficient(s) for the model in the coef column. Thus our "best fit" logistic regression model for the training dataset is the following (represented in three equivelant ways).

Simple Logistic Regression Model

$\hat{p}=\frac{1}{1+e^{-(-1.58+0.006number_of_followers)}}$
$\hat{odds}=\frac{\hat{p}}{1-\hat{p}}=e^{(-1.58+0.006number_of_followers)}$
$log(\hat{odds})=log(\frac{\hat{p}}{1-\hat{p}})=-1.58+0.006number_of_followers$

← Odds and Probability Next: Multiple Logistic Regression →