Inference for Logistic Regression


When our response variable is quantitative, we can use linear regression to estimate the mean value for the response variable. However, when our response variable is categorical, we can use logistic regression to estimate the log odds of the response variable.

Similar to how we saw that we could perform similar analyses for logistic regression as linear regression (with some modifications), we can do the same for inference.

Recall that the inference procedures that could be described for linear regression include:

  • individual coefficients
  • extension to difference of population proportions
  • significance of regression test
  • testing a subset of slopes

Each of these inference procedures can be modified for logistic regression.

In the last section, we focused only on inference for individual coefficients. On this page, we will do the same for a logistic regression coefficient.

One of the biggest differences to note is that we use different distributions and test statistics for these procedures. These relate to the fact that categorical variables rely on only one parameter $p$ while quantitative variables often rely on two parameters: $\mu$ and $\sigma$. Because we only need to estimate one parameter, we can use the Normal distribution instead of the more relaxed $t$ distribution.

Individual Coefficients for Logistic Regression

We can assess whether individual coefficients appear to be significant; that is, we can assess if a single variable appears to provide helpful information for linearly predicting our response variable. We can continue to perform hypothesis tests and confidence intervals for single coefficients, using our properties that we have relied on earlier. Recall that the default hypothesis that is provided for us is:

$H_0: \beta_i = 0$ vs. $H_a: \beta_i \neq 0$

For example, suppose that we would like to predict whether the host of a given Airbnb unit is local (in Chicago, Illinois) based on characteristics of the listing and host, including whether the host is a superhost, the price of the listing, how many people the listing accommodates, and how long the host has been with Airbnb. For this data, we will aim to answer whether how many people the listing accommodates (the size of the listing) is associated with whether the host of a given Airbnb unit is local, after controlling for whether the host is a superhost, the price of the listing, and how long the host has been with Airbnb.

import statsmodels.formula.api as smf
df['local_host'] = (df['host_location'] == 'Chicago, IL')
df['local_host'] = df['local_host'].replace({True: 1, False: 0})
results = smf.logit('local_host ~ host_is_superhost + price + accommodates + host_since', data = df).fit()
results.summary()
Optimization terminated successfully.

Current function value: 0.542406

Iterations 6

Logit Regression Results
Dep. Variable: local_host No. Observations: 700
Model: Logit Df Residuals: 695
Method: MLE Df Model: 4
Date: Wed, 29 Nov 2023 Pseudo R-squ.: 0.1611
Time: 21:37:38 Log-Likelihood: -379.68
converged: True LL-Null: -452.59
Covariance Type: nonrobust LLR p-value: 1.607e-30
coef std err z P>|z| [0.025 0.975]
Intercept -1.5411 0.250 -6.153 0.000 -2.032 -1.050
host_is_superhost[T.True] 1.3195 0.202 6.522 0.000 0.923 1.716
price -0.0008 0.001 -1.329 0.184 -0.002 0.000
accommodates 0.0852 0.035 2.412 0.016 0.016 0.154
host_since 0.0007 8.92e-05 8.183 0.000 0.001 0.001

The regression coefficients table from statsmodels for a logistic regression model.

By looking at the row associated with our accommodates variable, we can identify the p-value as 0.016, which is fairly small. In other words, we would anticipate a relationship between accommodates and the log odds of the host being a local host like the one we observed or larger (in magnitude) about 1.6 times out of every 100 times if there really were no relationship. While the exact threshold needed to convince our skeptic has not been defined, this may be enough to decide that there is a relationship between the number of people a listing accommodates and whether that host is a local host.

Also, note that we could adjust our hypotheses to be written in terms of the odds multiplier. In this case, we have an equivalent set of hypotheses and would test using the same test statistic and p-value as above.

$H_0: e^{\beta_i} = 1$ vs. $H_a: e^{\beta_i} \neq 1$.

Similarly, if we wanted to generate a confidence interval for the odds multiplier, we can calculate the corresponding confidence interval for $\beta_i$. Then, we can adjust each of our endpoints to the corresponding odds multiplier by exponentiating them. This allows us to use the theory that we know for the coefficients along with a simple transformation to answer questions about a different version of that same statistic.

import numpy as np
print(np.exp(0.016), ", ", np.exp(0.154))
1.016128685406095 ,  1.1664908867784396

95% confidence interval for the odds multiplier for the slope for accommodates.

Here, we see that we are 95% confident that the true odds multiplier for whether a host is local for each 1 unit increase in the number of people that a listing accommodates, after controlling for whether the host is a superhost, the price of the listing, and how long the host has been with Airbnb is contained in the interval (1.016, 1.166). In other words, as the number of people a listing accommodates increases, it becomes more likely that the host will be local, after controlling for the three variables mentioned above.

Checking assumptions

We know that certain conditions need to be met for the sampling distributions and corresponding inference to be valid. For logistic regression, those conditions are:

  • the response variable needs to be a categorical variable with two possible outcomes
  • the relationship between the log odds of success and the combination of Xs should be linear
  • the observations need to be independent
  • no multicollinearity between the X variables
  • the sample size is large enough that to support the normal approximation
  • no strong outliers or influential points

Response Variable

The first assumption is that the response variable is appropriate for a logistic regression model. In essence, the response variable must be a categorical variable that has two possible outcomes. In this way, we are able to calculate the log odds for the response variable and can fit a logistic regression model to the data.

Linearity of Predictors with Log Odds of Response

We also will assume that there is in fact a linear relationship between our $X$ variables and the log odds of the $y$ variable. This indicates that our underlying model is appropriate.

There are methods to assess this with hypothesis testing, but those are beyond the scope of our course. An approximate way to check this condition is to graph each $X$ variable with the $y$ variable and check that the line has an "S" shape, supporting that a logistic regression model appropriate for modeling the variation in $y$.

Independence

The observations should be independent. Independence is met if the observations are sampled from the population randomly with replacement. Since we generally sample randomly without replacement, we can approximate that the observations are independent if:

  • the sample size is less than 20% of the population size and
  • the sample is randomly generated

No Strong Multicollinearity

Again, we need our $X$ variables to not display strong multicollinearity. This ensures that each variable is providing unique information and that the estimates corresponding to our variables are robust.

Large Sample Size

Recall that for a theory-based approach of inference for proportions, we needed the sample size to be large enough. Again, we need a sample size condition to hold to ensure that the Normal distribution represents the sampling distribution of $p$ well. This condition will now be that we need at least 10 observations for each coefficient in the model for each $y$ group. To calculate this, we need to ensure that:

$n \ge \frac{10 \times p}{\hat{\pi}} \text{ and } n \ge \frac{10 \times p}{1-\hat{\pi}}$

where $p$ represents the number of coefficients.

This is similar to our previous conditions where $n\pi$ and $n(1-\pi)$ both needed to be at least 10, except that now we also incorporate the model complexity in this condition.

No Strong Outliers or Influential Points

The data should not include strong outliers or influential points. Methods to check this condition are beyond the scope of our course. Generally, we hope to see that we do not have observations that have extreme values in $x$ or extremely uncommon values of $y$ for the given $x$. Using visualizations to check this is a reasonable approach for checking this condition.