Inference for Regression


We demonstrated how we could use simulation-based inference for simple linear regression. Some regression situations are more complex, so that it is more challenging to define a simulation-based approach to the inference. In this section, we will define theory-based forms of inference specific for linear and logistic regression. This includes procedures that allow us to test:

  • individual coefficients for linear regression
  • testing all slopes simultaneously for linear regression
  • testing subsets of slopes for linear regression
  • similar tests for logistic regression
  • special cases of using regression as a shortcut or an alternate test for other scenarios

We can also use functions within Python to perform the calculations for us. Although we could continue to use our simulation-based approaches, we'll rely on the underlying theory for convenience. Similarly, we'll use the statsmodels package, as many of our calculations are automatically printed for us.

Let's get started!

Individual Coefficients for Linear Regression

For simple linear regression, we saw how we could use simulation-based resampling to approximate a sampling distribution for our coefficients. When we have multiple linear regression, the simulation approach starts to involve more variables and can get more complicated.

Instead, we'll rely on the theory-based approach for inference for regression. Python will do most of the challenging calculations for us, although we do still need to interpret the output appropriately.

In each row of the coefficients table in the default output, we can see the estimate for the coefficient, the standard error for that estimate, the 95% confidence interval for each coefficient, the t-test statistics, and the two-sided p-value.

import statsmodels.formula.api as smf
ols = smf.ols('price ~ beds', data = df)
ols_result = ols.fit()
ols_result.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 64.5344 9.182 7.028 0.000 46.506 82.563
beds 44.6672 3.204 13.942 0.000 38.377 50.958

The regression coefficients table from statsmodels.

For example, we can see that the fitted slope for beds used to predict the price of a listing in 44.6672 with an estimated standard error of 3.204.

The t-test statistic and the p-value are defined for a default hypothesis of:

$H_0: \beta_i = 0$ vs. $H_a: \beta_i \neq 0$

For this particular model, we can see that the number of beds that a listing has is a significant linear predictor for the mean price of the listing, at least for our Chicago Airbnbs, as the p-value is very small.

The output contains enough to repeat or adjust the confidence interval or hypothesis test calculations for any alternate hypotheses or confidence levels. We'll skip these adjustments for now, but we could incorporate our procedures from 13-05 in this process.

Linear Regression Assumptions

Remember that for our inference results to be valid (and our theoretical sampling distribution to be correct), our LINE + no multicollinearity conditions must be met.

As a reminder, the full set of assumptions for linear models are:

  • Linearity ~ the relationship between the Xs and the Y variable should be linear in form
  • Independence ~ the true errors are independent
  • Normality ~ the true errors are normally distributed
  • Equal Variance ~ the variance of Y at each combination of X is equal
  • Multicollinearity ~ there is no strong multicollinearity between the X variables.

Note that the Linearity condition is partially being assessed by some of these inference procedures but should still be checked as appropriate for the form of the model.