Inference for Regression

← Traditional Procedures for Inference Next: Inference for Logistic Regression →

We demonstrated how we could use simulation-based inference for simple linear regression. Some regression situations are more complex, so that it is more challenging to define a simulation-based approach to the inference. In this section, we will define theory-based forms of inference specific for linear and logistic regression. This includes procedures that allow us to test:

individual coefficients for linear regression
testing all slopes simultaneously for linear regression
testing subsets of slopes for linear regression
similar tests for logistic regression
special cases of using regression as a shortcut or an alternate test for other scenarios

We can also use functions within Python to perform the calculations for us. Although we could continue to use our simulation-based approaches, we'll rely on the underlying theory for convenience. Similarly, we'll use the statsmodels package, as many of our calculations are automatically printed for us.

Let's get started!

Individual Coefficients for Linear Regression

For simple linear regression, we saw how we could use simulation-based resampling to approximate a sampling distribution for our coefficients. When we have multiple linear regression, the simulation approach starts to involve more variables and can get more complicated.

Instead, we'll rely on the theory-based approach for inference for regression. Python will do most of the challenging calculations for us, although we do still need to interpret the output appropriately.

In each row of the coefficients table in the default output, we can see the estimate for the coefficient, the standard error for that estimate, the 95% confidence interval for each coefficient, the t-test statistics, and the two-sided p-value.

import statsmodels.formula.api as smf
ols = smf.ols('price ~ beds', data = df)
ols_result = ols.fit()
ols_result.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	64.5344	9.182	7.028	0.000	46.506	82.563
beds	44.6672	3.204	13.942	0.000	38.377	50.958

The regression coefficients table from statsmodels.

For example, we can see that the fitted slope for beds used to predict the price of a listing in 44.6672 with an estimated standard error of 3.204.

The t-test statistic and the p-value are defined for a default hypothesis of:

$H_0: \beta_i = 0$ vs. $H_a: \beta_i \neq 0$

For this particular model, we can see that the number of beds that a listing has is a significant linear predictor for the mean price of the listing, at least for our Chicago Airbnbs, as the p-value is very small.

The output contains enough to repeat or adjust the confidence interval or hypothesis test calculations for any alternate hypotheses or confidence levels. We'll skip these adjustments for now, but we could incorporate our procedures from 13-05 in this process.

Linear Regression Assumptions

Remember that for our inference results to be valid (and our theoretical sampling distribution to be correct), our LINE + no multicollinearity conditions must be met.

As a reminder, the full set of assumptions for linear models are:

Linearity ~ the relationship between the Xs and the Y variable should be linear in form
Independence ~ the true errors are independent
Normality ~ the true errors are normally distributed
Equal Variance ~ the variance of Y at each combination of X is equal
Multicollinearity ~ there is no strong multicollinearity between the X variables.

Note that the Linearity condition is partially being assessed by some of these inference procedures but should still be checked as appropriate for the form of the model.

← Traditional Procedures for Inference Next: Inference for Logistic Regression →