Sampling Distributions for Regression


As we saw in the last section, we can create sampling distributions when we have two populations in addition to when we have one population. Is there more that we can do when we have data that have been collected with additional structure?

For example, what about when we have data used for regression? Or when the observations are not independent in some other way?

In this section, we’ll explore what we can do when we have data collected for regression purposes.

Simulations for Regression Data

Let’s think about simple linear regression. As a reminder, this is when we have one predictor variable ($x$) and one response variable ($y$).

For this example, we'll consider predicting $y = \text{price}$ from $x = \text{beds}$.

If we wanted to generate a sampling distribution for the population slope, one attempt might be to recognize that we have two different columns. We could use a similar resampling scheme to when we were comparing between two groups, like in last section. In this case, our simulation might look as follows:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(df_popn[['beds']], df_popn['price'])
true_slope = model.coef_[0]
print(true_slope)
57.106388887951276
sample_slopes = []
for i in range(1000):
x = df_popn[['beds']].sample(50, replace = True)
y = df_popn['price'].sample(50, replace = True)
model.fit(x, y)
sample_slopes.append(model.coef_[0])
sample_slopes = pd.DataFrame({'x': sample_slopes})
sample_slopes['x'].hist()
plt.xlabel('Possible Sample Slopes Predicting Price from Beds')
plt.ylabel('Frequency')
plt.title('Incorrect Sampling Distribution of Sample Slopes for Chicago Airbnb Data')
plt.show()

Does this seem correct?

The answer is no, the above approach is incorrect! There are a couple of reasons we can say that this simulation is incorrect, but the biggest one stems from the fact that our data has additional structure. Each $x$ variable is recorded for an Airbnb listing (our observational unit) that also has the $y$ variable. Thinking back to the assumptions when comparing between two groups, we do want for the two groups to be independent of each other. This assumption is violated by this situation. So, what can we do?

Instead of creating separate samples for the two different variables, we’ll create a single sample made up of the observational units. We can then pull the entire row from the original dataset as an observation in a sample.

Using this scheme, our sampling distribution is as follows:

sample_slopes = []
for i in range(1000):
df_sample = df_popn.sample(50, replace = True)
model.fit(df_sample[['beds']],df_sample['price'])
sample_slopes.append(model.coef_[0])
sample_slopes = pd.DataFrame({'x': sample_slopes})
sample_slopes['x'].hist()
plt.xlabel('Possible Sample Slopes Predicting Price from Beds')
plt.ylabel('Frequency')
plt.title('Correct Sampling Distribution of Sample Slopes for Chicago Airbnb Data')
plt.show()

Notice that we are able to sample each individual observation rather than sampling each of the variables separately. This allows us to retain the original structure in the data for the simulation (that is, we still have the structure that a specific $x$ value is associated with a $y$ value by being recorded from the same observation).

We can see here that the first sampling distribution (the incorrect one) is centered at a slope of 0, while the second sampling distribution (the correct one) is centered around a slope of around \$50/night, which matches with what we observe in the actual data.

The nice part is that we can now extend the simulation to allow us to create a sampling distribution for more complex situations, including multiple linear regression by sampling (or resampling) observations from our data, fitting a new model to our data, and recording the coefficients for this model.

Theory for Linear Regression Slopes

Regression is one of the fundamental techniques in statistics and data science, so quite a bit is known about regression.

Because of this, we know theoretical properties about the sampling distribution of a sample slope for a regression slope, both for simple and multiple linear regression. More information about the theoretical properties can be found in the Deeper Dive for Sampling Distributions.

For the purposes of our course, the important characteristics for the sampling distribution for sample coefficients are that:

  • the shape of the distribution is Normal
  • the estimates are unbiased, which means that the center is the corresponding population coefficient
  • an estimate for the standard error of the sample coefficient can be found from statsmodels

While we could calculate the estimated standard error using linear algebra and continue using the sklearn package, we can also use the statsmodels package, which will provide an estimated standard error for our coefficient.

In this instance, we'll switch to estimating the sampling distribution for the sample slope, now based on a sample of size 692 (rather than of 50, like we did before).

import statsmodels.api as sm
import statsmodels.formula.api as smf
ols = smf.ols('price ~ beds', data = df)
ols_result = ols.fit()
ols_result.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.220
Model: OLS Adj. R-squared: 0.219
Method: Least Squares F-statistic: 194.4
Date: Tue, 27 Jun 2023 Prob (F-statistic): 4.21e-39
Time: 14:20:04 Log-Likelihood: -4424.9
No. Observations: 692 AIC: 8854.
Df Residuals: 690 BIC: 8863.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 64.5344 9.182 7.028 0.000 46.506 82.563
beds 44.6672 3.204 13.942 0.000 38.377 50.958
Omnibus: 707.443 Durbin-Watson: 2.015
Prob(Omnibus): 0.000 Jarque-Bera (JB): 35874.174
Skew: 4.705 Prob(JB): 0.00
Kurtosis: 36.995 Cond. No. 5.16


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Output for the regression model using statsmodels.

We can confirm the validity of the theoretical sampling distribution by adding it to a simulated sampling distribution, like one that we simulated above. We can see that the shape of the simulated sampling distribution matches with the overlaid theoretical distribution closely, indicating that the theoretical and simulated sampling distributions support the same results for each.

sample_slopes = []
for i in range(1000):
df_sample = df_popn.sample(692, replace = True)
model.fit(df_sample[['beds']],df_sample['price'])
sample_slopes.append(model.coef_[0])
sample_slopes = pd.DataFrame({'x': sample_slopes})
sample_slopes['x'].describe()
    count    1000.000000
    mean       57.481029
    std        11.382801
    min        32.185216
    25%        49.694991
    50%        55.287380
    75%        63.028159
    max       119.786748
    Name: x, dtype: float64
mu = true_slope
std = var_beta_hat[1,1]**(1/2)
sample_slopes['x'].hist(density = True, bins = 30)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
plt.xlabel('Possible Sample Slopes Predicting Price from Beds')
plt.ylabel('Frequency')
plt.title('Correct Sampling Distribution of Sample Slopes for Chicago Airbnb Data')
plt.show()

Sampling distribution for the possible sample slopes.

Assumptions for Linear Regression Sampling Distributions

If you notice, the theoretical sampling distribution doesn't seem to match with our simulated sampling distribution. This mismatch should be a little concerning, as we do typically expect the simulated sampling distribution to match the theoretical one. What is happening here?

Our simulated sampling distribution (the one represented by the histogram) is accurate, since it is generated according to the definition of a sampling distribution. Why isn't the theoretical distribution matching?

There are three assumptions needed for sampling distribution of a regression coefficient to follow the theoretical sampling distribution, beyond the two assumptions that are needed for the linear regression line to be appropriate.

Recall that the two linear regression assumptions already covered in 8-08 are:

  • Linearity: the true relationship between each of the $X$ variables and the $Y$ variable is linear in nature and
  • No Multicollinearity: there is no (strong) relationship between any set of $X$ variables.

For the sampling distribution to be valid, we also need the following three assumptions to be met:

  • Independence: the true errors (the true values for the residuals from the population model) are independent of each other,
  • Normality: the distribution of the true errors is normally distributed, and
  • Equal Variance: the variance of $y$ at each combination of $x$ values is equal, or alternately, the true errors have equal variance $\sigma^2$ at every combination of $x$ values.

The acronym LINE can be used to remember four of these assumptions, with an M added for the multicollinearity condition.

How do we check for each of these assumptions?

Checking the Linearity Assumption

The linearity assumption can be assessed using the fitted vs. residuals plot, as described in 8-08. We can continue to follow this same process here.

plt.scatter(ols_result.predict(df), ols_result.resid)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Fitted vs. Residuals Plot, Price ~ Beds')
plt.show()

Fitted vs. residual plot of predicting price from the number of beds of Airbnb listings.

Checking the Independence Assumption

We will not discuss a method to assess the independence assumption using a plot or other data analysis procedure. One way to ensure independence of our observations would be to randomly sample from the population with replacement. Since most samples aren't taken with replacement, we can at least determine if there are any obvious concerns with the independence assumption. To do this, we can ensure that we have:

  • randomly sampled observations from the population
  • sampled less than 10% of observations from the population (i.e., $10n < N$, where $n$ is the sample size and $N$ is the population size).

Checking the Normality Assumption

Our assumption is that the true errors are Normally distributed. However, we don't have access to the true errors, since we are missing the true (population) linear model for the data. Instead, we use the residuals calculated from our data and our sample regression line as an estimate for the true errors.

We can then check that the residuals appear to be reasonably normally distributed. We can do so by making a histogram or a QQ plot to check for obvious violations of a normal distribution.

First, we can generate a histogram of our residuals and observe the shape of the histogram. We should be able to identify any strong violations of normality from the histogram.

ols_result.resid.hist(bins = 20)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Histogram of Residuals, Predicting Price of Chicago Airbnbs from Beds')
plt.show()

Histogram of the residuals for the linear regression model.

This histogram may have some evidence of non-normality, but it is hard to tell definitively. When I observe this distribution, I can tell that the shape appear unimodal. I struggle to tell if the shape is especially non-symmetric, or if the specific location of the bins is giving the appearance of strong non-normality. I can tell that there are a few more upper outliers than I might have otherwise expected.

In an instance like this one, where the normality of the residuals is unclear, I may instead turn to a QQ plot. QQ plots are helpful, because you only need to observe whether the observations follow a line. It is easier to make a determination if observations seem to follow a line by eye compared to if observations follow a specific curve, like a normal distribution.

To generate a QQ plot, we can use the statsmodels package.

sm.qqplot(ols_result.resid)
plt.show()

Histogram of the minimum nights required for the population of Chicago Airbnb listings.

Because the points on the QQ plot above do not follow a straight line, the distribution of the residuals does not appear to come from a population of errors that follows a Normal distribution.

Therefore, it appears that the normality assumption is not met for this data.

Checking the Equal Variance Assumption

The equal variance assumption can actually be checked with a plot that we've already made: the fitted vs. residuals plot.

For the linearity assumption to be satisfied, the points should be centered around the $e=0$ line of the plot at every fitted value. Now, for the equal variance assumption, the points should be equally distributed around the $e=0$ line for every fitted value. Another way to think about this is that you should be able to draw a line that encompasses all (or almost all) positive residuals and another line that encompasses all (or almost all) negative residuals. These two lines should be parallel to each other for all fitted values.

For this distribution, the following two lines might be added to the fitted vs. residuals graph.

Fitted vs. residual plot of predicting price from the number of beds of Airbnb listings.

In this case, I would conclude that the equal variance assumption is not met, as the two lines added to incorporate the observations do not appear to be parallel. I am concerned that the equal variance assumption might not be met.

One option to adjust the model if the equal variance assumption is not met is to attempt to transform the data. For some datasets, a data transformation can appropriate adjust the data so that the model no longer violates the equal variance assumption.

Checking the Multicollinearity Assumption

The no multicollinearity assumption could be assessed using the pairplot, sometimes also called a scatterplot matrix, as described in 8-08. This approach should be followed if we had more than one predictor.

A Second (Good) Example

Let's consider a new dataset. In this example, we have information about body dimensions of healthy US adults. We'd like to learn about how height, weight, age, pelvic breadth, chest depth, and wrist diameter affect the waist girth of all healthy US adults. We'll then compare this to how age and pelvic breadth alone affect the waist girth of all healthy US adults.

We'll start by fitting the first model and checking the corresponding assumptions.

Fitting the Full Model

df = pd.read_csv('bdims.csv')
results = smf.ols('waist_girth ~ age + height + weight + wrist_diameter + chest_depth + pelvic_breadth', data = df).fit()
results.summary()
OLS Regression Results
Dep. Variable: waist_girth R-squared: 0.876
Model: OLS Adj. R-squared: 0.874
Method: Least Squares F-statistic: 564.4
Date: Mon, 06 Nov 2023 Prob (F-statistic): 7.71e-214
Time: 23:18:11 Log-Likelihood: -1356.9
No. Observations: 487 AIC: 2728.
Df Residuals: 480 BIC: 2757.
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 45.7400 4.759 9.610 0.000 36.388 55.092
age 0.1853 0.021 9.013 0.000 0.145 0.226
height -0.1889 0.029 -6.519 0.000 -0.246 -0.132
weight 0.7341 0.032 22.817 0.000 0.671 0.797
wrist_diameter -0.0005 0.322 -0.001 0.999 -0.633 0.632
chest_depth 0.6363 0.124 5.112 0.000 0.392 0.881
pelvic_breadth -0.1807 0.099 -1.833 0.067 -0.374 0.013
Omnibus: 4.252 Durbin-Watson: 1.831
Prob(Omnibus): 0.119 Jarque-Bera (JB): 4.042
Skew: 0.193 Prob(JB): 0.133
Kurtosis: 3.226 Cond. No. 5.06e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.06e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Output for the full regression model about body dimensions.

Checking the Assumptions for the Full Model

Recall that since we are using a multiple linear regression model, we have 5 main assumptions to check.

For the linearity assumption, we can use the fitted vs. residuals plot to assess whether there is a true linear relationship between each combination of our X variables and our y variable. Because the following fitted vs. residual plot seems scattered around the e=0 line for all fitted values, this assumption seems reasonable.

plt.scatter(results.predict(df), results.resid)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Fitted vs. Residuals Plot, Full Model for Waist Girth')
plt.axhline(y = 0, color = 'r', linestyle = '--')
plt.show()

Fitted vs. residual plot of full model for predicting waist girth.

For the independence assumption, we will rely on the fact that we have a random sample of US adults (assumed) that are less than 10% of the population size (487 is reasonably less than 10% of the population of all US adults).

To assess the normality of true errors, we can use the QQ plot of the residuals. Because the QQ plot follows a very straight line, it appears that this assumption is reasonably met.

sm.qqplot(results.resid)
plt.show()

QQ plot of the full model for predicting waist girth.

To assess the equality of variances of the true errors for each combination of X values, we can again use the fitted vs. residuals plot from above. Because we saw that the spread of our scatter around the e = 0 line does not seem to change as we move from left to right, this assumption seems to be reasonably met.

Finally, for our multicollinearity assumption, we can create a pairplot. Here, we see that many of our predictor variables do seem to have a linear relationship with waist_girth, our y variable. This is encouraging for the fitting of a linear model. However, when we examine the relationship between our X variables, we see substantial relationships between predictors, especially between height or weight and all of the remaining predictors.

sns.pairplot(df[['waist_girth', 'age', 'pelvic_breadth', 'height', 'weight', 'wrist_diameter', 'chest_depth']])

Scatterplot matrix of the variables used in the full model for predicting waist girth.

Why is multicollinearity concerning?

Intuitively, we know that multicollinearity can be concerning, because the slope estimates can be unreliable. We now have the tools to explore and to demonstrate why the slope estimates are unreliable.

For our given data, we know that the true slope for the pelvic_breadth should be -0.1807. Based on statistical theory outside the scope of our course, we can also identify that the standard error of this estimate should be 0.099. This value is reported in the output from statsmodels in Python.

We can now use our simulation approach for generating a sampling distribution for regression in order to compare with our theoretical distribution suggested by statistical theory.

sample_slopes = []
for i in range(1000):
df_sample = df.sample(487, replace = True)
results = smf.ols('waist_girth ~ age + pelvic_breadth + weight + height + wrist_diameter + chest_girth', data = df_sample).fit()
sample_slopes.append(results.params[2])
sample_slopes = pd.DataFrame({'x': sample_slopes})
sample_slopes['x'].hist(density = True, bins = 30)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, -0.1807, 0.099)
plt.plot(x, p, 'k', linewidth=2)
plt.xlabel('Possible Sample Slopes Predicting Waist Girth from Pelvic Breadth in our Full Model')
plt.ylabel('Frequency')
plt.title('Simulated Sampling Distribution of Sample Slopes for Body Dimensions Data')
plt.show()

Simulated sampling distribution of pelvic breadth slopes in the presence of multicollinearity.

Notice here that the theoretical sampling distribution (black line) does not match up with the simulated sampling distribution that we just created. This seems concerning and shows that our slope estimates do not seem to be particularly reliable and close to the value that we know to be true for the population model. This might suggest that, even with the same data, my fitted coefficients could vary widely and even be positive sometimes.

Is this variability entirely due to the presence of multicollinearity, or might this simply be present due to the fact that we have a multiple linear regression? We also have only seen situations where simulated sampling distributions do not match with what is suggested by the underlying theory. Is it possible that the theory is wrong? To provide evidence that the multicollinearity violation is driving the mismatch between theory and simulation, let's finally examine an example where all of the assumptions are met.

Fitting the Reduced Model

We'll now focus on our final model: predicting waist girth using pelvic breadth and age. Can we understand how these two variables are related to the waist girth of US adults?

We'll start by fitting our model.

results = smf.ols('waist_girth ~ age + pelvic_breadth', data = df).fit()
results.summary()
OLS Regression Results
Dep. Variable: waist_girth R-squared: 0.266
Model: OLS Adj. R-squared: 0.263
Method: Least Squares F-statistic: 87.68
Date: Mon, 06 Nov 2023 Prob (F-statistic): 3.19e-33
Time: 23:18:13 Log-Likelihood: -1789.7
No. Observations: 487 AIC: 3585.
Df Residuals: 484 BIC: 3598.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 16.2297 5.504 2.949 0.003 5.415 27.044
age 0.3329 0.046 7.164 0.000 0.242 0.424
pelvic_breadth 1.8161 0.204 8.891 0.000 1.415 2.217
Omnibus: 3.686 Durbin-Watson: 0.977
Prob(Omnibus): 0.158 Jarque-Bera (JB): 3.637
Skew: 0.175 Prob(JB): 0.162
Kurtosis: 2.761 Cond. No. 533.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Output for the reduced regression model about body dimensions.

When we look at the fitted vs. residuals plot for this model, we see that the points are centered at the e = 0 line for all fitted values, suggesting the linearity assumption is valid. While we do see some changes in variability for the different fitted values, these do not seem too concerning. Therefore, we might anticipate that the equal variance assumption might be violated, but these violations seem minimal.

plt.scatter(results.predict(df), results.resid)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Fitted vs. Residuals Plot, Reduced Model for Waist Girth')
plt.axhline(y = 0, color = 'r', linestyle = '--')
plt.show()

Fitted vs. residual plot for the reduced model for waist girth.

Then, the normality assumption seems to be reasonably met based on the straight line identified in the QQ plot of the corresponding residuals.

sm.qqplot(results.resid)
plt.show()

QQ plot of the residuals for the reduced model for waist girth.

We have already generated the pairplot before. If we focus on the relationship between our two predictor variables (age and pelvic breadth), we can see that there is no strong relationship. Therefore, the multicollinearity assumption is met as there does not appear to be any strong multicollinearity between our two predictor variables.

Finally, our independence assumption is checked in the same way that the independence assumption was checked above. Therefore, we anticipate that the independence assumption is met.

We now have a situation where all of our assumptions, including those specific to the sampling distribution, appear to be met or have only minor violations. How does our simulated sampling distribution compare to the theoretical one?

sample_slopes = []
for i in range(1000):
df_sample = df.sample(487, replace = True)
results = smf.ols('waist_girth ~ age + pelvic_breadth', data = df_sample).fit()
sample_slopes.append(results.params[2])
sample_slopes = pd.DataFrame({'x': sample_slopes})
sample_slopes['x'].hist(density = True, bins = 30)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, 1.8161, 0.204)
plt.plot(x, p, 'k', linewidth=2)
plt.xlabel('Possible Sample Slopes Predicting Waist Girth from Pelvic Breadth in our Reduced Model')
plt.ylabel('Frequency')
plt.title('Simulated Sampling Distribution of Sample Slopes for Body Dimensions Data')
plt.show()

The simulated and theoretical sampling distributions for the slope of pelvic breadth to predict waist girth in the reduced model.

We finally see an example where our theoretical distribution matches with the simulated distribution! These two distributions seem to match fairly closely, which we will use later in Module 13.

Conclusion

We've now introduced additional assumptions for the slopes to have sampling distributions that follow Normal distributions and for our slopes to be reliable. The linearity and multicollinearity assumptions are necessary for our model itself to be appropriate and to result in trustworthy estimates. The independence assumption helps to suggest that our data itself is good. Finally, the normality and equal variance assumptions help confirm that our sampling distribution follows the normal distribution and with a specific variance as expected.

Checking these assumptions is crucial to ensure that our sampling distribution will behave as expected, especially if we anticipate using our theoretical results as a shortcut to simulation.