Deeper Dive into Underlying Theory


This page should be considered optional for a first reading of this material.

In this section, we are going to look at some of the mathematical reasons and properties that explain the calculations for means and standard errors of sampling distributions.

General Properties for Expectations and Standard Deviations

Let $a$ and $b$ represent any general constant. In other words, $a$ and $b$ are both placeholders for numbers.

$X$ and $Y$ represent any random variable.

We have the following important properties for expectations, variances, and standard deviations.

  • $\mathbb{E}(aX + bY) = a\mathbb{E}(X) + b \mathbb{E}(Y)$
  • $\text{sd}(X) = \sqrt{\text{Var}(X)}$, or equivalently $\text{Var}(X) = \text{sd}(X)^2$
  • $\text{Var}(aX + bY) = a^2\text{Var}(X) + b^2\text{Var}(Y)$ when $X$ and $Y$ are independent

Calculations for Sample Means

A sample mean is $\bar{X} = \frac{1}{n}\sum_{i = 1}^{n}X_i = \frac{1}{n}(X_1 + X_2 + \ldots + X_n)$.

Generally, when we calculate a sample mean, we assume that all of the individual observations $X_i = X_1, X_2, \ldots, X_n$ are observations from the same population, where that population has $\mathbb{E}(X_i) = \mu$ and $Var(X_i) = \sigma^2$.

We can calculate the expected value of all possible sample means using the properties and definitions above:

$\mathbb{E}(\bar{X}) = \mathbb{E}(\frac{1}{n}(X_1 + X_2 + \ldots + X_n)) = \frac{1}{n}\mathbb{E}(X_1 + X_2 + \ldots + X_n) = \frac{1}{n}(\mathbb{E}(X_1) + \mathbb{E}(X_2) + \ldots + \mathbb{E}(X_n)) = \frac{1}{n}(\mu + \mu + \ldots + \mu) = \frac{1}{n}(n\mu) = \mu$

$\text{Var}(\bar{X}) = \text{Var}(\frac{1}{n}(X_1 + X_2 + \ldots + X_n)) = \frac{1}{n^2}\text{Var}(X_1 + X_2 + \ldots + X_n) = \frac{1}{n^2}(Var(X_1) + Var(X_2) + \ldots + Var(X_n)) = \frac{1}{n^2}(\sigma^2 + \sigma^2 + \ldots + \sigma^2) = \frac{1}{n^2}(n\sigma^2) = \frac{\sigma^2}{n}$

Based on this, then $\text{se}(\bar{X}) = \sqrt{\text{Var}(\bar{X})} = \frac{\sigma}{\sqrt{n}}$

Recall that we often call the standard deviation of a statistic – in this case of $\bar{X}$ – the standard error of that statistic. This can also be thought of as the standard deviation of the sampling distribution for the sample mean.

For Proportions

We can similarly use math to support the mean and the standard error calculations for the sampling distribution of sample proportions.

Our sample proportion $\hat{p} = \frac{\text{# with characteristic}}{\text{total #}}$

We described earlier that we could define $X_i$ as

\begin{equation}
X_i =
\begin{cases}
1 & \text{if observation i has the desired characteristic}\
0 & \text{if observation i does not have the desired characteristic}
\end{cases}
\end

and $\hat{p} = \frac{1}{n}\sum_{i = 1}^n X_i = \frac{\text{# with characteristics}}{\text{total #}}$. We previously demonstrated that $\hat{p}$ is a mean, since it has the same equation and format as the sample mean.

Each of the $X_i$ is then distributed according to the Bernoulli distribution. In other words, each $X_i$ has $p$ probability of having the desired characteristic, where $p$ is the population proportion of the desired characteristic. The additional assumption is that each of the $X_i$ are independent.

In essence, what this means is that each of the observations is sampled with replacement. As described, there is a loosening of the assumption that the sample is taken with replacement if the sample is randomly selected and is a small portion (less than 10%) of the population size.

For Bernoulli distributions, $\mathbb{E}(X) = p$ and $\text{Var}(X) = p(1-p)$.

We now have all of the subsequent components needed in order to demonstrate the expected value and variance of our sample proportion.

$\mathbb{E}(\hat{p}) = \mathbb{E}(\frac{1}{n}\sum_{i = 1}^n X_i) = \mathbb{E}(X_i) = p$.

$\text{Var}(\hat{p}) = \text{Var}(\frac{1}{n}\sum_{i = 1}^n X_i) = \frac{\text{Var}(X_i)}{n} = \frac{p(1-p)}{n}$.

Based on this, the standard error for $\hat{p}$ is $\text{se}(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}$

Adding and Subtracting Normal Distributions

We previously reported the theoretical sampling distribution for difference statistics, including the difference in two sample means and the difference in two sample proportions for two different populations. We also demonstrated using simulations that these distributions are accurate.

Now, we'll shift our focus to demonstrating these properties are accurate based on our understanding of means and standard deviations.

We do need to add one more property to address how to combine Normal distributions. When we add or subtract Normal distributions, the resulting distribution will remain Normal.

This means that if we have two Normal distributions:
N(mean = 3, sd = 4) and
N(mean = 7, sd = 3)

Then if we add the two distributions, our resulting distribution will be Normal. The mean will be the sum of the two means (3 + 7 = 10) and the standard deviation will be the square root of the sum of the two variances (var = $4^2 + 3^2$ = 16 + 9 = 25; sd = $\sqrt{25}$ = 5).

If instead we choose to subtract the two distributions (we’ll subtract the second distribution from the first), then the resulting distribution will still be Normal. This time, the mean will be the difference between the two means ($\mathbb{E}(X - Y) = \mathbb{E}(1X + (-1)Y) = \mathbb{E}(X) + (-1) \mathbb{E}(Y) = \mathbb{E}(X) - \mathbb{E}(Y)$, or 3 - 7 = -4 for these two distributions). The standard deviation will still be 5 ($\text{Var}(X - Y) = \text{Var}(1X + (-1)Y) = 1^2\text{Var}(X) + (-1)^2 \text{Var}(Y) = \text{Var}(X) + \text{Var}(Y) = 4^2 + 3^2 = 16 + 9 = 25$; $\text{sd}(X-Y) = \sqrt(\text{Var}(X-Y)) = \sqrt{25} = 5$).

Therefore, the sum of these two distributions would be N(mean = 10, sd = 5). The difference of these two distributions (distribution 1 - distribution 2) would be N(mean = -4, sd = 5).

Central Limit Theorem for Difference Statistics

The Central Limit Theorem can then be applied to the difference in sample means or sample proportions as long as the conditions for the Central Limit Theorem apply to each individual population.

The Central Limit Theorem says that $\bar{X}_1 - \bar{X}_2 \sim N(\mu_1 - \mu_2, \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}})$.

The Central Limit Theorem also says that $\hat{p}_1 - \hat{p}_2 \sim N(p_1 - p_2, \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}})$.

Theory for Linear Regression Slopes

Regression is one of the fundamental techniques in statistics and data science, so quite a bit is known about regression. Because of this, we know theoretical properties about the sampling distribution of sample slopes, for simple linear regression, multiple linear regression, and regression from different families, including logistic regression.

We'll provide the sampling distribution for simple linear regression and multiple linear regression, but we won't provide derivations to support them and we won't use the equations below for our course. We won't provide the theoretical sampling distributions for other regression families; more information can be found online for those who would like to learn more.

For simple linear regression, the sampling distribution of possible sample slopes $\hat{\beta}_1$ will be Normally distributed with a mean of thetrue population slope $\beta$ and a standard error of $\frac{\sigma}{\sigma_x\sqrt{n}}$, where $\sigma$ is the standard deviation of the errors of regression, $\sigma_x$ is the standard deviation for the $x$ observations, and $n$ is the number of observations.

A more general approach that applies to both simple and multiple linear regression is:

$\hat{\beta} \sim N(\beta, \sigma \sqrt{(X^TX)^{-1}})$ where $\sigma$ is the standard deviation of the errors of regression and $X$ is the design matrix for regression.