Sampling Distribution Properties

← Sampling Distributions Next: Sampling Distribution for Two Populations →

In the last section, we focused on generating a sampling distribution for a sample statistic through simulations, using either the population data or our sample data. Now that we know how to simulate a sampling distribution, let’s focus on the properties of sampling distributions.

On this page, we will start by exploring these properties using simulations. Then, we will review statistical theory and motivate this theory with some additional mathematical support that formalizes the properties that we first observed.

Properties of Sampling Distributions

What are some of the features that we can control about a sampling distribution? Let’s review the code from the simulations, to see the features that we can control:

The features that I observe that we control are:

The number of repeated samples to take
The size of the sample
The statistic to calculate for each sample

We’re going to focus on the sample mean price as the statistic for this example and for most of this page; at the end, we’ll discuss two additional statistics. Let’s explore how each of the other two features – the number of repeated samples and the size of the sample affect the sampling distribution. Specifically, consider the following features of the distributions:

The shape of the distribution
The center of the distribution
The variability (spread) of the distribution

Function for a Sampling Distribution

Since we'll be repeating the process of generating a sampling distribution many times, let's write a function to automate the process of simulating the sampling distribution.

def sampling_distribution(reps = 1000, n = 40, data = df_popn, withreplace = True, var = 'price'):
    # INPUT
    # reps, number of samples to collect, set to 1000 by default
    # n, sample size for each sample, set to 40 by default
    # data to use as population, set to df_popn by default
    # withreplace, whether to sample with replacement, set to True by default
    # var, variable of interest, set as price by default
    simulated_statistics = []
    for i in range(reps):
        # For each repetition, collect a random sample of size n
        # calculate the sample mean, and add to simulated_statistics
        df_sample = data.sample(n, replace = withreplace)
        simulated_statistics.append(df_sample[var].mean())
    simulated_statistics = pd.DataFrame({'x': simulated_statistics})
    # returns our sampling distribution as a data frame
    return simulated_statistics

The sampling_distribution function takes five arguments as inputs. You can supply it with your data, variable of interest, sample size, if you want to sample with replacement, and the number of repetitions to collect. It will then return a data frame with one variable (x) that contains a simulated sampling distribution for a sample mean.

Under the hood, the function will perform a for loop. That means it will repeat the same process as many times as we ask it to, in this case based on the number of repetitions. Then, for each repetition, it will take a sample from the data of interest. For each sample, it calculates and records the sample mean before continuing through the for loop.

With the df_popn, we are simulating the true sampling distribution from the population of interest. We are not resampling from our example sample data.

Number of Repeated Samples

For the number of repeated samples, let’s consider taking 100, 1000, and 10000 repeated samples to generate the sampling distribution. We’ll set the sample size to 40 for each of these simulations.

sample_means_100reps = sampling_distribution(reps = 100)
sample_means_1000reps = sampling_distribution()
sample_means_10000reps = sampling_distribution(reps = 10000)

plt.subplot(1,3,1)
sample_means_100reps['x'].hist()
plt.ylabel('Frequency')
plt.title('100 repetitions')
plt.subplot(1,3,2)
sample_means_1000reps['x'].hist()
plt.title('1000 repetitions')
plt.subplot(1,3,3)
sample_means_10000reps['x'].hist()
plt.title('10000 repetitions')
plt.suptitle('Histograms of Sample Mean Chicago Airbnb Prices Per Night')
plt.show()

The number of repetitions increase from 100 to 10,000 to observe how the repetitions affect the sampling distribution.

What do you notice? Well, for these three distributions, one can tell that the distribution is more filled out as the number of samples increases. That is, there are more observations in the last distribution than in the first. But, it appears that the shape, center, and spread all remain relatively consistent between the three distributions. That means that the number of repetitions does not affect the sampling distribution.

Sample Size

For the size of the sample, let’s consider taking random samples of size 4, 16, 100, and 1000. We’ll set the number of repeated samples to 1000.

sample_means_n4 = sampling_distribution(n = 4)
sample_means_n16 = sampling_distribution(n = 16)
sample_means_n100 = sampling_distribution(n = 100)
sample_means_n1000 = sampling_distribution(n = 1000)
plt.subplot(1,4,1)
sample_means_n4['x'].hist()
plt.ylabel('Frequency')
plt.title('n = 4')
plt.subplot(1,4,2)
sample_means_n16['x'].hist()
plt.title('n = 16')
plt.subplot(1,4,3)
sample_means_n100['x'].hist()
plt.title('n = 100')
plt.subplot(1,4,4)
sample_means_n1000['x'].hist()
plt.title('n = 1000')
plt.suptitle('Histograms of Sample Mean Chicago Airbnb Prices Per Night')
plt.show()

The number of repetitions increase from 100 to 10,000 to observe how the repetitions affect the sampling distribution.

What do you notice from these four graphs?

For these four distributions, the shape becomes more normal (bell shaped) as the sample size increases. The center stays in roughly the same location across the four distributions. The variability of the sampling distributions decreases as the sample size increases; that is, the sample means generally are closer to the center as the sample size is larger.

Why does this happen? We have some statistical theory that explains this phenomenon!

The Central Limit Theorem

While we can always use the resampling scheme for any sample statistic based on any setting, we can sometimes use statistical theory in order to take a shortcut. You were previously introduced to this concept. We’ll continue this discussion through the rest of this page.

The Central Limit Theorem states that the sampling distribution for the sample mean will be

approximately Normally distributed with a mean of $\mu$ and a standard deviation of $\frac{\sigma}{\sqrt{n}}$ as long as $n$ is large enough (typically defined as $n \ge 25$ or $30$, regardless of the shape of the population distribution)
Normally distributed with a mean of $\mu$ and a standard deviation of $\frac{\sigma}{\sqrt{n}}$ regardless of the sample size if the population is Normally distributed
when the sample is taken with replacement
where $\mu$ is the mean of the population distribution, $\sigma$ is the standard deviation of the population distribution, and $n$ is the sample size for each sample used to calculate a single sample mean

One limitation of the Central Limit Theorem is that it only applies to the sampling distribution of a sample mean and not to the sampling distribution for other sample statistics. We will see one special case towards the end of this page.

If you'd like to see a mathematical explanation for the mean and standard deviation of the sampling distribution, take a look at the deeper dive at the end of this section.

Properties Based on the Central Limit Theorem

Through our simulations, we saw that the center (average) of a sampling distribution for a sample mean stays fairly constant for all of our simulations. We saw that the variability (standard deviation) of a sampling distribution for a sample mean decreases as the sample size increases. We often call the standard deviation of a statistic – in this case of $\bar{X}$ – the standard error of that statistic. This can also be thought of as the standard deviation of the sampling distribution for the sample mean. We saw that the shape became more normal as the sample size increased from very small (n = 4) to medium (like n = 16 or 100), but then the shape (normality) didn’t change much after that.

We can formalize many of these features by using the Central Limit Theorem above.

In general, we find that the center of the sampling distribution of the sample mean will always be approximately the population mean. What does this mean? This means that we generally anticipate that the sample mean should be an appropriate estimate for the population mean. Does this indicate that the sample mean will exactly be the population mean? Or even that the sample mean (if all parts of the sampling are performed appropriately) will be accurate?

samp_size = ['Population', 'n=4', 'n=16', 'n=100', 'n=1000']
mean_of_dist = [df_popn['price'].mean(), sample_means_n4['x'].mean(), sample_means_n16['x'].mean(), sample_means_n100['x'].mean(), sample_means_n1000['x'].mean()]
std_of_dist = [df_popn['price'].std(), sample_means_n4['x'].std(), sample_means_n16['x'].std(), sample_means_n100['x'].std(), sample_means_n1000['x'].std()]
SampDist = pd.DataFrame({'Sample': samp_size,
'Mean': mean_of_dist,
'Standard Deviation': std_of_dist})
SampDist

	Sample	Mean	Standard Deviation
0	Population	170.174997	216.351752
1	n=4	168.443000	107.766482
2	n=16	170.601937	55.198696
3	n=100	170.599250	22.682464
4	n=1000	169.930734	7.106900

We can actually see that this statement is not true. We can see from the sampling distributions that common sample means vary from about 100 to about 500 for n = 4. This indicates that there is still uncertainty for a single sample mean — it could be exactly at the population mean, or it could vary widely.

But, we can also see that the variability decreases dramatically as the sample size increases. In fact, the Central Limit Theorem indicates that the standard deviation of a sample mean decreases by a factor of $\frac{1}{\sqrt{n}}$. As our sample size increases, we generally expect our sample mean to be closer to the population mean.

This is not a deterministic statement over all possible samples; in other words, this is not necessarily true for two specific samples of different sample sizes. For example, we could have a sample with a small sample size where the sample mean is exactly equal to the population mean and another sample with a larger sample size where the sample mean is not exactly equal to the population mean. In this instance, the smaller sample would be closer to the true value than the larger sample. However, across all possible samples, we would anticipate that the sample mean from the larger sample size would be closer to the population mean than the sample mean from a smaller sample size, at least in general.

In this way, our sampling distributions allow us to define uncertainty that is associated with sample statistics. We can understand reasonable values for sample statistics based on characteristics of the sample — primarily, the sample size.

Loosening the Assumptions and Extending the Theory

The biggest assumption associated with the Central Limit Theorem and sampling distributions is that each of the observations in the sample are independent of each other. When observations in a sample are generated with replacement, the observations are independent. When observations in a sample are generated without replacement, the observations are no longer independent.

Recall that independence indicates that one event occurring doesn’t affect the probability that another event occurs. We can consider a special case: suppose that we select a first observation with a price of \$107. There are only 2 observations in the population that have that specific price. If we sample with replacement, then the probability of the second observation having a price of \$107 is again $\frac{2}{N}$. However, when we sample without replacement, the probability of the second observation having a price of \$107 has changed and is now $\frac{1}{N}$. This difference indicates that the two events are not independent.

Most samples are not collected with replacement but are instead collected without replacement from the population. This is true, especially because we generally don’t have repeated rows recorded for the same observation in a data frame. What can we do, then?

We can loosen the assumptions needed for this to still hold. In general, if $N$ is large, the difference between $\frac{2}{N}$ and $\frac{1}{N}$ in the above example is going to be fairly small. Therefore, we want to consider cases where the lack of independence only mildly affects the properties of the variance. We can allow this mild infraction of the independence assumption where we sample without replacement if:

Our sample size is less than 10% of the population size and
The sample is randomly generated

Proportions as Means

The above properties are all written for sampling distributions of sample means. Does this also hold for proportions calculated from categorical variables?

In fact, there’s a really neat property where proportions are in essence a special case for a mean.

Consider a mean. A mean is calculated as the sum of all of the observations divided by the number of observations.

$\bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i$

Now, consider a proportion. A proportion is calculated as the number of observations that have a certain characteristic divided by the number of observations.

$\hat{p} = \frac{\text{# with characteristic}}{\text{total #}}$

If we define having a certain characteristic as having the value of 1 and not having the characteristic as having the value of 0, then the number of observations that have a certain characteristic will be the same as the sum of the values of 1 with the sum of the values of 0. In other words, let

\begin{equation}
X_i =
\begin{cases}
1 & \text{if observation i has the characteristic}\
0 & \text{if observation i does not have the characteristic}
\end{cases}
\end

$\hat{p} = \frac{\text{# with characteristic}}{\text{total #}} = \frac{\sum_{i = 1}^n x_i}{n} = \frac{1}{n}\sum_{i = 1}^n x_i = \bar{x}$

This demonstrates that a proportion is a special case of a mean.

Then, do the same properties hold for the sampling distribution of a sample proportion?

Number of Repeated Samples

Let’s check using simulations. At first, we’ll assume that we are using a sample size of $n = 40$ and a population proportion of $p = 0.5$.

We’ll perform some simulations as the number of repetitions increases, from 100 to 1000 to 10000.

def theoretical_sampling_phat(reps = 1000, n = 40, p = 0.5):
    # INPUT
    # reps, number of samples to collect, set to 1000 by default
    # n, sample size for each sample, set to 40 by default
    # p, the proportion of our characteristic, set to 0.5 by default
    simulated_statistics = []
    for i in range(reps):
        # For each repetition, use binom to simulate the number of successes in n trials
        # calculate the proportion, and add to simulated_statistics
        simulated_statistics.append(binom.rvs(n, p) / n)
    simulated_statistics = pd.DataFrame({'x':simulated_statistics})
    # return our simulated sampling distribution for p-hat as a data frame
    return simulated_statistics

simulated_statistics = binom.rvs(40, 0.5, size = 1000) / 40
simulated_statistics = pd.DataFrame({'x': simulated_statistics})

sample_props_100reps = theoretical_sampling_phat(reps = 100)
sample_props_1000reps = theoretical_sampling_phat(reps = 1000)
sample_props_10000reps = theoretical_sampling_phat(reps = 10000)

plt.subplot(1,3,1)
sample_props_100reps['x'].hist()
plt.ylabel('Frequency')
plt.title('100 repetitions')
plt.subplot(1,3,2)
sample_props_1000reps['x'].hist()
plt.title('1000 repetitions')
plt.subplot(1,3,3)
sample_props_10000reps['x'].hist()
plt.title('10000 repetitions')
plt.suptitle('Histograms of Sample Proportions')
plt.show()

Histograms displaying how the number of repetitions affects the sampling distribution for means.

Again, we see that the distribution looks more rounded and the shape is more Normal with more repetitions, but this could possibly be due to having more observations in the specific distribution. We don't observe any obvious differences in the means or standard errors for these distributions.

rep_size = ['Population', 'Reps=100', 'Reps=1000', 'Reps=10000']
mean_of_dist = [0.5, sample_props_100reps['x'].mean(), sample_props_1000reps['x'].mean(), sample_props_10000reps['x'].mean()]
std_of_dist = [(0.5*0.5/40)**(0.5), sample_props_100reps['x'].std(), sample_props_1000reps['x'].std(), sample_props_10000reps['x'].std()]
SampDist = pd.DataFrame({'Repetitions': rep_size,
'Mean': mean_of_dist,
'Standard Deviation': std_of_dist})
SampDist

	Repetitions	Mean	Standard Deviation
0	Population	0.500000	0.079057
1	Reps=100	0.485500	0.082280
2	Reps=1000	0.499375	0.077346
3	Reps=10000	0.501363	0.079159

Sample Size

We'll also perform a few simulations as the sample size increases with sample sizes of 4, 16, 100, and 1000.

We can again observe how these distributions change.

sample_props_n4 = theoretical_sampling_phat(n = 4)
sample_props_n16 = theoretical_sampling_phat(n = 16)
sample_props_n100 = theoretical_sampling_phat(n = 100)
sample_props_n1000 = theoretical_sampling_phat(n = 1000)

plt.subplot(1,4,1)
sample_props_n4['x'].hist()
plt.ylabel('Frequency')
plt.title('n = 4')
plt.subplot(1,4,2)
sample_props_n16['x'].hist()
plt.title('n = 16')
plt.subplot(1,4,3)
sample_props_n100['x'].hist()
plt.title('n = 100')
plt.subplot(1,4,4)
sample_props_n1000['x'].hist()
plt.title('n = 1000')
plt.suptitle('Histograms of Sample Proportions')
plt.show()

Histograms displaying how the sample size affects the sampling distribution for means.

samp_size = ['Population', 'n=4', 'n=16', 'n=100', 'n=1000']
mean_of_dist = [0.5, sample_props_n4['x'].mean(), sample_props_n16['x'].mean(), sample_props_n100['x'].mean(), sample_props_n1000['x'].mean()]
std_of_dist = [0.5, sample_props_n4['x'].std(), sample_props_n16['x'].std(), sample_props_n100['x'].std(), sample_props_n1000['x'].std()]
theoretical_std = [0.5, 0.5/2, 0.5/4, 0.5/10, 0.5/(1000)**(0.5)]
SampDist = pd.DataFrame({'Sample': samp_size,
'Mean': mean_of_dist,
'Standard Deviation': std_of_dist,
'Theoretical Standard Error': theoretical_std})
SampDist

	Sample	Mean	Standard Deviation	Theoretical Standard Error
0	Population	0.500000	0.500000	0.500000
1	n=4	0.489500	0.248901	0.250000
2	n=16	0.502125	0.124763	0.125000
3	n=100	0.501670	0.050232	0.050000
4	n=1000	0.500412	0.016045	0.015811

Summary measures for the sampling distribution with different sample sizes.

Again, we see that the center for each of our sampling distributions stays fairly constant. It's close to 0.50 for each of our sample sizes.

We can also see that the standard error for each sampling distribution decreases as the sample size increases. We can also confirm that it follows the expected pattern based on the theoretical standard error that we expect.

In terms of the shape, when $n = 4$ we have a distribution that looks fairly discrete; that is, the bars don't appear to be touching each other. Once we get to 16, that distribution becomes more Normal. As our sample size increases further, the distribution begins to appear more and more Normal.

Proportion Values

We can also consider what happens as we have different underlying proportions ~ say p = 0.01, p = 0.1, and p = 0.5. We’ll look at these in combination with the sample size (n = 4, n = 16, and n = 100), since we saw that the sample size was important for our sampling distribution.

sample_props_n4_p01 = theoretical_sampling_phat(n = 4, p = 0.01)
sample_props_n16_p01 = theoretical_sampling_phat(n = 16, p = 0.01)
sample_props_n100_p01 = theoretical_sampling_phat(n = 100, p = 0.01)
sample_props_n1000_p01 = theoretical_sampling_phat(n = 1000, p = 0.01)
sample_props_n4_p1 = theoretical_sampling_phat(n = 4, p = 0.1)
sample_props_n16_p1 = theoretical_sampling_phat(n = 16, p = 0.1)
sample_props_n100_p1 = theoretical_sampling_phat(n = 100, p = 0.1)
sample_props_n1000_p1 = theoretical_sampling_phat(n = 1000, p = 0.1)

plt.subplot(3,4,1)
sample_props_n4_p01['x'].hist()
plt.ylabel('p = 0.01')
plt.title('n = 4')
plt.subplot(3,4,2)
sample_props_n16_p01['x'].hist()
plt.title('n = 16')
plt.subplot(3,4,3)
sample_props_n100_p01['x'].hist()
plt.title('n = 100')
plt.subplot(3,4,4)
sample_props_n1000_p01['x'].hist()
plt.title('n = 1000')
plt.subplot(3,4,5)
sample_props_n4_p1['x'].hist()
plt.ylabel('p = 0.1')
plt.subplot(3,4,6)
sample_props_n16_p1['x'].hist()
plt.subplot(3,4,7)
sample_props_n100_p1['x'].hist()
plt.subplot(3,4,8)
sample_props_n1000_p1['x'].hist()
plt.subplot(3,4,9)
sample_props_n4['x'].hist()
plt.ylabel('p = 0.5')
plt.subplot(3,4,10)
sample_props_n16['x'].hist()
plt.subplot(3,4,11)
sample_props_n100['x'].hist()
plt.subplot(3,4,12)
sample_props_n1000['x'].hist()

plt.suptitle('Histograms of Sample Proportions')
plt.show()

Histograms displaying how the sample size and population proportions affect the sampling distribution for sample proportions.

We can see here how the sample size and the proportion interact with each other. When $p = 0.01$, we don't have an especially Normal distribution when $n = 100$. We don't get a Normal distribution until $n = 1000$. However, when $p = 0.5$, the distribution begins to look fairly Normal when $n = 16$. Therefore, we see that the Normality depends on the interaction of both $n$ and $p$. We will formalize this with the Central Limit Theorem for Proportions below.

Central Limit Theorem for Proportions

Based on the simulations, we can observe the same (or similar) properties from the sampling distributions for sample means as they apply to the sampling distributions for sample proportions. Mathematical support is contained in the Deeper Dive.

When it comes to the statement of the Central Limit Theorem and its assumptions, we do need to make a modification for proportions.

The Central Limit Theorem when applied to proportions says that the sampling distribution for a sample proportion, $\hat{p}$ will be:

approximately Normally distributed with a mean of $p$ and a standard deviation of $\sqrt{\frac{p(1-p)}{n}}$ as long as the sample is large enough (typically defined as both $np$ and $n(1-p)$ being at least 10)
when the sample is taken with replacement
where $p$ is the corresponding proportion of the population and $n$ is the sample size for each sample used to calculate a single sample proportion

Note that the large sample size condition is a little more specific when we have proportions compared to means. We say that we want to have $np$ and $n(1-p)$ both be large enough (typically at least 10, but some sources are ok with 5) in order for Normality to accurately describe the shape of the sampling distribution.

The loosening of assumptions above still holds. Generally, when we describe how the Central Limit Theorem applies to proportions we would say that the sampling distribution of the possible values for the sample proportion would be $N(p, \sqrt{\frac{p(1-p)}{n}})$ provided that $np$ and $n(1-p)$ are both at least 10, the sample size is less than 10% of the population size, and the sample is randomly generated from the population.

Other Sample Statistics

We mentioned at the beginning of this lesson that we make a decision to summarize a sample with a specific sample statistic. How does this decision affect the distribution of the sampling distribution?

We have statistical theory through the Central Limit Theorem that specifies the sampling distribution for the sample mean and for the sample proportion. Can we use other statistical theory for any other sample statistics?

There are some advanced statistical theories and proofs that can be applied, if certain conditions are met. These require additional concepts from distributions and probability beyond the foundations that we’ve covered here.

The answer, though, is that you can get a sense of the answer yourself using simulations.

Below, I’ll provide an example for the sampling distribution for the minimum of a sample.

What do you observe from this distribution, particularly about the shape, the center, and the spread as the sample size increases?

def sampling_distribution_min(reps = 1000, n = 40, data = df_popn, withreplace = True, var = 'price'):
    # INPUT
    # reps, number of samples to collect, set to 1000 by default
    # n, sample size for each sample, set to 40 by default
    # data to use as population, set to df_popn by default
    # withreplace, whether to sample with replacement, set to True by default
    # var, variable of interest, set as price by default
    simulated_statistics = []
    for i in range(reps):
        # For each repetition, collect a random sample of size n
        # calculate the statistic of interest, and add to simulated_statistics
        df_sample = data.sample(n, replace = withreplace)
        simulated_statistics.append(df_sample[var].min())
    simulated_statistics = pd.DataFrame({'x': simulated_statistics})
    # returns our sampling distribution as a data frame
    return simulated_statistics

sample_mins_n4 = sampling_distribution_min(n = 4)
sample_mins_n16 = sampling_distribution_min(n = 16)
sample_mins_n100 = sampling_distribution_min(n = 100)
sample_mins_n1000 = sampling_distribution_min(n = 1000)

plt.subplot(1,4,1)
sample_mins_n4['x'].hist()
plt.ylabel('Frequency')
plt.title('n = 4')
plt.subplot(1,4,2)
sample_mins_n16['x'].hist()
plt.title('n = 16')
plt.subplot(1,4,3)
sample_mins_n100['x'].hist()
plt.title('n = 100')
plt.subplot(1,4,4)
sample_mins_n1000['x'].hist()
plt.title('n = 1000')
plt.suptitle('Histograms of Sample Minimum Chicago Airbnb Prices')
plt.show()

Histograms displaying sampling distributions for sample minimum values.

samp_size = ['Population', 'n=4', 'n=16', 'n=100', 'n=1000']
mean_of_dist = [df_popn['price'].min(), sample_mins_n4['x'].mean(), sample_mins_n16['x'].mean(), sample_mins_n100['x'].mean(), sample_mins_n1000['x'].mean()]
std_of_dist = ['NaN', sample_mins_n4['x'].std(), sample_mins_n16['x'].std(), sample_mins_n100['x'].std(), sample_mins_n1000['x'].std()]
SampDist = pd.DataFrame({'Sample': samp_size,
'Mean': mean_of_dist,
'Standard Deviation': std_of_dist})
SampDist

	Sample	Mean	Standard Deviation
0	Population	10.000	NaN
1	n=4	66.233	32.776134
2	n=16	38.164	14.32132
3	n=100	22.708	6.122436
4	n=1000	12.998	3.254318

Here, we see that the sampling distribution for the minimum does not appear to be particularly Normal or symmetric in shape. We see that the mean value for the sampling distribution does decrease and approaches the true minimum value of \$10 as the sample size gets larger. We also see that the standard deviation decreases as the sample size gets larger, and does in fact appear to decrease by a factor of the $\sqrt{n}$ as we saw with our previous examples.

See if you can perform a similar analysis for the maximum of a sample on your own. Pick out another statistic – maybe the median, and see what you can learn about the sampling distribution for that sample statistic. This process of exploring the data and the results of these simulations can answer many questions without relying on the underlying theory.

← Sampling Distributions Next: Sampling Distribution for Two Populations →