Sampling Distributions


We just defined a few statistics that we could use to summarize a sample. What would happen if we took a second sample from the same population? Realistically, we would get a different set of observations in our second sample. From this different set of observations, we could recalculate the sample statistic. Would we get the same statistic? We probably would not calculate the exact same value for the statistic. Instead, we would likely get something similar to our first calculation. What do we actually know or expect for different values that our sample statistic can take from sample to sample? We can answer this question by studying sampling distributions.

Sampling Distributions

A sampling distribution is a distribution of the possible values that a sample statistic can take from repeated random samples of the same sample size n when sampling with replacement from the same population.

How is this different from a population distribution? For a population distribution, we are interested in seeing the possible values for a variable from a single observation, along with the corresponding probability for each of the possible values. For a sampling distribution, we are no longer interested in the possible values of a single observation but instead want to know the possible values of a statistic calculated from a sample.

How is this different from a sample distribution? Although the names sampling and sample are similar, the distributions are pretty different. The sample distribution displays the values for a variable for each of the observations in the sample. From that sample distribution, we could calculate the statistic value for that specific sample. That statistic would then be a single observation in the sampling distribution.

Visualizing a Sampling Distribution

Let’s see how to construct a sampling distribution below. In this example, we'll construct a sampling distribution for the mean price for a listing of a Chicago Airbnb.

First, we start with the population distribution.

df_popn['price'].hist()
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.title('Histogram of Chicago Airbnb Prices per Night')
plt.show()

Histogram of the population distribution of Chicago Airbnb prices.

df_popn['price'].hist(bins = 1000)
plt.xlim([-5, 1000])
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.title('Histogram of Chicago Airbnb Prices Per Night, Limited to Listings Less than $1000')
plt.show()

Histogram of the population distribution of Chicago Airbnb prices for Airbnbs that are less than $1000 per night.

From the population distribution, we gather a random sample, this time of size 100. We can visualize the sample distribution. Often, the sample distribution will closely mirror (look similar to) the population distribution, since it is made up of a subset of observations from the population. There will likely be minor differences in the distributions between the sample and the population.

df_sample = df_popn.sample(100, replace = True)
df_sample['price'].hist()
plt.xlim([-5, 1000])
plt.xlabel('Price per Night')
plt.ylabel('Frequency')
plt.title('Histogram of A Sample of Chicago Airbnb Prices Per Night, Limited to Less than $1000')
plt.show()

Histogram of a sample of 100 Chicago Airbnb prices for Airbnbs.

Then, for that sample, we calculate our statistic.

df_sample['price'].mean()
183.77
df_sample['price'].hist()
plt.xlim([-5, 1000])
plt.xlabel('Price per Night')
plt.ylabel('Frequency')
plt.title('Histogram of A Sample of Chicago Airbnb Prices Per Night, Limited to Less than $1000')
plt.axvline(df_sample['price'].mean(), linestyle = 'dashed', color = 'k')
plt.show()

The added line here shows where the sample mean falls.

We can repeat this for a second sample.

df_sample = df_popn.sample(100, replace = True)
df_sample['price'].hist()
plt.xlim([-5, 1000])
plt.axvline(df_sample['price'].mean(), linestyle = 'dashed', color = 'k')
plt.xlabel('Price per Night')
plt.ylabel('Frequency')
plt.title('Histogram of A Sample of Chicago Airbnb Prices Per Night, Limited to Less than $1000')
plt.show()
print(df_sample['price'].mean())

The added line has moved in this graph and shows where the sample mean for the second sample falls.

196.35

We get a different statistic for our two samples (183.77 and 196.35). Is this a concern?

No, this expected. We have slightly different observations in each sample, and so it's not surprising that we get slightly different means. Our two sample means (our two statistics) are similar in value to each other.

We can continue by repeating this for a third sample and for a fourth sample.

plt.subplot(1,2,1)
df_sample = df.sample(100, replace = True)
df_sample['price'].hist()
plt.xlabel('Price per Night')
plt.ylabel('Frequency')
plt.xlim([-5, 1000])
plt.axvline(df_sample['price'].mean(), linestyle = 'dashed', color = 'k')
plt.subplot(1,2,2)
df_sample = df.sample(100, replace = True)
df_sample['price'].hist()
plt.xlabel('Price per Night')
plt.ylabel('Frequency')
plt.xlim([-5, 1000])
plt.axvline(df_sample['price'].mean(), linestyle = 'dashed', color = 'k')
plt.suptitle('Histogram of A Sample of Chicago Airbnb Prices Per Night, Limited to Less than $1000')
plt.show()

Two more histograms display where sample means for two additional samples fall.

And so forth until we’ve calculated many sample statistics from each of our repeated samples. In essence, what we'll do is record where the black line is located for each of our samples. We could repeat this process many times, resulting in many samples and many resulting sample statistics.

simulated_means = []
for i in range(1000):
df_sample = df.sample(100, replace = True)
simulated_means.append(df_sample['price'].mean())
simulated_means = pd.DataFrame(simulated_means)
simulated_means
0
0 158.93
1 174.65
2 185.28
3 158.94
4 167.03
... ...
995 171.57
996 162.54
997 184.29
998 159.83
999 174.21

1000 rows × 1 columns

A data frame is generated that consists of 1000 sample means from 1000 repeated samples from the population.

We can then visualize the distribution of all of these sample statistics. This is the sampling distribution for our sample statistic – the possible values that the sample statistic can take along with how likely each of those values are to occur.

simulated_means.hist()
plt.xlabel('Possible Sample Means from Samples of Size 100')
plt.ylabel('Frequency')
plt.title('Simulated Sampling Distribution of Sample Mean Chicago Airbnb Prices')
plt.show()

A histogram of simulated sample means from 1000 repeated random samples.

We can generate sampling distributions for statistics regardless of whether we are summarizing a quantitative or a categorical variable. Below, you can see code that is used to generate a sampling distribution for a categorical variable (room_type, specifically whether the listing is for an entire home or apartment). What is the sample statistic that is represented in this sampling distribution?

simulated_statistics = []
for i in range(1000):
df_sample = df.sample(100, replace = True)
simulated_statistics.append((df_sample['room_type'] == 'Entire home/apt').mean())
simulated_statistics = pd.DataFrame(simulated_statistics)
simulated_statistics.hist()
plt.xlabel('Possible Sample Proportions from Samples of Size 100')
plt.ylabel('Frequency')
plt.title('Simulated Sampling Distribution of Sample Proportions of Chicago Airbnbs that are Entire homes or apartments')
plt.show()

A histogram of simulated sample proportions summarizing a categorical variable from 1000 repeated random samples.

Simulating a Sampling Distribution for a Less Common Statistic

One of the nice features about simulation is that we can generate a sampling distribution for any statistic. For example, suppose that we want to know the lowest price accommodation (per night) if we randomly sample 10 Airbnb listings in Chicago. We can do that, using the code seen below.

simulated_mins = []
for i in range(1000):
df_sample = df.sample(10, replace = True)
simulated_mins.append(df_sample['price'].min())
simulated_mins = pd.DataFrame(simulated_mins)
simulated_mins.hist()
plt.xlabel('Possible Minimum Chicago Airbnb Prices from a Sample of 10 Listings')
plt.ylabel('Frequency')
plt.title('Sampling Distribution of Sample Minimum Airbnb Prices for 10 Chicago Airbnbs')
plt.show()

A sampling distribution for the minimum value from each sample.

Or, imagine that we would want to select the second lowest priced accommodation per night, if we anticipate that we don’t want to stay at the absolute least expensive Airbnb but still want to be budget conscious.

df_sample = df.sample(10, replace = True)
print(sorted(df_sample['price']))
sorted(df_sample['price'])[1]
[58.0, 125.0, 151.0, 165.0, 168.0, 175.0, 196.0, 286.0, 343.0, 370.0]
125.0
simulated_seconds = []
for i in range(1000):
df_sample = df.sample(10, replace = True)
simulated_seconds.append(sorted(df_sample['price'])[1])
simulated_seconds = pd.DataFrame(simulated_seconds)
simulated_seconds.hist()
plt.xlabel('Possible Second Smallest Chicago Airbnb Prices from a Sample of 10 Listings')
plt.ylabel('Frequency')
plt.title('Sampling Distribution of Sample Second Lowest Airbnb Prices for 10 Chicago Airbnbs')
plt.show()

A sampling distribution for the second smallest price from each sample of size 10.

What we've done is created a process that we can use to understand the variability of sample statistics from different samples. This process can be applied regardless of the type of variable and the type of statistic to be recorded, as long as the statistic is recorded as a single number for each sample.

Simulating a Sampling Distribution from a Sample

Above, we saw how to generate a sampling distribution when we have the population available. We were able to generate repeated random samples from the population. We could then calculate our sample statistic for each of our samples in order to generate a sampling distribution.

In the real world, it’s not often the case that we have access to the population distribution. Instead, we are often wanting to make a statement about the population using information that we have from the sample to ground that statement. So, how can we estimate a sampling distribution if we only have a sample?

We saw above that our sample distribution has very similar features to our population (assuming that our sample is representative of the population). We can use our sample as a substitute for the population when generating our sampling distribution.

We will draw a random sample from our sample (we’ll call this a resample). In this case, it is absolutely crucial that we sample with replacement of the same sample size as the original sample. This ensures that we have comparable sample statistics calculated from our simulated samples and that our sample statistics have some variability – that is, that we don’t simulate a sample that is exactly the same as the original sample.

To illustrate this, we’ll consider this sample of 5 observations.

df5 = df_popn['price'].sample(5)
df5
748     120
5228    155
4966    223
4757     79
186     169
Name: price, dtype: int64

If we take a resample of size 5 without replacement from our sample, then we would get this as our first resample:

df5.sample(5, replace = False)
186     169
4757     79
748     120
4966    223
5228    155
Name: price, dtype: int64

And as our second resample:

df5.sample(5, replace = False)
748     120
5228    155
4757     79
4966    223
186     169
Name: price, dtype: int64
    

And as our third resample:

df5.sample(5, replace = False)
4966    223
4757     79
5228    155
186     169
748     120
Name: price, dtype: int64
    

We get the same resample each time, although in a different order. This isn't happening just by chance for each of these three examples. For each resample, we would have the same sample, since we are sampling 5 observations from a set of 5 observations, and we don’t return an observation back into our pool once it’s been selected.

Now, what happens when we resample (again of size 5) from the original data but this time do so with replacement. That is, we could draw a single observation multiple times.

Our first resample might be this:

df5.sample(5, replace = True)
748     120
4966    223
5228    155
5228    155
186     169
Name: price, dtype: int64
    

With a different second resample:

df5.sample(5, replace = True)
748     120
748     120
748     120
4966    223
4966    223
Name: price, dtype: int64
    

And a different third resample:

df5.sample(5, replace = True)
4757     79
186     169
4966    223
5228    155
4757     79
Name: price, dtype: int64
    

All because we allow a resample to have the same observation repeated.

Typically, however, we have a few more observations than 5. What does this look like if we use our sample of Chicago airbnb listings?

df_resample = df.sample(700, replace = True)
df_resample['price']
    238     72.0
    453    143.0
    629    618.0
    558    132.0
    212    148.0
           ...  
    411    786.0
    98     210.0
    412    130.0
    360    128.0
    40      70.0
    Name: price, Length: 700, dtype: float64

Our algorithm for generating a sampling distribution using resampling is:

  1. Resample from the original data, generating the sample number of observations as in the original sample with replacement
  2. Calculate and save the statistic from the resample
  3. Repeat steps 1 and 2 for as many resamples as desired
  4. The list of statistics calculated from step 2 will serve as the simulated sampling distribution

This algorithm can be used for any possible sample statistic and with minimal assumptions, which is a distinct advantage of using resampling to simulate the sampling distribution.

Simulating a Sampling Distribution using Underlying Theory

While we can always resample from our original sample to simulate a sampling distribution, there are some situations where we can also make use of underlying distributions to simulate a sampling distribution.

For a categorical variable, we can use a Binomial distribution to simulate a new sample.

For example, in the resampling scheme, we would use our original sample. In this instance, the probability that we would sample a listing that is for a entire home or apartment would be . The same characteristic can be replicated with the Binomial distribution. Recall that the Binomial distribution occurs when we are sampling $n$ observations where each observation is independent and has $p$ probability of success.

df_popn['room_type'].value_counts(normalize = True)
    Entire home/apt    0.779140
    Private room       0.205112
    Shared room        0.009294
    Hotel room         0.006454
    Name: room_type, dtype: float64

This indicates that the probability $p$ of a selecting a listing from the population that is in an entire home or apartment is 0.779140. Suppose that I want to select 100 observations with replacement from this population. I could select the 100 listings and count the number of listings that correspond to an entire home or apartment. If I do sample with replacement, then the probability of selecting a listing that is an entire home or apartment would be 0.779140 for each draw.

For example, I could use the following code to draw 100 listings and then count the type of room for each listing.

df_popn['room_type'].sample(100, replace = True).value_counts()
    Entire home/apt    74
    Private room       24
    Hotel room          1
    Shared room         1
    Name: room_type, dtype: int64

Or I could specifically ask Python to sample 100 observations from a population where 0.779140 have some characteristic (an Airbnb listing corresponding to an entire home or apartment).

from scipy.stats import binom
binom.rvs(100, 0.77914)
85

From this output, we could continue to use the count or calculate proportions to use as our statistics.

For a quantitative variable, it is a little more challenging to determine which distribution to use to simulate observations from. Some common options include a Normal or an exponential distribution for continuous variables and a Poisson or a Geometric distribution for discrete variables. In most cases, we would want to select a distribution that most closely matches the population distribution, which we approximate using the observed sample distribution. In some cases, we may have other theoretical distributions that we would choose to use in order to have valid results.

We will discuss possible options as needed during the following lessons.