Sampling Distribution for Two Populations

← Sampling Distribution Properties Next: Sampling Distributions for Regression →

We’ve learned the most fundamental building block of statistical inference: the sampling distribution. We will continue returning to the sampling distribution throughout the rest of this module and the last module of the series.

We can certainly generate a sampling distribution for a sample statistic from a single population, as we have been for the past few pages. Many times we would like to understand more about a statistic that compares two different populations ~ for example, what is the difference in Airbnb prices between listings in Lake View compared to Logan Square, two neighborhoods separated by the Chicago River?

Statistics Using Two Populations

When we are interested in comparing two different populations (or similarly, two different samples), we have a number of different ways that we can compare the two groups. Recall back in the Data Science Journey that we discussed some of the different ways we can start to measure the relationship between two variables or between two distributions. Most of these will involve calculating the difference between two similar statistics between the two groups.

For example, we may want to calculate the difference in the sample mean Airbnb price in the Lake View neighborhood and the sample mean Airbnb price in the Logan Square neighborhood ($\bar{x}_1 - \bar{x}_2$). This would allow us to determine if the sample means (a measure of center) were different between the two neighborhoods.

Since we know that the distribution of Airbnb prices in Chicago is not symmetric, then we might prefer to use a different measure for our center. What measure would that be? When we have non-symmetric distributions, the median will be a better measure of center. In this case, we might instead prefer to calculate the difference in sample median Airbnb prices between the Lake View and Logan Square neighborhoods.

We could choose to calculate the difference between other characteristics of the two distributions. For example, we could calculate the difference in the sample minimum Airbnb prices (what might be the least expensive Airbnb in each neighborhood?) or the sample maximum Airbnb prices (the most extravagent lodgings) if we are interested in knowing more about the extreme costs in the two neighborhoods. If we are interested in the variability of the two neighborhood prices, we could choose to calculate the differences between the sample standard deviations or the sample IQRs, depending on which measure of variability is more appropriate.

While many of our statistics focus on the difference between two characteristics, we aren’t limited to only calculating differences using subtraction. We may be interested in the ratio (that is, by dividing) two statistics. For example, we could find that a sample minimum for calculate that the sample minimum for Lake View is 1.15 times more expensive (or 15% more expensive) than for Logan Square.

If we are instead summarizing a categorical variable, we could also consider the difference in proportions between the two samples as our statistic of interest ($\hat{p}_1-\hat{p}_2$), for example. We will revisit the theoretical properties associated with this statistic at the end of this page.

df_logan = df[df['neighbourhood_cleansed'] == 'Logan Square']
df_lake = df[df['neighbourhood_cleansed'] == 'Lake View']
df_logan.shape

(50, 334)

df_lake.shape

(54, 334)

df_logan['price'].min()

41.0

df_lake['price'].min()

47.0

47/41

1.146341463414634

Sampling Distributions for Two Populations

For all of these situations, we can simulate the sampling distribution for our statistic of interest, using the data for both populations if we have it or using a resampling method from the sample data for both samples. The process for simulating the sampling distribution will be very similar to the process that we followed before. The biggest differences between simulating a sampling distribution between one population and two populations are that we now need to perform two different sampling steps for each of our separate groups and that we will calculate a different statistic.

For example, let’s consider generating a sampling distribution for the difference in median Airbnb prices for Lake View and Logan Square (Lake View - Logan Square). We'll use the population data for Lake View and Logan Square. We can do this using the code below.

df_logan = df_popn[df_popn['neighbourhood_cleansed'] == 'Logan Square']
df_lake = df_popn[df_popn['neighbourhood_cleansed'] == 'Lake View']

df_logan['price'].describe()

    count     451.000000
    mean      155.483370
    std       151.273135
    min        10.000000
    25%        85.000000
    50%       115.000000
    75%       166.000000
    max      1499.000000
    Name: price, dtype: float64

df_lake['price'].describe()

    count     614.000000
    mean      163.745928
    std       138.988283
    min        10.000000
    25%        75.000000
    50%       120.000000
    75%       199.750000
    max      1109.000000
    Name: price, dtype: float64

def sampling_distr_diff(df1 = df_lake, df2 = df_logan, var = 'price', n1 = 50, n2 = 50,
                       withreplace = True, reps = 1000):
    # INPUT
    # reps, number of samples to collect, set to 1000 by default
    # n1 and n2, sample size for each group, set to 50 by default
    # df1 and df2 to use as population, set to df_lake and df_logan by default
    # withreplace, whether to sample with replacement, set to True by default
    # var, variable of interest, set as price by default
    simulated_statistics = []
    for i in range(reps):
        # For each repetition, collect a random sample of size n
        # calculate the difference of sample medians, and add to simulated_statistics
        df_sample1 = df1.sample(n1, replace = withreplace)
        df_sample2 = df2.sample(n2, replace = withreplace)
        simulated_statistics.append(df_sample1[var].median() - df_sample2[var].median())
    simulated_statistics = pd.DataFrame({'x': simulated_statistics})
    # returns our sampling distribution as a data frame
    return simulated_statistics

samp_dist = sampling_distr_diff()
samp_dist['x'].hist()
plt.xlabel('Possible Difference in Sample Medians')
plt.ylabel('Frequency')
plt.title('Difference in Sample Median Prices of Airbnbs in Lake View - Logan Square neighborhoods')
plt.show()

Histogram of a sampling distribution for the difference in medians between Lake View and Logan Square neighborhoods.

How would you describe this sampling distribution? What are the three important features of the sampling distribution: its shape, center, and spread?

It looks like the distribution is symmetric. It's hard to say that it's definitively bell-shaped or Normally distributed, since it appears to have limited tails, but it does seem to be fairly symmetric. It is centered around \$40 or so and has values from around -40 to around 125. This indicates that we simulated samples where the Lake View Airbnb median was \$40 less expensive to \$125 more expensive than the median for Logan Square.

Properties of Sampling Distributions for Two Populations

How do the properties of the sampling distribution change now that we are working with statistics that involve two populations (instead of one)? We can consider the features that we control in this sampling distribution:

The number of repetitions
The size for each sample
The calculated sample statistic

Number of Repetitions

Again, what happens as we change the number of repetitions to generate the sampling distribution?

samp_dist_100reps = sampling_distr_diff(reps = 100)
samp_dist_1000reps = sampling_distr_diff(reps = 1000)
samp_dist_10000reps = sampling_distr_diff(reps = 10000)

plt.subplot(1,3,1)
samp_dist_100reps['x'].hist()
plt.ylabel('Frequency')
plt.title('100 repetitions')
plt.subplot(1,3,2)
samp_dist_1000reps['x'].hist()
plt.title('1000 repetitions')
plt.subplot(1,3,3)
samp_dist_10000reps['x'].hist()
plt.title('10000 repetitions')
plt.suptitle('Histograms of Difference of Sample Median Airbnb Prices (Lake View - Logan Square)')
plt.show()

We can use these graphs to determine how the number of repetitions affects the sampling distribution.

What about when we change the sample sizes for the two groups?

samp_dist_n4 = sampling_distr_diff(n1 = 4, n2 = 4)
samp_dist_n16 = sampling_distr_diff(n1 = 16, n2 = 16)
samp_dist_n100 = sampling_distr_diff(n1 = 100, n2 = 100)

plt.subplot(1,3,1)
samp_dist_n4['x'].hist()
plt.ylabel('Frequency')
plt.title('n = 4')
plt.subplot(1,3,2)
samp_dist_n16['x'].hist()
plt.title('n = 16')
plt.subplot(1,3,3)
samp_dist_n100['x'].hist()
plt.title('n = 100')
plt.suptitle('Histograms of Difference of Sample Median Airbnb Prices (Lake View - Logan Square)')
plt.show()

We can use these graphs to determine how the sample sizes for the groups affects the sampling distribution.

The same features and properties that we saw for one population are repeated here again. We see that the number of repetitions does not substantially alter the sampling distribution. However, as the sample sizes increase, we see that the distribution becomes more bell shaped and symmetric and less variable with the center staying in approximately the same location.

Sample Sizes

What should we use as the sample sizes for each of our two groups? Do we have to use the same sample sizes for both groups?

Again, we want our sampling distribution to be helpful in summarizing the possible values for a sample statistic based on the two samples that we are interested in comparing. We would like the sampling distribution to mimic the properties of the sample(s) that we are hoping to summarize, so we want to have the same features in our simulation.

We do not need to have the same sample sizes for both groups.

For example, we might choose to use two different sample sizes in a resampling-based sampling distribution, as the original samples consist of different sample sizes.

samp_dist_truen = sampling_distr_diff(n1 = 54, n2 = 50)
samp_dist_truen['x'].hist()
plt.xlabel('Possible Difference of Sample Medians, Lake View - Logan Square')
plt.ylabel('Frequency')
plt.title('Sampling Distribution for Difference of Median Airbnb Prices')
plt.show()

The estimated sampling distribution based on our samples for Chicago Airbnbs.

Using the Central Limit Theorem for Two Populations

We can use theoretical properties about Normal distributions along with our previous Central Limit Theorem properties to say something theoretical about the sampling distribution for the difference of two sample means ($\bar{x}_1 - \bar{x}_2$).

Recall that the Central Limit Theorem says that:

The sampling distribution of $\bar{X}$ is Normally distributed with a mean of $\mu$ and a standard deviation of $\frac{\sigma}{\sqrt{n}}$ if
The population the observations comes from has a mean of $\mu$ and a standard deviation of $\sigma$,
The sample is taken with replacement, and
At least one of: the sample size is large enough (generally accepted to be at least 25) or the population is Normally distributed

We also relaxed these conditions slightly to allow the sample to be taken without replacement if:

The sample size is less than 10% of the population size and
The sample is randomly generated

Because we now have two populations, we’ll add a little bit of notation. We’ll add a subscript (a little number to the right) to indicate whether a component is referring to the first population or the second population. As a quick note, how you define population 1 and population 2 is not too important; you can define either population as 1 or 2. But, you will want to be consistent in your definition.

As long as each of the sampling distributions of $\bar{X}_1$ and $\bar{X}_2$ both meet the conditions for the Central Limit Theorem and are therefore Normally distributed, we can say that $\bar{X}_1 \sim N(\mu_1, \frac{\sigma_1}{\sqrt{n_1}})$ and $\bar{X}_2 \sim N(\mu_2, \frac{\sigma_2}{\sqrt{n_2}})$. Then, we can rely on the properties for adding or subtracting normal distributions.

We can say that $\bar{X}_1 - \bar{X}_2 \sim N(\mu_1 - \mu_2, \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}})$ if:

The sample size $n_1$ is at least 25,
Sample 1 is randomly sampled with replacement from the population,
The sample size $n_2$ is at least 25,
Sample 2 is randomly sampled with replacement from the population,
And samples 1 and 2 are independent samples.

We know that sampling with replacement from the population is uncommon. If we sample without replacement instead, then we need to add our two additional conditions for each sample:

Sample 1 is randomly generated from Population 1
The sample size $n_1$ is less than 10% of the population size of population 1
Sample 2 is randomly generated from Population 2
The sample size $n_2$ is less than 10% of the population size of population 2

For example, let's consider the sampling distribution for the difference in sample mean Airbnb prices for Lake View - Logan Square neighborhoods. This means that we are defining Lake View to be population 1 and Logan Square to be population 2.

We can find the population mean and population standard deviations for each of our two neighborhoods from our population data.

$\mu_\text{Lake} = 163.75$
$\sigma_\text{Lake} = 138.99$
$n_\text{Lake} = 54$
$\mu_\text{Logan} = 155.48$
$\sigma_\text{Logan} = 151.27$
$n_\text{Logan} = 50$

Then, we know that the sampling distribution for the difference in the sample means should be: $\bar{X}_1 - \bar{X}_2 \sim N(8.27, 28.56)$

And we can confirm this using simulation.

xbar_lake_less_logan = []
for i in range(1000):
    df_logan_sample = df_logan.sample(50, replace = True)
    df_lake_sample = df_lake.sample(54, replace = True)
    diff_means = df_lake_sample['price'].mean() - df_logan_sample['price'].mean()
    xbar_lake_less_logan.append(diff_means)
xbar_lake_less_logan = pd.DataFrame({'x': xbar_lake_less_logan})

xbar_lake_less_logan['x'].hist()
plt.xlabel('Possible Differences in Sample Mean Airbnb Prices, Lake View - Logan Square')
plt.ylabel('Frequency')
plt.title('Histogram of Simulated Sampling Distribution')
plt.show()

A histogram of the simulated sampling distribution allows us to observe plausible values for the difference in sample means between the two populations.

xbar_lake_less_logan['x'].describe()

    count    1000.000000
    mean        9.194654
    std        27.857065
    min       -98.597778
    25%        -9.135556
    50%         8.751852
    75%        27.485185
    max       106.177037
    Name: x, dtype: float64

Theoretical Sampling Distribution for the Difference of Two Proportions

What about proportions? Can we do the same thing for the sampling distribution for the difference in two sample proportions that we previously did for means? The answer is again yes! We can combine our previous Central Limit Theorem along with the rules for combining normal distributions for proportions as well.

The result is that $\hat{p}_1 - \hat{p}_2 \sim N(p_1 - p_2, \sqrt{\frac{p_1 \times (1-p_1)}{n_1} + \frac{p_2 \times (1-p_2)}{n_2}})$ if:

$n_1 \times p_1$ and $n_1 \times (1-p_1)$ are both at least 10
Sample 1 is randomly generated without replacement from the population
Sample size 1 is less than 10% of the size of population 1
$n_2 \times p_2$ and $n_2 \times (1-p_2)$ are both at least 10
Sample 2 is randomly generated without replacement from the population
Sample size 2 is less than 10% of the size of population 2
And the two samples are independent of each other.

← Sampling Distribution Properties Next: Sampling Distributions for Regression →