Calculating Probability for Statistics


There are many reasons that sampling distributions are helpful. One of the most helpful features is that it helps us to understand the behavior of sample statistics. In this section, we’ll combine the information about sampling distributions along with some of our previous probability knowledge to help us answer questions about the anticipated behavior of sample statistics. In the next module, we’ll see additional uses of the sampling distribution to help us answer different sets of questions specific to statistical inference, which we’ll describe in more detail in the next section.

Probability Questions

Airbnb started as a way to combat high prices in large cities by encouraging local hosts to list available space for rent. Now, there are Airbnb hosts that are not local, including some properties that are managed by large companies. How much has Airbnb strayed from the original intention?

We’ll consider a sample of 26 Airbnb properties. For each sample, we’ll record the proportion of hosts that are local to Chicago. We’d like to identify the probability that, in a sample of 26 Airbnb listings, 70% or more are from local hosts? What about the probability that 70% or less are hosted by locals? What about the probability that between 55% and 70% are local hosts?

The first thing to do to answer this question is to generate our sampling distribution. That is, what are the possible values for the sample proportion of local hosts from a sample of 26 listings. To do this, we'll remove any hosts that haven't listed their location. We'll also look only at those hosts who are located within Chicago while not considering any hosts from the Chicago suburbs.

df_host = df_popn['host_location']
df_host = df_host.dropna()
local_hosts = []
for i in range(1000):
local_hosts.append((df_host.sample(26, replace = True) == 'Chicago, IL').mean())
local_hosts = pd.DataFrame({'x': local_hosts})
local_hosts['x'].hist()
plt.xlabel('Possible Sample Proportion of Local Hosts, Sample Size of 26')
plt.ylabel('Frequency')
plt.title('Histogram of Sampling Distribution of Proportion of Local Hosts of Chicago Airbnbs')
plt.show()

Histogram of simulated proportions of local hosts.

Then, to calculate the probabilities, we can calculate the proportion of samples that we generated that had 70% or more from a local host. To do this, we'll use a logical statement.

(local_hosts['x'] >= 0.7).mean()
0.821

For 70% or less being hosted by a local host, we can adjust our logical statement just slightly.

(local_hosts['x'] <= 0.7).mean()
0.179

For between 55% and 70%, we can again adjust our logical statement.

((local_hosts['x'] > 0.55) & (local_hosts['x'] < 0.70)).mean()
0.176

How do these numbers change if instead of using 26 Airbnb properties, we collect 100 properties in our sample?

We can calculate the answers to all of these questions using our sampling distributions.

local_hosts_100 = []
for i in range(1000):
local_hosts_100.append((df_host.sample(100, replace = True) == 'Chicago, IL').mean())
local_hosts_100 = pd.DataFrame({'x': local_hosts_100})
local_hosts_100['x'].hist()
plt.xlabel('Possible Sample Proportion of Local Hosts, Sample Size of 100')
plt.ylabel('Frequency')
plt.title('Histogram of Sampling Distribution of Proportion of Local Hosts of Chicago Airbnbs')
plt.show()

Histogram of simulated proportions of local hosts for samples of size 100.

print('Probability at least 70%: ', (local_hosts_100['x'] >= 0.7).mean())
print('Probability at most 70%: ', (local_hosts_100['x'] <= 0.7).mean())
print('Probability between 55% and 70%: ', ((local_hosts_100['x'] > 0.55) & (local_hosts_100['x'] < 0.70)).mean())
Probability at least 70%: 0.988 Probability at most 70%: 0.026 Probability between 55% and 70%: 0.012

Notice that our sampling distribution when we have 100 observations in a sample is considerably less variable. We can also observe that the calculated probabilities are quite different when we have 100 observations in a sample as opposed to our earlier calculations based on 26 observations.

What proportion of out of towners hosting the properties would be considered too much to still be a local business? Can you determine the probability that a sample of 26 Airbnbs would exceed your threshold for non-locals hosting? See if you can repeat the analyses for this new situation.

One of the benefits of using simulation to generate sampling distributions is that we aren’t limited by the assumptions needed for the Central Limit Theorem. We also aren’t limited to statistics that follow predefined distributions.

In this case, we could also use the Central Limit Theorem to determine the theoretical distribution for the sampling distribution of the sample proportion of non-local hosts, provided that it meets the required assumptions.

Percentile Questions

Once we determine the sampling distribution, we can also answer other questions about possible values for our sample statistic. For example, what if we wanted to learn what the 15th percentile of the proportion of local hosts in samples of size 26. That is, we want to determine the value (quantile) such that 15% of the samples would have a sample proportion less than or equal to this value.

np.quantile(local_hosts['x'], 0.15)
0.6923076923076923

We may instead be interested in finding the 85th percentile.

np.quantile(local_hosts['x'], 0.85)
 0.8461538461538461

The 15th and 85th percentiles together allow us to determine the middle 70% of the sample proportions of local hosts. 70% of the observations will be between the 15th and 85th percentiles, with 15% in either tail.

Again, we can answer these questions using the simulated sampling distribution. Alternatively, we can use theoretical properties of the sampling distribution that we expect using the Central Limit Theorem.

Sample Size Calculations

For most of the previous calculations, we have been able to use our simulated sampling distribution. Now, there is an additional question that does rely on using the Central Limit Theorem and the corresponding Normal distribution to be able to answer.

In this question, you are told a percentile and expected to find the sample size. Or, alternatively, you may be told the standard error and asked to determine the sample size.

For example, suppose that you are looking at booking a last minute trip to Chicago. You would like to know the mean number of available days for an Airbnb in the next 30 days. You may be told that the overall mean is 5 days, the 90th percentile for the sample mean is 8 days, and the population standard deviation is 16 days. If that is the case, can you use the information to determine the sample size.

To answer this question, it can help to list out what we know:

  • $\mu = 5$
  • $\sigma = 16$
  • 90th percentile is 8

If we assume a Normal distribution, we can determine the z-score that corresponds to the 90th percentile: 1.28.

scipy.stats.norm.ppf(0.9)
 1.2815515655446004

We could then use this value to calculate the standard error needed so that 8 days corresponds to the 90th percentile.

$z = \frac{x - \mu}{\sigma / \sqrt{n}} \Rightarrow 1.28 = \frac{8 - 5}{16/\sqrt{n}} \Rightarrow \sqrt{n} = \frac{1.28 \times 16}{8-5} \Rightarrow n = 46.6$

(1.28 * 16 / (8-5)) **2
 46.60337777777779

Recall that our sample size $n$ must be a whole number. In this case, we might anticipate that this sampling distribution was generated for sample means using a sample size of approximately 47 Airbnb listings.