Population Parameters and Sample Statistics


Our previous module focused on using our population data to make statements about samples. Specifically, we were able to calculate the variability of sample statistics and were able to calculate probabilities from the sampling distributions.

We are going to transition from our previous discussion to the situation where we have a sample and would like to make a statement about the corresponding population. We will use many of the theoretical foundations from the previous module in this process and apply it to a more realistic scenario.

Before we turn to discussing the approaches to inference, where we use our sample data to make a statement about the population, we are going to describe the setup of the scenario first.

Parameter and Statistic Definitions

Inference is based on using samples to make statements about the population. What do we use to do this?

From the samples, we calculate statistics, or summary measures of characteristics from the sample. In other words, a statistic is a number that has been calculated using sample data. Generally, a statistic is known, since we calculate it from known sample data. However, statistics can also be random variables (or unknown quantities with a possible distribution) if we have not yet generated the corresponding sample.

If we had census data from a population available to us, we could calculate parameters, or corresponding summary measures of characteristics from the population. The parameter is the true but often unknown value that we would ideally like to know. Since populations are generally fixed, a parameter is generally also a fixed number. However, unless we know and have access to the full population data, we won't know what the fixed number is.

Common Parameters and Statistics

The definitions of statistics and parameters can be abstract, so it can help to provide examples for these quantities.

Below you can find a table of common statistics and corresponding parameters for both categorical and quantitative variables. Symbols are provided for those with commonly accepted symbols.

Value Statistic Parameter
For Categorical Variables
Proportion $\hat{p}$ or $\hat{\pi}$ $p$ or $\pi$
For Quantitative Variables
Mean $\bar{x}$ or $\hat{\mu}$ $\mu$
Standard Deviation $\hat{\text{sd}}(x)$ or $\hat{\sigma}$ or $s$ $\sigma$
Variance $\hat{\text{Var}}(X)$ or $\hat{\sigma}^2$ $\sigma^2$
Median $m$ or $\hat{M}$ $M$
Minimum
Maximum

Common summary measures for data, including symbols for values from a sample and from a population.

Other common quantities, specifically for quantitative variables, include the proportion of observations that are equal to one specific value.

Notationally, many of the sample statistics will have a "hat" added to the top of the symbol for the parameter. Many (but not all) of the parameters use a Greek letter to represent that they are calculated for a population.

If you intend to use a statistic or parameter that does not have a commonly accepted symbol, it is crucial for you to communicate what any symbol that you choose to use represents. Indeed, as we continue forward, we will generally include a brief description that defines any symbol in context, including a brief definition of what the symbol represents, e.g. sample mean for $\bar{x}$.

Examples of Parameters and Statistics

To further clarify the difference between parameters and statistics, let's look at some example below.

df_popn = pd.read_csv('chicago_listings.csv')
df_june = pd.read_csv('june22_listings.csv')
df_sample = pd.read_csv('sample_chicago_listings.csv')
df_sample['price'].median()
126.0

The median price ($m$) for a Chicago Airbnb for our sample was \$126 per night. How does this correspond to the value for the population?

df_popn['price'].median()
124.0

The median price for our sample was similar to the median price for all Chicago Airbnbs from March 2023, which is \$124 per night ($M_\text{March 2023}$).

We can see that this median price is different from the previous median price for all Chicago Airbnbs from June 2022, which is \$150 per night ($M_\text{June 2022}$), as calculated below.

df_june['price'].median()
150.0

We can also calculate the median required stay (nights) for the sample and the corresponding population.

df_sample['minimum_nights'].median()
2.0
df_popn['minimum_nights'].median()
2.0

We see that the median required stay for our sample of 700 Chicago Airbnbs from March 2023 was 2 nights ($m$), which is the same as the median required stay for the population of all Chicago Airbnbs from March 2023 ($M$).

We can also calculate the proportion of local hosts for our sample and for our population. Here, we will need to make a decision on what to do with missing data. Similar to the process followed in the last section, we will remove any observations with missing host locations from the data.

(df_sample['host_location'].dropna() == 'Chicago, IL').mean()
0.8042328042328042
(df_popn['host_location'].dropna() == 'Chicago, IL').mean()
0.7861526357199056

Our sample of Chicago Airbnbs had $\hat{p} = 0.8042$ or $80.42\%$ of the listings hosted by a local host. On the other hand, our population of Chicago Airbnbs has $p = 0.7852$ or $78.62\%$ local hosts.

Finally, we may also be interested in calculating the average number of people that a listing accommodates. The accommodates variable is cut off (censored) at 16, so any listing that can accommodate more than 16 people will also be recorded as 16.

df_sample['accommodates'].mean()
4.417142857142857
df_popn['accommodates'].mean()
4.314702465470504

Our sample of 700 Chicago Airbnbs had a mean number of people accommodated in a listing of 4.42 ($\bar{x}$). The corresponding population of Chicago Airbnbs (in March 2023) has a mean number of people accommodated in a listing of 4.31 people ($\mu$).

We know that these means are calculated using a censored variable that includes some measurement error. How many of our observations are affected by this censoring? While we can answer this exactly, we can at least answer how many listings had 16 recorded as the number of people the listing can accommodate.

(df_sample['accommodates'] == 16).mean()
0.012857142857142857
(df_popn['accommodates'] == 16).mean()
0.015748031496062992

We see that the proportion of listings that are recorded as accommodating 16 was 1.29% or $\hat{p} = 0.0129$ for our sample of Chicago Airbnb listings. The corresponding value for the population is $p = 0.0157$ or 1.57%. This means that up to 1.5% of the accommodates values may be underestimates for the true number the listing can accommodate, so the mean that we calculated above could be an underestimate.

Above, we saw that almost all of the statistic and corresponding parameter values were slightly different. Based on our experiences with sampling distributions, we know that we don't need to be too alarmed by the minor differences between our statistic and parameter values. We expect our statistics to vary from the true value, but we also expect it to be similar to the true value.

Note that generally we don't have the population data available to us, so we wouldn't be able to say anything definitive about a parameter typically.

Also, I'll often use the past tense to describe a statistic. That is, I'll say "the sample statistic was a certain value." This helps to indicate the statistic is specific to a given sample that I collected in the past. When I talk about an unknown parameter, I'll often use the present tense or a future tense to describe this value. This helps to indicate that it is unknown and applies to a current population.

Now that we've seen a few examples of parameters and statistics, let's consider how we can use our statistics to make statements about parameters. We'll start with confidence intervals first before continuing to hypothesis testing.