Describing a Sample with Visualizations and Statistics


Let’s say that we want to summarize a variable based on sample data. While there are many choices for how to summarize a sample, there are a few guiding questions:

  • What is the variable type? Categorical or quantitative?
  • What do you want to summarize about the variable?

Sample Distributions (typically for Quantitative Variables)

I generate a sample from my population. I want to start summarizing the information contained in the variable that I have calculated. I might start with visualizing the distribution for a variable from a sample.

Let's consider the minimum stay required. We'll look at our sample of 700 Chicago Airbnb listings.

df = pd.read_csv('sample_chicago_listings.csv')
df['minimum_nights'].hist()
plt.title('Histogram of Minimum Nights Required for Chicago Airbnb Listings')
plt.xlabel('Minimum Nights Required')
plt.ylabel('Frequency')
plt.show()

Histogram of the minimum nights required for the sample of Chicago Airbnb listings.

The sample of Chicago Airbnb listings was right skewed with a center between 0 and 15 nights, minimum nights ranging from around 1 and around 175 nights, and with upper outliers.

Because our sample comes from the population, we anticipate that our sample distribution should mirror our population distribution. In this case, because we have a population distribution available to us, we can also use that to confirm.

df_popn = pd.read_csv('chicago_listings.csv')
df_popn['minimum_nights'].hist()
plt.title('Histogram of Minimum Nights Required for Population of Chicago Airbnb Listings')
plt.xlabel('Minimum Nights Required')
plt.ylabel('Frequency')
plt.show()

Histogram of the minimum nights required for the population of Chicago Airbnb listings.

df_popn['minimum_nights'].describe()
count 7747.000000
mean 14.459533
std 42.247809
min 1.000000
25% 2.000000
50% 2.000000
75% 32.000000
max 1125.000000

Summary measures for the minimum nights required for the population of Chicago Airbnb listings.

df['minimum_nights'].describe()
count 700.000000
mean 12.524286
std 19.018573
min 1.000000
25% 1.000000
50% 2.000000
75% 32.000000
max 180.000000

Summary measures for the minimum nights required for the sample of 700 Chicago Airbnb listings.

We can see from both the histogram and the numerical summaries that the sample distribution does have a number of similar features to the population distribution. Both are right skewed; both minimums, medians, and Q3s are exactly the same; and Q1 is similar.

However, we also see that the maximum for the two distributions is quite different, and that the right tail for the population is substantially larger. The right tail also explains why the standard deviation is quite a bit larger for the population compared to from our random sample. Combined, this demonstrates that our sample distribution does do a good job of representing the population but will not perfectly represent every aspect of the population.

Descriptive Statistics for Quantitative Variables

For quantitative variables, the most common summary measure is the mean of that variable. Other common summary measures could include the median, minimum, maximum. Depending on the question of interest, other summary measures could be selected as most relevant or most appropriate.

We'll continue exploring the minimum stay (in nights) that a listing requires for our sample of 700 Chicago Airbnb listings.

At first, we may be interested in the mean minimum stay required.

print('Mean minimum required stay: ', df['minimum_nights'].mean()) 
Mean minimum required stay: 12.524285714285714

The mean minimum required stay for our sample of 700 Chicago Airbnb listings was 12.52 nights.

For this sample, the mean minimum stay allowed was 12.52 nights for a Chicago Airbnb.

This seems like a large number of nights. What was the smallest minimum stay? The largest minimum stay required?

print('Minimum required stay: ', df['minimum_nights'].min())
Smallest minimum required stay: 1

The smallest minimum required stay for our sample of 700 Chicago Airbnb listings was 1 night.

print('Largest minimum required stay: ', df['minimum_nights'].max())
Largest minimum required stay: 180

The largest minimum required stay for our sample of 700 Chicago Airbnb listings was 180 nights.

For this sample, the smallest minimum stay was 1 night, and the largest minimum stay was 180 nights. A listing with a minimum stay of 180 nights (almost 6 months!) seems quite unusual.

Were there many with this requirement?

print(' Number of listings with 180 nights required: ', (df['minimum_nights'] == 180).sum())
Number of listings with 180 nights required: 3

Three listings had required stays of 180 nights!

There were 3 listings that each had the minimum nights listed as 180 nights! I have to wonder how frequently those listings would be booked.

I would imagine that listings with a minimum stay of 1 night would be more common. Let's check to see how many of those are in this sample.

print('Number of listings with only 1 night required: ', (df['minimum_nights'] == 1).sum())
Number of listings with only 1 night required: 181

181 listings had a minimum stay of only 1 night.

Yes, there were 181 listings that had a minimum stay of only 1 night. This seems much more common.

We might start to anticipate that there is a right skewed distribution with a long right tail, based on the large minimum stay that is less common and a small minimum stay that is more common. In this instance, what would be the correct center to use?

We should use the median, to more accurately summarize where the center of the distribution is.

print('Median minimum required stay: ', df['minimum_nights'].median())
Median minimum required stay: 2.0

The median minimum required stay for our sample of 700 Chicago Airbnb listings was 2.0 nights.

We see that the median minimum stay was 2 nights, so we can anticipate that there are likely a large number of listings that require someone to stay for at least 2 nights.

Finally, we may also be interested in learning about the variability. If we are using the median, we would likely want to use the IQR, range, or other resistent measure of variability to summarize this sample. Here, we'll choose to use the IQR.

q3, q1 = np.percentile(df['minimum_nights'], [75 ,25])
iqr = q3 - q1
print('IQR for the minimum required stay: ', iqr)
IQR for the minimum required stay: 31.0

The IQR for the minimum required stay for our sample of 700 Chicago Airbnb listings was 31.0 nights.

The IQR for this distribution was 31 days.

Note that for all of these examples, I chose to calculate a number to summarize some portion of the distribution. These are all descriptive statistics for my sample.

While I could also visualize this distribution, and that is a very helpful step in summarizing a distribution, we are going to start focusing on the numerical summaries of a distribution for the remainder of our content on statistical inference.

Descriptive Statistics for Categorical Variables

For categorical variables, one common summary measure is the proportion of observations that have a certain characteristic. This can be described as a proportion, rate, or percent.

For example, from our sample of Chicago Airbnb listings, we could answer what proportions of listings have an owner who typically responds within a day.

df['host_response_time'].value_counts(normalize = True)
within an hour 0.850649
within a few hours 0.095779
within a day 0.043831
a few days or more 0.009740

Distribution with proportions of host response times for the sample of Chicago Airbnb listings.

Based on this analysis, one person might say that 4.4% of Chicago Airbnb listings had owners who responded within a day.

Based on this same analysis, a second person might say that 99.02% of Chicago Airbnb listings had owners who responded within a day (0.8506 + 0.0958 + 0.0438).

Who is correct?

Both could be correct. Answering this question relies on whether responding within an hour counts as responding within a day. You may want to return to your stakeholder and/or situation to determine how to interpret this question.

For example, suppose you are a user who will be talking with a few different hosts. You want to have a sense of if you should expect a response from the hosts within a day, where an hour is a response that still occurs within the day. In this siutation, you'd expect an answer like the second person above.

On the other hand, suppose that you are a user who sent a message to a host an hour or two ago. You haven't heard back, and you'd like to consider what proportion of overall users would typically respond in more than a few hours but still within a day. In other words, is there still hope that the host might respond to you. In this case, you would want an answer like the first person gave above.

Similarly, you might consider how hosts behave and what type of answer they may prefer to hear, or how an Airbnb executive might expect an answer. This decision, again, may be determined by the context of the analysis.

Now, we might continue exploring the data. We can look at the specific counts rather than the proportions as calculated above using the normalize = True argument.

df['host_response_time'].value_counts()
within an hour 524
within a few hours 59
within a day 27
a few days or more 6

Distribution with counts of host response times for the sample of Chicago Airbnb listings.

df['host_response_time'].value_counts().sum()
616

There are 616 values recorded for the host response time from our sample of 700 Chicago Airbnb listings was 31.0 nights.

Notice that our host_response_time variable only contains 616 observations, although there are 700 observations in the data frame. That means there are 84 listings of our sample (over 10%) that don't have a value recorded for the host response time. Is there anything we can say about these 84 listings?

We'll start by adding a new variable called missing_response_time that records whether the response time is missing. We can then continue by exploring whether there is an association between the response time being missing and other characteristics. We'll consider whether the host is a superhost, if the host has availability, how long it has been since the last review on the listing, and the number of reviews the listing has received.

df['missing_response_time'] = df['host_response_time'].isna()

We added a new variable, missing_response_time to our data frame to assist with later analyses.

pd.crosstab(df['missing_response_time'], df['host_is_superhost'], normalize = 'index')
host_is_superhost False True
missing_response_time
False 0.579545 0.420455
True 0.916667 0.083333

Distribution of proportion of superhosts separately, grouped based on whether the host response time was missing or present.

pd.crosstab(df['missing_response_time'], df['has_availability'], normalize = 'index')
has_availability False True
missing_response_time
False 0.003247 0.996753
True 0.226190 0.773810

Distribution of proportion of the listing having avialability separately, grouped based on whether the host response time was missing or present.

sns.boxplot(x = 'missing_response_time', y = 'host_total_listings_count', data = df)
plt.ylim([-1, 100])
plt.xlabel('Is Host Response Time Missing?')
plt.ylabel('Total Listings of Host (Cut Off at 100)')
plt.title('Boxplot of Total Listings')
plt.show()

Side-by-side boxplots of the total host listing counts based on whether the response time was missing or provided for the sample of Chicago Airbnb listings.

df.groupby(by='missing_response_time')['host_total_listings_count'].describe()
count mean std min 25% 50% 75% max
missing_response_time
False 616.0 462.094156 1470.566468 0.0 3.0 6.5 29.0 8342.0
True 84.0 7.630952 25.052610 1.0 1.0 2.0 4.0 184.0

Distribution of proportion of the listings having availability separately, grouped based on whether the host response time was missing or present.

sns.boxplot(x = 'missing_response_time', y = 'last_review', data = df)
plt.xlabel('Is Host Response Time Missing?')
plt.ylabel('Days Since Listing Last Reviewed')
plt.title('Boxplot of Last Review')
plt.show()

Side-by-side boxplots of the number of days since the last review based on whether the response time was missing or provided for the sample of Chicago Airbnb listings.

df.groupby(by = 'missing_response_time')['last_review'].describe()
count mean std min 25% 50% 75% max
missing_response_time
False 528.0 119.566288 236.707001 1.0 10.00 31.5 120.25 2122.0
True 62.0 660.951613 553.214642 3.0 160.25 538.5 1151.75 1880.0

Distribution of the days since the last review separately, grouped based on whether the host response time was missing or present.

sns.boxplot(x = 'missing_response_time', y = 'number_of_reviews', data = df)
plt.xlabel('Is Host Response Time Missing?')
plt.ylabel('Number of Reviews on Listing')
plt.title('Boxplot of Number of Listing Reviews')
plt.show()

Side-by-side boxplots of the number of reviews based on whether the response time was missing or provided for the sample of Chicago Airbnb listings.

df.groupby(by = 'missing_response_time')['number_of_reviews'].describe()
count mean std min 25% 50% 75% max
missing_response_time
False 616.0 49.975649 77.677681 0.0 2.0 18.0 67.00 661.0
True 84.0 46.190476 95.304771 0.0 0.0 9.5 37.25 645.0

Distribution of the number of reviews separately, grouped based on whether the host response time was missing or present.

From these four summaries, it does appear that the 84 listings that are missing the response time variable may be different from those who have this variable recorded in a few different ways:

  • a smaller proportion of these hosts are superhosts (8% compared to 42% of those who aren't missing)
  • a larger proportion of these hosts do not have availability to be booked (22.6% compared to 0.3% of those who aren't missing)
  • a smaller number of listings per host (75% of listings have 10 or fewer other properties if the response time is missing compared to 75% of listings have 30 or fewer other properties if the response time is listed)
  • a larger time since the last review (median of ~500 days ago compared to <50 days ago for those who aren't missing)
  • a smaller number of reviews

All of this information suggests that those who are missing the response time may not be as active on Airbnb currently or recently. Many of them may have limited historical bookings on their listings, or may not currently be listing the unit. For this question, it may be reasonable to omit that group of listings, as they may not be functional current listings. However, it is worth exploring this relationship further, and mentioning this limitation in any conclusions.