Summarizing Variables with Statistics, Tables, & Plots
On this page, we will review some of the methods that can be used to create and summarize variables using calculations, tables, and visualizations.
This page will focus on research questions surrounding the size of Chicago Airbnb hosts as calculated on the last page.
How do hosts with more properties (defined as having 3 or more properties listed on Airbnb located in Chicago) compare with hosts with fewer properties (defined as having 1 or 2 properties listed on Airbnb located in Chicago)?
Logical Statements
To make answering this question easier, we will add a new variable to our data that records whether we consider the host to be large (having 3 or more properties).
We can do so using a logical statement. You saw this previously when we created boolean variables a few pages ago.
A logical statement is a statment that can be evaluated as being either True or False. These statements typically include some kind of inequality, which includes allowing:
- "less than" with the < symbol
- "less than or equal to" with the <= symbols
- "greater than" with the > symbol
- "greater than or equal to" with the >= symbols
- "exactly equal to" with the == symbols
- "not equal to" with the != symbols
df_host['Chicago_listings_count'] >= 3
0 True 1 False 2 False 3 False 4 False ... 3585 False 3586 False 3587 False 3588 False 3589 False Name: Chicago_listings_count, Length: 3590, dtype: bool
Once we have the variable as we would like it, we can specify that we want it added to our original host data. One method to do so is to specify a new column within the data frame that you would like to serve as the variable for having more Chicago Airbnb properties.
df_host['many_properties'] = (df_host['Chicago_listings_count'] >= 3)
df_host.head()
host_id | mean_bedrooms | Chicago_listings_count | host_name | host_since | host_location | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_total_listings_count | host_has_profile_pic | host_identity_verified | more_properties | many_properties | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2153 | 1.0 | 6 | Linda | 2008-08-16 | Munster, IN | within an hour | 92.0 | 100.0 | True | 24 | True | True | True | True |
1 | 2613 | 1.0 | 1 | Rebecca | 2008-08-29 | Chicago, IL | within an hour | 100.0 | 97.0 | True | 1 | True | True | False | False |
2 | 4434 | 3.0 | 1 | Kellen | 2008-11-20 | Chicago, IL | within an hour | 100.0 | 92.0 | False | 5 | True | True | False | False |
3 | 6162 | 2.0 | 1 | Jackie | 2009-01-08 | Chicago, IL | within a few hours | 80.0 | 30.0 | False | 2 | True | True | False | False |
4 | 7529 | 1.0 | 1 | Emily | 2009-02-07 | Chicago, IL | within an hour | 100.0 | 100.0 | True | 1 | True | True | False | False |
Note the added variable many_properties at the end of this dataset.
Proportions as Means
Earlier, we calculated by hand the proportion of Airbnb hosts that were considered to be large hosts in Chicago. We can use a helpful shortcut for this calculation that is specific to Boolean variables. Calculating the mean of a Boolean variable is the same thing as calculating the proportion of observations that have that characteristic.
df_host['many_properties'].mean()
0.13760445682451253
We can confirm that 13.76% of Airbnb hosts from Chicago have 3 or more properties located inside the city.
Python evaluates True as the value of 1 and False as the value of 0 when using a Boolean variable in a calculation. This allows us to use the mean function to calculate a proportion.
We'll return to this shortcut in a few modules and provide some additional support for why this shortcut works.
Value Counts
We used value_counts
to generate the Chicago_listings_count variable in the df_host data frame.
Here, we'll formally introduce the value_counts
function and describe options that can be used to perform additional calculations quickly with this function.
First, the value_counts
will return the unique values that a variable takes along with a count of how frequently each of those values occur in the data.
For example, we previously used this to determine the distribution of the number of listings that hosts have in Chicago.
df_host['Chicago_listings_count'].value_counts()
1 2657 2 439 3 172 4 84 5 48 6 42 10 19 8 19 7 19 12 12 11 11 9 9 13 7 16 6 17 5 21 4 15 4 14 3 31 3 18 3 22 3 23 2 24 2 19 2 39 2 47 1 64 1 30 1 40 1 27 1 33 1 658 1 34 1 75 1 25 1 32 1 63 1 38 1 Name: Chicago_listings_count, dtype: int64
These are printed in order of the most common value to the least common value. We see that 2,657 hosts have a single listing, 439 have 2 listings in Chicago, and 172 have 3 listings in Chicago.
The counts can be helpful, but we may also want to know the proportion of each level. We could do this using logical statements and the mean function as previously described.
(df_host['Chicago_listings_count'] == 1).mean()
0.7401114206128133
We see that 74% of the hosts in Chicago have exactly 1 listing. If we wanted to calculate this for every value in the data, we would need to repeat the calculation 38 times. Instead, we can add an additional input to the value_counts
function to calculate the proportions. Setting the normalize
input to True allows us to specify that we would like Python to report the proportions for each group.
df_host['Chicago_listings_count'].value_counts(normalize = True)
1 0.740111 2 0.122284 3 0.047911 4 0.023398 5 0.013370 6 0.011699 10 0.005292 8 0.005292 7 0.005292 12 0.003343 11 0.003064 9 0.002507 13 0.001950 16 0.001671 17 0.001393 21 0.001114 15 0.001114 14 0.000836 31 0.000836 18 0.000836 22 0.000836 23 0.000557 24 0.000557 19 0.000557 39 0.000557 47 0.000279 64 0.000279 30 0.000279 40 0.000279 27 0.000279 33 0.000279 658 0.000279 34 0.000279 75 0.000279 25 0.000279 32 0.000279 63 0.000279 38 0.000279 Name: Chicago_listings_count, dtype: float64
We can use the value_counts
function for categorical variables in addition to quantitative variables.
For example, the following output demonstrates that 86.78% of Chicago Airbnb hosts are locals themselves, while the remaining hosts are located elsewhere. Note that the output skips over some of the middle values, since there are 225 locations recorded for Chicago Airbnb hosts. We can observe that there is not a second primary location where Chicago Airbnb hosts are located, since the second most common host location (United States) is only for 0.57% of hosts.
df_host['host_location'].value_counts(normalize = True)
Chicago, IL 0.867752 United States 0.005765 Los Angeles, CA 0.004747 Illinois, United States 0.004069 New York, NY 0.004069 ... Wauconda, IL 0.000339 La Grange Park, IL 0.000339 Cleveland, OH 0.000339 Countryside, IL 0.000339 High Point, NC 0.000339 Name: host_location, Length: 225, dtype: float64
Comparing Distributions
Now, we can start to compare the distribution of mean bedrooms per listing for hosts depending on whether the host has many or few listings in Chicago. Recall that this was our research question of interest.
Let's start with some visualizations that allow us to answer this question.
Violinplots
We may start with violinplots. These allow us to visualize a smoothed form of a histogram between two different groups. We'll use the seaborn
package to help with this visualization. The sns.set()
code simply sets the visualization theme for the resulting graphs.
The matplotlib.pyplot
package allows us to further adjust the labels for each of these graphs.
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
The first line of code provides all information needed for generating the graph, and the following four lines of code adjust the formatting of this graph. From the seaborn package, we supply the violinplot function with the variables to place on the x-axis, y-axis, and the dataframe from which to find these variables. Python then plots each of these variables on the appropriate axes and adds the labels as specified by the last lines of code.
sns.violinplot(y = 'mean_bedrooms', x = 'many_properties', data = df_host)
plt.xlabel("Host has Many Properties (3+)")
plt.ylabel('Mean Number of Bedrooms')
plt.title('Violinplots of Mean Number of Bedrooms for Chicago Airbnb Hosts')
plt.show()
Violinplot of the Mean Number of Bedrooms for Airbnb hosts.
The first thing that I notice when looking at this violinplot is that those hosts that have 1 or 2 properties seem to have a distribution that is discrete. That is, we can see that there are more common and less common values by the very wavy nature of the distribution. This is because when you are averaging either 1 or 2 numbers, all averages will either be whole numbers or numbers with 0.5 as the last digit. This does mean there are a discrete number of options for the distribution.
The second thing that I notice is that each distribution has a long right tail. This suggests that the distributions are not symmetric. The median would be the more appropriate measure of center to compare these two distributions.
The spread for those with fewer properties is larger, although it does appear that the bulk of the data lies within the same range. The maximum for the smaller hosts is clearly larger than the maximum for larger hosts. This is likely influenced by the extreme values being more obvious when they appear in only one or two units. When hosts have more than 2 units, the appearance of extreme values would be counteracted by the more common values.
I also notice that even though the maximum number of bedrooms that Airbnb allows is 16, I do not see any hosts with an average of 16 bedrooms. What types of hosts have the large units, then? This graph doesn't allow us to answer this question. Instead, we would need to turn to our original data frame to analyze those listings that can accommodate large groups.
It might be easier to compare these two distributions using numerical summaries.
Numerical Summaries
To calculate numerical summaries for these two groups, we can filter the data to each group separately and then use the describe
function to provide initial calculations.
df_host['mean_bedrooms'][df_host['many_properties'] == True].describe()
count 489.000000 mean 1.835768 std 0.921820 min 1.000000 25% 1.000000 50% 1.588235 75% 2.250000 max 6.400000 Name: mean_bedrooms, dtype: float64
df_host['mean_bedrooms'][df_host['many_properties'] == False].describe()
count 2980.000000 mean 1.995638 std 1.077470 min 1.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 12.000000 Name: mean_bedrooms, dtype: float64
The numerical summaries support the initial observations from the violinplots. Those hosts who have fewer properties have a much higher maximum (12 compared to 6.4). The mean, median, standard deviation, and IQR are all higher for the hosts that have fewer properties. As described before, we would want to choose the median and IQR to represent the number of bedrooms, due to the shape of these distributions.
When observing the distribution using numerical summaries alone, it can be challenging to compare quickly between two groups. For this reason, we often use boxplots to improve the speed with which we can compare between these two groups.
Boxplots
Boxplots are a method of displaying graphically five summary measures for each group. The code is similar to that used for violinplots, except we replace the violinplot function with boxplot.
sns.boxplot(x = 'many_properties', y = 'mean_bedrooms', data = df_host)
plt.xlabel("Host has Many Properties (3+)")
plt.ylabel('Mean Number of Bedrooms')
plt.title('Boxplots of Mean Number of Bedrooms for Chicago Airbnb Hosts')
plt.show()
The boxplot highlights a few summary measures for the mean number of bedrooms of Airbnb hosts.
Each of these two plots could also be generated for a single quantitative variable, as well. We do not need to include the second variable that records whether the host has many properties to generate the plot.
Using these previous analyses, we can now answer our question of interest. It does not appear that there is much of a relationship between how many properties a host has in Chicago and the mean number of bedrooms for each host's listings. If anything, it does appear that those hosts who have fewer properties generally have larger properties, which is opposite of the original question of interest.
Two-Way Tables and Two-Way Graphs
Now, let's focus on the relationship between a host being a superhost and a host having many properties.
Both being a superhost and having many properties only have two values. That means that there are four options for each of the two variables: a superhost with many properties, a superhost without many properties, not a superhost with many properties, and not a superhost without many properties.
Numerical Summaries Using Tables
We might first be interested in how many hosts fall into each of these four categories. We can do this by creating a two-way table using the crosstab
function from within the pandas library. The first variable provided will be displayed as the rows, and the second variable will be used as the columns.
pd.crosstab(df_host['host_is_superhost'], df_host['many_properties'])
many_properties | False | True |
---|---|---|
host_is_superhost | ||
False | 1899 | 278 |
True | 1197 | 216 |
A two-way table with counts for the number of hosts that are contained within each category.
The counts are helpful as a way to first observe how many hosts fall into each category. For example, we see that 1,899 hosts are not superhost and have only one or two properties. Of the four categories, we also see that this is the most common category for a host.
The counts can be challenging to interpret, so we might also want to observe the proportion of hosts that fall into each category. We can again use the normalize input to instruct Python to report the proportion of hosts in each category.
pd.crosstab(df_host['host_is_superhost'], df_host['many_properties'], normalize = True)
many_properties | False | True |
---|---|---|
host_is_superhost | ||
False | 0.528969 | 0.077437 |
True | 0.333426 | 0.060167 |
A two-way table with proportions calculated based on the total number of Airbnb hosts.
52.90% of hosts have one or two listings in Chicago and are not superhosts.
We may actually wonder if the rate of having multiple properties is different between those who are superhosts and those who are not. To calculate these, we will change the input for the normalize argument. We can specify that we want to normalize by the index, or the row. This means that the proportions for each column will add to 1.
pd.crosstab(df_host['host_is_superhost'], df_host['many_properties'], normalize = 'index')
many_properties | False | True |
---|---|---|
host_is_superhost | ||
False | 0.872301 | 0.127699 |
True | 0.847134 | 0.152866 |
A two-way table with proportions calculated based on whether the host is or is not a superhost.
We are now able to identify that 15.29% of superhosts have 3 or more Airbnb listings in Chicago, while only 12.77% of hosts that are not superhosts have 3 or more Airbnb listings in Chicago. In other words, it does seem like hosts who are superhosts have a higher rate of having three or more properties than hosts who are not superhosts. However, most hosts do not have three or more properties.
The same relationship may change as we change our groupings. Are those that have three or more Chicago properties more often superhosts compared to hosts with one or two Chicago properties?
pd.crosstab(df_host['many_properties'], df_host['host_is_superhost'], normalize = 'index')
host_is_superhost | False | True |
---|---|---|
many_properties | ||
False | 0.613372 | 0.386628 |
True | 0.562753 | 0.437247 |
A two-way table with proportions calculated based on whether the host has 3 or more properties.
Now we see that 38.66% of hosts who have one or two properties are superhosts, and 43.72% of hosts who have three or more properties are superhosts.
Our calculated proportions are quite different. This is because the definition of each group is different. In the first example, we were calculating proportions for superhosts and non-superhosts separately. In the second example, we were calculating proportions for hosts with many properties compared to those with fewer properties.
Visualizations as Summaries
Again, a visualization can be helpful to summarize the relationship between two categorical variables. There are various ways to display the same information, just as the two-way tables above each indicate different information.
For example, the following two graphs display the same information that is shown in the last two tables above.
superhost_table = pd.crosstab(df_host['host_is_superhost'], df_host['many_properties'], normalize = 'index')
superhost_table.plot.bar()
plt.xlabel('Is the Host a Superhost?')
plt.ylabel('Proportion')
plt.title('Barplot of Chicago Airbnb Host Status of Having Many Listings and Being a Superhost')
plt.ylim([0, 1])
plt.legend(loc = 'upper center')
plt.show()
Side-by-side barplot to display the proportion of hosts with many properties based on whether or not the host is a superhost.
properties_table = pd.crosstab(df_host['many_properties'], df_host['host_is_superhost'], normalize = 'index')
properties_table.plot.bar()
plt.xlabel('Does the Host Have Many Chicago Listings?')
plt.ylabel('Proportion')
plt.title('Barplot of Chicago Airbnb Host Status of Having Many Listings and Being a Superhost')
plt.ylim([0, 1])
plt.legend(loc = 'upper center')
plt.show()
Side-by-side barplot to display the proportion of hosts that are superhosts based on whether or not the host has many properties in Chicago.