Summarizing Variables with Statistics, Tables, & Plots

← Reshaping and Merging Data Next: Measurement Errors →

On this page, we will review some of the methods that can be used to create and summarize variables using calculations, tables, and visualizations.

This page will focus on research questions surrounding the size of Chicago Airbnb hosts as calculated on the last page.

How do hosts with more properties (defined as having 3 or more properties listed on Airbnb located in Chicago) compare with hosts with fewer properties (defined as having 1 or 2 properties listed on Airbnb located in Chicago)?

Logical Statements

To make answering this question easier, we will add a new variable to our data that records whether we consider the host to be large (having 3 or more properties).

We can do so using a logical statement. You saw this previously when we created boolean variables a few pages ago.

A logical statement is a statment that can be evaluated as being either True or False. These statements typically include some kind of inquality, which includes allowing:

"less than" with the < symbol
"less than or equal to" with the <= symbols
"greater than" with the > symbol
"greater than or equal to" with the >= symbols
"exactly equal to" with the == symbols
"not equal to" with the != symbols

df_host['Chicago_listings_count'] >= 3

    0        True
    1       False
    2       False
    3       False
    4       False
            ...  
    3585    False
    3586    False
    3587    False
    3588    False
    3589    False
    Name: Chicago_listings_count, Length: 3590, dtype: bool

Once we have the variable as we would like it, we can specify that we want it added to our original host data. One method to do so is to specify a new column within the data frame that you would like to serve as the variable for having more Chicago Airbnb properties.

df_host['many_properties'] = (df_host['Chicago_listings_count'] >= 3)
df_host.head()

	host_id	mean_bedrooms	Chicago_listings_count	host_name	host_since	host_location	host_response_time	host_response_rate	host_acceptance_rate	host_is_superhost	host_total_listings_count	host_has_profile_pic	host_identity_verified	more_properties	many_properties
0	2153	1.0	6	Linda	2008-08-16	Munster, IN	within an hour	92.0	100.0	True	24	True	True	True	True
1	2613	1.0	1	Rebecca	2008-08-29	Chicago, IL	within an hour	100.0	97.0	True	1	True	True	False	False
2	4434	3.0	1	Kellen	2008-11-20	Chicago, IL	within an hour	100.0	92.0	False	5	True	True	False	False
3	6162	2.0	1	Jackie	2009-01-08	Chicago, IL	within a few hours	80.0	30.0	False	2	True	True	False	False
4	7529	1.0	1	Emily	2009-02-07	Chicago, IL	within an hour	100.0	100.0	True	1	True	True	False	False

Note the added variable many_properties at the end of this dataset.

Proportions as Means

Earlier, we calculated by hand the proportion of Airbnb hosts that were considered to be large hosts in Chicago. We can use a helpful shortcut for this calculation that is specific to Boolean variables. Calculating the mean of a Boolean variable is the same thing as calculating the proportion of observations that have that characteristic.

df_host['many_properties'].mean()

0.13760445682451253

We can confirm that 13.76% of Airbnb hosts from Chicago have 3 or more properties located inside the city.

Python evaluates True as the value of 1 and False as the value of 0 when using a Boolean variable in a calculation. This allows us to use the mean function to calculate a proportion.

We'll return to this shortcut in a few modules and provide some additional support for why this shortcut works.

Value Counts

We used value_counts to generate the Chicago_listings_count variable in the df_host data frame.

Here, we'll formally introduce the value_counts function and describe an options that can be used to perform additional calculations quickly with this function.

First, the value_counts will return the unique values that a variable takes along with a count of how frequently each of those values occur in the data.

For example, we previously used this to determine the distribution of the number of listings that hosts have in Chicago.

df_host['Chicago_listings_count'].value_counts()

    1      2657
    2       439
    3       172
    4        84
    5        48
    6        42
    10       19
    8        19
    7        19
    12       12
    11       11
    9         9
    13        7
    16        6
    17        5
    21        4
    15        4
    14        3
    31        3
    18        3
    22        3
    23        2
    24        2
    19        2
    39        2
    47        1
    64        1
    30        1
    40        1
    27        1
    33        1
    658       1
    34        1
    75        1
    25        1
    32        1
    63        1
    38        1
    Name: Chicago_listings_count, dtype: int64

These are printed in order of the most common value to the least common value. We see that 2,657 hosts have a single listing, 439 have 2 listings in Chicago, and 172 have 3 listings in Chicago.

The counts can be helpful, but we may also want to know the proportion of each level. We could do this using logical statements and the mean function as previously described.

(df_host['Chicago_listings_count'] == 1).mean()

0.7401114206128133

We see that 74% of the hosts in Chicago have exactly 1 listing. If we wanted to calculate this for every value in the data, we would need to repeat the calculation 38 times. Instead, we can add an additional input to the value_counts function to calculate the proportions. Setting the normalize input to True allows us to specify that we would like Python to report the proportions for each group.

df_host['Chicago_listings_count'].value_counts(normalize = True)

    1      0.740111
    2      0.122284
    3      0.047911
    4      0.023398
    5      0.013370
    6      0.011699
    10     0.005292
    8      0.005292
    7      0.005292
    12     0.003343
    11     0.003064
    9      0.002507
    13     0.001950
    16     0.001671
    17     0.001393
    21     0.001114
    15     0.001114
    14     0.000836
    31     0.000836
    18     0.000836
    22     0.000836
    23     0.000557
    24     0.000557
    19     0.000557
    39     0.000557
    47     0.000279
    64     0.000279
    30     0.000279
    40     0.000279
    27     0.000279
    33     0.000279
    658    0.000279
    34     0.000279
    75     0.000279
    25     0.000279
    32     0.000279
    63     0.000279
    38     0.000279
    Name: Chicago_listings_count, dtype: float64

We can use the value_counts function for categorical variables in addition to quantitative variables.

For example, the following output demonstrates that 86.78% of Chicago Airbnb hosts are locals themselves, while the remaining hosts are located elsewhere. Note that the output skips over some of the middle values, since there are 225 locations recorded for Chicago Airbnb hosts. We can observe that there is not a second primary location where Chicago Airbnb hosts are located, since the second most common host location (United States) is only for 0.57% of hosts.

df_host['host_location'].value_counts(normalize = True)

    Chicago, IL                0.867752
    United States              0.005765
    Los Angeles, CA            0.004747
    Illinois, United States    0.004069
    New York, NY               0.004069
                                 ...   
    Wauconda, IL               0.000339
    La Grange Park, IL         0.000339
    Cleveland, OH              0.000339
    Countryside, IL            0.000339
    High Point, NC             0.000339
    Name: host_location, Length: 225, dtype: float64

Comparing Distributions

Now, we can start to compare the distribution of mean bedrooms per listing for hosts depending on whether the host has many or few listings in Chicago. Recall that this was our research question of interest.

Let's start with some visualizations that allow us to answer this question.

Violinplots

We may start with violinplots. These allow us to visualize a smoothed form of a histogram between two different groups. We'll use the seaborn package to help with this visualization. The sns.set() code simply sets the visualization theme for the resulting graphs.

The matplotlib.pyplot package allows us to further adjust the labels for each of these graphs.

import seaborn as sns; sns.set()
import matplotlib.pyplot as plt

The first line of code provides all information needed for generating the graph, and the following four lines of code adjust the formatting of this graph. From the seaborn package, we supply the violinplot function with the variables to place on the x-axis, y-axis, and the dataframe from which to find these variables. Python then plots each of these variables on the appropriate axes and adds the labels as specified by the last lines of code.

sns.violinplot(y = 'mean_bedrooms', x = 'many_properties', data = df_host)
plt.xlabel("Host has Many Properties (3+)")
plt.ylabel('Mean Number of Bedrooms')
plt.title('Violinplots of Mean Number of Bedrooms for Chicago Airbnb Hosts')
plt.show()

Violinplot of the Mean Number of Bedrooms for Airbnb hosts.

The first thing that I notice when looking at this violinplot is that those hosts that have 1 or 2 properties seem to have a distribution that is discrete. That is, we can see that there are more common and less common values by the very wavy nature of the distribution. This is because when you are averaging either 1 or 2 numbers, all averages will either be whole numbers or numbers with 0.5 as the last digit. This does mean there are a discrete number of options for the distribution.

The second thing that I notice is that each distribution has a long right tail. This suggests that the distributions are not symmetric. The median would be the more appropriate measure of center to compare these two distributions.

The spread for those with fewer properties is larger, although it does appear that the bulk of the data lies within the same range. The maximum for the smaller hosts is clearly larger than the maximum for larger hosts. This is likely influenced by the extreme values being more obvious when they appear in only one or two units. When hosts have more than 2 units, the appearance of extreme values would be counteracted by the more common values.

I also notice that even though the maximum number of bedrooms that Airbnb allows is 16, I do not see any hosts with an average of 16 bedrooms. What types of hosts have the large units, then? This graph doesn't allow us to answer this question. Instead, we would need to turn to our original data frame to analyze those listings that can accommodate large groups.

It might be easier to compare these two distributions using numerical summaries.

Numerical Summaries

To calculate numerical summaries for these two groups, we can filter the data to each group separately and then use the describe function to provide initial calculations.

df_host['mean_bedrooms'][df_host['many_properties'] == True].describe()

    count    489.000000
    mean       1.835768
    std        0.921820
    min        1.000000
    25%        1.000000
    50%        1.588235
    75%        2.250000
    max        6.400000
    Name: mean_bedrooms, dtype: float64

df_host['mean_bedrooms'][df_host['many_properties'] == False].describe()

    count    2980.000000
    mean        1.995638
    std         1.077470
    min         1.000000
    25%         1.000000
    50%         2.000000
    75%         3.000000
    max        12.000000
    Name: mean_bedrooms, dtype: float64

The numerical summaries support the initial observations from the violinplots. Those hosts who have fewer properties have a much higher maximum (12 compared to 6.4). The mean, median, standard deviation, and IQR are all higher for the hosts that have fewer properties. As described before, we would want to choose the median and IQR to represent the number of bedrooms, due to the shape of these distributions.

When observing the distribution using numerical summaries alone, it can be challenging to compare quickly between two groups. For this reason, we often use boxplots to improve the speed with which we can compare between these two groups.

Boxplots

Boxplots are a method of displaying graphically five summary measures for each group. The code is similar to that used for violinplots, except we replace the violinplot function with boxplot.

sns.boxplot(x = 'many_properties', y = 'mean_bedrooms', data = df_host)
plt.xlabel("Host has Many Properties (3+)")
plt.ylabel('Mean Number of Bedrooms')
plt.title('Boxplots of Mean Number of Bedrooms for Chicago Airbnb Hosts')
plt.show()

The boxplot highlights a few summary measures for the mean number of bedrooms of Airbnb hosts.

Each of these two plots could also be generated for a single quantitative variable, as well. We do not need to include the second variable that records whether the host has many properties to generate the plot.

Using these previous analyses, we can now answer our question of interest. It does not appear that there is much of a relationship between how many properties a host has in Chicago and the mean number of bedrooms for each host's listings. If anything, it does appear that those hosts who have fewer properties generally have larger properties, which is opposite of the original question of interest.

Two-Way Tables and Two-Way Graphs

Now, let's focus on the relationship between a host being a superhost and a host having many properties.

Both being a superhost and having many properties only have two values. That means that there are four options for each of the two variables: a superhost with many properties, a superhost without many properties, not a superhost with many properties, and not a superhost without many properties.

Numerical Summaries Using Tables

We might first be interested in how many hosts fall into each of these four categories. We can do this by creating a two-way table using the crosstab function from within the pandas library. The first variable provided will be displayed as the rows, and the second variable will be used as the columns.

pd.crosstab(df_host['host_is_superhost'], df_host['many_properties'])

many_properties	False	True
host_is_superhost
False	1899	278
True	1197	216

A two-way table with counts for the number of hosts that are contained within each category.

The counts are helpful as a way to first observe how many hosts fall into each category. For example, we see that 1,899 hosts are not superhost and have only one or two properties. Of the four categories, we also see that this is the most common category for a host.

The counts can be challenging to interpret, so we might also want to observe the proportion of hosts that fall into each category. We can again use the normalize input to instruct Python to report the proportion of hosts in each category.

pd.crosstab(df_host['host_is_superhost'], df_host['many_properties'], normalize = True)

many_properties	False	True
host_is_superhost
False	0.528969	0.077437
True	0.333426	0.060167

A two-way table with proportions calculated based on the total number of Airbnb hosts.

52.90% of hosts have one or two listings in Chicago and are not superhosts.

We may actually wonder if the rate of having multiple properties is different between those who are superhosts and those who are not. To calculate these, we will change the input for the normalize argument. We can specify that we want to normalize by the index, or the row. This means that the proportions for each column will add to 1.

pd.crosstab(df_host['host_is_superhost'], df_host['many_properties'], normalize = 'index')

many_properties	False	True
host_is_superhost
False	0.872301	0.127699
True	0.847134	0.152866

A two-way table with proportions calculated based on whether the host is or is not a superhost.

We are now able to identify that 15.29% of superhosts have 3 or more Airbnb listings in Chicago, while only 12.77% of hosts that are not superhosts have 3 or more Airbnb listings in Chicago. In other words, it does seem like hosts who are superhosts have a higher rate of having three or more properties than hosts who are not superhosts. However, most hosts do not have three or more properties.

The same relationship may change as we change our groupings. Are those that have three or more Chicago properties more often superhosts compared to hosts with one or two Chicago properties?

pd.crosstab(df_host['many_properties'], df_host['host_is_superhost'], normalize = 'index')

host_is_superhost	False	True
many_properties
False	0.613372	0.386628
True	0.562753	0.437247

A two-way table with proportions calculated based on whether the host has 3 or more properties.

Now we see that 38.66% of hosts who have one or two properties are superhosts, and 43.72% of hosts who have three or more properties are superhosts.

Our calculated proportions are quite different. This is because the definition of each group is different. In the first example, we were calculating proportions for superhosts and non-superhosts separately. In the second example, we were calcualting proportions for hosts with many properties compared to those with fewer properties.

Visualizations as Summaries

Again, a visualization can be helpful to summarize the relationship between two categorical variables. There are various ways to display the same information, just as the two-way tables above each indicate different information.

For example, the following two graphs display the same information that is shown in the last two tables above.

superhost_table = pd.crosstab(df_host['host_is_superhost'], df_host['many_properties'], normalize = 'index')
superhost_table.plot.bar()
plt.xlabel('Is the Host a Superhost?')
plt.ylabel('Proportion')
plt.title('Barplot of Chicago Airbnb Host Status of Having Many Listings and Being a Superhost')
plt.ylim([0, 1])
plt.legend(loc = 'upper center')
plt.show()

Side-by-side barplot to display the proportion of hosts with many properties based on whether or not the host is a superhost.

properties_table = pd.crosstab(df_host['many_properties'], df_host['host_is_superhost'], normalize = 'index')
properties_table.plot.bar()
plt.xlabel('Does the Host Have Many Chicago Listings?')
plt.ylabel('Proportion')
plt.title('Barplot of Chicago Airbnb Host Status of Having Many Listings and Being a Superhost')
plt.ylim([0, 1])
plt.legend(loc = 'upper center')
plt.show()

Side-by-side barplot to display the proportion of hosts that are superhosts based on whether or not the host has many properties in Chicago.

← Reshaping and Merging Data Next: Measurement Errors →