Reshaping and Merging Data


The original form of the data does not always allow us to answer our questions of interest. In these cases, we may need to combine or adjust the original data in order to answer the question of interest. On this page, you will see examples of reshaping data, copying data, merging data, and creating a new data frame as tools that can be used to adjust data and allow us to answer our research questions of interest.

Reshaping Data

Recall that we introduced a research question of interest a few pages ago: Do hosts with more properties generally have larger properties than hosts with fewer properties?

I can adjust my available data to answer this question. Before doing so, I am going make my question more specific, to ensure that I can clearly answer the question.

Do hosts with more properties (defined as having 3 or more properties listed on Chicago Airbnb) have larger properties (defined by the mean number of bedrooms per property) compared to hosts with fewer properties (defined as having 2 or fewer properties listed on Chicago Airbnb)?

We should have all of the information that we need to answer this question. To address this question directly, we need to adjust the data so that the host is the observational unit. First, let's explore the number of properties that each host has on Airbnb. We can do so using the value_counts function, which reports the number of times that a given value appears in the dataset.

df['host_id'].value_counts()
    107434423    658
    3965428       75
    395498979     64
    47172572      63
    396270947     47
                ... 
    24203102       1
    48302234       1
    31860753       1
    374295         1
    2613           1
    Name: host_id, Length: 3590, dtype: int64
    

The host with the largest number of listings has 658 listings in Chicago. There appear to be many hosts with a single listing in Chicago on Airbnb. From this output, we can also see that we have 3,590 different hosts with at least one listing in Chicago on Airbnb.

Now, we may want to learn more about the number of properties that a host has on Airbnb. We could, for example, count the number of hosts who have a single listing. To do so, we need to return to our original summary from above and chain another value_counts function to the end.

df['host_id'].value_counts().value_counts()
    1      2657
    2       439
    3       172
    4        84
    5        48
    6        42
    8        19
    10       19
    7        19
    12       12
    11       11
    9         9
    13        7
    16        6
    17        5
    15        4
    21        4
    31        3
    14        3
    22        3
    18        3
    19        2
    23        2
    24        2
    39        2
    75        1
    25        1
    27        1
    30        1
    32        1
    33        1
    34        1
    38        1
    40        1
    47        1
    63        1
    64        1
    658       1
    Name: host_id, dtype: int64
    

We can see that a vast majority of hosts have just 1 listing from Chicago on Airbnb. In fact, 2,657 out of the 3,590 hosts with a listing in Chicago on Airbnb have a single listing. We do still have 494 hosts that we would consider to have more properties (13.76% of our available hosts).

Creating a Data Frame

Above, we see that we can observe how many listings each host has. However, it might be nice to have this information available in a data frame for future analysis. How can we create a data frame with this information? Ideally, with 3,590 hosts, we would like to do so automatically rather than manually.

To do so, we can instruct Python to create a data frame from the available data.

df_host_listings = pd.DataFrame(df['host_id'].value_counts())
df_host_listings.head()
host_id
107434423 658
3965428 75
395498979 64
47172572 63
396270947 47

The first few observations for the number of listings that a host has.

Using pd.DataFrame, we have instructed Python to change the form of the value_counts from above to a data frame type. However, when we look at this data frame above, it does not have our columns as we would typically expect them, and the labels do not appear as we might expect. We can adjust this using additional code.

First, we will reset the index using reset_index so that the index has values from 0 to 3,589. Our original index becomes a new column.

df_host_listings = df_host_listings.reset_index()
df_host_listings.head()
index host_id
0 107434423 658
1 3965428 75
2 395498979 64
3 47172572 63
4 396270947 47

The first few observations for the number of listings that a host has, with the index reset.

Then, we will adjust the column names for this data frame using the columns feature of a data frame.

df_host_listings.columns
 Index(['index', 'host_id'], dtype='object')
df_host_listings.columns = ['host_id', 'Chicago_listings_count']
df_host_listings.head()
index host_id
0 107434423 658
1 3965428 75
2 395498979 64
3 47172572 63
4 396270947 47

The first few observations for the number of listings that a host has, with the column names corrected.

We now have a new data frame that includes the host ids and the number of listings that each host has in Chicago.

If we needed to create a data frame by hand, we can do so with similar original code. In this example, we would provide our variables inside of curly bracket { }, with the variable names in quotes and the values for each variable provided as a list surrounded by square brackets [ ] separated by a colon.

pd.DataFrame({'a': [1, 2, 3], 'b': ['house', 'apt', 'duplex'], 'c': [20, 40, 30]})
a b c
0 1 house 20
1 2 apt 40
2 3 duplex 30

A data frame can be generated manually, if needed.

Reshaping the Data, Part 2

Above, we saw that we were able to create a new data frame that counted how many listing each of the hosts had in Chicago on Airbnb. Now, however, we also want to record the mean number of bedrooms per property for each of the hosts.

Data Science Discovery demonstrated the use of the groupby function, which allows you to calculate a summary measure of a variable after creating groups as specified by another variable in the data. We will use this tool below to calculate the mean number of bedrooms per property for each of our hosts.

df_host_bedrooms = df[['host_id', 'bedrooms']].groupby(by = 'host_id').mean()
df_host_bedrooms.head()
bedrooms
host_id
2153 1.0
2613 1.0
4434 3.0
6162 2.0
7529 1.0

A summary measure of the mean number of bedrooms per property that each host has.

df_host_bedrooms.shape
(3590, 1)

We see that the data is returned to us in a similar form as with the value_counts, where the host_id is serving as the index and the mean number of bedrooms is a single column in the resulting data frame. We can similarly adjust this data so that the host_id is retained as a column.

df_host_bedrooms = df_host_bedrooms.reset_index()
df_host_bedrooms.columns = ['host_id', 'mean_bedrooms']
df_host_bedrooms.head()
host_id mean_bedrooms
0 2153 1.0
1 2613 1.0
2 4434 3.0
3 6162 2.0
4 7529 1.0

The version of the bedrooms dataset, with corrections and adjustments in formatting.

Merging Data

We have all of the necessary information calculated in our two data frames above: df_host_listings and df_host_bedrooms. However, we need to incorporate these two data frames into one to be able to address our question fully.

We can do this with the merge function. This function will combine two data frames while matching rows based on some criteria. Here, we'd like to match the two data frames using the host_id variable, so that our resulting data frame contains observations for each host with the number of properties and the mean number of bedrooms recorded for each host. The input on specifies the variable used to match values from one data frame to the other.

df_host = df_host_listings.merge(df_host_bedrooms, on = 'host_id')
df_host.head()
host_id Chicago_listings_count mean_bedrooms
0 107434423 658 1.332075
1 3965428 75 1.344262
2 395498979 64 2.525424
3 47172572 63 1.652174
4 396270947 47 1.425532

The merged data contains information on both the number of listings and the mean number of bedrooms for each host based on their Chicago Airbnb listings.

df_host.shape
(3590, 3)

Our resulting df_host data frame added the mean bedrooms variable to the host listings variable. It was able to reorder the mean bedrooms variable to appropriately match the listings variable by searching for the appropriate host_id.

With this dataset, we are able to answer our original question of interest. We will describe approaches to answer this data on the next page.

Before we turn to the rest of the analysis, let's explore the capabilities of the merge function a little more.

First, does it matter the order with which we specify our two data frames to merge? Above, we added the bedroom variable to the original listings data frame. Could we do the opposite?

df_host = df_host_bedrooms.merge(df_host_listings, on = 'host_id')
df_host.head()
host_id mean_bedrooms Chicago_listings_count
0 2153 1.0 6
1 2613 1.0 1
2 4434 3.0 1
3 6162 2.0 1
4 7529 1.0 1

A second way to merge the two host datasets. The same information is reported, although arranged in a different manner.

We see that our hosts are provided in different orders in the two data frames. The order of the hosts in the second data frame matches the original order in the df_host_bedrooms data. The order of the hosts will match the original order of the first data frame provided. However, do the values match? We can check that the values match for a few hosts by sorting our new data frame.

By default, the values are sorted in ascending order (from small to big). We will request instead that these values be sorted in the opposite order.

df_host.sort_values(by = 'Chicago_listings_count', ascending = False).head()
host_id mean_bedrooms Chicago_listings_count
1932 107434423 1.332075 658
262 3965428 1.344262 75
3133 395498979 2.525424 64
1329 47172572 1.652174 63
3136 396270947 1.425532 47

After sorting, the beginnings of both merged data display the same results.

It does appear that the values match between these two dataset, although we see that the order of the columns have been switched.

Above, we observed a merge from two data sets, where each host appeared exactly once in each data set. Not all merges have exact matches in each dataset. We will now demonstrate how to adjust the merge function to account for this type of scenario.

First, consider the situation where we answer our research question of interest but now would like to answer follow up questions about the host. For example, maybe we would like to know more about characteristics of larger property owners, using their response rates, for example.

We can merge information from our original listings data frame with the host data frame, even though the hosts appear multiple times in the listings data frame. To do so, we will specify only host-specific variables and drop duplicates before our merge. Without removing duplicates, we would have a new row for each time the host appears in the original data frame.

host_vars = ['host_id', 'host_name', 'host_since', 'host_location', 'host_response_time', 
'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
'host_total_listings_count', 'host_has_profile_pic', 'host_identity_verified']
df_host_values = df[host_vars].drop_duplicates()
df_host = df_host.merge(df_host_values, on = 'host_id')
df_host.head()
host_id mean_bedrooms Chicago_listings_count host_name host_since host_location host_response_time host_response_rate host_acceptance_rate host_is_superhost host_total_listings_count host_has_profile_pic host_identity_verified
0 2153 1.0 6 Linda 2008-08-16 Munster, IN within an hour 92.0 100.0 True 24 True True
1 2613 1.0 1 Rebecca 2008-08-29 Chicago, IL within an hour 100.0 97.0 True 1 True True
2 4434 3.0 1 Kellen 2008-11-20 Chicago, IL within an hour 100.0 92.0 False 5 True True
3 6162 2.0 1 Jackie 2009-01-08 Chicago, IL within a few hours 80.0 30.0 False 2 True True
4 7529 1.0 1 Emily 2009-02-07 Chicago, IL within an hour 100.0 100.0 True 1 True True

We can merge information from a dataset that has additional variables.

Now, we have our host-specific information contained within the df_host data frame, so that we can address many host-based questions quickly.

What if we instead wanted to ask a question based on the listing? For example, we may want to learn more about if the listings in a neighborhood have hosts who are larger and have more properties in Chicago or if they are hosts who have a smaller number of properties. To do this, we want to add the Chicago_listings_count and mean_bedrooms variables to our original listings data frame.

df = df.merge(df_host[['host_id', 'mean_bedrooms', 'Chicago_listings_count']], on = 'host_id')
df.shape
(7747, 77)

Through this data manipulation, we now have variables in our data frames that were not originally there. This enables us to answer more questions and in different ways. Let's return now to our original question of interest and explore what we can say about larger and smaller Airbnb hosts in the city of Chicago.