Reshaping and Merging Data
The original form of the data does not always allow us to answer our questions of interest. In these cases, we may need to combine or adjust the original data in order to answer the question of interest. On this page, you will see examples of reshaping data, copying data, merging data, and creating a new data frame as tools that can be used to adjust data and allow us to answer our research questions of interest.
Reshaping Data
Recall that we introduced a research question of interest a few pages ago: Do hosts with more properties generally have larger properties than hosts with fewer properties?
I can adjust my available data to answer this question. Before doing so, I am going make my question more specific, to ensure that I can clearly answer the question.
Do hosts with more properties (defined as having 3 or more properties listed on Chicago Airbnb) have larger properties (defined by the mean number of bedrooms per property) compared to hosts with fewer properties (defined as having 2 or fewer properties listed on Chicago Airbnb)?
We should have all of the information that we need to answer this question. To address this question directly, we need to adjust the data so that the host is the observational unit. First, let's explore the number of properties that each host has on Airbnb. We can do so using the value_counts
function, which reports the number of times that a given value appears in the dataset.
df['host_id'].value_counts()
107434423 658 3965428 75 395498979 64 47172572 63 396270947 47 ... 24203102 1 48302234 1 31860753 1 374295 1 2613 1 Name: host_id, Length: 3590, dtype: int64
The host with the largest number of listings has 658 listings in Chicago. There appear to be many hosts with a single listing in Chicago on Airbnb. From this output, we can also see that we have 3,590 different hosts with at least one listing in Chicago on Airbnb.
Now, we may want to learn more about the number of properties that a host has on Airbnb. We could, for example, count the number of hosts who have a single listing. To do so, we need to return to our original summary from above and chain another value_counts
function to the end.
df['host_id'].value_counts().value_counts()
1 2657 2 439 3 172 4 84 5 48 6 42 8 19 10 19 7 19 12 12 11 11 9 9 13 7 16 6 17 5 15 4 21 4 31 3 14 3 22 3 18 3 19 2 23 2 24 2 39 2 75 1 25 1 27 1 30 1 32 1 33 1 34 1 38 1 40 1 47 1 63 1 64 1 658 1 Name: host_id, dtype: int64
We can see that a vast majority of hosts have just 1 listing from Chicago on Airbnb. In fact, 2,657 out of the 3,590 hosts with a listing in Chicago on Airbnb have a single listing. We do still have 494 hosts that we would consider to have more properties (13.76% of our available hosts).
Creating a Data Frame
Above, we see that we can observe how many listings each host has. However, it might be nice to have this information available in a data frame for future analysis. How can we create a data frame with this information? Ideally, with 3,590 hosts, we would like to do so automatically rather than manually.
To do so, we can instruct Python to create a data frame from the available data.
df_host_listings = pd.DataFrame(df['host_id'].value_counts())
df_host_listings.head()
host_id | |
---|---|
107434423 | 658 |
3965428 | 75 |
395498979 | 64 |
47172572 | 63 |
396270947 | 47 |
The first few observations for the number of listings that a host has.
Using pd.DataFrame
, we have instructed Python to change the form of the value_counts from above to a data frame type. However, when we look at this data frame above, it does not have our columns as we would typically expect them, and the labels do not appear as we might expect. We can adjust this using additional code.
First, we will reset the index using reset_index
so that the index has values from 0 to 3,589. Our original index becomes a new column.
df_host_listings = df_host_listings.reset_index()
df_host_listings.head()
index | host_id | |
---|---|---|
0 | 107434423 | 658 |
1 | 3965428 | 75 |
2 | 395498979 | 64 |
3 | 47172572 | 63 |
4 | 396270947 | 47 |
The first few observations for the number of listings that a host has, with the index reset.
Then, we will adjust the column names for this data frame using the columns
feature of a data frame.
df_host_listings.columns
Index(['index', 'host_id'], dtype='object')
df_host_listings.columns = ['host_id', 'Chicago_listings_count']
df_host_listings.head()
index | host_id | |
---|---|---|
0 | 107434423 | 658 |
1 | 3965428 | 75 |
2 | 395498979 | 64 |
3 | 47172572 | 63 |
4 | 396270947 | 47 |
The first few observations for the number of listings that a host has, with the column names corrected.
We now have a new data frame that includes the host ids and the number of listings that each host has in Chicago.
If we needed to create a data frame by hand, we can do so with similar original code. In this example, we would provide our variables inside of curly bracket { }, with the variable names in quotes and the values for each variable provided as a list surrounded by square brackets [ ] separated by a colon.
pd.DataFrame({'a': [1, 2, 3], 'b': ['house', 'apt', 'duplex'], 'c': [20, 40, 30]})
a | b | c | |
---|---|---|---|
0 | 1 | house | 20 |
1 | 2 | apt | 40 |
2 | 3 | duplex | 30 |
A data frame can be generated manually, if needed.
Reshaping the Data, Part 2
Above, we saw that we were able to create a new data frame that counted how many listing each of the hosts had in Chicago on Airbnb. Now, however, we also want to record the mean number of bedrooms per property for each of the hosts.
Data Science Discovery demonstrated the use of the groupby
function, which allows you to calculate a summary measure of a variable after creating groups as specified by another variable in the data. We will use this tool below to calculate the mean number of bedrooms per property for each of our hosts.
df_host_bedrooms = df[['host_id', 'bedrooms']].groupby(by = 'host_id').mean()
df_host_bedrooms.head()
bedrooms | |
---|---|
host_id | |
2153 | 1.0 |
2613 | 1.0 |
4434 | 3.0 |
6162 | 2.0 |
7529 | 1.0 |
A summary measure of the mean number of bedrooms per property that each host has.
df_host_bedrooms.shape
(3590, 1)
We see that the data is returned to us in a similar form as with the value_counts, where the host_id
is serving as the index and the mean number of bedrooms is a single column in the resulting data frame. We can similarly adjust this data so that the host_id is retained as a column.
df_host_bedrooms = df_host_bedrooms.reset_index()
df_host_bedrooms.columns = ['host_id', 'mean_bedrooms']
df_host_bedrooms.head()
host_id | mean_bedrooms | |
---|---|---|
0 | 2153 | 1.0 |
1 | 2613 | 1.0 |
2 | 4434 | 3.0 |
3 | 6162 | 2.0 |
4 | 7529 | 1.0 |
The version of the bedrooms dataset, with corrections and adjustments in formatting.
Merging Data
We have all of the necessary information calculated in our two data frames above: df_host_listings and df_host_bedrooms. However, we need to incorporate these two data frames into one to be able to address our question fully.
We can do this with the merge
function. This function will combine two data frames while matching rows based on some criteria. Here, we'd like to match the two data frames using the host_id variable, so that our resulting data frame contains observations for each host with the number of properties and the mean number of bedrooms recorded for each host. The input on
specifies the variable used to match values from one data frame to the other.
df_host = df_host_listings.merge(df_host_bedrooms, on = 'host_id')
df_host.head()
host_id | Chicago_listings_count | mean_bedrooms | |
---|---|---|---|
0 | 107434423 | 658 | 1.332075 |
1 | 3965428 | 75 | 1.344262 |
2 | 395498979 | 64 | 2.525424 |
3 | 47172572 | 63 | 1.652174 |
4 | 396270947 | 47 | 1.425532 |
The merged data contains information on both the number of listings and the mean number of bedrooms for each host based on their Chicago Airbnb listings.
df_host.shape
(3590, 3)
Our resulting df_host data frame added the mean bedrooms variable to the host listings variable. It was able to reorder the mean bedrooms variable to appropriately match the listings variable by searching for the appropriate host_id.
With this dataset, we are able to answer our original question of interest. We will describe approaches to answer this data on the next page.
Before we turn to the rest of the analysis, let's explore the capabilities of the merge
function a little more.
First, does it matter the order with which we specify our two data frames to merge? Above, we added the bedroom variable to the original listings data frame. Could we do the opposite?
df_host = df_host_bedrooms.merge(df_host_listings, on = 'host_id')
df_host.head()
host_id | mean_bedrooms | Chicago_listings_count | |
---|---|---|---|
0 | 2153 | 1.0 | 6 |
1 | 2613 | 1.0 | 1 |
2 | 4434 | 3.0 | 1 |
3 | 6162 | 2.0 | 1 |
4 | 7529 | 1.0 | 1 |
A second way to merge the two host datasets. The same information is reported, although arranged in a different manner.
We see that our hosts are provided in different orders in the two data frames. The order of the hosts in the second data frame matches the original order in the df_host_bedrooms data. The order of the hosts will match the original order of the first data frame provided. However, do the values match? We can check that the values match for a few hosts by sorting our new data frame.
By default, the values are sorted in ascending order (from small to big). We will request instead that these values be sorted in the opposite order.
df_host.sort_values(by = 'Chicago_listings_count', ascending = False).head()
host_id | mean_bedrooms | Chicago_listings_count | |
---|---|---|---|
1932 | 107434423 | 1.332075 | 658 |
262 | 3965428 | 1.344262 | 75 |
3133 | 395498979 | 2.525424 | 64 |
1329 | 47172572 | 1.652174 | 63 |
3136 | 396270947 | 1.425532 | 47 |
After sorting, the beginnings of both merged data display the same results.
It does appear that the values match between these two dataset, although we see that the order of the columns have been switched.
Above, we observed a merge from two data sets, where each host appeared exactly once in each data set. Not all merges have exact matches in each dataset. We will now demonstrate how to adjust the merge
function to account for this type of scenario.
First, consider the situation where we answer our research question of interest but now would like to answer follow up questions about the host. For example, maybe we would like to know more about characteristics of larger property owners, using their response rates, for example.
We can merge information from our original listings data frame with the host data frame, even though the hosts appear multiple times in the listings data frame. To do so, we will specify only host-specific variables and drop duplicates before our merge. Without removing duplicates, we would have a new row for each time the host appears in the original data frame.
host_vars = ['host_id', 'host_name', 'host_since', 'host_location', 'host_response_time',
'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
'host_total_listings_count', 'host_has_profile_pic', 'host_identity_verified']
df_host_values = df[host_vars].drop_duplicates()
df_host = df_host.merge(df_host_values, on = 'host_id')
df_host.head()
host_id | mean_bedrooms | Chicago_listings_count | host_name | host_since | host_location | host_response_time | host_response_rate | host_acceptance_rate | host_is_superhost | host_total_listings_count | host_has_profile_pic | host_identity_verified | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2153 | 1.0 | 6 | Linda | 2008-08-16 | Munster, IN | within an hour | 92.0 | 100.0 | True | 24 | True | True |
1 | 2613 | 1.0 | 1 | Rebecca | 2008-08-29 | Chicago, IL | within an hour | 100.0 | 97.0 | True | 1 | True | True |
2 | 4434 | 3.0 | 1 | Kellen | 2008-11-20 | Chicago, IL | within an hour | 100.0 | 92.0 | False | 5 | True | True |
3 | 6162 | 2.0 | 1 | Jackie | 2009-01-08 | Chicago, IL | within a few hours | 80.0 | 30.0 | False | 2 | True | True |
4 | 7529 | 1.0 | 1 | Emily | 2009-02-07 | Chicago, IL | within an hour | 100.0 | 100.0 | True | 1 | True | True |
We can merge information from a dataset that has additional variables.
Now, we have our host-specific information contained within the df_host data frame, so that we can address many host-based questions quickly.
What if we instead wanted to ask a question based on the listing? For example, we may want to learn more about if the listings in a neighborhood have hosts who are larger and have more properties in Chicago or if they are hosts who have a smaller number of properties. To do this, we want to add the Chicago_listings_count and mean_bedrooms variables to our original listings data frame.
df = df.merge(df_host[['host_id', 'mean_bedrooms', 'Chicago_listings_count']], on = 'host_id')
df.shape
(7747, 77)
Through this data manipulation, we now have variables in our data frames that were not originally there. This enables us to answer more questions and in different ways. Let's return now to our original question of interest and explore what we can say about larger and smaller Airbnb hosts in the city of Chicago.