Reshaping and Merging Data

← Missing Data Next: Summarizing Variables with Statistics, Tables, & Plots →

The original form of the data does not always allow us to answer our questions of interest. In these cases, we may need to combine or adjust the original data in order to answer the question of interest. On this page, you will see examples of reshaping data, copying data, merging data, and creating a new data frame as tools that can be used to adjust data and allow us to answer our research questions of interest.

Reshaping Data

Recall that we introduced a research question of interest a few pages ago: Do hosts with more properties generally have larger properties than hosts with fewer properties?

I can adjust my available data to answer this question. Before doing so, I am going make my question more specific, to ensure that I can clearly answer the question.

Do hosts with more properties (defined as having 3 or more properties listed on Chicago Airbnb) have larger properties (defined by the mean number of bedrooms per property) compared to hosts with fewer properties (defined as having 2 or fewer properties listed on Chicago Airbnb)?

We should have all of the information that we need to answer this question. To address this question directly, we need to adjust the data so that the host is the observational unit. First, let's explore the number of properties that each host has on Airbnb. We can do so using the value_counts function, which reports the number of times that a given value appears in the dataset.

df['host_id'].value_counts()

    107434423    658
    3965428       75
    395498979     64
    47172572      63
    396270947     47
                ... 
    24203102       1
    48302234       1
    31860753       1
    374295         1
    2613           1
    Name: host_id, Length: 3590, dtype: int64

The host with the largest number of listings has 658 listings in Chicago. There appear to be many hosts with a single listing in Chicago on Airbnb. From this output, we can also see that we have 3,590 different hosts with at least one listing in Chicago on Airbnb.

Now, we may want to learn more about the number of properties that a host has on Airbnb. We could, for example, count the number of hosts who have a single listing. To do so, we need to return to our original summary from above and chain another value_counts function to the end.

df['host_id'].value_counts().value_counts()

    1      2657
    2       439
    3       172
    4        84
    5        48
    6        42
    8        19
    10       19
    7        19
    12       12
    11       11
    9         9
    13        7
    16        6
    17        5
    15        4
    21        4
    31        3
    14        3
    22        3
    18        3
    19        2
    23        2
    24        2
    39        2
    75        1
    25        1
    27        1
    30        1
    32        1
    33        1
    34        1
    38        1
    40        1
    47        1
    63        1
    64        1
    658       1
    Name: host_id, dtype: int64

We can see that a vast majority of hosts have just 1 listing from Chicago on Airbnb. In fact, 2,657 out of the 3,590 hosts with a listing in Chicago on Airbnb have a single listing. We do still have 494 hosts that we would consider to have more properties (13.76% of our available hosts).

Creating a Data Frame

Above, we see that we can observe how many listings each host has. However, it might be nice to have this information available in a data frame for future analysis. How can we create a data frame with this information? Ideally, with 3,590 hosts, we would like to do so automatically rather than manually.

To do so, we can instruct Python to create a data frame from the available data.

df_host_listings = pd.DataFrame(df['host_id'].value_counts())

df_host_listings.head()

	host_id
107434423	658
3965428	75
395498979	64
47172572	63
396270947	47

The first few observations for the number of listings that a host has.

Using pd.DataFrame, we have instructed Python to change the form of the value_counts from above to a data frame type. However, when we look at this data frame above, it does not have our columns as we would typically expect them, and the labels do not appear as we might expect. We can adjust this using additional code.

First, we will reset the index using reset_index so that the index has values from 0 to 3,589. Our original index becomes a new column.

df_host_listings = df_host_listings.reset_index()

df_host_listings.head()

	index	host_id
0	107434423	658
1	3965428	75
2	395498979	64
3	47172572	63
4	396270947	47

The first few observations for the number of listings that a host has, with the index reset.

Then, we will adjust the column names for this data frame using the columns feature of a data frame.

df_host_listings.columns

 Index(['index', 'host_id'], dtype='object')

df_host_listings.columns = ['host_id', 'Chicago_listings_count']
df_host_listings.head()

	index	host_id
0	107434423	658
1	3965428	75
2	395498979	64
3	47172572	63
4	396270947	47

The first few observations for the number of listings that a host has, with the column names corrected.

We now have a new data frame that includes the host ids and the number of listings that each host has in Chicago.

If we needed to create a data frame by hand, we can do so with similar original code. In this example, we would provide our variables inside of curly bracket { }, with the variable names in quotes and the values for each variable provided as a list surrounded by square brackets [ ] separated by a colon.

pd.DataFrame({'a': [1, 2, 3], 'b': ['house', 'apt', 'duplex'], 'c': [20, 40, 30]})

	a	b	c
0	1	house	20
1	2	apt	40
2	3	duplex	30

A data frame can be generated manually, if needed.

Reshaping the Data, Part 2

Above, we saw that we were able to create a new data frame that counted how many listing each of the hosts had in Chicago on Airbnb. Now, however, we also want to record the mean number of bedrooms per property for each of the hosts.

Data Science Discovery demonstrated the use of the groupby function, which allows you to calculate a summary measure of a variable after creating groups as specified by another variable in the data. We will use this tool below to calculate the mean number of bedrooms per property for each of our hosts.

df_host_bedrooms = df[['host_id', 'bedrooms']].groupby(by = 'host_id').mean()

df_host_bedrooms.head()

	bedrooms
host_id
2153	1.0
2613	1.0
4434	3.0
6162	2.0
7529	1.0

A summary measure of the mean number of bedrooms per property that each host has.

df_host_bedrooms.shape

(3590, 1)

We see that the data is returned to us in a similar form as with the value_counts, where the host_id is serving as the index and the mean number of bedrooms is a single column in the resulting data frame. We can similarly adjust this data so that the host_id is retained as a column.

df_host_bedrooms = df_host_bedrooms.reset_index()
df_host_bedrooms.columns = ['host_id', 'mean_bedrooms']
df_host_bedrooms.head()

	host_id	mean_bedrooms
0	2153	1.0
1	2613	1.0
2	4434	3.0
3	6162	2.0
4	7529	1.0

The version of the bedrooms dataset, with corrections and adjustments in formatting.

Merging Data

We have all of the necessary information calculated in our two data frames above: df_host_listings and df_host_bedrooms. However, we need to incorporate these two data frames into one to be able to address our question fully.

We can do this with the merge function. This function will combine two data frames while matching rows based on some criteria. Here, we'd like to match the two data frames using the host_id variable, so that our resulting data frame contains observations for each host with the number of properties and the mean number of bedrooms recorded for each host. The input on specifies the variable used to match values from one data frame to the other.

df_host = df_host_listings.merge(df_host_bedrooms, on = 'host_id')
df_host.head()

	host_id	Chicago_listings_count	mean_bedrooms
0	107434423	658	1.332075
1	3965428	75	1.344262
2	395498979	64	2.525424
3	47172572	63	1.652174
4	396270947	47	1.425532

The merged data contains information on both the number of listings and the mean number of bedrooms for each host based on their Chicago Airbnb listings.

df_host.shape

(3590, 3)

Our resulting df_host data frame added the mean bedrooms variable to the host listings variable. It was able to reorder the mean bedrooms variable to appropriately match the listings variable by searching for the appropriate host_id.

With this dataset, we are able to answer our original question of interest. We will describe approaches to answer this data on the next page.

Before we turn to the rest of the analysis, let's explore the capabilities of the merge function a little more.

First, does it matter the order with which we specify our two data frames to merge? Above, we added the bedroom variable to the original listings data frame. Could we do the opposite?

df_host = df_host_bedrooms.merge(df_host_listings, on = 'host_id')
df_host.head()

	host_id	mean_bedrooms	Chicago_listings_count
0	2153	1.0	6
1	2613	1.0	1
2	4434	3.0	1
3	6162	2.0	1
4	7529	1.0	1

A second way to merge the two host datasets. The same information is reported, although arranged in a different manner.

We see that our hosts are provided in different orders in the two data frames. The order of the hosts in the second data frame matches the original order in the df_host_bedrooms data. The order of the hosts will match the original order of the first data frame provided. However, do the values match? We can check that the values match for a few hosts by sorting our new data frame.

By default, the values are sorted in ascending order (from small to big). We will request instead that these values be sorted in the opposite order.

df_host.sort_values(by = 'Chicago_listings_count', ascending = False).head()

	host_id	mean_bedrooms	Chicago_listings_count
1932	107434423	1.332075	658
262	3965428	1.344262	75
3133	395498979	2.525424	64
1329	47172572	1.652174	63
3136	396270947	1.425532	47

After sorting, the beginnings of both merged data display the same results.

It does appear that the values match between these two dataset, although we see that the order of the columns have been switched.

Above, we observed a merge from two data sets, where each host appeared exactly once in each data set. Not all merges have exact matches in each dataset. We will now demonstrate how to adjust the merge function to account for this type of scenario.

First, consider the situation where we answer our research question of interest but now would like to answer follow up questions about the host. For example, maybe we would like to know more about characteristics of larger property owners, using their response rates, for example.

We can merge information from our original listings data frame with the host data frame, even though the hosts appear multiple times in the listings data frame. To do so, we will specify only host-specific variables and drop duplicates before our merge. Without removing duplicates, we would have a new row for each time the host appears in the original data frame.

host_vars = ['host_id', 'host_name', 'host_since', 'host_location', 'host_response_time', 
             'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 
             'host_total_listings_count', 'host_has_profile_pic', 'host_identity_verified']
df_host_values = df[host_vars].drop_duplicates()

df_host = df_host.merge(df_host_values, on = 'host_id')

df_host.head()

	host_id	mean_bedrooms	Chicago_listings_count	host_name	host_since	host_location	host_response_time	host_response_rate	host_acceptance_rate	host_is_superhost	host_total_listings_count	host_has_profile_pic	host_identity_verified
0	2153	1.0	6	Linda	2008-08-16	Munster, IN	within an hour	92.0	100.0	True	24	True	True
1	2613	1.0	1	Rebecca	2008-08-29	Chicago, IL	within an hour	100.0	97.0	True	1	True	True
2	4434	3.0	1	Kellen	2008-11-20	Chicago, IL	within an hour	100.0	92.0	False	5	True	True
3	6162	2.0	1	Jackie	2009-01-08	Chicago, IL	within a few hours	80.0	30.0	False	2	True	True
4	7529	1.0	1	Emily	2009-02-07	Chicago, IL	within an hour	100.0	100.0	True	1	True	True

We can merge information from a dataset that has additional variables.

Now, we have our host-specific information contained within the df_host data frame, so that we can address many host-based questions quickly.

What if we instead wanted to ask a question based on the listing? For example, we may want to learn more about if the listings in a neighborhood have hosts who are larger and have more properties in Chicago or if they are hosts who have a smaller number of properties. To do this, we want to add the Chicago_listings_count and mean_bedrooms variables to our original listings data frame.

df = df.merge(df_host[['host_id', 'mean_bedrooms', 'Chicago_listings_count']], on = 'host_id')

df.shape

(7747, 77)

Through this data manipulation, we now have variables in our data frames that were not originally there. This enables us to answer more questions and in different ways. Let's return now to our original question of interest and explore what we can say about larger and smaller Airbnb hosts in the city of Chicago.

← Missing Data Next: Summarizing Variables with Statistics, Tables, & Plots →