Cleaning and Preparing Data
Most data will not arrive perfect and ready for analysis. Data scientists will typically need to engage with cleaning and preparing the data before calculating numerical summaries, creating visualizations, and making conclusions from the available data.
Identifying Variables for Adjustment
One of the first steps in preparing data for cleaning is observing and identifying which variables may need to be cleaned. Depending on the number of variables in the data, you may be able to do this by eye or you may need a more systematic process. The data may also come with documentation that includes dictionaries for variables, ways in which the data have already been prepared, and suggestions for how to analyze the data.
We'll briefly look through the data as it originally was posted to Inside Airbnb.
df.dtypes
id int64 listing_url object scrape_id int64 last_scraped object source object name object description object neighborhood_overview object picture_url object host_id int64 host_url object host_name object host_since object host_location object host_about object host_response_time object host_response_rate object host_acceptance_rate object host_is_superhost object host_thumbnail_url object host_picture_url object host_neighbourhood object host_listings_count int64 host_total_listings_count int64 host_verifications object host_has_profile_pic object host_identity_verified object neighbourhood object neighbourhood_cleansed object neighbourhood_group_cleansed float64 latitude float64 longitude float64 property_type object room_type object accommodates int64 bathrooms float64 bathrooms_text object bedrooms float64 beds float64 amenities object price object minimum_nights int64 maximum_nights int64 minimum_minimum_nights int64 maximum_minimum_nights int64 minimum_maximum_nights int64 maximum_maximum_nights int64 minimum_nights_avg_ntm float64 maximum_nights_avg_ntm float64 calendar_updated float64 has_availability object availability_30 int64 availability_60 int64 availability_90 int64 availability_365 int64 calendar_last_scraped object number_of_reviews int64 number_of_reviews_ltm int64 number_of_reviews_l30d int64 first_review object last_review object review_scores_rating float64 review_scores_accuracy float64 review_scores_cleanliness float64 review_scores_checkin float64 review_scores_communication float64 review_scores_location float64 review_scores_value float64 license object instant_bookable object calculated_host_listings_count int64 calculated_host_listings_count_entire_homes int64 calculated_host_listings_count_private_rooms int64 calculated_host_listings_count_shared_rooms int64 reviews_per_month float64 dtype: object
In a brief review of the above information, I notice a few variables with unexpected data types. For example, I see that the following variables are recorded as objects but seem like they would be more appropriate as a numerical variable:
host_response_rate
host_acceptance_rate
price
I also see that the following variables are dichotomous variables, and so I would like them to be recorded as boolean variables in Python:
host_is_superhost
host_has_profile_pic
host_identity_verified
has_availability
instant_bookable
Finally, I notice that some of these variables include information about calendar dates:
host_since
calendar_last_scraped
first_review
last_review
We will want to adjust these variable types so that we can use them correctly later. While we won't be able to provide examples for every variable adjustment that might be needed, there are a number of resources online to help for more specific ways of adjusting variables.
Recommendations while attempting to adjust variables include:
- Understanding the desired variables and what is required for analysis with input from key stakeholders,
- Determining what needs to be adjusted about the variable for cleaning,
- Checking your work after each partial step through visualization,
- Asking for help through your resources, whether it's coworkers, course staff, or online message boards.
Creating Boolean Variables
Since I will be recreating five variables as boolean variable types, I will write a function to help with this adjustment. The benefit of writing a function is that the same function can be reused multiple times without needing to copy and paste the some code with minor adjustments.
def make_boolean(var, true_condition):
# Creates a boolean variable
# True is recorded when the true_condition holds
return (var == true_condition)
df['host_is_superhost'] = make_boolean(df['host_is_superhost'], 't')
df['host_has_profile_pic'] = make_boolean(df['host_has_profile_pic'], 't')
df['host_identity_verified'] = make_boolean(df['host_identity_verified'], 't')
df['has_availability'] = make_boolean(df['has_availability'], 't')
df['instant_bookable'] = make_boolean(df['instant_bookable'], 't')
Throughout the analysis process, it is recommended that you perform regular checks of your work. I recommend stopping after each step to allow you to identify errors earlier, so that it is easier to identify the code that isn't working like you hoped and to correct the errors.
df['host_is_superhost'].value_counts()
False 4898 True 2849 Name: host_is_superhost, dtype: int64
Creating Numeric Variables
Some of the variables that we anticipated would be recorded as numeric variables are not numeric. For example, we see that price, host acceptance rate, and host response rate all seem like they would be recorded as numbers but are not.
For this example, how should we clean these variables? Let's first look at how the variables currently exist in the data.
df['price'].head()
0 $90.00 1 $379.00 2 $479.00 3 $479.00 4 $79.00 Name: price, dtype: object
df['host_response_rate'].head()
0 100% 1 100% 2 100% 3 100% 4 100% Name: host_response_rate, dtype: object
For each of these variables, an additional symbol (either the '$' or '%' sign) results in the values being recorded as objects. Note that numerical variables can only understand numbers and numbers alone.
If we were to try to calculate the mean of one of these variables, would it work?
df['price'].mean()
We see an error is returned, with a message that include that Python could not convert the string values to float (numeric) values. This error is returned due to the '$' symbol that is included in the data.
In order to adjust this variable, we need to first remove any symbols. We first remove the '$' symbol at the beginning of each entry by specifying that we want to retain only the values starting with the second location of a string.
df['price'] = df['price'].str[1:]
df['price'].head()
0 90.00 1 379.00 2 479.00 3 479.00 4 79.00 Name: price, dtype: object
In looking through some of the possible values, we also see that some were recorded with commas included in the price. We would also want to remove these symbols so that we can adjust the observations to numerical values. We can use the following code to replace any comma within the string with blank space.
df['price'].iloc[50:60]
50 160.00 51 230.00 52 130.00 53 1,234.00 54 91.00 55 379.00 56 356.00 57 113.00 58 295.00 59 75.00 Name: price, dtype: object
df['price'] = df['price'].str.replace(',', '')
Then, once the values only include numbers, we can adjust the data to record the value as a float variable.
df['price'] = df['price'].astype(float)
df['price'].head()
0 90.0 1 379.0 2 479.0 3 479.0 4 79.0 Name: price, dtype: float64
df['price'].mean()
184.2859171292113
I now see that I can complete a calculation for the mean price per night for an Airbnb in Chicago, where earlier I received an error.
I can try to clean the host response rate and host acceptance rate variables similarly, although this time I want to remove the '%' sign that are present for both variables before changing them to float variables.
df['host_response_rate'] = df['host_response_rate'].str[:-1]
df['host_response_rate'].head()
0 100 1 100 2 100 3 100 4 100 Name: host_response_rate, dtype: object
df['host_response_rate'] = df['host_response_rate'].astype(float)
df['host_response_rate'].mean()
97.45249740625464
We'll also adjust the host acceptance rate similarly.
df['host_acceptance_rate'] = df['host_acceptance_rate'].str[:-1]
df['host_acceptance_rate'] = df['host_acceptance_rate'].astype(float)
df['host_acceptance_rate'].mean()
92.10460789925587
Adjusting Time Variables
To adjust time-based variables, another package can be used. This will not be a major focus of our course. You can view the code that was used to make this adjustment in the Deeper Dive page of this section. A full description of all of the abilities of the datetime
package are beyond the scope of our course, with additional internet resources available for those interested in learning more.
For the remainder of the modules, we will use data with the date and the time adjusted, so that they record the number of days between the event of interest and when the Airbnb data was obtained.