Cleaning and Preparing Data
Most data will not arrive perfect and ready for analysis. Data scientists will typically need to engage with cleaning and preparing the data before calculating numerical summaries, creating visualizations, and making conclusions from the available data.
Identifying Variables for Adjustment
One of the first steps in preparing data for cleaning is observing and identifying which variables may need to be cleaned. Depending on the number of variables in the data, you may be able to do this by eye or you may need a more systematic process. The data may also come with documentation that includes dictionaries for variables, ways in which the data have already been prepared, and suggestions for how to analyze the data.
We'll briefly look through the data as it originally was posted to Inside Airbnb.
pd.set_option('display.max_rows', 80)
We can adjust our settings to see all output. Here, we will print at least 80 rows.
df.dtypes
id int64 listing_url object scrape_id int64 last_scraped object source object name object description object neighborhood_overview object picture_url object host_id int64 host_url object host_name object host_since object host_location object host_about object host_response_time object host_response_rate object host_acceptance_rate object host_is_superhost object host_thumbnail_url object host_picture_url object host_neighbourhood object host_listings_count int64 host_total_listings_count int64 host_verifications object host_has_profile_pic object host_identity_verified object neighbourhood object neighbourhood_cleansed object neighbourhood_group_cleansed float64 latitude float64 longitude float64 property_type object room_type object accommodates int64 bathrooms float64 bathrooms_text object bedrooms float64 beds float64 amenities object price object minimum_nights int64 maximum_nights int64 minimum_minimum_nights int64 maximum_minimum_nights int64 minimum_maximum_nights int64 maximum_maximum_nights int64 minimum_nights_avg_ntm float64 maximum_nights_avg_ntm float64 calendar_updated float64 has_availability object availability_30 int64 availability_60 int64 availability_90 int64 availability_365 int64 calendar_last_scraped object number_of_reviews int64 number_of_reviews_ltm int64 number_of_reviews_l30d int64 first_review object last_review object review_scores_rating float64 review_scores_accuracy float64 review_scores_cleanliness float64 review_scores_checkin float64 review_scores_communication float64 review_scores_location float64 review_scores_value float64 license object instant_bookable object calculated_host_listings_count int64 calculated_host_listings_count_entire_homes int64 calculated_host_listings_count_private_rooms int64 calculated_host_listings_count_shared_rooms int64 reviews_per_month float64 dtype: object
We can look through the variable types for all 75 variables.
In a brief review of the above information, I notice a few variables with unexpected data types. For example, I see that the following variables are recorded as objects but seem like they would be more appropriate as a numerical variable:
host_response_rate
host_acceptance_rate
price
I also see that the following variables are dichotomous variables, and so I would like them to be recorded as boolean variables in Python:
host_is_superhost
host_has_profile_pic
host_identity_verified
has_availability
instant_bookable
Finally, I notice that some of these variables include information about calendar dates:
host_since
calendar_last_scraped
first_review
last_review
We will want to adjust these variable types so that we can use them correctly later. While we won't be able to provide examples for every variable adjustment that might be needed for any dataset, we hope this gives you some foundations for cleaning data. Should you need a specific type of data cleaning, there are a number of resources online to help guide you in adjusting variables based on your situation.
General recommendations while attempting to adjust variables include:
- Understanding the desired variables,
- Understanding what is required for analysis with input from key stakeholders,
- Determining what needs to be adjusted about the variable for cleaning,
- Checking your work after each partial step through visualization, and
- Asking for help through your resources, whether it's coworkers, course staff, or online message boards.
Creating Boolean Variables
Since I will be recreating five variables as boolean variable types, I will write a function to help with this adjustment. The benefit of writing a function is that the same function can be reused multiple times without needing to copy and paste the some code with minor adjustments.
def make_boolean(var, true_condition):
# Creates a boolean variable
# True is recorded when the true_condition holds
return (var == true_condition)
df['host_is_superhost'] = make_boolean(df['host_is_superhost'], 't')
df['host_has_profile_pic'] = make_boolean(df['host_has_profile_pic'], 't')
df['host_identity_verified'] = make_boolean(df['host_identity_verified'], 't')
df['has_availability'] = make_boolean(df['has_availability'], 't')
df['instant_bookable'] = make_boolean(df['instant_bookable'], 't')
Throughout the analysis process, it is recommended that you perform regular checks of your work. I recommend stopping after each step to allow you to identify errors earlier, so that it is easier to identify the code that isn't working like you hoped and to correct the errors.
df['host_is_superhost'].value_counts()
False 4898 True 2849 Name: host_is_superhost, dtype: int64
We see that the superhost variable is now recorded with True and False levels.
Creating Numeric Variables
Some of the variables that we anticipated would be recorded as numeric variables are not numeric. For example, we see that price, host acceptance rate, and host response rate all seem like they would be recorded as numbers but are not.
How should we prepare and clean these variables? Let's first look at how the variables currently exist in the data.
df['price'].head()
0 $90.00 1 $379.00 2 $479.00 3 $479.00 4 $79.00 Name: price, dtype: object
df['host_response_rate'].head()
0 100% 1 100% 2 100% 3 100% 4 100% Name: host_response_rate, dtype: object
We see that the price includes $ signs and the rate includes '%' signs.
For each of these variables, an additional symbol (either the '$' or '%' sign) results in the values being recorded as objects. Note that Python can only understand numbers and numbers alone (or decimal points) as part of a numerical variable.
If we were to try to calculate the mean of one of these variables, would it work?
df['price'].mean()

The first few lines of a lengthy error message.
We see an error is returned, with a message that includes that Python could not convert the string values to float (numeric) values. This error is returned in part due to the '$' symbol that is included in the data.
In order to adjust this variable, we need to first remove any symbols. We first remove the '$' symbol at the beginning of each entry by specifying that we want to retain only the values starting with the second location of a string (from the [1:]
in the code below).
df['price'] = df['price'].str[1:]
df['price'].head()
0 90.00 1 379.00 2 479.00 3 479.00 4 79.00 Name: price, dtype: object
The first few values for price after removing the leading $.
In looking through some of the possible values, we also see that some were recorded with commas included in the price. We would also want to remove these symbols so that we can adjust the observations to numerical values. We can use the following code to replace any comma within the string with blank space.
df['price'].iloc[50:60]
50 160.00 51 230.00 52 130.00 53 1,234.00 54 91.00 55 379.00 56 356.00 57 113.00 58 295.00 59 75.00 Name: price, dtype: object
The 50th to 60th observations for price.
df['price'] = df['price'].str.replace(',', '')
Then, once the values only include numbers, we can adjust the data to record the value as a float variable.
df['price'] = df['price'].astype(float)
df['price'].head()
0 90.0 1 379.0 2 479.0 3 479.0 4 79.0 Name: price, dtype: float64
The price variable is now recognized as a numerical float variable in Python.
df['price'].mean()
184.2859171292113
We can now calculate the mean price per night!
I now see that I can complete a calculation for the mean price per night for an Airbnb in Chicago, where earlier I received an error.
I can try to clean the host response rate and host acceptance rate variables similarly, although this time I want to remove the '%' sign that are present for both variables before changing them to float variables. The '%' sign is the last value in each of the strings, so I can pull out that entry using the [:-1]
code below. Note that I could have also used the replace
from earlier; there can be multiple coding approaches in Python to achieve the same end result!
df['host_response_rate'] = df['host_response_rate'].str[:-1]
df['host_response_rate'].head()
0 100 1 100 2 100 3 100 4 100 Name: host_response_rate, dtype: object
The rates are now recorded without % signs, but is still an object variable.
df['host_response_rate'] = df['host_response_rate'].astype(float)
df['host_response_rate'].mean()
97.45249740625464
After switching the variable type to a float variable, we can calculate the mean host response rate!
We'll also adjust the host acceptance rate similarly.
df['host_acceptance_rate'] = df['host_acceptance_rate'].str[:-1]
df['host_acceptance_rate'] = df['host_acceptance_rate'].astype(float)
df['host_acceptance_rate'].mean()
92.10460789925587
The mean acceptance rate can be calculated!
Adjusting Time Variables
To adjust time-based variables, another package can be used. This will not be a major focus of our course. You can view the code that was used to make this adjustment in the Deeper Dive page of this section. A full description of all of the abilities of the datetime
package are beyond the scope of our course, with additional internet resources available for those interested in learning more.
For the remainder of the modules, we will use data with the date and the time adjusted, so that they record the number of days between the event of interest (as defined by the variable) and when the Airbnb data was obtained.