Cleaning and Preparing Data


Most data will not arrive perfect and ready for analysis. Data scientists will typically need to engage with cleaning and preparing the data before calculating numerical summaries, creating visualizations, and making conclusions from the available data.

Identifying Variables for Adjustment

One of the first steps in preparing data for cleaning is observing and identifying which variables may need to be cleaned. Depending on the number of variables in the data, you may be able to do this by eye or you may need a more systematic process. The data may also come with documentation that includes dictionaries for variables, ways in which the data have already been prepared, and suggestions for how to analyze the data.

We'll briefly look through the data as it originally was posted to Inside Airbnb.

pd.set_option('display.max_rows', 80)

We can adjust our settings to see all output. Here, we will print at least 80 rows.

df.dtypes
        id                                                int64
        listing_url                                      object
        scrape_id                                         int64
        last_scraped                                     object
        source                                           object
        name                                             object
        description                                      object
        neighborhood_overview                            object
        picture_url                                      object
        host_id                                           int64
        host_url                                         object
        host_name                                        object
        host_since                                       object
        host_location                                    object
        host_about                                       object
        host_response_time                               object
        host_response_rate                               object
        host_acceptance_rate                             object
        host_is_superhost                                object
        host_thumbnail_url                               object
        host_picture_url                                 object
        host_neighbourhood                               object
        host_listings_count                               int64
        host_total_listings_count                         int64
        host_verifications                               object
        host_has_profile_pic                             object
        host_identity_verified                           object
        neighbourhood                                    object
        neighbourhood_cleansed                           object
        neighbourhood_group_cleansed                    float64
        latitude                                        float64
        longitude                                       float64
        property_type                                    object
        room_type                                        object
        accommodates                                      int64
        bathrooms                                       float64
        bathrooms_text                                   object
        bedrooms                                        float64
        beds                                            float64
        amenities                                        object
        price                                            object
        minimum_nights                                    int64
        maximum_nights                                    int64
        minimum_minimum_nights                            int64
        maximum_minimum_nights                            int64
        minimum_maximum_nights                            int64
        maximum_maximum_nights                            int64
        minimum_nights_avg_ntm                          float64
        maximum_nights_avg_ntm                          float64
        calendar_updated                                float64
        has_availability                                 object
        availability_30                                   int64
        availability_60                                   int64
        availability_90                                   int64
        availability_365                                  int64
        calendar_last_scraped                            object
        number_of_reviews                                 int64
        number_of_reviews_ltm                             int64
        number_of_reviews_l30d                            int64
        first_review                                     object
        last_review                                      object
        review_scores_rating                            float64
        review_scores_accuracy                          float64
        review_scores_cleanliness                       float64
        review_scores_checkin                           float64
        review_scores_communication                     float64
        review_scores_location                          float64
        review_scores_value                             float64
        license                                          object
        instant_bookable                                 object
        calculated_host_listings_count                    int64
        calculated_host_listings_count_entire_homes       int64
        calculated_host_listings_count_private_rooms      int64
        calculated_host_listings_count_shared_rooms       int64
        reviews_per_month                               float64
        dtype: object
    

We can look through the variable types for all 75 variables.

In a brief review of the above information, I notice a few variables with unexpected data types. For example, I see that the following variables are recorded as objects but seem like they would be more appropriate as a numerical variable:

  • host_response_rate
  • host_acceptance_rate
  • price

I also see that the following variables are dichotomous variables, and so I would like them to be recorded as boolean variables in Python:

  • host_is_superhost
  • host_has_profile_pic
  • host_identity_verified
  • has_availability
  • instant_bookable

Finally, I notice that some of these variables include information about calendar dates:

  • host_since
  • calendar_last_scraped
  • first_review
  • last_review

We will want to adjust these variable types so that we can use them correctly later. While we won't be able to provide examples for every variable adjustment that might be needed, there are a number of resources online to help for more specific ways of adjusting variables.

Recommendations while attempting to adjust variables include:

  • Understanding the desired variables and what is required for analysis with input from key stakeholders,
  • Determining what needs to be adjusted about the variable for cleaning,
  • Checking your work after each partial step through visualization,
  • Asking for help through your resources, whether it's coworkers, course staff, or online message boards.

Creating Boolean Variables

Since I will be recreating five variables as boolean variable types, I will write a function to help with this adjustment. The benefit of writing a function is that the same function can be reused multiple times without needing to copy and paste the some code with minor adjustments.

def make_boolean(var, true_condition):
# Creates a boolean variable
# True is recorded when the true_condition holds
return (var == true_condition)
df['host_is_superhost'] = make_boolean(df['host_is_superhost'], 't')
df['host_has_profile_pic'] = make_boolean(df['host_has_profile_pic'], 't')
df['host_identity_verified'] = make_boolean(df['host_identity_verified'], 't')
df['has_availability'] = make_boolean(df['has_availability'], 't')
df['instant_bookable'] = make_boolean(df['instant_bookable'], 't')

Throughout the analysis process, it is recommended that you perform regular checks of your work. I recommend stopping after each step to allow you to identify errors earlier, so that it is easier to identify the code that isn't working like you hoped and to correct the errors.

df['host_is_superhost'].value_counts()
    
        False    4898
        True     2849
        Name: host_is_superhost, dtype: int64

We see that the superhost variable is now recorded with True and False levels.

Creating Numeric Variables

Some of the variables that we anticipated would be recorded as numeric variables are not numeric. For example, we see that price, host acceptance rate, and host response rate all seem like they would be recorded as numbers but are not.

For this example, how should we clean these variables? Let's first look at how the variables currently exist in the data.

df['price'].head()
        
        0     $90.00
        1    $379.00
        2    $479.00
        3    $479.00
        4     $79.00
        Name: price, dtype: object
    
df['host_response_rate'].head()
            
        0    100%
        1    100%
        2    100%
        3    100%
        4    100%
        Name: host_response_rate, dtype: object
    

We see that the price includes $ signs and the rate includes '%' signs.

For each of these variables, an additional symbol (either the '$' or '%' sign) results in the values being recorded as objects. Note that numerical variables can only understand numbers and numbers alone.

If we were to try to calculate the mean of one of these variables, would it work?

df['price'].mean()

The first few lines of a lengthy error message.

We see an error is returned, with a message that include that Python could not convert the string values to float (numeric) values. This error is returned due to the '$' symbol that is included in the data.

In order to adjust this variable, we need to first remove any symbols. We first remove the '$' symbol at the beginning of each entry by specifying that we want to retain only the values starting with the second location of a string.

df['price'] = df['price'].str[1:]
df['price'].head()
 
        0     90.00
        1    379.00
        2    479.00
        3    479.00
        4     79.00
        Name: price, dtype: object
    

The first few values for price after removing the leading $.

In looking through some of the possible values, we also see that some were recorded with commas included in the price. We would also want to remove these symbols so that we can adjust the observations to numerical values. We can use the following code to replace any comma within the string with blank space.

df['price'].iloc[50:60]
     
        50      160.00
        51      230.00
        52      130.00
        53    1,234.00
        54       91.00
        55      379.00
        56      356.00
        57      113.00
        58      295.00
        59       75.00
        Name: price, dtype: object
    

The 50th to 60th observations for price.

df['price'] = df['price'].str.replace(',', '')

Then, once the values only include numbers, we can adjust the data to record the value as a float variable.

df['price'] = df['price'].astype(float)
df['price'].head()
         
        0     90.0
        1    379.0
        2    479.0
        3    479.0
        4     79.0
        Name: price, dtype: float64
    

The price variable is now recognized as a numerical float variable in Python.

df['price'].mean()
184.2859171292113

We can now calculate the mean price per night!

I now see that I can complete a calculation for the mean price per night for an Airbnb in Chicago, where earlier I received an error.

I can try to clean the host response rate and host acceptance rate variables similarly, although this time I want to remove the '%' sign that are present for both variables before changing them to float variables.

df['host_response_rate'] = df['host_response_rate'].str[:-1]
df['host_response_rate'].head()
             
        0    100
        1    100
        2    100
        3    100
        4    100
        Name: host_response_rate, dtype: object
    

The rates are now recorded without % signs, but is still an object variable.

df['host_response_rate'] = df['host_response_rate'].astype(float)
df['host_response_rate'].mean()
97.45249740625464

After switching the variable type to a float variable, we can calculate the mean host response rate!

We'll also adjust the host acceptance rate similarly.

df['host_acceptance_rate'] = df['host_acceptance_rate'].str[:-1]
df['host_acceptance_rate'] = df['host_acceptance_rate'].astype(float)
df['host_acceptance_rate'].mean()
92.10460789925587

The mean acceptance rate can be calculated!

Adjusting Time Variables

To adjust time-based variables, another package can be used. This will not be a major focus of our course. You can view the code that was used to make this adjustment in the Deeper Dive page of this section. A full description of all of the abilities of the datetime package are beyond the scope of our course, with additional internet resources available for those interested in learning more.

For the remainder of the modules, we will use data with the date and the time adjusted, so that they record the number of days between the event of interest and when the Airbnb data was obtained.