Deeper Dive in Data Cleaning


Python has a couple of data types used for recording time data. Time-based data can include rich information including year, month, date, and time of day. With that in mind, it can be used in a number of different ways.

How did we prepare and adjust some of the time-based data recorded in the original Inside Airbnb data?

Adjusting Time Data

Originally, the time data was read into Python as an object variable type. The first step was to help Python to recognize all of the date variables as including a date and changing the object type in Python to a time-based system. To do this, we use the datetime package.

from datetime import datetime
df['host_since'].head()
    0    2008-08-29
    1    2013-10-08
    2    2013-10-08
    3    2013-10-08
    4    2013-10-17
    Name: host_since, dtype: object
df['host_since'] = pd.to_datetime(df['host_since'])
df['host_since'].head()
    0   2008-08-29
    1   2013-10-08
    2   2013-10-08
    3   2013-10-08
    4   2013-10-17
    Name: host_since, dtype: datetime64[ns]

Note that the data still appears to contain the same information, but has now been stored as a datetime64 variable type in Python. This means that Python now recognizes this information as containing calendar-based information and can then perform calculations with it accordingly.

To analyze this variable in a standard way, we decided to measure how long each host had been a host on Airbnb by calculating the number of days between becoming a host and when the data was obtained from Airbnb. This was recorded in the last_scraped variable. We will also prepare the last_scraped variable below.

df['last_scraped'] = pd.to_datetime(df['last_scraped'])
df['host_since'] = df['last_scraped'] - df['host_since']
df['host_since'].head()
    0   5315 days
    1   3449 days
    2   3449 days
    3   3449 days
    4   3440 days
    Name: host_since, dtype: timedelta64[ns]
df['host_since'].mean()
Timedelta('2167 days 21:26:50.171679360')

We see that we were able to calculate the number of days each host had been with Airbnb, and could perform additional calculations using this data. However, the mean was reported with some additional information, which may pose challenges if used for future calculations. We'll aim to change this to be recorded as a numerical variable without the days variable, with the understanding that any time differences are recorded as days.

df['host_since'] = df['host_since'].dt.days
df['host_since'].head()
    0    5315
    1    3449
    2    3449
    3    3449
    4    3440
    Name: host_since, dtype: int64
df['host_since'].mean()
2167.8936362462887

The host_since variable is now recorded as an integer type variable, and the mean is recorded as an integer only. This adjustment will help to increase the ease with which this variable can be used for future calculations.

This same process could be repeated with all of the other time variables, and has been for the data as analyzed in the later modules.