Single Variable Descriptive Analytics and Data Manipulation


Before proceeding further down the data science pipeline and performing analyses like building a predictive model, it's useful to perform what we might call more "basic descriptive analytics" techniques on the dataset. The general goal of performing descriptive analytics is to "describe" the dataset insightfully and thoroughly. Some basic descriptive analytics techniques might include calculating and evaluating relevant summary statistics of the data and plotting relevant visualizations.

First, trying to describe just a single variable in the dataset, for instance, can help lead to insightful data cleaning decisions that might need to be made before building your predictive model.

Single Categorical Variable

Summary Statistics

For instance, we know that we'd like to include the neighborhood variable as one of our explanatory variables in the linear regression model. This is a categorical variable, so we can use the value_counts() function as shown below to calculate summary statistics for this variable including: the number and percentage of observations that belong to each level of the variable, respectively.

neighborhood_counts = df['neighborhood'].value_counts()
neighborhood_counts
    Near North Side    629
    West Town          583
    Lake View          422
    Logan Square       350
    Near West Side     339
                      ... 
    West Elsdon          1
    Burnside             1
    Edison Park          1
    Mount Greenwood      1
    East Side            1
    Name: neighborhood, Length: 75, dtype: int64
neighborhood_percentages = df['neighborhood'].value_counts(normalize=True)
neighborhood_percentages
    Near North Side    0.116030
    West Town          0.107545
    Lake View          0.077845
    Logan Square       0.064564
    Near West Side     0.062535
                         ...   
    West Elsdon        0.000184
    Burnside           0.000184
    Edison Park        0.000184
    Mount Greenwood    0.000184
    East Side          0.000184
    Name: neighborhood, Length: 75, dtype: float64

Visualizations

We can also visualize each of these percentages, say, in a barplot by using the sns.barplot() function and the following inputs:

  • .index of the neighborhood_percentages pandas series that was created (ie. the neighborhoods) and the
  • values of the neighborhood_percentages pandas series that was created (ie. the percentage of each observation that belongs to each neighborhood)
plt.figure(figsize=(20,8))
sns.barplot(neighborhood_percentages.index, neighborhood_percentages)
plt.xticks(rotation=90)
plt.show()

Data Cleaning Insights and Rationale

There's quite a lot of distinct neighborhoods in this dataset, most of which do not have that many listings that belong to them. Including categorical explanatory variables in some predictive models like linear regression that have levels with only a few number of observations can lead to several issues. One of these issues is what we call overfitting , which we'll talk more about in a later module.

Thus in order to avoid overfitting, we may decide to only keep the top 5 most popular neighborhoods in this dataset: Near North Side, West Town, Lake View, Logan Square, and Near West Side.

top_neigh = df['neighborhood'].value_counts().index[0:5]
top_neigh
    Index(['Near North Side', 'West Town', 'Lake View', 'Logan Square',
           'Near West Side'],
          dtype='object')

df=df[df['neighborhood'].isin(top_neigh)]
df.shape
    (2323, 6) 

Also, for similar reasons, let's just keep the top two most popular room_types: entire home/apt, private room.

room_type_counts = df['room_type'].value_counts()
room_type_counts
    Entire home/apt    1976
    Private room        297
    Hotel room           42
    Shared room           8
    Name: room_type, dtype: int64
sns.barplot(room_type_counts.index, room_type_counts)
plt.show()
top_room = df['room_type'].value_counts().index[0:2]
top_room
    Index(['Entire home/apt', 'Private room'], dtype='object')
df=df[df['room_type'].isin(top_room)]
df.shape
(2273, 6)

Note however, that the downside of these data cleaning techniques is that our model that we will build will only be able to reliably make predictions for these top 5 neighborhoods and top 2 room types. The number of rows in our dataframe has now also dropped from 5421 to 2273.

Missing Values

Furthermore, before proceeding let's drop all rows with missing values in the dataset. We see that this deletes 212 more observations.

df.isna().sum()
    price             0
    neighborhood      0
    room_type         0
    accommodates      0
    bedrooms        203
    beds             16
    dtype: int64
df=df.dropna()
df.shape
(2061, 6)

Single Numerical Variable

Similarly, describing each of the numerical variables that we intend to include in our model may, for instance yield insights about what model is most appropriate to use to make these predictions as well as other data cleaning measures we should take.

Summary Statistics

We can use the .describe() function to quickly calculate a set of useful summary statistics for each numerical variable in a given dataset. These summary statistics include the:

  • number of non-missing observations for each variable
  • mean
  • standard deviation
  • minimum
  • Q1 (25%)
  • median (50%)
  • Q3 (75%)
  • maximum
df.describe()
price accommodates bedrooms beds
count 2061.000000 2061.000000 2061.000000 2061.000000
mean 250.710335 4.706938 1.918001 2.408054
std 2209.424510 3.134291 1.158578 1.934589
min 19.000000 1.000000 1.000000 1.000000
25% 99.000000 2.000000 1.000000 1.000000
50% 149.000000 4.000000 2.000000 2.000000
75% 219.000000 6.000000 2.000000 3.000000
max 99998.000000 16.000000 12.000000 21.000000

These basic summary statistics of the numerical variables can tell us quite few important things the may impact our model that we might build.

  1. The maximum Airbnb price is set at $99,998 a night! Might this be an error? Let's insepct this more below with a boxplot.
  2. The standard deviations of each of our intended numerical explanatory variables are quite different. This fact will impact how we are able to interpret the slopes in our resulting model. We'll discuss this more in section 08-07.

A more complete analysis would visualize and further examine each numerical variable individually, but for now let's just examine the price variable.

When describing a numerical variable to someone, it is important to be thorough in describing the four most important components of the numerical variable. These are:

  1. Shape
    • skew
    • modality
  2. Measure of center
    • mean
    • median
  3. Measure of spread
    • standard deviation
    • IQR
    • range
  4. Any outliers

To visualize and further describe a single numerical variable, we can use any of the following plots:

  • histogram
  • boxplot
  • violinplot

Each of these three plots are able to describe some of these four components, but not all.

Visualizations

Histograms

For instance, we might create a histogram below. However, interestingly this does not look like a very insightful histogram. There's only one bar. Why might this be?

df['price'].hist()
plt.title('Chicago Airbnb Listings')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

Boxplot

Recall in our summary statistics, we saw an Airbnb that was $99,998 a night. This large observation may be impacting our ability to see the shape of our distribution with this plot above. In other words, this Airbnb might be an outlier.

Remember, a boxplot can identify outliers. That is, it shows you the points in the plot that are either:

  • $> Q3+1.5IQR$
  • $< Q1-1.5IQR$
sns.boxplot(df['price'])
plt.title('Chicago Airbnb Listings')
plt.show()

We do in fact see a very large outlier. Because this outlier might be an error in the dataset, let's get rid of it and see how this changes our visualizations.

df=df[df['price']<20000]
print(df.shape)
sns.boxplot(df['price'])
plt.title('Chicago Airbnb Listings')
plt.show()
(2060, 6)

By getting rid of this single observation, we do see more outliers in the resulting boxplot. There are pros and cons to leaving outliers in a dataset that you plan to use for modeling purposes.

  • One con is that it can inflate the RMSE, a measure of model accuracy, making your model performance look worse, in general, than it actually is.
  • However, it's likely that many of these outliers are legitimate Airbnb listings. Therefore by getting rid of them, your model may become less capable of predicting all Chicago Airbnb listings in general (only choosing to focus on the cheaper listings).

Suppose we'd like to be inclusive of the more expensive Airbnb listing hosts, so we'll decide to leave them in in this analysis.

This single listing deletion yielded a more insightful histogram plot as well.

df['price'].hist()
plt.title('Chicago Airbnb Listings')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

Summarizing

From our histogram plot we can determine/estimate the following.

  1. Shape
    • the prices are skewed to the right
    • the prices are unimodal
  2. Measure of center
    • the median is usually close to the peak of a unimodal distribution

The fact that this price distribution is not symmetric means that the median (as opposed to the mean) and the IQR (as oppposed to the standard deviation) are better measures of center and measures of spread, respectively to describe the price distribution. Remember that the mean tends to get dragged out in the direction of the skew and/or outliers and can thus become a misleading measure of center. Similarly the standard deviation can become an inflated and misleading measure of spread because of the skew and outliers.

Notice, that the histogram CANNOT reliably/easily tell us about:

  • any outliers
  • IQR

From our boxplot we can determine the following.

  1. Shape
    • the prices are skewed to the right
  2. Measure of Center
    • the median price is 149 dollars
  3. Measure of spread
    • the IQR price is 120 dollars
  4. Outliers
    • there are many outlier listings with high prices
median = df['price'].median()
median
149.0
Q3 = df['price'].quantile(0.75)
Q1 = df['price'].quantile(0.25)

IQR = Q3-Q1
IQR
120.0

Notice that our boxplot CANNOT tell us the modality of the distribution. There could be many "modes" in this dataset that would be masked by the boxplot.

Violinplot

We can, however, use what we call a violinplot to visualize the modality and skew of the distribution. This plot visualizes what we call the density curve of the histogram. This density curve is what you might see if you overlayed a "spaghetti noodle" over your histogram. This density curve is then symmetrically mirrored to resemble a shape like a violin.

The black and white interior of this shape also show most of the components of a boxplot including the:

  • median: white dot
  • Q1 and Q3 (left and right edges of the inner box)
  • min whisker reach and max whisker reach (left and right lines emanating out of the box)

However, this plot does not show you any outliers.

sns.violinplot(df['price'])
plt.title('Chicago Airbnb Listings')
plt.show()