Single Variable Descriptive Analytics and Data Manipulation


Before proceeding further down the data science pipeline and performing analyses like building a predictive model, it's useful to perform what we might call more "basic descriptive analytics" techniques on the dataset. The general goal of performing descriptive analytics is to "describe" the dataset insightfully and thoroughly.

These insights might impact the following elements of the data science pipeline in pursuit of our goal.

  • We may learn that we need to reprhase our research goal/question. For instance, perhaps our preliminary analysis might reveal that this dataset is in fact not representative of all Chicago Airbnb listings. If this was the case, we would want to narrow the scope of our research goal to only focus on predicting the price for a certain subset of Chicago Airbnb listings.
  • On the other hand, if we don't want to have to rephrase our research goal/question, we may decide to collect another dataset that would help us effectively pursue/answer our original research goal/question.
  • We may learn that we additional data cleaning/manipulation is required in order for us to more effectively pursue our research goal.
  • We may learn more about what predictive models may be more or less effective at helping us meet our research goal.

Some basic descriptive analytics techniques might include calculating and evaluating effective> summary statistics of the data and plotting effective visualizations. We consider summary statistics and visualizations to be effective if they help us extract the a high amount of informative insights with respect to our main research goal/question. The type of summary statistics and visualizations that are going to be the most effective are going to depend on whether each of the variable that you are considering is numerical or categorical.

First, trying to describe just a single variable in the dataset, for instance, can help lead to insightful data cleaning decisions that might need to be made before building your predictive model.

Single Categorical Variable

Summary Statistics

For instance, we know that we'd like to include the neighborhood variable as one of our explanatory variables in the linear regression model. This is a categorical variable, so we can use the value_counts() function as shown below to calculate summary statistics for this variable including: the number and percentage of observations that belong to each level of the variable, respectively.

neighborhood_counts = df['neighborhood'].value_counts()
neighborhood_counts
Near North Side 629 West Town 583 Lake View 422 Logan Square 350 Near West Side 339 ... West Elsdon 1 Burnside 1 Edison Park 1 Mount Greenwood 1 East Side 1 Name: neighborhood, Length: 75, dtype: int64
neighborhood_percentages = df['neighborhood'].value_counts(normalize=True)
neighborhood_percentages
Near North Side 0.116030 West Town 0.107545 Lake View 0.077845 Logan Square 0.064564 Near West Side 0.062535 ... West Elsdon 0.000184 Burnside 0.000184 Edison Park 0.000184 Mount Greenwood 0.000184 East Side 0.000184 Name: neighborhood, Length: 75, dtype: float64

Visualizations

We can also visualize each of these percentages, say, in a barplot by using the sns.barplot() function and the following inputs:

  • .index of the neighborhood_percentages pandas series that was created (ie. the neighborhood names) and the
  • values of the neighborhood_percentages pandas series that was created (ie. the percentage of each observation that belongs to each neighborhood)
plt.figure(figsize=(20,8))
sns.barplot(neighborhood_percentages.index, neighborhood_percentages)
plt.xticks(rotation=90)
plt.show()

Data Cleaning Insights and Rationale

There's quite a lot of distinct neighborhoods in this dataset, most of which do not have that many listings that belong to them. Including categorical explanatory variables in some predictive models like linear regression that have levels with only a few number of observations can lead to several issues. One of these issues is what we call overfitting, which we'll talk more about in a later module.

Thus in order to avoid overfitting, we may decide to only keep the top 5 most popular neighborhoods in this dataset: Near North Side, West Town, Lake View, Logan Square, and Near West Side.

top_neigh = df['neighborhood'].value_counts().index[0:5]
top_neigh
Index(['Near North Side', 'West Town', 'Lake View', 'Logan Square', 'Near West Side'], dtype='object')

df=df[df['neighborhood'].isin(top_neigh)]
df.shape
    (2323, 6) 

Also, for similar reasons, let's just keep the top two most popular room_types: entire home/apt, private room.

room_type_counts = df['room_type'].value_counts()
room_type_counts

Entire home/apt 1976
Private room 297
Hotel room 42
Shared room 8
Name: room_type, dtype: int64

sns.barplot(room_type_counts.index, room_type_counts)
plt.show()
top_room = df['room_type'].value_counts().index[0:2]
top_room
Index(['Entire home/apt', 'Private room'], dtype='object')
df=df[df['room_type'].isin(top_room)]
df.shape
(2273, 6)

Note however, that the downside of these data cleaning techniques is that our model that we will build will only be able to reliably make predictions for these top 5 neighborhoods and top 2 room types. The number of rows in our dataframe has now also dropped from 5421 to 2273.

Missing Values

Furthermore, before proceeding let's drop all rows with missing values in the dataset. We see that this deletes 212 more observations.

df.isna().sum()
    price             0
    neighborhood      0
    room_type         0
    accommodates      0
    bedrooms        203
    beds             16
    dtype: int64
df=df.dropna()
df.shape
(2061, 6)

Single Numerical Variable

Similarly, describing each of the numerical variables that we intend to include in our model may, for instance, yield insights about what model is most appropriate to use to make these predictions as well as other data cleaning measures we should take.

Summary Statistics

We can use the .describe() function to quickly calculate a set of useful summary statistics for each numerical variable in a given dataset. These summary statistics include the:

  • number of non-missing observations for each variable
  • mean
  • standard deviation
  • minimum
  • Q1 (25%)
  • median (50%)
  • Q3 (75%)
  • maximum
df.describe()
price accommodates bedrooms beds
count 2061.000000 2061.000000 2061.000000 2061.000000
mean 250.710335 4.706938 1.918001 2.408054
std 2209.424510 3.134291 1.158578 1.934589
min 19.000000 1.000000 1.000000 1.000000
25% 99.000000 2.000000 1.000000 1.000000
50% 149.000000 4.000000 2.000000 2.000000
75% 219.000000 6.000000 2.000000 3.000000
max 99998.000000 16.000000 12.000000 21.000000

These basic summary statistics of the numerical variables can tell us quite few important things that may impact our model that we might build.

  1. The maximum Airbnb price is set at $99,998 a night! Might this be an error? Let's insepct this more below with a boxplot.
  2. The standard deviations of each of our intended numerical explanatory variables are quite different. This fact will impact how we are able to interpret the slopes in our resulting model. We'll discuss this more in section 08-07.

A more complete analysis would visualize and further examine each numerical variable individually, but for now let's just examine the price variable.

When describing a numerical variable to someone, it is important to be thorough in describing the four most important components of the numerical variable. These are:

  1. Shape
    • skew
    • modality
  2. Measure of center
    • mean
    • median
  3. Measure of spread
    • standard deviation
    • IQR
    • range
  4. Any outliers

To visualize and further describe a single numerical variable, we can use any of the following plots:

  • histogram
  • boxplot
  • violinplot

Each of these three plots are able to describe some of these four components, but not all.

Visualizations

Histograms

For instance, we might create a histogram below. However, interestingly this does not look like a very insightful histogram. There's only one bar. Why might this be?

df['price'].hist()
plt.title('Chicago Airbnb Listings')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

Boxplot

Recall in our summary statistics, we saw an Airbnb that was $99,998 a night. This large observation may be impacting our ability to see the shape of our distribution with this plot above. In other words, this Airbnb might be an outlier.

Remember, a boxplot can identify outliers. That is, it shows you the points in the plot that are either:

  • $> Q3+1.5IQR$
  • $< Q1-1.5IQR$
sns.boxplot(df['price'])
plt.title('Chicago Airbnb Listings')
plt.show()

We do in fact see a very large outlier. Because this outlier might be an error in the dataset, let's get rid of it and see how this changes our visualizations.

df=df[df['price']<20000]
print(df.shape)
sns.boxplot(df['price'])
plt.title('Chicago Airbnb Listings')
plt.show()
(2060, 6)

By getting rid of this single observation, we do see more outliers in the resulting boxplot. There are pros and cons to leaving outliers in a dataset that you plan to use for machine learning purposes.

Potential Cons

  • Depending on where your outliers are in relation to the non-outlier points, one potential con is that leaving the outliers in the dataset could inflate the RMSE, a measure of model accuracy, making your model performance look worse, in general, than it actually is.
  • In addition, depending on where your outliers are in relation to the non-outlier points in your dataset, it's possible that by keeping these outliers your fitted linear regression model might be dramatically different than if you had left the outliers out. Based on the location of the outliers, your linear regression model may have been encouraged to be fit closer to the outliers in order to reduce the larger residuals that the outliers would bring to the overall fit of the model. However, this may come at the expense of worsening the overall fit for the non-outlier points.

Potential Pros

  • However, it's likely that many of these outliers are legitimate Airbnb listings. Therefore by getting rid of them, your model may become less capable of predicting all Chicago Airbnb listings in general (only choosing to focus on the cheaper listings).

Suppose we'd like to be inclusive of the more expensive Airbnb listing hosts, so we'll decide to leave them in in this analysis.

This single listing deletion yielded a more insightful histogram plot as well.

df['price'].hist()
plt.title('Chicago Airbnb Listings')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

Summarizing

From our histogram plot we can determine/estimate the following.

  1. Shape
    • the prices are skewed to the right
    • the prices are unimodal
  2. Measure of center
    • the median is usually close to the peak of a unimodal distribution

The fact that this price distribution is not symmetric means that the median (as opposed to the mean) and the IQR (as oppposed to the standard deviation) are better measures of center and measures of spread, respectively to describe the price distribution. Remember that the mean tends to get dragged out in the direction of the skew and/or outliers and can thus become a misleading measure of center. Similarly the standard deviation can become an inflated and misleading measure of spread because of the skew and outliers.

Notice, that the histogram CANNOT easily tell us about:

  • any outliers
  • IQR

From our boxplot we can determine the following.

  1. Shape
    • the prices are skewed to the right
  2. Measure of Center
    • the median price is 149 dollars
  3. Measure of spread
    • the IQR price is 120 dollars
  4. Outliers
    • there are many outlier listings with high prices
median = df['price'].median()
median
149.0
Q3 = df['price'].quantile(0.75)
Q1 = df['price'].quantile(0.25)

IQR = Q3-Q1
IQR
120.0

Notice that our boxplot CANNOT tell us the modality of the distribution. There could be many "modes" in this dataset that would be masked by the boxplot.

Violinplot

We can, however, use what we call a violinplot to visualize the modality and skew of the distribution. This plot visualizes what we call the density curve of the histogram. This density curve is what you might see if you overlayed a "spaghetti noodle" over your histogram. This density curve is then symmetrically mirrored to resemble a shape like a violin.

The black and white interior of this shape also show most of the components of a boxplot including the:

  • median: white dot
  • Q1 and Q3 (left and right edges of the inner box)
  • min whisker and max whisker (left and right lines emanating out of the box)

However, this plot does not show you any outliers.

sns.violinplot(df['price'])
plt.title('Chicago Airbnb Listings')
plt.show()