Single Variable Descriptive Analytics and Data Manipulation
Before proceeding further down the data science pipeline and performing analyses like building a predictive model, it's useful to perform what we might call more "basic descriptive analytics" techniques on the dataset. The general goal of performing descriptive analytics is to "describe" the dataset insightfully and thoroughly. Some basic descriptive analytics techniques might include calculating and evaluating relevant summary statistics of the data and plotting relevant visualizations.
First, trying to describe just a single variable in the dataset, for instance, can help lead to insightful data cleaning decisions that might need to be made before building your predictive model.
Single Categorical Variable
Summary Statistics
For instance, we know that we'd like to include the neighborhood variable as one of our explanatory variables in the linear regression model. This is a categorical variable, so we can use the value_counts() function as shown below to calculate summary statistics for this variable including: the number and percentage of observations that belong to each level of the variable, respectively.
neighborhood_counts = df['neighborhood'].value_counts()
neighborhood_counts
Near North Side 629 West Town 583 Lake View 422 Logan Square 350 Near West Side 339 ... West Elsdon 1 Burnside 1 Edison Park 1 Mount Greenwood 1 East Side 1 Name: neighborhood, Length: 75, dtype: int64
neighborhood_percentages = df['neighborhood'].value_counts(normalize=True)
neighborhood_percentages
Near North Side 0.116030 West Town 0.107545 Lake View 0.077845 Logan Square 0.064564 Near West Side 0.062535 ... West Elsdon 0.000184 Burnside 0.000184 Edison Park 0.000184 Mount Greenwood 0.000184 East Side 0.000184 Name: neighborhood, Length: 75, dtype: float64
Visualizations
We can also visualize each of these percentages, say, in a barplot by using the sns.barplot() function and the following inputs:
- .index of the neighborhood_percentages pandas series that was created (ie. the neighborhoods) and the
- values of the neighborhood_percentages pandas series that was created (ie. the percentage of each observation that belongs to each neighborhood)
plt.figure(figsize=(20,8))
sns.barplot(neighborhood_percentages.index, neighborhood_percentages)
plt.xticks(rotation=90)
plt.show()

Data Cleaning Insights and Rationale
There's quite a lot of distinct neighborhoods in this dataset, most of which do not have that many listings that belong to them. Including categorical explanatory variables in some predictive models like linear regression that have levels with only a few number of observations can lead to several issues. One of these issues is what we call overfitting , which we'll talk more about in a later module.
Thus in order to avoid overfitting, we may decide to only keep the top 5 most popular neighborhoods in this dataset: Near North Side, West Town, Lake View, Logan Square, and Near West Side.
top_neigh = df['neighborhood'].value_counts().index[0:5]
top_neigh
Index(['Near North Side', 'West Town', 'Lake View', 'Logan Square', 'Near West Side'], dtype='object')
df=df[df['neighborhood'].isin(top_neigh)]
df.shape
(2323, 6)
Also, for similar reasons, let's just keep the top two most popular room_types: entire home/apt, private room.
room_type_counts = df['room_type'].value_counts()
room_type_counts
Entire home/apt 1976 Private room 297 Hotel room 42 Shared room 8 Name: room_type, dtype: int64
sns.barplot(room_type_counts.index, room_type_counts)
plt.show()
top_room = df['room_type'].value_counts().index[0:2]
top_room
Index(['Entire home/apt', 'Private room'], dtype='object')
df=df[df['room_type'].isin(top_room)]
df.shape
(2273, 6)
Note however, that the downside of these data cleaning techniques is that our model that we will build will only be able to reliably make predictions for these top 5 neighborhoods and top 2 room types. The number of rows in our dataframe has now also dropped from 5421 to 2273.
Missing Values
Furthermore, before proceeding let's drop all rows with missing values in the dataset. We see that this deletes 212 more observations.
df.isna().sum()
price 0 neighborhood 0 room_type 0 accommodates 0 bedrooms 203 beds 16 dtype: int64
df=df.dropna()
df.shape
(2061, 6)
Single Numerical Variable
Similarly, describing each of the numerical variables that we intend to include in our model may, for instance yield insights about what model is most appropriate to use to make these predictions as well as other data cleaning measures we should take.
Summary Statistics
We can use the .describe() function to quickly calculate a set of useful summary statistics for each numerical variable in a given dataset. These summary statistics include the:
- number of non-missing observations for each variable
- mean
- standard deviation
- minimum
- Q1 (25%)
- median (50%)
- Q3 (75%)
- maximum
df.describe()
price | accommodates | bedrooms | beds | |
---|---|---|---|---|
count | 2061.000000 | 2061.000000 | 2061.000000 | 2061.000000 |
mean | 250.710335 | 4.706938 | 1.918001 | 2.408054 |
std | 2209.424510 | 3.134291 | 1.158578 | 1.934589 |
min | 19.000000 | 1.000000 | 1.000000 | 1.000000 |
25% | 99.000000 | 2.000000 | 1.000000 | 1.000000 |
50% | 149.000000 | 4.000000 | 2.000000 | 2.000000 |
75% | 219.000000 | 6.000000 | 2.000000 | 3.000000 |
max | 99998.000000 | 16.000000 | 12.000000 | 21.000000 |
These basic summary statistics of the numerical variables can tell us quite few important things the may impact our model that we might build.
- The maximum Airbnb price is set at $99,998 a night! Might this be an error? Let's insepct this more below with a boxplot.
- The standard deviations of each of our intended numerical explanatory variables are quite different. This fact will impact how we are able to interpret the slopes in our resulting model. We'll discuss this more in section 08-07.
A more complete analysis would visualize and further examine each numerical variable individually, but for now let's just examine the price variable.
When describing a numerical variable to someone, it is important to be thorough in describing the four most important components of the numerical variable. These are:
- Shape
- skew
- modality
- Measure of center
- mean
- median
- Measure of spread
- standard deviation
- IQR
- range
- Any outliers
To visualize and further describe a single numerical variable, we can use any of the following plots:
- histogram
- boxplot
- violinplot
Each of these three plots are able to describe some of these four components, but not all.
Visualizations
Histograms
For instance, we might create a histogram below. However, interestingly this does not look like a very insightful histogram. There's only one bar. Why might this be?
df['price'].hist()
plt.title('Chicago Airbnb Listings')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()
Boxplot
Recall in our summary statistics, we saw an Airbnb that was $99,998 a night. This large observation may be impacting our ability to see the shape of our distribution with this plot above. In other words, this Airbnb might be an outlier.
Remember, a boxplot can identify outliers. That is, it shows you the points in the plot that are either:
- $> Q3+1.5IQR$
- $< Q1-1.5IQR$
sns.boxplot(df['price'])
plt.title('Chicago Airbnb Listings')
plt.show()
We do in fact see a very large outlier. Because this outlier might be an error in the dataset, let's get rid of it and see how this changes our visualizations.
df=df[df['price']<20000]
print(df.shape)
sns.boxplot(df['price'])
plt.title('Chicago Airbnb Listings')
plt.show()
(2060, 6)
By getting rid of this single observation, we do see more outliers in the resulting boxplot. There are pros and cons to leaving outliers in a dataset that you plan to use for modeling purposes.
- One con is that it can inflate the RMSE, a measure of model accuracy, making your model performance look worse, in general, than it actually is.
- However, it's likely that many of these outliers are legitimate Airbnb listings. Therefore by getting rid of them, your model may become less capable of predicting all Chicago Airbnb listings in general (only choosing to focus on the cheaper listings).
Suppose we'd like to be inclusive of the more expensive Airbnb listing hosts, so we'll decide to leave them in in this analysis.
This single listing deletion yielded a more insightful histogram plot as well.
df['price'].hist()
plt.title('Chicago Airbnb Listings')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()
Summarizing
From our histogram plot we can determine/estimate the following.
- Shape
- the prices are skewed to the right
- the prices are unimodal
- Measure of center
- the median is usually close to the peak of a unimodal distribution
The fact that this price distribution is not symmetric means that the median (as opposed to the mean) and the IQR (as oppposed to the standard deviation) are better measures of center and measures of spread, respectively to describe the price distribution. Remember that the mean tends to get dragged out in the direction of the skew and/or outliers and can thus become a misleading measure of center. Similarly the standard deviation can become an inflated and misleading measure of spread because of the skew and outliers.
Notice, that the histogram CANNOT reliably/easily tell us about:
- any outliers
- IQR
From our boxplot we can determine the following.
- Shape
- the prices are skewed to the right
- Measure of Center
- the median price is 149 dollars
- Measure of spread
- the IQR price is 120 dollars
- Outliers
- there are many outlier listings with high prices
median = df['price'].median()
median
149.0
Q3 = df['price'].quantile(0.75)
Q1 = df['price'].quantile(0.25)
IQR = Q3-Q1
IQR
120.0
Notice that our boxplot CANNOT tell us the modality of the distribution. There could be many "modes" in this dataset that would be masked by the boxplot.
Violinplot
We can, however, use what we call a violinplot to visualize the modality and skew of the distribution. This plot visualizes what we call the density curve of the histogram. This density curve is what you might see if you overlayed a "spaghetti noodle" over your histogram. This density curve is then symmetrically mirrored to resemble a shape like a violin.
The black and white interior of this shape also show most of the components of a boxplot including the:
- median: white dot
- Q1 and Q3 (left and right edges of the inner box)
- min whisker reach and max whisker reach (left and right lines emanating out of the box)
However, this plot does not show you any outliers.
sns.violinplot(df['price'])
plt.title('Chicago Airbnb Listings')
plt.show()