Describing Associations between Two Variables


Next, measuring and describing the nature of the association between pairs of variables in the dataset can also give us insights as to what additional data cleaning techniques we night need to employ before modeling as well as what types of models might be good to use. Again, the best types of summary statistics and visualizations that might use to try to describe these associations are going to depend on the types of variables that you are dealing with (ie. numerical vs. categorical).

Two Categorical Variables

Summary Statistics

To calculate summary statistics which help us evaluate the association between two (or more) categorical variables, we can use the pd.crosstab() function to create what we call a contingency table.

For instance, the table below counts up how many listings belong to each combination of neighborhood and room_type.

pd.crosstab(df['neighborhood'], df['room_type'])
room_type Entire home/apt Private room
neighborhood
Lake View 338 58
Logan Square 292 44
Near North Side 472 47
Near West Side 242 57
West Town 433 77

Similarly, we can use the normalize='index' parameter to find the percentage of a given neighborhood's listings that are of each room type. For instance this tells us that:

  • 85% of Lake View listings are the entire home/apartment
  • 15% of Lake View listings are just a private room
temp = pd.crosstab(df['neighborhood'], df['room_type'], normalize='index')
temp
room_type Entire home/apt Private room
neighborhood
Lake View 0.853535 0.146465
Logan Square 0.869048 0.130952
Near North Side 0.909441 0.090559
Near West Side 0.809365 0.190635
West Town 0.849020 0.150980

Visualizations

We can visualize this resulting dataframe in a bar plot with the .plot.bar() function.

This plot below can help us determine that there is an association between room_type and neighborhood.

How do we know this?

For instance, notice that the percentage of Near West Side listings that are private rooms (19.1%) is about twice as high as the percentage of Near North Side listings that are private rooms (9.1%). This fact, for instance, suggest that there is an association between room_type and neighborhood in general in this dataset.

temp.plot.bar()
plt.legend(loc='upper right')
plt.ylabel('Relative Frequency')
plt.show()

Recognizing an Association between Two Categorical Variables
In general, given that at least one of these "colors" (ie. room_types) had some neighborhood percentages that were different in the plot above, this suggests an association.

Recognizing a Lack of Association between Two Categorical Variables
On the other hand, we would say that there is NOT an association between room_type and neighborhood if for instance we had seen the following:

  • all neighborhoods were about equally likely to be a private room listing (ie. all orange box heights were about the same)
  • all neighborhoods were about equally likely to be a entire house listing (ie. all blue box heights were about the same)

We could have flipped ordering of the two categorical variables to produce a slightly different barplot.

For instance, this tells us that:

  • 20.5% of private rooms in the dataset are in Lake View,
  • 15.5% of private rooms in the dataset are in Logan Square,
  • etc.
temp = pd.crosstab(df['room_type'], df['neighborhood'], normalize='index')
temp
neighborhood Lake View Logan Square Near North Side Near West Side West Town
room_type
Entire home/apt 0.190208 0.164322 0.265616 0.136185 0.243669
Private room 0.204947 0.155477 0.166078 0.201413 0.272085

However, regardless of which ordering of your categorical variables that you used, your interpretation as to whether there was an association between the two categorical variables would be the same.

temp.plot.bar()
plt.legend(loc='upper right')
plt.ylabel('Relative Frequency')
plt.show()

Categorical and Numerical Variable

To visualize the nature of the association between a categorical variable and numerical variable we can either use what we call a side-by-side boxplots visualization or a side-by-side violinplots visualization.

For instance, by using the sns.boxplot() function below, we can create two boxplots of price distributions. The first boxplot is for the entire home/apartment listings and the second boxplot is for private room listings. Notice how in the parameters we place the room_type on the x-axis and the price on the y-axis.

sns.boxplot(x='room_type', y='price', data=df)
plt.show()

The outliers in these boxplots make the two distributions difficult to compare, let's use plt.ylim([0,1000]) to only look at y-axis values in the plot that are between 0 and 1000.

Is there an association between room type and price?

Because we see a slight separation between the price IQR (ie. box height) of the entire home/apartment listings and the IQR of the private room listings, this might indicate that there is a moderate association between the two variables room type and price in this dataset.

This is useful information for our model! This indicates that using room_type as an explanatory variable might be useful to predict price.

sns.boxplot(x='room_type', y='price', data=df)
plt.ylim([0,1000])
plt.show()

Describing the Association

Furthermore, we can describe the nature of this association by comparing the four things that best summarize a single numerical variable distribution.

  1. Measure of Center Comparison
    • The median listing price of entire home/apartment listings (157 dollars) is COMPARATIVELY higher than the median listing price of private rooms (70 dollars). We say that one median value is comparatively higher because the price IQRs of the two listings types furthermore do not overlap.

    • We choose to use the median as opposed to the mean measure of center because at least one of these price distributions is skewed.

    • We can use the .groupby() function and the .median() function, grouping by room_type to quickly find the median price of each room_type.

df[['price', 'room_type']].groupby(['room_type']).median()
price
room_type
Entire home/apt 157.0
Private room 70.0
  1. Measure of Spread Comparison
  • The IQR listing price of entire home/apartment listings (115 dollars) is much larger than the IQR listing price of private rooms (53.5 dollars).

    • We choose to use the IQR as opposed to the standard deviation measure of spread because at least one of these price distributions is skewed.

    • We can use the .groupby() function and the .quantile() functions, grouping by room_type to quickly find the IQR price of each room_type as shown below.

df[['price', 'room_type']].groupby(['room_type']).quantile(0.75)-df[['price', 'room_type']].groupby(['room_type']).quantile(0.25)
price
room_type
Entire home/apt 115.0
Private room 53.5
  1. Shape Comparison

The shapes of both room type price distributions are right skewed and unimodal.

sns.violinplot(x='room_type', y='price', data=df)
#plt.ylim([0,1000])
plt.show()
  1. Outliers

Both room types have quite a few high price outliers.

Is there an association between neighborhood and price?

On the other hand, looking at the plot below we see a much weaker association between neighborhood and price, because there is much more overlap in ALL of the boxplots IQRs. However, because there is some slight separation, say, between Near North Side IQR and the Logan Square IQR, we can say that there is some association. Thus, we'll choose to include neighborhood as one of our explanatory variables in our linear regression model to predict price.

sns.boxplot(x='neighborhood', y='price', data=df)
plt.ylim([0,1000])
plt.show()

Two Numerical Variables

To visualize the nature of the association between two numerical variables we can use a scatterplot.

Is there an association between bedrooms and beds?

It makes sense that in the plot below we see a moderately strong, positive, linear association between the number of bedrooms and the number of beds in an Airbnb.

  1. The relationship is positive as the trend of the data goes up and to the right.
  2. The relationship is linear as the best fitting curve for this data would be a line.
  3. The relationship is strong as the points above and below this line are not too far off.
  4. Furthermore, there do not seem to be any strong outliers.

In general, an effective and thorough way to communicate the nature of an association between two numerical variables are to describe the following.

  1. Direction (positive/negative) of the trend
  2. Shape of the trend (linear/nonlinear)
  3. Strength of the trend (none, weak, moderate, strong, etc.)
  4. Any outliers

We can use the sns.scatterplot() function to create a basic scatterplot.

sns.scatterplot(x='bedrooms', y='beds', data=df)
plt.show()

The term "strong" is a subjective assessment. Because the relationship between two numerical variables is linear we can use the correlation to measure the strength and the direction of this relationship. Recall that the correlation ranges from $[-1.0, 1.0]$ and that the closer the |correlation| is to 1.0, the stronger the association. Furthermore, a positive correlation indicates that the relationship is positive, and a negative correlation indicates that the relationship is negative.

Thus, the correlation of 0.859 between bedrooms and beds further validates that the linear relationship is strong and positive.

df[['bedrooms', 'beds']].corr()
bedrooms beds
bedrooms 1.000000 0.859116
beds 0.859116 1.000000

If we want to overlay a "best fit line" in a scatterplot, then we can use the sns.lmplot() function.

sns.lmplot(x='bedrooms', y='beds', data=df, ci=False)
plt.show()

Warning! If the relationship between your two numerical variables is NONLINEAR, the correlation is not an effective way to measure the strength and direction of a relationship. If you use the correlation on a relationship is not linear, your interpretations may be misleading!

We can actually quickly visualize the nature of the relationship between all of our numerical explanatory variables all at once using the sns.pairplot() function.

sns.pairplot(df)
plt.show()

These plots yield some interesting insights for our intended model.

  1. Each of our intended numerical explanatory variables (ie. accommodates, bedrooms, and beds) all have strong linear relationships with each other. The correlations between each pair of explanatory variables are 0.856, 0.877, and 0.859.
  2. On the other hand, the relationship between each explanatory variable and price (our response variable) is only moderately strong (0.586, 0.611, 0.577).
df.corr()
price accommodates bedrooms beds
price 1.000000 0.585688 0.611560 0.576968
accommodates 0.585688 1.000000 0.856451 0.876848
bedrooms 0.611560 0.856451 1.000000 0.859116
beds 0.576968 0.876848 0.859116 1.000000

So on one hand, the strong linear relationships between accommodates, bedrooms, and beds with price indicates that using these three variables as explanatory variables in our model might be a good idea.

However, as we'll see in Section 08-08, because our three numerical explanatory variables have a strong linear relationship with each other, our model is going to run into some problems.