Describing Associations between Three Variables


Finally, there are many ways in which we can visualize the relationship between three or more variables in a dataset. One question that we might be able to answer from a plot that visualizes three variables can take on the following form:

"How does the association between x and y change for different values of z?"

When we ask a question with the form above, we say that we are describing the association between x and y, while controlling for z.

The best way to answer a question like this with a visualization is also going to depend on the types of variables that you are dealing with (ie. numerical vs. categorical).

Association between Two Numerical Variables - Controlling for a Categorical Variable

For instance, we might ask the following question:

"How does the relationship between accommodates and price change based on the room type?"

Because the two variables we are measuring the association of are numerical, and the variable that we are controlling by is categorical, we can plot a scatterplot between price and accommodates and color-code the points by room type. We can color code by the room_type labels, by specifying the hue parameter in the sns.scatterplot() function.

sns.scatterplot(x='accommodates', y='price', hue='room_type', data=df)
plt.show()

Furthermore, we can actually fit a best fit line for each room type, by using the sns.lmplot() and specifying the hue parameter.

sns.lmplot(x='accommodates', y='price', hue='room_type', ci=False, data=df)
plt.show()

Furthermore, we can quickly calculate the correlation between accommodates and beds for each room type by using the .groupby() and .corr() function, grouping by the room_type.

df[['room_type', 'accommodates', 'price']].groupby(['room_type']).corr()
accommodates price
room_type
Entire home/apt accommodates 1.000000 0.579332
price 0.579332 1.000000
Private room accommodates 1.000000 0.529313
price 0.529313 1.000000

Thus, with the following plots and summary statistics, we can answer our question below as follows.

Question: "How does the relationship between accommodates and beds change based on the room type?"

Answer:

1. Direction Change

The direction of the association is positive for both room types.

2. Shape Change

The shape of the association is linear for both room types.

3. Strength Change

The relationship between accommodates and price is slightly stronger for entire home/apartment listings (R=0.58), then it is for private room listings (R=0.53).

4. Outliers Change

There are a few outliers in the relationships between accommodates and price for both room types.

5. Slope Change

The slope of the best fit line for private room listings is slightly higher, than it is for entire home/apartment listings.

Thus, if we were to increase the number of people a listing accommodates by 1, we would expect, on average a higher price increase in the private home listings compared to the entire house listings.

Note: There's many more changes you could compare here, but we've covered some of the basics.

Association between Two Numerical Variables - Controlling for a Numerical Variable

For instance, we might ask the following question.

"How does the relationship between accommodates and beds change based on the number of bedrooms?""

Because two variables that we are measuring the association of are numerical, we can also create a scatterplot between these two variables and color code by numerical variable bedrooms by specifying the hue parameter in the sns.scatterplot() function.

sns.scatterplot(x='accommodates', y='beds', hue='bedrooms', data=df)
plt.show()

Association between a Numerical and Categorical Variable - Controlling for another Categorical Variable

We might ask the following question.

Question: "How does the relationship between price and room type change based on the neighborhood?

Because the two variables that we measuring the association of are numerical and categorical, and the variable that we are controlling by is categorical, we can plot a side-by-side boxplots/violinplots visualization with neighbhorhood on the x-axis and price on the y-axis using the sns.boxplots()/sns.violinplots() functions. Then we can color code by the room type labels, by specifying the hue parameter.

plt.figure(figsize=(8,5))
sns.boxplot(x='neighborhood', y='price', hue='room_type', data=df)
plt.ylim([0,1000])
plt.show()

Answer:

1. Measure of Center Comparison Change

For all neighborhoods except for Near North Side, the median price for the entire home/apartment listings is comparatively a lot larger than it is for private room listings.

2. Measure of Spread Comparison Change

For all neighborhoods except for Near North Side, the IQR price for the entire home/apartment listings is a lot larger than it is for private room listings.

3. Shape Comparison Change

For all neighborhoods except for Near North Side, the price of entire home/apartment listings is much more right skewed than it is for private room listings.

Each of these 10 price distributions above are unimodal (see violin plots below).

4. Outliers Comparison Change

For all neighborhoods, the entire home/apartment listings have way more high price outliers than the private room listings.

plt.figure(figsize=(8,5))
sns.violinplot(x='neighborhood', y='price', hue='room_type', data=df)
plt.show()

Our answer above opens up an interesting side-question that a thorough data scientist might want to explore: "why was the Near North Side neighborhood so different?"