Variable Transformations


Throughout the whole of this module so far, our research goal was to be able to predict the price of a new Airbnb listings. Luckily, we saw that there was a linear relationship between the explanatory variables and the price response variable in our final model (ie. the linearity condition is met).. Therefore, using a linear regression model was a suitable model to use to predict price.

y_pred = final_model.predict(X_train_dummies.drop(['beds','room_type*accommodates', 'room_type*bedrooms'], axis=1))
residuals = y_train - y_pred

plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted (Predicted) values')
plt.ylabel('Residuals')
plt.title('Fitted values vs. Residuals Plot')
plt.show()

But this begs the question, what should we do when the linearity condition is not met by a given dataset. Should we abandon using a linear regression model all together?

Not necessarily! There are some "hacks" that we can apply as data scientists in which we can try transforming one or more of our variables in attempt to have our resulting linearity condition be met.

Predicting the MPG of Cars

To give you an ideas as to how we might use this variable transformation "hack" let's introduce a new dataset and a new research goal in which the linearity condition is not going to immediately work out perfectly for us.

Let's read in the following dataset below which is comprised of various information of 398 types of cars from the 70's and 80's. Let's say that our research goal in this case is the following.

Research Goal: Build a well-suited linear regression model that enables you to predict the the mpg of a car in this dataset given: weight and acceleration.

df_cars=pd.read_csv('auto-mpg.csv', na_values=['data missing'])
df_cars = df_cars[['mpg', 'weight','acceleration']]
df_cars.head()
mpg weight acceleration
0 18.0 3504 12.0
1 15.0 3693 11.5
2 18.0 3436 11.0
3 16.0 3433 12.0
4 17.0 3449 10.5
df_cars.shape
(398, 2)

Descriptive Analytics

Before building our model, we should explore the nature of the variables in this dataset to see if this can yield any insights about how to create our model or any data cleaning that we might need to first perform.

sns.pairplot(df_cars)
plt.show()

Pre-Check for Multicollinearity

Luckily, there does need seem to be a strong linear relationship between our two numerical explanatory variables weight and acceleration. So our model does does not need to worry as much about biased slope interpretation because of multicollinearity.

Pre-Check for Linearity Condition

However, unfortunately, we can see that the relationship between weight (an explanatory variable) and mpg (the response variable) independently does not look linear. We can also see that the relationship between acceleration (an explanatory variable) and mpg (the response variable) independently does not look quite linear as well.

This should make us suspicious that the linearity condition of our original intended linear regression model below will not be met. Or in other words, we should be suspicious that the relationship between the explanatory variables (weight and acceleration together) and the response variable (mpg) is linear.

Linearity Condition of Initial Model

Let's actually verify this suspicion by creating our initial linear regression model below and creating it's corresponding fitted values vs. residuals plot with the full dataset.

Initial Linear Regression Model

First we fit our initial linear regression model below.

$\hat{mpg}=\hat{\beta}_0+\hat{\beta}_1weight+\hat{\beta}_2acceleration$

X = df_cars[['weight','acceleration']]
X.head()
weight acceleration
0 3504 12.0
1 3693 11.5
2 3436 11.0
3 3433 12.0
4 3449 10.5
y=df_cars['mpg']
y.head()
    0    18.0
    1    15.0
    2    18.0
    3    16.0
    4    17.0
    Name: mpg, dtype: float64
initial_model = LinearRegression()
initial_model.fit(X,y)

And then create it's fitted values vs. residuals plot with the full dataset.

y_pred = initial_model.predict(X)
residuals = y - y_pred

plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted (Predicted) values')
plt.ylabel('Residuals')
plt.title('Initial Model \n Fitted values vs. Residuals Plot')
plt.show()

We can see from the plot above that our initial model and our dataset does not meet the linearity condition, because there exists at least one small-width x-axis box that does not have an even distribution of positive negative residuals. For instance, your far left box would have way more positive residuals than negative residuals.

Thus, our initial linear regression model below would not be a suitable fit for the dataset.

$\hat{mpg}=\hat{\beta}_0+\hat{\beta}_1weight+\hat{\beta}_2acceleration$

R^2 vs. Linearity Condition

Do note, however that the the R^2=0.698 of the full dataset and this model is still somewhat high. This indicates that our linear model still has a relatively decent fit, even though a linear model is most likely not the most suitable model to use.

r2_score(y, y_pred)
0.6982595061815189

Trying out a Variable Transformation

Let's, for instance, inspect the nonlinear relationship between mpg and one of our explanatory variables, weight, specifically.

We might wonder to ourselves: if we can somehow slightly "squash" or "push down" the points in this dataset that have high mpg, more so than the cars that have low mpg, we might actually be able to "force" this relationship to look more linear.

sns.lmplot(x='weight', y='mpg', data=df_cars)
plt.show()

Let's recall, for instance, what the natural log function ln() looks like. The higher the input into the ln() function is, the more the function "squashes" that input.

mpg = np.linspace(5, 50, 100)
ln_mpg = np.log(mpg)

plt.plot(mpg, ln_mpg)
plt.xlabel('mpg')
plt.ylabel('ln(mpg)')
plt.show()

Let's use this concept to transform our response variable mpg into ln(mpg).

ln_y = np.log(y)
ln_y.head()
    0    2.890372
    1    2.708050
    2    2.890372
    3    2.772589
    4    2.833213
    Name: mpg, dtype: float64

Log-Transformed Linear Regression Model

Next, we use this transformed variable to fit a new candidate linear regression model below.

$\hat{ln(mpg)}=\hat{\beta}_0+\hat{\beta}_1weight+\hat{\beta}_2acceleration$

log_trans_model = LinearRegression()
log_trans_model.fit(X,ln_y)

And then we create the corresponding fitted values vs. residuals plot for this model and the full dataset below.

Note, when you calculate the residuals here, your new residuals should be calculated as:

$$residual_i=\ln(mpg_i)-\hat{ln(mpg_i)}$$

ln_y_pred = log_trans_model.predict(X)
residuals = ln_y - ln_y_pred

plt.scatter(ln_y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted (Predicted) values')
plt.ylabel('Residuals')
plt.title('Log-Transformed Model \n Fitted values vs. Residuals Plot')
plt.show()

We can see that by using $ln(mpg)$ as our response variable instead of $mpg$, our linearity condition for the resulting log-transformed model was much closer to being met. For the most part, most of our small-width boxes will contain a roughly even distribution of positive and negative residuals.

pd.DataFrame([log_trans_model.intercept_]+list(log_trans_model.coef_.T),
index=['intercept']+list(X.columns))
0
intercept 3.909704
weight -0.000335
acceleration 0.011977

Therefore, it just so happens that our "hack" worked out! Our resulting log-transformed model is a suitable linear regression model to use.

Final Model

$$\hat{ln(mpg)}=3.91-0.000335weight+0.012acceleration$$

Also, notice how the R^2 of the new log-transformed model happened to increase to 0.77.

r2_score(ln_y, ln_y_pred)
0.7744264429992024

Making Predictions with Variable Transformed Models

Suppose now that we would like to predict the mpg of a car with a weight of 3500 lbs an an acceleration of 13. We should be careful about how we use our transformed model to make this prediction.

new_point= pd.DataFrame({'weight':[3500], 'acceleration':[13]})
new_point
weight acceleration
0 3500 13

Recall that the response variable that we are predicting our log-transformed model is now $ln(mpg)$. Thus, when we use the .predict() function, we are actually predicting that the log of the mpg of this car is 2.89.

$$\hat{ln(mpg)}=3.91-0.000335(3500)+0.012(13)=2.89$$

log_trans_model.predict(new_point)
array([2.89327317])

Thus, to get the $\hat{mpg}$ on the left-hand side by itself, all we need to do is exponentiate both sides of this equation.

$$\hat{ln(mpg)}=2.89$$
$$e^{\hat{ln(mpg)}}=e^{2.89}$$
$$\hat{mpg}=17.99$$

np.exp(2.89)
17.993309601550315

Research Goal Conclusion and Follow Ups

Conclusion

Thus, in conclusion by using linear transformations we have satisfied our research goal. We have found a well-suited linear regression model that enables us to predict the mpg of a car in this dataset given weight and acceleration.

$$\hat{ln(mpg)}=3.91-0.000335weight+0.012acceleration$$.

An additional benefit of this log-transformed model is that the $R^2=0.77$ is relatively high. The fit of this log-transformed model also increased by applying the log-transformation.

Follow Ups

Note that our main research goal was to simply find a linear regression model that satisfied the linearity condition. It just so happens that transforming the response variable with the natural log function just happened to work, yielding a fitted values vs. residuals plot that shows the linearity condition being met.

This specific technique, however, may not work with all datasets though. Some other things you could have tried would be the following.

1. Transforming one or more explanatory variables
Rather than (or in addition to) transforming the response variable with a given function, you could have also tried transforming one or more of the explanatory variables with a given function.

2. Using Different Functions
Rather than using the natural log function to transform a variable, you could have used any function to transform the variable. Some common transformations include:

  • $\sqrt{x}$
  • $x^2, x^3, etc.$

🥳🥳🥳 Module Conclusion🥳🥳🥳

This concludes Module 08! In Module 09 we'll discuss even more sophisticated machine learning techniques that can help us build even better linear regression models that yield better predictions for new datasets.