Evaluating your Linear Regression Model for Machine Learning and Interpretation Purposes


Suitability of the Linear Regression Model in General

When choosing whether a given linear regression model is the right type of model for your dataset, to start off with, there are two core assumptions about your dataset that should be met.

  1. Your response variable should be numerical.
  2. The relationship between your explanatory variables (all together) and your response variable should be linear.

Linear Relationship between the Response Variable and Explanatory Variables

How do we evaluate this second assumption in a simple linear regression model?

For a moment, let's only consider our simple linear regression model where we want to predict price with just the number of beds. We can see that this relationship is mostly linear (as opposed to non-linear), thus fitting a linear regression curve is the most sensible curve to fit (as opposed to a nonlinear regression curve).

sns.lmplot(x='beds', y='price', data=df_train, ci=None)
plt.title('Training Dataset')
plt.show()

But how could we articulate why this is a linear relationship?

Generally, speaking we can see on this simple linear regression curve $\hat{price}=40.86+66.62beds$ that as we move up the line , our residuals look like they will be an even mixture of positive and negative.

Let's double check this with another plot below. This plots the dataset and model's

  • set of predicted (ie. fitted) values on the x-axis and
  • the residuals of these predicted values on the y-axis.

We call this a fitted values vs. residuals plot.

y_pred = model.predict(X_train[['beds']])
residuals = y_train - y_pred

plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted (Predicted) values')
plt.ylabel('Residuals')
plt.title('Fitted values vs. Residuals Plot')
plt.show()

We see this same relationship in this plot as well. Specifically now, as we move from left to right in the plot above (equivalently as we move "up" the linear curve's predictions), our residuals are roughly an even distribution of positives and negatives (with some exceptions on the far right).

Practically speaking, what we might do is create small equally sized boxes over the range of fitted values. If the number of negative and positive residuals is roughly the same in ALL boxes, then we can infer that the relationship between the explanatory variable(s) and the response variable is indeed linear. In the 4 boxes that we drew below, the one on the far right does show more negative residuals, however, given that there are only a few observations in this box, we can still say that the relationship between the explanatory variable(s) and the response variable is mostly linear.

Example of a Nonlinear Relationship

By contrast, suppose we had this artificial 2-d dataset which clearly has a nonlinear relationship. We can see as we move up the best fit linear regression line there would be parts of this line in which there are way more positive residuals than negative residuals (and vice versa).

We can similarly take this dataset, fit a linear regression curve to it ($\hat{y}=267.4+10x$), and create a corresponding fitted values vs. residuals plot shown below.

We can also use this fitted values vs. residuals plot to infer that the relationship between the explanatory variable(s) x and the response variable y is not linear in this dataset because there exists at least one "x-axis window" in which there is not an even distribution of positive and negative residuals.

Using a Fitted Values vs. Residuals Plot to Evaluate Linearity Assumption

Now, if we were dealing with multiple linear regression model with two explanatory variables and and trying to predict some , then our fitted linear regression curve $\hat{y}=\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2$ would actually be represented by a 3-d plane. And ideally our observations would have a roughly even distribution of positive and negative residuals anywhere we were on the plane.

Unfortunately, we are unable to visualize linear regression curves with more than two explanatory variables. However, we can use the corresponding fitted values vs. residuals plot to infer if the explanatory variables do indeed have a linear relationship with the response variable using the following steps and interpretation.

  1. Create a series of small width boxes in your fitted values vs. residuals plot going from left to right.
  2. Interpretation:
  3. Linearity Assumption Satisfied: If ALL of the boxes have a roughly even distribution of positive and negative residuals, then we say that there is a linear relationship between the explanatory variables and the response variables in this dataset. Thus, a linear regression model is a suitable model for this dataset.
  4. Linearity Assumption Not Satisfied: If AT LEAST ONE of the boxes DOES NOT have a roughly even distribution of positive and negative residuals, then we say that there is not a linear relationship between the explanatory variables and the response variables in this dataset. Thus, a linear regression model is not a suitable model for this dataset.

Assessing the Model Fit of a Dataset

Next, suppose that you've fit a linear regression model to your dataset and you've used it's corresponding fitted values vs. residuals plot to verify that there is indeed a linear relationship between the explanatory variables and the response variable; and thus a linear regression model is suitable for this dataset. Unfortunately this does not necessarily mean that your linear regression model will be good at predicting income (for either your training or your test dataset). So how can we assess how good the fit of this linear regression model is for a given dataset?

RMSE

In Data Science Discovery we introduced the root mean square error (RMSE) $= \sqrt{\frac{\sum_{i=1}^n(y_i-\hat{y}_i)^2}{n}}$ as a metric that evaluates the fit of a given predictive model for a given dataset.

But what does this RMSE actually represent? Well, if $(y_i-\hat{y}i)^2$ represents the square of the error (ie. residual ) of a single observation. Then we can think of $\frac{\sum^n(y_i-\hat{y}i)^2}{n}$ as the mean of the each of these squared error terms. And by taking the square root of $\frac{\sum^n(y_i-\hat{y}_i)^2}{n}$ we can approximately interpret the whole RMSE as essentially the average error of each response variable value in the dataset.

Problems with RMSE

If we calculate the RMSE of our fitted linear regression model with our training dataset we get $165.91.

from sklearn.metrics import mean_squared_error

y_pred_train = main_model.predict(X_train_dummies)
rmse = mean_squared_error(y_train, y_pred_train, squared=False)
rmse
165.9138131819029

And if we calculate the RMSE of our fitted linear regression model with our test dataset we get $178.25.

y_pred_test = main_model.predict(X_test_dummies)
rmse = mean_squared_error(y_test, y_pred_test, squared=False)
rmse
178.24942782040878

Ideally, we would have no model error for any of our observations. So ideally the overall RMSE would be 0 and the closer the RMSE is to 0, the better.

But in the most likely event the RMSE is not 0, how are we to evaluate if a test data RMSE of $178.25 represents a lot of price error, comparatively low-price error, or anything in between? This answer then becomes highly contextual on the response variable that you're measuring.

Question: If were to build another linear regression model that predicted the height of the world's tallest mountains (in cm) and our model were also to yield a RMSE of 178.25 which model would you say has better performance?

  1. Our Airbnb price prediction model
  2. The mountain height prediction model

R^2

Thus, one of the issues using RMSE to evaluate your model results is that it is not a relative evaluation metric. That is, you cannot reliably compare RMSE of models applied to different datasets, because whether a given RMSE value is good or bad is highly dependent on the type of response variable that you're dealing with.

Let's try to come up with a relative model evaluation metric that can reliably be used to measure model performance of different types of dataset.

SSE

Recall, that when fitting a linear regression model with a training dataset, we are attempting to find the optimal intercept and slopes, say , that minimize the following function.

$\sum_{i=1}^n(y_i - \hat{y}i)^2=\sum^n(y_i - (\hat{\beta}_0+\hat{\beta}ix+...+\hat{\beta}8x))^2$

After we've found these optimal intercept and slopes (let's call them $\hat{\beta}_0,\hat{\beta}_1,...,\hat{\beta}_8$), then we call this minimal value the sum square error (SSE) of the model. We can actually calculate the SSE of any given dataset (training or test for instance) using this equation.

$SSE=\sum_{i=1}^n(y_i - \hat{y}i)^2=\sum^n(y_i - (\hat{\beta}^_0+\hat{\beta}^ix+...+\hat{\beta}^*8x))^2$

This SSE can be thought of as the amount of response variable variability in the dataset that is NOT explained by the model.

Unfortunately, the SSE is also not a relative metric. How do we know if this response variable variability is comparatively low or high? Again it's going to depend on the nature of your response variable.

To help turn our SSE into a relative metric, we'll define two more terms.

SST

We call the sum square total (SST) the result of the following equation for a given dataset.

$SST=\sum_{i=1}^n (y_i-\bar{y})^2$

This term might look closely familiar. It's actually the numerator of the variance equation for the response variable . Because this term measures the amount of deviation of the response variable observations from the mean of the response variable, we say that the SST represents the total amount of response variable variability in the dataset.

SSR

Thus, if we want to measure amount of response variable variability in the dataset that is IS explained by the model we can define one more term, the sum square regression (SSR) which is

$SSR=SST-SSE$.

R^2

However, unfortunately the SSR is still not a relative term. However, if we divide the SSR/SST, then we get what we call the R^2 which represents the PERCENT of response variable variability in the dataset that is IS explained by the model. Because this R^2 term is now a percent, we can faithfully compare the R^2 of different models and different datasets as one way to evaluate the fit of our linear regression models.

$R^2 = \frac{SSR}{SST}=\frac{SST-SSE}{SST}=1-\frac{SSE}{SST}$

Range of R^2

[0,1]

Because ideally we would like for 100% of our response variable variability to be explained by the model, the best model fit is represented by an R^2=1. Similarly, the worst model fit would be represented by an R^2=0.

Model R^2 of the Training and Test Dataset

Below we calculate the R^2 of our linear regression model on the training dataset as 0.45 This means that:

  • 45% of the training dataset price variability is explained by the model and
  • 55%(=1-0.45) of the training dataset price variability is not explained by the model
from sklearn.metrics import r2_score
r2_score(y_train, y_pred_train)
0.448434174678492

Below we calculate the R^2 of our linear regression model on the test dataset as 0.325. This means that:

  • 32.5% of the test dataset price variability is explained by the model and
  • 67.5%(=1-0.325) of the test dataset price variability is not explained by the model.
from sklearn.metrics import r2_score
r2_score(y_test, y_pred_test)
0.3252232124705081

We can see that the model was a better fit for the training dataset than it was for the test dataset. This makes sense as it was the training dataset that was used to select the best intercept and slopes that tried to keep SSE small, which hence tried to make the SSR =SST-SSE larger, and hence tried to make $R^2=\frac{SSR}{SST}$ larger.

However, given that our goal of our fitted model is to predict the price of new Airbnb listings that were also not considered when coming up with the best intercept and slopes of our model, then we would similarly expect the performance of our model to resemble the performance of the test dataset R^2.

Trusting your Model's Predictions

Question : Now, suppose that you've found that your model has a high R^2 for both the training dataset and the test dataset. Does this mean that you can trust your model to yield accurate predictions for any input that you give it?

No! For instance, recall in Data Science Discovery we cautioned the data scientist about making extrapolations with your fitted model.

Let's consider our current fitted model which predicts price with our 5000 explanatory variables.
$\hat{income}=-77.9+9.24accommodates+86.17bedrooms+12.10beds-12.47neighborhood_{Logan_Square}+103.03neighborhood_{Near_North_Side}
+47.45neighborhood_{Near_West_Side}+35.58neighborhood_{West_Town}+14.62roomtype_{private_room}$

We have calculated some summary statistics for the numerical variables in our training dataset using the describe() function.

df_train.describe()
accommodates bedrooms beds price
count 1648.000000 1648.000000 1648.000000 1648.000000
mean 4.739078 1.933252 2.437500 203.244539
std 3.181974 1.183949 2.003573 223.468171
min 1.000000 1.000000 1.000000 19.000000
25% 2.000000 1.000000 1.000000 100.000000
50% 4.000000 2.000000 2.000000 149.000000
75% 6.000000 2.000000 3.000000 219.250000
max 16.000000 12.000000 21.000000 4500.000000

Question : Should you trust the price predicted by this model for an Airbnb listing which accommodates 10 guests, has 14 bedrooms, and 10 beds?

No! An Airbnb with 14 bedrooms is outside of the range [1,12] of the number of bedrooms in the training dataset. Therefore, this would be an extrapolation.

Summary

Just because this linear regression model is suitable and a good fit for this training dataset within the range of explanatory variables (ie. accommodates: [1,16], bedrooms: [1,12], beds: [1,20]), does not necessarily mean that it would have been a suitable and a good fit for a more expansive dataset that falls outside of this range. Without a dataset like this, we have no way of knowing if the model suitability and fit would start to decline outside of this range.

Trusting your Interpretation of your Model’s Slopes

Noticing Multicollinearity

Now for instance, let's suppose that you discovered a recent update to your dataset in which one of your training dataset listings actually had 6 beds as opposed to 1 bed. Suppose in your attempt to fix this you hastily just added a new column called beds_new and didn't drop the beds column leaving you with the following training dataset.

df_train_new=df_train.copy()
df_train_new['beds_new']=df_train['beds']
df_train_new.iloc[0,6] = 6
df_train_new[['price', 'beds', 'beds_new']].head()
price beds beds_new
1193 85 2.0 6.0
1392 602 4.0 4.0
338 335 5.0 5.0
1844 395 2.0 2.0
359 166 2.0 2.0

Then suppose your colleague hastily picked up this training dataset and decided to fit a linear regression model that predicted price with the following explanatory variables.

  1. beds
  2. beds_new

The resulting linear regression model was the following.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(df_train_new[['beds', 'beds_new']], y_train)
pd.DataFrame([model.intercept_]+list(model.coef_.T), index=['intercept','beds', 'beds_new'])
0
intercept 40.927729
beds 88.902492
beds_new -22.288785

This model had an R^2=0.3569 for the training dataset.

y_pred_train = model.predict(df_train_new[['beds', 'beds_new']])
r2_score(y_train, y_pred_train)
0.35686243191891487

Now, let's interpret these two coefficients for beds and beds_new.

  1. beds: All else held equal, if we were to increase the number of beds in a Chicago Airbnb listing, then we would expect the price to increase by $88.90 on average.

  2. beds_new: All else held equal, if we were to increase the number of beds in a Chicago Airbnb listing, then we would expect the price to decrease by $22.29. on average.

Remember that technically these two columns (beds and beds_new) actually represent the same thing (ie. number of beds). And practically speaking it makes sense that one of them should have been left out as an explanatory variable to avoid confusion, at the very least.

But notice how the two slopes for these two variables give us wildly different interpretations. Given that these two variables are practically the same except for one observation, shouldn't these two slopes look more similar to each other?

Also notice that if we plot the relationships between beds vs. price and beds_new vs. price, we see that both of these relationships are positive. Thus beds_new having a negative slope of -22.29 in the linear regression model seems a bit puzzling.

sns.lmplot(x='beds', y='price',data=df_train_new, ci=False)
plt.title('Training Data')
plt.show()
sns.lmplot(x='beds_new', y='price',data=df_train_new, ci=False)
plt.title('Training Data')
plt.show()

Ultimately, what we'd like to know is which one of these slope interpretations can we trust when it comes to interpreting and summarizing the general relationship between the number of beds in a Chicago Airbnb and the price, in the presence of the other variable being included in the model.

Unfortunately, the answer is that potentially BOTH of these slopes are misleading from a general interpretation perspective. We can see in the scatterplot below and a correlation of 0.9988 between beds and beds_new that there is a strong linear relationship between these two explanatory variables. When at least two explanatory variables in linear regression model have a strong linear relationship with each other, we say that the two explanatory variables are collinear and that the model has multicollinearity.

sns.scatterplot(x='beds', y='beds_new', data=df_train_new)
plt.title('Training Dataset')
plt.show()
df_train_new[['beds', 'beds_new']].corr()
beds beds_new
beds 1.000000 0.998792
beds_new 0.998792 1.000000

Closer Examination

Why do we think the linear regression model gave such wildly different slopes for beds and beds_new when it came to predicting price?

Notice how if we fit a linear regression model predicting price with just_beds, then we get a more sensible linear regression model which has a positive slope for beds $\hat{price}=40.86+66.62beds$. This is more consistent with the positive relationship we see in the scatterplot of beds and price.

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(df_train_new[['beds']], y_train)
pd.DataFrame([model.intercept_]+list(model.coef_.T), index=['intercept','beds'])
0
intercept 40.859250
beds 66.619606

However, notice that the R^2 of this model and the training dataset ever so slightly decreased to R^2=0.3567. This is an unfortunate reality when it comes to using the R^2 to measure the fit of the training dataset and the model it trained. No matter how superfluous, redundant, or collinear an explanatory variable is, this training dataset R^2 will never get any better (ie. higher) if you remove this variable. It most likely will get worse (ie. lower)!

y_pred_train = model.predict(df_train_new[['beds']])
r2_score(y_train, y_pred_train)
0.3567658509296193

Because a linear regression model is designed to minimize the $SSE=\sum_{i=1}^n(y_i-\hat{y}_i)^2$, it will do whatever it takes to make this value even smaller, even if this means selecting slope values that start to lose interpretive value.

Notice how the price prediction for the second observation in the dataset below ends up being very close. Given that most of these training dataset predictions will be extremely close, then there is very little forcing the first model to select a slope for beds_new that is sensible.

  • Model 1: $\hat{price}=40.93+88.90beds-22.29beds_new = 40.93+88.90(2)-22.29(2) = 174.15$
  • Model 2: $\hat{price}=40.86+66.62beds = =40.86+66.62(2) = 174.1$

Dealing with Multicollinearity

When a model has multicollinearity, this can cause several issues including the following.

  1. You may not be able to trust the interpretation of the resulting slopes in your model.
  2. Your model may be overfitted to the training dataset.

One of the downsides to overfitting a model to the training dataset is that the model may be a worse fit for the test dataset.

Notice how the R^2 of the test dataset actually got worse when we included the beds_new variable in the model. Thus, we can infer that using beds_new would yield worse performance when it comes to predicting price for new datasets.

One technique in dealing with multicollinearity is to delete enough explanatory variables until your model no longer has any issues with multicollinearity.

Example

Let's return back to our original research goal of predicting the price of new Chicago Airbnb listings using the following 5 explanatory variables.

  1. Neighborhood
  2. Room type
  3. accommodates
  4. bedrooms
  5. beds

We fit the following model, but can we trust the interpretative value of these slopes?

$\hat{income}=-77.9+9.24accommodates+86.17bedrooms+12.10beds-12.47neighborhood_{Logan_Square}+103.03neighborhood_{Near_North_Side}+47.45neighborhood_{Near_West_Side}+35.58neighborhood_{West_Town}+14.62roomtype_{private_room}$

Unfortunately, if we look at the relationship between each pair of our numerical explanatory variables below, we can see that we should be skeptical. Because there is at least one pair of numerical explanatory variables that have a strong linear relationship (actually ALL of them do), then our model has multicollinearity. Thus, our model may be overfit to the training dataset and we should be skeptical of the interpretive values of our slopes.

sns.pairplot(X_train)
plt.show()
X_train.corr()
accommodates bedrooms beds
accommodates 1.000000 0.854718 0.879999
bedrooms 0.854718 1.000000 0.860562
beds 0.879999 0.860562 1.000000

Because all three of these numerical variables have strong correlations with each other, we should only leave one of these variables in the model to get rid of the multicollinearity issue.

So which one should we leave in?

One approach to answering this is to try out three candidate models, each using only one of the collinear numerical explanatory variables. We can see that the R^2 of the test dataset is highest when we leave in accommodates. So let's drop beds and bedrooms.

Candidate Model 1 (Test R^2=0.26)

$\hat{income}=\hat{\beta}_0+\mathbf{\hat{\beta}_1beds}+\hat{\beta}2neighborhood{Logan_Square}+\hat{\beta}3neighborhood{Near_North_Side}+\hat{\beta}4neighborhood{Near_West_Side}+\hat{\beta}5neighborhood{West_Town}+\hat{\beta}6roomtype{private_room}$

test_model = LinearRegression()
test_model.fit(X_train_dummies.drop(['bedrooms','accommodates'], axis=1), y_train)

y_pred_test = test_model.predict(X_test_dummies.drop(['bedrooms','accommodates'], axis=1))

r2_score(y_test, y_pred_test)
0.2562558334384033

Candidate Model 2 (Test R^2=0.31)

$\hat{income}=\hat{\beta}_0+\mathbf{\hat{\beta}_1bedrooms}+\hat{\beta}2neighborhood{Logan_Square}+\hat{\beta}3neighborhood{Near_North_Side}+\hat{\beta}4neighborhood{Near_West_Side}+\hat{\beta}5neighborhood{West_Town}+\hat{\beta}6roomtype{private_room}$

test_model = LinearRegression()
test_model.fit(X_train_dummies.drop(['beds','accommodates'], axis=1), y_train)

y_pred_test = test_model.predict(X_test_dummies.drop(['beds','accommodates'], axis=1))

r2_score(y_test, y_pred_test)
0.3051361985583356

Candidate Model 3 (Test R^2=0.32)

$\hat{income}=\hat{\beta}_0+\mathbf{\hat{\beta}_1accommodates}+\hat{\beta}2neighborhood{Logan_Square}+\hat{\beta}3neighborhood{Near_North_Side}+\hat{\beta}4neighborhood{Near_West_Side}+\hat{\beta}5neighborhood{West_Town}+\hat{\beta}6roomtype{private_room}$

test_model = LinearRegression()
test_model.fit(X_train_dummies.drop(['bedrooms','beds'], axis=1), y_train)

y_pred_test = test_model.predict(X_test_dummies.drop(['bedrooms','beds'], axis=1))

r2_score(y_test, y_pred_test)
0.32336698322666924

We get the new resulting model in which we can trust the interpretative value of the slopes more and is less likely to be overfit to the training dataset.

new_model = LinearRegression()
new_model.fit(X_train_dummies.drop(['bedrooms','beds'], axis=1), y_train)
pd.DataFrame([new_model.intercept_]+list(new_model.coef_.T),
index=['intercept']+list(X_train_dummies.columns.drop(['bedrooms','beds'])))
0
intercept -36.668703
accommodates 42.869466
neighborhood_Logan Square -3.516044
neighborhood_Near North Side 91.599767
neighborhood_Near West Side 43.789388
neighborhood_West Town 31.713225
room_type_Private room 1.891156

Searching for a Model with the Best Test Data R^2

Notice that the R^2 of this new model (0.31) for the test dataset is slightly lower than it was for the old model with the test dataset (0.325). Thus, while deleting all the collinear explanatory variables was able to aid in bringing about a more sensible interpretation of the resulting slopes that we can trust more, unfortunately in this case it seems that deleting beds and bedrooms gave us a worse fit for the test dataset.

However, if were to try out each of the following candidate models (where we only end up deleting one of the 3 collinear variables at a time, then we end up finding a model that has the highest test dataset R^2 that we have seen so far (0.329).

Candidate Model 4 (Test R^2=0.31)

$\hat{income}=\hat{\beta}_0+\mathbf{\hat{\beta}_1beds}+\mathbf{\hat{\beta}_2accommodates}+\hat{\beta}3neighborhood{Logan_Square}+\hat{\beta}4neighborhood{Near_North_Side}+\hat{\beta}5neighborhood{Near_West_Side}+\hat{\beta}6neighborhood{West_Town}+\hat{\beta}7roomtype{private_room}$

test_model = LinearRegression()
test_model.fit(X_train_dummies.drop(['bedrooms'], axis=1), y_train)

y_pred_test = test_model.predict(X_test_dummies.drop(['bedrooms'], axis=1))

r2_score(y_test, y_pred_test)
0.3111943022376431

Candidate Model 5 (Test R^2=0.329)

$\hat{income}=\hat{\beta}_0+\mathbf{\hat{\beta}_1bedrooms}+\mathbf{\hat{\beta}_2accommodates}+\hat{\beta}3neighborhood{Logan_Square}+\hat{\beta}4neighborhood{Near_North_Side}+\hat{\beta}5neighborhood{Near_West_Side}+\hat{\beta}6neighborhood{West_Town}+\hat{\beta}7roomtype{private_room}$

test_model = LinearRegression()
test_model.fit(X_train_dummies.drop(['beds'], axis=1), y_train)

y_pred_test = test_model.predict(X_test_dummies.drop(['beds'], axis=1))

r2_score(y_test, y_pred_test)
0.3293011996166699

Candidate Model 6 (Test R^2=0.3099)

$\hat{income}=\hat{\beta}_0+\mathbf{\hat{\beta}_1beds}+\mathbf{\hat{\beta}_2bedrooms}+\hat{\beta}3neighborhood{Logan_Square}+\hat{\beta}4neighborhood{Near_North_Side}+\hat{\beta}5neighborhood{Near_West_Side}+\hat{\beta}6neighborhood{West_Town}+\hat{\beta}7roomtype{private_room}$

test_model = LinearRegression()
test_model.fit(X_train_dummies.drop(['accommodates'], axis=1), y_train)

y_pred_test = test_model.predict(X_test_dummies.drop(['accommodates'], axis=1))

r2_score(y_test, y_pred_test)
 0.3099441057668366

Thus, we may decide to set our final model to the following that just deletes beds and leaves in both bedrooms and accommodates. However, do note that we are now again including two explanatory variables that are collinear. Thus, our resulting slopes may be biased.

$\hat{income}=-85.59+13.36accommodates+94.59accommodates-11.30neighborhood_{Logan_Square}+105.15neighborhood_{Near_North_Side}+48.06neighborhood_{Near_West_Side}+36.07neighborhood_{West_Town}+17.16roomtype_{private_room}$

new_model = LinearRegression()
new_model.fit(X_train_dummies.drop(['beds'], axis=1), y_train)
pd.DataFrame([new_model.intercept_]+list(new_model.coef_.T),
index=['intercept']+list(X_train_dummies.columns.drop(['beds'])))
0
intercept -85.589216
accommodates 13.363291
bedrooms 94.593238
neighborhood_Logan Square -11.292819
neighborhood_Near North Side 105.149914
neighborhood_Near West Side 48.056851
neighborhood_West Town 36.070250
room_type_Private room 17.156893

By trying to fit a model with the best fit for the test dataset, however, we were able to move one step closer in the pursuit of our overarching research goal. That is, we found a model that we can infer will perform better when it comes to predicting the price of new Chicago Airbnb listings compared to the other models we have looked at so far.