Airbnb Research Goal Conclusion


We've tried out and evaluated many linear regression models in this module as a means of pursuing our main research goal which was to predict the Airbnb price of new Chicago listings. In any given data science project, it's good practice to summarize and discuss your main findings with respect to your main research goal as well as discuss any pros and cons of your analysis and and findings. If your analysis has any cons (almost all do), then you should discuss avenues for future work that you think might potentially address these shortcomings.

Let's do that now!

Evaluating the Main Research Goal

Our main research goal was to fit a model that would be able to predict the price of new Chicago Airbnb listings. We trained many models with our training dataset and evaluated their predictive performance on the test dataset using the test dataset R^2 value. Out of all the models that we tried out, the one that had the highest test dataset R^2 was the following.

$$\hat{income}=-85.59+13.36accommodates+94.59bedrooms-11.30neighborhood_{Logan_Square}+105.15neighborhood_{Near_North_Side}+48.06neighborhood_{Near_West_Side}+36.07neighborhood_{West_Town}+17.16roomtype_{private_room}$$

This model's test dataset R^2 was 0.329, which means that 32.9% of the price variable in the test dataset was explained by this model. We might use this R^2 of the test dataset to infer that the performance of this model with new Chicago Airbnb listings that haven't been priced yet will be very similar.

One might interpret this low percentage of 32.9%, closer to 0% than it is to 100%, from our best model as not being a good fit. However, an R^2 value should always be interpreted in a context dependent way. It may be the case that accurately predicting the price of Airbnb listings is a very challenging task. Comparatively, a model with a test dataset R^2=0.329, might actually be pretty good. Given the importance of context, it's best to consult with subject matter experts about whether your R^2 is good or bad.

Evaluating Additional (Implicit) Research Goals

While our main research goal was to build a model which yields accurate predictions for new Airbnbs, having a model that is suitable and interpretable for the given task based on the model that was used can also be an ideal secondary goal to pursue in any given machine learning analysis.

Model Suitability

Given that our response variable is numerical and the linearity condition is met below, we can say that a linear regression model is suitable for the task of predicting our response variable.

We know that our linearity condition is met based on what we see when we fit this final model and create a fitted values vs. residuals plot corresponding to this model and the training dataset. For the most part, we see that if we create a series of small-width boxes going from left-to-right in the plot, most of the boxes have a roughly even distribution of negative and positive points (ie. residuals). Thus, this implies that there is a linear relationship between the chosen explanatory variables and the response variable in our training dataset.

y_pred = final_model.predict(X_train_dummies.drop(['beds','room_type*accommodates', 'room_type*bedrooms'], axis=1))
residuals = y_train - y_pred

plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted (Predicted) values')
plt.ylabel('Residuals')
plt.title('Fitted values vs. Residuals Plot')
plt.show()

Interpreting Model Slope Interpretations

Unfortunately, what we discovered in Section 08-08 is that our final model that yielded the best test model predictions had two numerical explanatory variables (bedrooms and accommodates) that were collinear (R=0.85). Because of this, we should proceed with caution when it comes to trusting our interpretations of the slopes in our model.

sns.scatterplot(x='bedrooms', y='accommodates', data=X_train_dummies)
plt.title('Airbnb Training Dataset')
plt.show()
X_train_dummies[['bedrooms','accommodates']].corr()
bedrooms accommodates
bedrooms 1.000000 0.854718
accommodates 0.854718 1.000000

In addition, our analysis in section 08-09 indicated that there may be an interaction between at least one pair of explanatory variables in our training dataset. However, no interaction terms were included in our final model. For instance, we saw that the relationship (ie. best fit line slope) between bedrooms and price was different for different room types. Because we did not include an interaction term between room type and bedrooms in our final model, this differing relationship will not be reflected/interpreted in our final model.

sns.lmplot(x='bedrooms', y='price', hue='room_type', ci=False, data=df)
plt.show()

Some Additional Analysis Shortcomings and Future Work

Not an Exhaustive Exploration of all Explanatory Variables

In our analysis above we used just potential 5 explanatory variables to predict Airbnb price. We also tried out a few candidate models in which we removed a subset of these explanatory variables from the model. However, there actually exist $2^5=32$ potential candidate models that we could have fit and evaluated based on removing a given subset of explanatory variables. Thus, in this analysis, we did not try out every possible candidate model that we could have created. Thus, it's possible that there exists an even better candidate model that would have achieved an even higher test dataset R^2. In module 9 we'll talk about some ways of efficiently searching for the best candidate model for machine learning tasks.

Furthermore, there exist many other variables in our Chicago Airbnb listings dataset. We could have tried out more than just 5 explanatory variables to explore in this analysis and yielded an even better test dataset $R^2$.

Is Linear Regression the Best?

We only tried out a linear regression model in this analysis. Could it be the case that another predictive model would have done better?

Using a Single Training Dataset and Test Dataset

Recall how we fixed a random seed when we randomly selected the 80% of listings to be in our training dataset and the 20% of listings to be in our test dataset. We then used this randomly selected test dataset to base our decisions on what was the best model. We should be a bit wary of this.

Could it be the case that if we randomly selected a different 20% of observations for the test dataset, that our test dataset R^2 values for our candidate models may have been slightly different? Perhaps we might have even selected a different candidate model for our final model based on this difference.

In module 9 we will discuss how to deal with the nebulous random nature of creating a training and test dataset to evaluate how well a model might perform with new datasets.

Subset of the Chicago Airbnb Population

Let's also not forget that our dataset that we used to train and test this model is not actually the complete population of all Chicago Airbnb listings from March 19, 2023. We filtered for only the top 5 most popular neighborhoods as well as just private room and whole house/apartment listings. Therefore if you wanted to predict the price of an Airbnb listing, say, that was not in one of these top 5 neighborhoods, you could not use this model.

Furthermore, of this filtered dataset, we further filtered out 9% of Airbnbs that had missing values. Unfortunately, if there exists any type of pattern to the "missingness" of observations in this dataset, then our model may not be as reliable for these types of observations that are more likely to have missing values.