Introduction


Research Goal (Continued): Predicting Airbnb Prices for New Datasets

Let's return back to our Chicago Airbnb listings dataset that we cleaned for the purposes of building a linear regression model in Module 8. We'll also return to our same research goal from Module 8. That is, we'd like to build a predictive model that predicts the price of new Chicago Airbnb listings using some subset (or all) of the following possible explanatory variables.

  1. Neighborhood
  2. Room type
  3. How many people the listing accommodates
  4. How many bedrooms the listing has
  5. How many beds the listing has

Note: In order to keep things simpler, we will not explore interaction terms in Module 9. However, interaction terms can be incorporated into all of these techniques.

In Module 8 we tried out many, but not all possible $2^5=32$, linear regression models that we could make based on some combination of these 5 explanatory variables (not including interaction term models). Based on the handful of candidate models we tried, we found that the linear regression model that had the best test dataset $R^2$ was the one below with $R^2=0.329$. We can infer that this may be the best trained model for predicting the price of new Airbnb Chicago listings, for which we don't know the actual price of (not including interaction term models).

Best Model Found So Far (without Interaction Terms)

\[ \begin{aligned} \hat{price} &= -85.59 \\ &\quad + 13.36 \cdot \text{accommodates} \\ &\quad + 94 \cdot \text{bedrooms} \\ &\quad - 11.30 \cdot \text{neighborhood}_{\text{Logan\_Square}} \\ &\quad + 105.15 \cdot \text{neighborhood}_{\text{Near\_North\_Side}} \\ &\quad + 48.06 \cdot \text{neighborhood}_{\text{Near\_West\_Side}} \\ &\quad + 36.07 \cdot \text{neighborhood}_{\text{West\_Town}} \\ &\quad + 17.16 \cdot \text{roomtype}_{\text{private\_room}} \end{aligned} \]

Feature Selection Techqniques

But given that we didn't try out all possible linear regression models, it's possible that there exists a linear regression model that will yield even better performance for new datasets. Thus, in this module we will discuss some popular feature selection techniques (ie. explanatory variable selection techniques) which can be used to attempt to efficiently find the best regression model that optimizes some metric that you're interested in.

Cross-Validation Techniques

Furthermore, in module 8 we randomly split our dataset into a training dataset that was used to train each candidate model, and a test dataset that was used to test each candidate model. We used our test dataset performance to infer how well our model might predict new Chicago Airbnb listings. However, we'll also discuss in this module that there are both pros and cons for using a single training and test dataset to help us meet this goal. Thus, in this module we'll discuss what we call cross-validation techniques, which expand upon this idea of using a training and test dataset to gauge model performance for new datasets.

Module Outline

  • Overfitting vs. Underfitting: In section 2 we'll discuss what it means to overfit vs. underfit a model as well as some ways in which this can happen.
  • Parsimonious model: In section 3 we'll discuss the concept of finding a parsimonious model that can help us avoid overfitting and underfitting. And we'll talk about how to measure the parsimoniousness of a linear regression model.
  • Feature Selection Techniques: In sections 4-8 we'll describe and apply some of the most common feature selection techniques.
  • Cross-Validation Techniques: In section 9 we'll describe and apply some of the most common cross-validation techniques.
  • Principal Component Regression: Finally, in section 10 we'll introduce what's known as a principal component analysis and introduce how it can be used as another type of "feature selection" technique.