A Machine Learning Technique for Finding Good Predictions for New Datasets


In Section 08-02, we got a sense as to which explanatory variables might be useful to use to predict the price of a new Chicago Airbnb listing (particularly those in either Lake View, Logan Square, West Town, Near West Side, Near North Side; that are either the full house/apartment or a single room.)

We specifically explored the following 5 potential explanatory variables:

  • Neighborhood
  • Room type
  • Accommodates
  • Beds
  • Bedrooms

But before we fit our multiple linear regression model, let's think carefully about what we are attempting to do with our main research goal. Ideally, we would like for our multiple linear regression model that we fit to yield accurate price predictions for new properties, that is, properties in which a price has yet to be chosen.

In Data Science Discovery we talked about two ways to evaluate the performance of a linear regression model when we knew what the "right answer" (i.e., the actual response variable value) was for a given observation.

  • Residual: When evaluating the performance of just a single observation, we calculated the $residual_i=y_i-\hat{y}_i$.
    So in our case, the ACTUAL price - PREDICTED price of a listing.

  • Root Mean Square Error (RMSE): When evaluating the performance of all observations in a given dataset, one way to do this is with the $RMSE = \sqrt{\frac{\sum_{i=1}^n(y_i-\hat{y}_i)^2}{n}}$.

But if we don't know the ACTUAL price $y_i$ of a new property, then how are we supposed to evaluate how well our model will do when it comes to predicting the price of these new properties?

Train-Test-Split Method

One Approach:

One way that data scientists try to infer the answer to this question is by taking the original dataset that you have in your hands, and randomly splitting this dataset into two datasets: a training dataset and a test dataset.

  1. Training Dataset

The purpose of the training dataset is to train the machine learning model. So in our case, we will only use the training dataset observations to come up with the best intercept and slopes $\hat{\beta}_0,\hat{\beta}_1,...,\hat{\beta}_8$ for our linear regression model.

$\hat{y}=\hat{\beta}_0+\hat{\beta}_1 x_1+\hat{\beta}_2 x_2+⋯+\hat{\beta}_8 x_8$

The training dataset is usually set to be larger, because ideally we'd not like to sacrifice too much useful data that can go towards providing a better fit of the model. It's usually comprised of about 80% of observations randomly selected from the original dataset.

  1. Test Dataset

The purpose of the test dataset is to test the machine learning model that has been fit with the training dataset. So for instance, we may calculate the RMSE of the test dataset.

We can use train_test_split() function to randomly split our df dataframe into a training dataset and test dataset. We stipulate that our test_size will be about 20% of the dataset size. Also, we should fix a random_state for reproducibility.

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2, random_state=101)
print(df_train.shape[0]/df.shape[0])
df_train.head()
0.8
price neighborhood room_type accommodates bedrooms beds
1193 85 Logan Square Entire home/apt 5 2.0 2.0
1392 602 Logan Square Entire home/apt 8 4.0 4.0
338 335 Lake View Entire home/apt 12 4.0 5.0
1844 395 West Town Entire home/apt 5 2.0 2.0
359 166 Lake View Entire home/apt 4 2.0 2.0
print(df_test.shape[0]/df.shape[0])
df_test.head()
0.2
price neighborhood room_type accommodates bedrooms beds
2592 179 Near West Side Entire home/apt 2 1.0 1.0
139 63 Lake View Private room 1 1.0 1.0
471 61 Lake View Entire home/apt 2 1.0 1.0
2015 122 West Town Entire home/apt 5 2.0 2.0
3324 425 Near North Side Entire home/apt 6 2.0 3.0

Idea behind this Approach

Training RMSE < Test RMSE

We can actually calculate both the RMSE of training dataset and the RMSE of the test dataset for a given model that was only fit with the training dataset. For comparison we'll do this in Module 09.

What we'll actually find is that the model predictions of the training dataset end up being better than the model predictions of the test dataset. What this shows us is that the model predictions of observations that were used to train the model were better than those that were not used to train the model.

Expecting Training Dataset Fit to Be Better

This scenario ends up often being the case and is something that we would mathematically come to expect. That is, we expect that the model that was fit by the training dataset to be a better fit of the training dataset than the test dataset.

Linear Regression Explicitly Minimizes Training Data SSE and RMSE

In section 08-05 we'll discuss how the optimal intercepts and slopes $\hat{\beta}_0,\hat{\beta}_1,...,\hat{\beta}_8$ of a linear regression model are those that minimize what we call the sum squared error $SSE=\sum_{i=1}^n(y_i-\hat{y}_i)^2$ of the linear regression model $\hat{y}=\hat{\beta}_0+\hat{\beta}_1 x_1+\hat{\beta}_2 x_2+⋯+\hat{\beta}_8 x_8$ and the dataset that was used to fit this model.

Thus, if we used a training dataset to fit our linear regression model, this means that we deliberately selected the values of $\hat{\beta}_0,\hat{\beta}_1,...,\hat{\beta}_8$ to keep the $SSE=\sum_{i=1}^n(y_i-\hat{y}_i)^2$ of the training dataset as low as possible, which subsequently keeps the $RMSE = \sqrt{\frac{\sum_{i=1}^n(y_i-\hat{y}_i)^2}{n}}$ of the training dataset as low as possible.

Linear Regression Does not Explicitly Minimize Test Data SSE and RMSE

On the other hand, our linear regression model that was fitted with the training dataset was NOT explicitly trying to keep the SSE (and subsequently the RMSE) of the test dataset as low as possible. Thus we would expect the SSE and the RMSE of the test dataset to be higher than that of the training dataset.

Linear Regression Does not Explicitly Minimize New Dataset SSE and RMSE

Similarly, because a dataset comprised of new listings was not used to train the model, we would also expect the SSE and the RMSE of these new datasets to be higher than that of the training dataset.

So we can expect that fit of our model with new listings will more closely resemble the RMSE of the test dataset rather than the training dataset.

Using Training and Test Datasets to Select Useful Models

Thus, in general if we know that our research goal is to build a machine learning model that will yield good predictions for new datasets, then we can make decisions about how to build our model based on which model yields the best performance for the test dataset.

Features Matrices and Target Arrays

Recall that when we use the LinearRegression() function we need to create a dataframe (which is called the features matrix X) that is comprised of just the explanatory variables and a dataframe/series (which is called the target array y) which is comprised of the response variable.

Hence we will create a training features matrix...

X_train = df_train.drop(['price'], axis=1)
X_train.head()
neighborhood room_type accommodates bedrooms beds
1193 Logan Square Entire home/apt 5 2.0 2.0
1392 Logan Square Entire home/apt 8 4.0 4.0
338 Lake View Entire home/apt 12 4.0 5.0
1844 West Town Entire home/apt 5 2.0 2.0
359 Lake View Entire home/apt 4 2.0 2.0

... and a training target array...

y_train=df_train['price']
y_train.head()
1193 85 1392 602 338 335 1844 395 359 166 Name: price, dtype: int64

... and a test features matrix...

X_test = df_test.drop(['price'], axis=1)
X_test.head()
neighborhood room_type accommodates bedrooms beds
2592 Near West Side Entire home/apt 2 1.0 1.0
139 Lake View Private room 1 1.0 1.0
471 Lake View Entire home/apt 2 1.0 1.0
2015 West Town Entire home/apt 5 2.0 2.0
3324 Near North Side Entire home/apt 6 2.0 3.0

... and a training target array.

y_test=df_test['price']
y_test.head()
    2592    179
    139      63
    471      61
    2015    122
    3324    425
    Name: price, dtype: int64