Predicting Airbnb Prices for New Datasets


Let's return back to our cleaned Chicago Airbnb dataset from Module 7. Suppose that you are a data scientist that works for Airbnb and that you've found from company research that many hosts with new properties that they would like to list are often unsure as to what daily price to initially set their listing at. You'd like to design a new service on the Airbnb website that can recommend to hosts with new properties what a good starting price might be, based on listings in the same city and other property related information including:

  1. Neighborhood
  2. Room type
  3. How many people the listing accommodates
  4. How many bedrooms the listing has
  5. How many beds the listing has
print(df.shape)
df = df[['price','neighborhood', 'room_type', 'accommodates', 'bedrooms', 'beds']]
df.head()
(5421, 333)
price neighborhood room_type accommodates bedrooms beds
0 90 Hyde Park Private room 1 1.0 1.0
1 125 Hyde Park Entire home/apt 6 3.0 3.0
2 77 Hyde Park Private room 2 1.0 1.0
3 76 Hyde Park Private room 1 1.0 1.0
4 46 Hyde Park Private room 2 1.0 1.0
df.shape
 (5421, 6)

So in other words, we'd like to predict price (a numerical response variable), given 5 explanatory variables. From Data Science Discovery, we learned about fitting a multiple linear regression model which can help us do this. Remember, this would be a multiple linear regression model as opposed to a simple linear regression model because we are now dealing with more than one explanatory variable.

$\hat{y}=\hat{\beta}_0+\hat{\beta}_1 x_1+\hat{\beta}_2 x_2+⋯+\hat{\beta}_8 x_8$

Note: The multiple linear regression model that we're going to fit in this module will take on the following form above with 8 slopes. Notice how this is different from the 5 explanatory variable that we'd like to consider. In Section 08-06 we'll talk about why this is.