Simulations for Difference Data


Suppose that I want to learn how much the cost of Airbnb in Chicago has increased over the past 9 months. To answer this question, I gather data from June 10, 2022 in addition to the March 19, 2023 data that we have been using for most of this course.

I could generate a random sample of Airbnbs from June 2022 (from df_june) and another random sample of Airbnbs from March 2023 (from df_popn). I could then attempt to use that data directly to estimate the difference in price. I would need to set my sample sizes; we'll use 50 listings from each time frame. I might get a simulated sampling distribution like below:

sampling_dist = []
for i in range(1000):
df_june_sample = df_june.sample(50, replace = True)
df_popn_sample = df_popn.sample(50, replace = True)
mystat = df_popn_sample['price'].mean() - df_june_sample['price'].mean()
sampling_dist.append(mystat)
sampling_dist = pd.DataFrame({'x': sampling_dist})
sampling_dist['x'].hist()
plt.xlabel('Possible Difference in Sample Means, March 2023 - June 2022')
plt.ylabel('Frequency')
plt.title('Histogram of Average Chicago Airbnb Price Increases from June 2022 to March 2023')
plt.show()

Histogram of the sampling distribution using a difference of observations from June 2022 and from March 2023.

sampling_dist.describe()
x
count 1000.000000
mean -36.223480
std 46.162097
min -234.600000
25% -65.330000
50% -34.980000
75% -7.945000
max 223.000000

Consider if the same Airbnb is included in each of my samples. Would my two samples be independent? The answer is no! The price for the Airbnb 9 months later is dependent on the earlier price, because both are recorded for the same unit. Similar to the example with regression, I did not consider the underlying structure of the data in the simulation above. In this case, I generated two different samples and try to use them to understand any shifts in prices over the last 9 months, and in general my estimate should generally center around the true increase. However, I can create a much less variable estimate; that is, an estimate that is closer to the true increase for more of the samples.

To accomplish this, we will take a random sample of Airbnb listings. We will then want to calculate the price difference between the two dates for our variable of interest. Now, we would only have a single variable of interest. In this case, I am better able to isolate the variable that I am truly interested in learning more about: the price increase between the two time frames. If I had two separate samples, I might be concerned that my two samples were not comparable to each other originally. For example, what if I happened to sample more expensive units in the earlier time frame and sampled less expensive units in the later time frame. That might make it appear that the price is actually decreasing when actually the location or the quality of the unit is changing between the two time frames.

With the difference variable, I can better isolate my variable of interest and more appropriately determine if (and how) the prices are changing over time, by ensuring that the prices in the two time frames are as comparable as possible. This has a similar motivation for why randomized controlled trials are the gold standards of experiments – we are trying to ensure that we have the most comparable groups possible, or the best way to measure the comparison possible.

To actually accomplish this through analysis, we will need to pair the prices for the two different dates together by matching the two data frames. What limitations might arise in this process? For example, we would only be able to use those units that are listed on both dates. Do we generally have most units still available for analysis? Or have we lost many units due to the missing data? If we’ve lost a lot of units, is it that most of the units left the market? Did many units enter the market during that time frame? What other implications and limitations does that have for our final conclusions from this analysis?

Now that we have our analysis plan in place, let’s go ahead and simulate the sampling distribution and communicate our findings.

df = pd.merge(df_popn, df_june, on = 'id')
df.shape
(4456, 351)
df.head()
Unnamed: 0 id listing_url scrape_id last_scraped source name_x description neighborhood_overview picture_url ... room_type_y price_y minimum_nights_y number_of_reviews_y last_review_y reviews_per_month_y calculated_host_listings_count_y availability_365_y number_of_reviews_ltm_y license_y
0 0 2384.0 https://www.airbnb.com/rooms/2384 2.023030e+13 3/19/23 city scrape Hyde Park - Walk to UChicago You are invited to be the sole Airbnb guest in... The apartment is less than one block from beau... https://a0.muscache.com/pictures/acf6b3c0-47f2... ... Private room 92 3 198 2022-05-22 2.19 1 326 15 R17000015609
1 1 1837153.0 https://www.airbnb.com/rooms/1837153 2.023030e+13 3/19/23 city scrape Musician's Quarters Host Chester of Musicians Quarters is offering... NaN https://a0.muscache.com/pictures/97205fb8-421c... ... Entire home/apt 115 32 56 2021-11-27 0.55 1 365 7 City registration pending
2 2 2604454.0 https://www.airbnb.com/rooms/2604454 2.023030e+13 3/19/23 previous scrape Cozy Single-Family Home near University of Chi... Comfortable House in Hyde Park: This beautiful... The house is essentially on the University of ... https://a0.muscache.com/pictures/47138943/783a... ... Entire home/apt 125 1 103 2020-02-16 1.08 1 0 0 R17000013467
3 3 3517984.0 https://www.airbnb.com/rooms/3517984 2.023030e+13 3/19/23 city scrape Private room w/bath in urban canopy Private bedroom w/attached private bath in a g... Hyde Park is a strange little pocket of the wo... https://a0.muscache.com/pictures/48801062/aa86... ... Private room 79 1 344 2022-06-09 3.61 1 253 31 R17000015187
4 4 5297152.0 https://www.airbnb.com/rooms/5297152 2.023030e+13 3/19/23 city scrape Fresh and Sunny Bed & Bath by UofC Our private bed and bath offers high ceilings,... Parks galore in Hyde Park! Walk 15 minutes eas... https://a0.muscache.com/pictures/773dd6ff-dbb0... ... Private room 70 7 201 2022-05-30 2.29 3 130 22 R17000014580

5 rows × 351 columns

Summary measures for the minimum nights required for the population of Chicago Airbnb listings.

We see from merging the two data frames based on the listing id, that we do lose a decent number of listings. We have 4,456 listings in common between both data sets. Note that each listing separately had 6,717 and 7,743 listings individually.

That means that we have only retained 66% and 58% of each original set of listings, respectively.

Next, we can define our new variable of interest: the change in price between the older price and the more recent price. If the price change is positive, that indicates that the price has increased over time. If negative, then the price for the listing has decreased in the past 9 months.

We can then observe the distribution of this price change.

df['price_change'] = df['price_x'] - df['price_y']
df['price_change'].describe()
    count    4456.000000
    mean      -34.498878
    std       143.871562
    min     -3360.000000
    25%       -45.000000
    50%        -8.000000
    75%         0.000000
    max      5007.000000
    Name: price_change, dtype: float64

Note that there are some pretty unusual and concerning values for the price change, especially with the maximum and minimum price changes. These might warrant some additional exploration. For example, one of the listings was no longer available the last time that I checked, indicating that the host may have tried to artificially remove the listing by increasing the price substantially. Some of the listings that had the largest price reductions appeared to be on listings that may have originally been newly refurbished units and then decreased the price as the units became older.

Some of this is speculation. However, all of this warrants additional exploration.

Finally, using the newly created price_change variable, we can generate a new sampling distribution.

new_sampling_dist = []
for i in range(1000):
df_sample = df.sample(50, replace = True)
new_sampling_dist.append(df_sample['price_change'].mean())
new_sampling_dist = pd.DataFrame({'x': new_sampling_dist})
sampling_dist.describe() # not using the differences
x
count 1000.000000
mean -36.223480
std 46.162097
min -234.600000
25% -65.330000
50% -34.980000
75% -7.945000
max 223.000000

Summary measures for the difference in price per night of Chicago Airbnbs between June 2022 and March 2023, not matched between the two time frames.

new_sampling_dist.describe() # using the difference variable
x
count 1000.000000
mean -35.104860
std 20.056687
min -156.140000
25% -43.785000
50% -34.100000
75% -25.680000
max 83.140000

Summary measures for the difference in price per night of Chicago Airbnbs between June 2022 and March 2023, matched between the two time frames.

sampling_dist['x'].hist()
plt.title('Histogram of Average Change in Prices for Chicago Airbnbs, March 2023 - June 2022')
plt.xlabel('Possible Sample Means of Change of Prices for 50 Chicago Airbnbs')
plt.ylabel('Frequency')
plt.show()

HIstogram for the paired difference in prices by each unit.

df_small = df[['id', 'listing_url', 'name_x', 'name_y', 'host_id_x', 'host_id_y', 'price_x', 'price_y', 'price_change', 'room_type_x', 'room_type_y', 'minimum_nights_x', 'minimum_nights_y']]
df_small[df_small['price_change'] < -1000]
id listing_url name_x name_y host_id_x host_id_y price_x price_y price_change room_type_x room_type_y minimum_nights_x minimum_nights_y
2265 53412390.0 https://www.airbnb.com/rooms/53412390 Cloud9 | Up to 14 ppl | The Hudson Group book Cloud9's 2 Penthouse 6 bedrooms for... 248760412 248760412 951 3502 -2551 Entire home/apt Entire home/apt 3 1
2951 44115433.0 https://www.airbnb.com/rooms/44115433 The Orleans - Entire Building - With Rooftop D... The Orleans - Entire Building - With Rooftop D... 170785489 170785489 1231 3042 -1811 Entire home/apt Entire home/apt 2 2
2958 44161649.0 https://www.airbnb.com/rooms/44161649 2 Newly Built Luxury Condos with Private Roofd... 2 Newly Built Luxury Condos with Private Roofd... 170785489 170785489 535 1885 -1350 Entire home/apt Entire home/apt 2 2
2959 44162119.0 https://www.airbnb.com/rooms/44162119 2 Newly Built Luxury Condos with Private Roofd... 2 Newly Built Luxury Condos with Private Roofd... 170785489 170785489 535 1885 -1350 Entire home/apt Entire home/apt 2 2
3135 53111489.0 https://www.airbnb.com/rooms/53111489 New Construction Building in River North! New Construction Building in River North! 170785489 170785489 1282 4642 -3360 Entire home/apt Entire home/apt 2 2

Airbnb listings that have an unusual change in their prices between June 2022 and March 2023.

Finally, I have generated a sampling distribution for a new variable: the change in price between June 2022 and March 2023. Compared to the original distribution that didn't use the difference variable, the sampling distribution has a much smaller standard error for the statistic of interest (20.24 compared to 48.57). This result is because we were able to more accurately isolate the price difference per listing, rather than averaging over the all of the listings.

From this sampling distribution, I can conclude that on average, listings in Chicago became a little less expensive per night. The average reduction is around $35 per night. I can also see that the mean price change for 50 listings will typically be between $100 less per night and around $25 more per night in any repeated random sample.

Above, we considered a few limitations to our analysis. These limitations mostly centered on why listings might not be present in both time frames. However, as noted above, there may be concerns surrounding the accuracy of the values for the price differences between the two time frames. Some of these may stem from decisions by the host to discourage or encourage bookings. Others may due to the listings being incompatible between different time frames; for example, in some of the instances, a unit may have been refurbished, updated, or had other changes. These could result in the number of people a listing can accommodate, the number of bathrooms, or other amenities being different, which could explain the difference in pricing.

Overall, to fully explore the differences in units would require a more in-depth exploration of the data. For this analysis, though, we will simply conclude that it appears that Airbnbs in March 2023 tended to be less expensive than the same Airbnbs in June 2022.