Simulations for Difference Data
Suppose that I want to learn how much the cost of Airbnb in Chicago has increased over the past 9 months. To answer this question, I gather data from June 10, 2022 in addition to the March 19, 2023 data that we have been using for most of this course.
I could generate a random sample of Airbnbs from June 2022 (from df_june) and another random sample of Airbnbs from March 2023 (from df_popn). I could then attempt to use that data directly to estimate the difference in price. I would need to set my sample sizes; we'll use 50 listings from each time frame. I might get a simulated sampling distribution like below:
sampling_dist = []
for i in range(1000):
df_june_sample = df_june.sample(50, replace = True)
df_popn_sample = df_popn.sample(50, replace = True)
mystat = df_popn_sample['price'].mean() - df_june_sample['price'].mean()
sampling_dist.append(mystat)
sampling_dist = pd.DataFrame({'x': sampling_dist})
sampling_dist['x'].hist()
plt.xlabel('Possible Difference in Sample Means, March 2023 - June 2022')
plt.ylabel('Frequency')
plt.title('Histogram of Average Chicago Airbnb Price Increases from June 2022 to March 2023')
plt.show()

Histogram of the sampling distribution using a difference of observations from June 2022 and from March 2023.
sampling_dist.describe()
x | |
---|---|
count | 1000.000000 |
mean | -36.223480 |
std | 46.162097 |
min | -234.600000 |
25% | -65.330000 |
50% | -34.980000 |
75% | -7.945000 |
max | 223.000000 |
Consider if the same Airbnb is included in each of my samples. Would my two samples be independent? The answer is no! The price for the Airbnb 9 months later is dependent on the earlier price, because both are recorded for the same unit. Similar to the example with regression, I did not consider the underlying structure of the data in the simulation above. In this case, I generated two different samples and try to use them to understand any shifts in prices over the last 9 months, and in general my estimate should generally center around the true increase. However, I can create a much less variable estimate; that is, an estimate that is closer to the true increase for more of the samples.
To accomplish this, we will take a random sample of Airbnb listings. We will then want to calculate the price difference between the two dates for our variable of interest. Now, we would only have a single variable of interest. In this case, I am better able to isolate the variable that I am truly interested in learning more about: the price increase between the two time frames. If I had two separate samples, I might be concerned that my two samples were not comparable to each other originally. For example, what if I happened to sample more expensive units in the earlier time frame and sampled less expensive units in the later time frame. That might make it appear that the price is actually decreasing when actually the location or the quality of the unit is changing between the two time frames.
With the difference variable, I can better isolate my variable of interest and more appropriately determine if (and how) the prices are changing over time, by ensuring that the prices in the two time frames are as comparable as possible. This has a similar motivation for why randomized controlled trials are the gold standards of experiments – we are trying to ensure that we have the most comparable groups possible, or the best way to measure the comparison possible.
To actually accomplish this through analysis, we will need to pair the prices for the two different dates together by matching the two data frames. What limitations might arise in this process? For example, we would only be able to use those units that are listed on both dates. Do we generally have most units still available for analysis? Or have we lost many units due to the missing data? If we’ve lost a lot of units, is it that most of the units left the market? Did many units enter the market during that time frame? What other implications and limitations does that have for our final conclusions from this analysis?
Now that we have our analysis plan in place, let’s go ahead and simulate the sampling distribution and communicate our findings.
df = pd.merge(df_popn, df_june, on = 'id')
df.shape
(4456, 351)
df.head()
Unnamed: 0 | id | listing_url | scrape_id | last_scraped | source | name_x | description | neighborhood_overview | picture_url | ... | room_type_y | price_y | minimum_nights_y | number_of_reviews_y | last_review_y | reviews_per_month_y | calculated_host_listings_count_y | availability_365_y | number_of_reviews_ltm_y | license_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2384.0 | https://www.airbnb.com/rooms/2384 | 2.023030e+13 | 3/19/23 | city scrape | Hyde Park - Walk to UChicago | You are invited to be the sole Airbnb guest in... | The apartment is less than one block from beau... | https://a0.muscache.com/pictures/acf6b3c0-47f2... | ... | Private room | 92 | 3 | 198 | 2022-05-22 | 2.19 | 1 | 326 | 15 | R17000015609 |
1 | 1 | 1837153.0 | https://www.airbnb.com/rooms/1837153 | 2.023030e+13 | 3/19/23 | city scrape | Musician's Quarters | Host Chester of Musicians Quarters is offering... | NaN | https://a0.muscache.com/pictures/97205fb8-421c... | ... | Entire home/apt | 115 | 32 | 56 | 2021-11-27 | 0.55 | 1 | 365 | 7 | City registration pending |
2 | 2 | 2604454.0 | https://www.airbnb.com/rooms/2604454 | 2.023030e+13 | 3/19/23 | previous scrape | Cozy Single-Family Home near University of Chi... | Comfortable House in Hyde Park: This beautiful... | The house is essentially on the University of ... | https://a0.muscache.com/pictures/47138943/783a... | ... | Entire home/apt | 125 | 1 | 103 | 2020-02-16 | 1.08 | 1 | 0 | 0 | R17000013467 |
3 | 3 | 3517984.0 | https://www.airbnb.com/rooms/3517984 | 2.023030e+13 | 3/19/23 | city scrape | Private room w/bath in urban canopy | Private bedroom w/attached private bath in a g... | Hyde Park is a strange little pocket of the wo... | https://a0.muscache.com/pictures/48801062/aa86... | ... | Private room | 79 | 1 | 344 | 2022-06-09 | 3.61 | 1 | 253 | 31 | R17000015187 |
4 | 4 | 5297152.0 | https://www.airbnb.com/rooms/5297152 | 2.023030e+13 | 3/19/23 | city scrape | Fresh and Sunny Bed & Bath by UofC | Our private bed and bath offers high ceilings,... | Parks galore in Hyde Park! Walk 15 minutes eas... | https://a0.muscache.com/pictures/773dd6ff-dbb0... | ... | Private room | 70 | 7 | 201 | 2022-05-30 | 2.29 | 3 | 130 | 22 | R17000014580 |
5 rows × 351 columns
Summary measures for the minimum nights required for the population of Chicago Airbnb listings.
We see from merging the two data frames based on the listing id, that we do lose a decent number of listings. We have 4,456 listings in common between both data sets. Note that each listing separately had 6,717 and 7,743 listings individually.
That means that we have only retained 66% and 58% of each original set of listings, respectively.
Next, we can define our new variable of interest: the change in price between the older price and the more recent price. If the price change is positive, that indicates that the price has increased over time. If negative, then the price for the listing has decreased in the past 9 months.
We can then observe the distribution of this price change.
df['price_change'] = df['price_x'] - df['price_y']
df['price_change'].describe()
count 4456.000000 mean -34.498878 std 143.871562 min -3360.000000 25% -45.000000 50% -8.000000 75% 0.000000 max 5007.000000 Name: price_change, dtype: float64
Note that there are some pretty unusual and concerning values for the price change, especially with the maximum and minimum price changes. These might warrant some additional exploration. For example, one of the listings was no longer available the last time that I checked, indicating that the host may have tried to artificially remove the listing by increasing the price substantially. Some of the listings that had the largest price reductions appeared to be on listings that may have originally been newly refurbished units and then decreased the price as the units became older.
Some of this is speculation. However, all of this warrants additional exploration.
Finally, using the newly created price_change variable, we can generate a new sampling distribution.
new_sampling_dist = []
for i in range(1000):
df_sample = df.sample(50, replace = True)
new_sampling_dist.append(df_sample['price_change'].mean())
new_sampling_dist = pd.DataFrame({'x': new_sampling_dist})
sampling_dist.describe() # not using the differences
x | |
---|---|
count | 1000.000000 |
mean | -36.223480 |
std | 46.162097 |
min | -234.600000 |
25% | -65.330000 |
50% | -34.980000 |
75% | -7.945000 |
max | 223.000000 |
Summary measures for the difference in price per night of Chicago Airbnbs between June 2022 and March 2023, not matched between the two time frames.
new_sampling_dist.describe() # using the difference variable
x | |
---|---|
count | 1000.000000 |
mean | -35.104860 |
std | 20.056687 |
min | -156.140000 |
25% | -43.785000 |
50% | -34.100000 |
75% | -25.680000 |
max | 83.140000 |
Summary measures for the difference in price per night of Chicago Airbnbs between June 2022 and March 2023, matched between the two time frames.
sampling_dist['x'].hist()
plt.title('Histogram of Average Change in Prices for Chicago Airbnbs, March 2023 - June 2022')
plt.xlabel('Possible Sample Means of Change of Prices for 50 Chicago Airbnbs')
plt.ylabel('Frequency')
plt.show()

HIstogram for the paired difference in prices by each unit.
df_small = df[['id', 'listing_url', 'name_x', 'name_y', 'host_id_x', 'host_id_y', 'price_x', 'price_y', 'price_change', 'room_type_x', 'room_type_y', 'minimum_nights_x', 'minimum_nights_y']]
df_small[df_small['price_change'] < -1000]
id | listing_url | name_x | name_y | host_id_x | host_id_y | price_x | price_y | price_change | room_type_x | room_type_y | minimum_nights_x | minimum_nights_y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2265 | 53412390.0 | https://www.airbnb.com/rooms/53412390 | Cloud9 | Up to 14 ppl | The Hudson | Group book Cloud9's 2 Penthouse 6 bedrooms for... | 248760412 | 248760412 | 951 | 3502 | -2551 | Entire home/apt | Entire home/apt | 3 | 1 |
2951 | 44115433.0 | https://www.airbnb.com/rooms/44115433 | The Orleans - Entire Building - With Rooftop D... | The Orleans - Entire Building - With Rooftop D... | 170785489 | 170785489 | 1231 | 3042 | -1811 | Entire home/apt | Entire home/apt | 2 | 2 |
2958 | 44161649.0 | https://www.airbnb.com/rooms/44161649 | 2 Newly Built Luxury Condos with Private Roofd... | 2 Newly Built Luxury Condos with Private Roofd... | 170785489 | 170785489 | 535 | 1885 | -1350 | Entire home/apt | Entire home/apt | 2 | 2 |
2959 | 44162119.0 | https://www.airbnb.com/rooms/44162119 | 2 Newly Built Luxury Condos with Private Roofd... | 2 Newly Built Luxury Condos with Private Roofd... | 170785489 | 170785489 | 535 | 1885 | -1350 | Entire home/apt | Entire home/apt | 2 | 2 |
3135 | 53111489.0 | https://www.airbnb.com/rooms/53111489 | New Construction Building in River North! | New Construction Building in River North! | 170785489 | 170785489 | 1282 | 4642 | -3360 | Entire home/apt | Entire home/apt | 2 | 2 |
Airbnb listings that have an unusual change in their prices between June 2022 and March 2023.
Finally, I have generated a sampling distribution for a new variable: the change in price between June 2022 and March 2023. Compared to the original distribution that didn't use the difference variable, the sampling distribution has a much smaller standard error for the statistic of interest (20.24 compared to 48.57). This result is because we were able to more accurately isolate the price difference per listing, rather than averaging over the all of the listings.
From this sampling distribution, I can conclude that on average, listings in Chicago became a little less expensive per night. The average reduction is around $35 per night. I can also see that the mean price change for 50 listings will typically be between $100 less per night and around $25 more per night in any repeated random sample.
Above, we considered a few limitations to our analysis. These limitations mostly centered on why listings might not be present in both time frames. However, as noted above, there may be concerns surrounding the accuracy of the values for the price differences between the two time frames. Some of these may stem from decisions by the host to discourage or encourage bookings. Others may due to the listings being incompatible between different time frames; for example, in some of the instances, a unit may have been refurbished, updated, or had other changes. These could result in the number of people a listing can accommodate, the number of bathrooms, or other amenities being different, which could explain the difference in pricing.
Overall, to fully explore the differences in units would require a more in-depth exploration of the data. For this analysis, though, we will simply conclude that it appears that Airbnbs in March 2023 tended to be less expensive than the same Airbnbs in June 2022.