Simulations for Difference Data

← Sampling Distributions for Regression Next: Calculating Probability for Statistics →

Suppose that I want to learn how much the cost of Airbnb in Chicago has increased over the past 9 months. To answer this question, I gather data from June 10, 2022 in addition to the March 19, 2023 data that we have been using for most of this course.

I could generate a random sample of Airbnbs from June 2022 (from df_june) and another random sample of Airbnbs from March 2023 (from df_popn). I could then attempt to use that data directly to estimate the difference in price. I would need to set my sample sizes; we'll use 50 listings from each time frame. I might get a simulated sampling distribution like below:

sampling_dist = []
for i in range(1000):
    df_june_sample = df_june.sample(50, replace = True)
    df_popn_sample = df_popn.sample(50, replace = True)
    mystat = df_popn_sample['price'].mean() - df_june_sample['price'].mean()
    sampling_dist.append(mystat)
sampling_dist = pd.DataFrame({'x': sampling_dist})

sampling_dist['x'].hist()
plt.xlabel('Possible Difference in Sample Means, March 2023 - June 2022')
plt.ylabel('Frequency')
plt.title('Histogram of Average Chicago Airbnb Price Increases from June 2022 to March 2023')
plt.show()

Histogram of the sampling distribution using a difference of observations from June 2022 and from March 2023.

sampling_dist.describe()

	x
count	1000.000000
mean	-36.223480
std	46.162097
min	-234.600000
25%	-65.330000
50%	-34.980000
75%	-7.945000
max	223.000000

Consider if the same Airbnb is included in each of my samples. Would my two samples be independent? The answer is no! The price for the Airbnb 9 months later is dependent on the earlier price, because both are recorded for the same unit. Similar to the example with regression, I did not consider the underlying structure of the data in the simulation above. In this case, I generated two different samples and try to use them to understand any shifts in prices over the last 9 months, and in general my estimate should generally center around the true increase. However, I can create a much less variable estimate; that is, an estimate that is closer to the true increase for more of the samples.

To accomplish this, we will take a random sample of Airbnb listings. We will then want to calculate the price difference between the two dates for our variable of interest. Now, we would only have a single variable of interest. In this case, I am better able to isolate the variable that I am truly interested in learning more about: the price increase between the two time frames. If I had two separate samples, I might be concerned that my two samples were not comparable to each other originally. For example, what if I happened to sample more expensive units in the earlier time frame and sampled less expensive units in the later time frame. That might make it appear that the price is actually decreasing when actually the location or the quality of the unit is changing between the two time frames.

With the difference variable, I can better isolate my variable of interest and more appropriately determine if (and how) the prices are changing over time, by ensuring that the prices in the two time frames are as comparable as possible. This has a similar motivation for why randomized controlled trials are the gold standards of experiments – we are trying to ensure that we have the most comparable groups possible, or the best way to measure the comparison possible.

To actually accomplish this through analysis, we will need to pair the prices for the two different dates together by matching the two data frames. What limitations might arise in this process? For example, we would only be able to use those units that are listed on both dates. Do we generally have most units still available for analysis? Or have we lost many units due to the missing data? If we’ve lost a lot of units, is it that most of the units left the market? Did many units enter the market during that time frame? What other implications and limitations does that have for our final conclusions from this analysis?

Now that we have our analysis plan in place, let’s go ahead and simulate the sampling distribution and communicate our findings.

df = pd.merge(df_popn, df_june, on = 'id')
df.shape

(4456, 351)

df.head()

	Unnamed: 0	id	listing_url	scrape_id	last_scraped	source	name_x	description	neighborhood_overview	picture_url	...	room_type_y	price_y	minimum_nights_y	number_of_reviews_y	last_review_y	reviews_per_month_y	calculated_host_listings_count_y	availability_365_y	number_of_reviews_ltm_y	license_y
0	0	2384.0	https://www.airbnb.com/rooms/2384	2.023030e+13	3/19/23	city scrape	Hyde Park - Walk to UChicago	You are invited to be the sole Airbnb guest in...	The apartment is less than one block from beau...	https://a0.muscache.com/pictures/acf6b3c0-47f2...	...	Private room	92	3	198	2022-05-22	2.19	1	326	15	R17000015609
1	1	1837153.0	https://www.airbnb.com/rooms/1837153	2.023030e+13	3/19/23	city scrape	Musician's Quarters	Host Chester of Musicians Quarters is offering...	NaN	https://a0.muscache.com/pictures/97205fb8-421c...	...	Entire home/apt	115	32	56	2021-11-27	0.55	1	365	7	City registration pending
2	2	2604454.0	https://www.airbnb.com/rooms/2604454	2.023030e+13	3/19/23	previous scrape	Cozy Single-Family Home near University of Chi...	Comfortable House in Hyde Park: This beautiful...	The house is essentially on the University of ...	https://a0.muscache.com/pictures/47138943/783a...	...	Entire home/apt	125	1	103	2020-02-16	1.08	1	0	0	R17000013467
3	3	3517984.0	https://www.airbnb.com/rooms/3517984	2.023030e+13	3/19/23	city scrape	Private room w/bath in urban canopy	Private bedroom w/attached private bath in a g...	Hyde Park is a strange little pocket of the wo...	https://a0.muscache.com/pictures/48801062/aa86...	...	Private room	79	1	344	2022-06-09	3.61	1	253	31	R17000015187
4	4	5297152.0	https://www.airbnb.com/rooms/5297152	2.023030e+13	3/19/23	city scrape	Fresh and Sunny Bed & Bath by UofC	Our private bed and bath offers high ceilings,...	Parks galore in Hyde Park! Walk 15 minutes eas...	https://a0.muscache.com/pictures/773dd6ff-dbb0...	...	Private room	70	7	201	2022-05-30	2.29	3	130	22	R17000014580

5 rows × 351 columns

Summary measures for the minimum nights required for the population of Chicago Airbnb listings.

We see from merging the two data frames based on the listing id, that we do lose a decent number of listings. We have 4,456 listings in common between both data sets. Note that each listing separately had 6,717 and 7,743 listings individually.

That means that we have only retained 66% and 58% of each original set of listings, respectively.

Next, we can define our new variable of interest: the change in price between the older price and the more recent price. If the price change is positive, that indicates that the price has increased over time. If negative, then the price for the listing has decreased in the past 9 months.

We can then observe the distribution of this price change.

df['price_change'] = df['price_x'] - df['price_y']
df['price_change'].describe()

    count    4456.000000
    mean      -34.498878
    std       143.871562
    min     -3360.000000
    25%       -45.000000
    50%        -8.000000
    75%         0.000000
    max      5007.000000
    Name: price_change, dtype: float64

Note that there are some pretty unusual and concerning values for the price change, especially with the maximum and minimum price changes. These might warrant some additional exploration. For example, one of the listings was no longer available the last time that I checked, indicating that the host may have tried to artificially remove the listing by increasing the price substantially. Some of the listings that had the largest price reductions appeared to be on listings that may have originally been newly refurbished units and then decreased the price as the units became older.

Some of this is speculation. However, all of this warrants additional exploration.

Finally, using the newly created price_change variable, we can generate a new sampling distribution.

new_sampling_dist = []
for i in range(1000):
    df_sample = df.sample(50, replace = True)
    new_sampling_dist.append(df_sample['price_change'].mean())
new_sampling_dist = pd.DataFrame({'x': new_sampling_dist})

sampling_dist.describe() # not using the differences

	x
count	1000.000000
mean	-36.223480
std	46.162097
min	-234.600000
25%	-65.330000
50%	-34.980000
75%	-7.945000
max	223.000000

Summary measures for the difference in price per night of Chicago Airbnbs between June 2022 and March 2023, not matched between the two time frames.

new_sampling_dist.describe() # using the difference variable

	x
count	1000.000000
mean	-35.104860
std	20.056687
min	-156.140000
25%	-43.785000
50%	-34.100000
75%	-25.680000
max	83.140000

Summary measures for the difference in price per night of Chicago Airbnbs between June 2022 and March 2023, matched between the two time frames.

sampling_dist['x'].hist()
plt.title('Histogram of Average Change in Prices for Chicago Airbnbs, March 2023 - June 2022')
plt.xlabel('Possible Sample Means of Change of Prices for 50 Chicago Airbnbs')
plt.ylabel('Frequency')
plt.show()

HIstogram for the paired difference in prices by each unit.

df_small = df[['id', 'listing_url', 'name_x', 'name_y', 'host_id_x', 'host_id_y', 'price_x', 'price_y', 'price_change', 'room_type_x', 'room_type_y', 'minimum_nights_x', 'minimum_nights_y']]
df_small[df_small['price_change'] < -1000]

	id	listing_url	name_x	name_y	host_id_x	host_id_y	price_x	price_y	price_change	room_type_x	room_type_y	minimum_nights_x	minimum_nights_y
2265	53412390.0	https://www.airbnb.com/rooms/53412390	Cloud9 \| Up to 14 ppl \| The Hudson	Group book Cloud9's 2 Penthouse 6 bedrooms for...	248760412	248760412	951	3502	-2551	Entire home/apt	Entire home/apt	3	1
2951	44115433.0	https://www.airbnb.com/rooms/44115433	The Orleans - Entire Building - With Rooftop D...	The Orleans - Entire Building - With Rooftop D...	170785489	170785489	1231	3042	-1811	Entire home/apt	Entire home/apt	2	2
2958	44161649.0	https://www.airbnb.com/rooms/44161649	2 Newly Built Luxury Condos with Private Roofd...	2 Newly Built Luxury Condos with Private Roofd...	170785489	170785489	535	1885	-1350	Entire home/apt	Entire home/apt	2	2
2959	44162119.0	https://www.airbnb.com/rooms/44162119	2 Newly Built Luxury Condos with Private Roofd...	2 Newly Built Luxury Condos with Private Roofd...	170785489	170785489	535	1885	-1350	Entire home/apt	Entire home/apt	2	2
3135	53111489.0	https://www.airbnb.com/rooms/53111489	New Construction Building in River North!	New Construction Building in River North!	170785489	170785489	1282	4642	-3360	Entire home/apt	Entire home/apt	2	2

Airbnb listings that have an unusual change in their prices between June 2022 and March 2023.

Finally, I have generated a sampling distribution for a new variable: the change in price between June 2022 and March 2023. Compared to the original distribution that didn't use the difference variable, the sampling distribution has a much smaller standard error for the statistic of interest (20.24 compared to 48.57). This result is because we were able to more accurately isolate the price difference per listing, rather than averaging over the all of the listings.

From this sampling distribution, I can conclude that on average, listings in Chicago became a little less expensive per night. The average reduction is around $35 per night. I can also see that the mean price change for 50 listings will typically be between $100 less per night and around $25 more per night in any repeated random sample.

Above, we considered a few limitations to our analysis. These limitations mostly centered on why listings might not be present in both time frames. However, as noted above, there may be concerns surrounding the accuracy of the values for the price differences between the two time frames. Some of these may stem from decisions by the host to discourage or encourage bookings. Others may due to the listings being incompatible between different time frames; for example, in some of the instances, a unit may have been refurbished, updated, or had other changes. These could result in the number of people a listing can accommodate, the number of bathrooms, or other amenities being different, which could explain the difference in pricing.

Overall, to fully explore the differences in units would require a more in-depth exploration of the data. For this analysis, though, we will simply conclude that it appears that Airbnbs in March 2023 tended to be less expensive than the same Airbnbs in June 2022.

← Sampling Distributions for Regression Next: Calculating Probability for Statistics →