Adam Stoddard
The Airbnb marketplace is very diverse: apartments and other housing could consist of anything from a bedbug ridden couch to a glamorous full floor penthouse. How do we quantify this database?
Some of the most interesting factors to look at are text-based. Airbnb includes text descriptions of the apartments and ‘about’ the host. The descriptions of apartments, using both cosine similarity and topic modeling, are about what you would expect: descriptions consist of words people use to describe housing: beds, baths, location, access, nearby restaurants, subways, bars, etc. But the topic modeling on host descriptions can be enlightening, allowing us to see how people think of themselves. Some hosts group themselves into categories, which could involve being a “professional” who “enjoys” “traveling”, or an “artist” in “Brooklyn” who spends time with “girlfriends.”
Host topic modeling using gemsim:
0 place family friends much girlfriends school good entrepreneur ive give
1 really de going also huge always living et vous things
2 living brooklyn ny manhattan people two well park best walk
3 things people time reading make moved meeting year amazing see
4 month great place also architect kyle couple married home ny
5 live favorite years enjoy garden travel home living like life
6 travel professional easy going organized time make clean park slope
7 great work host travel neighborhood see manhattan good currently trip
8 stayed living writer good dogs editor comfortable shows owner magazine
9 great restaurants yorker home space please event traveling way easy
What does the marketplace look like?
The following histogram shows the number of bedrooms:

One bedrooms clearly dominate, with far more units than either studios or two or more bedrooms.
This set of box plots shows the price spread, as well as the frequency of outliers:

What drives the difference in prices?
The following is a multiple OLS of the following variables, with each one corresponding to the X variable in the results:
‘f_number_of_reviews’, ‘f_bedrooms’, ‘f_beds’, ‘f_square_feet’, ‘f_review_scores_accuracy’, ‘f_review_scores_cleanliness’, ‘f_review_scores_checkin’, ‘f_review_scores_communication’, ‘f_review_scores_location’, ‘f_review_scores_value’
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.146
Model: OLS Adj. R-squared: 0.146
Method: Least Squares F-statistic: 469.6
Date: Mon, 31 Aug 2015 Prob (F-statistic): 0.00
Time: 11:05:11 Log-Likelihood: -1.8355e+05
No. Observations: 27469 AIC: 3.671e+05
Df Residuals: 27458 BIC: 3.672e+05
Df Model: 10
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
——————————————————————————
const -238.1189 22.414 -10.624 0.000 -282.051 -194.187
x1 -0.3410 0.057 -5.971 0.000 -0.453 -0.229
x2 70.0637 2.311 30.314 0.000 65.533 74.594
x3 33.1152 1.562 21.196 0.000 30.053 36.177
x4 0.0241 0.010 2.407 0.016 0.004 0.044
x5 10.1972 1.982 5.145 0.000 6.312 14.082
x6 6.7004 1.541 4.349 0.000 3.681 9.720
x7 -3.3756 2.343 -1.441 0.150 -7.968 1.216
x8 -0.1726 2.595 -0.067 0.947 -5.259 4.913
x9 34.1613 1.507 22.670 0.000 31.208 37.115
x10 -20.0611 2.023 -9.915 0.000 -24.027 -16.095
==============================================================================
Omnibus: 56902.725 Durbin-Watson: 1.888
Prob(Omnibus): 0.000 Jarque-Bera (JB): 356453892.818
Skew: 17.463 Prob(JB): 0.00
Kurtosis: 559.972 Cond. No. 1.35e+04
==============================================================================
Does location have an effect:
This map makes it appear so, with the dark dots representing higher prices:

But, even after setting dummy variables for each neighborhood, there is little statistical significance.
So what does this tell us?
Airbnb is a highly irregular market; even though there statistical significance in a number of variables, there is little explanatory power due to the low r-squared.
Something else must explain price differentials. Pretty pictures? Host preferences?
Cannot sort by price in Airbnb, only specify a range: both hosts and guests have imperfect information about the market prices, so matches happen with less quantifiable explanation.
If you’re curious in learning more about the intersection of data, coding and visualization, check out the Lede Program – an intensive certification program at Columbia’s School of Journalism, in conjunction with the Department of Computer Science. Find out more on our mail page – applications are open soon!