Adam Stoddard

The Airbnb marketplace is very diverse: apartments and other housing could consist of anything from a bedbug ridden couch to a glamorous full floor penthouse. How do we quantify this database?

Some of the most interesting factors to look at are text-based. Airbnb includes text descriptions of the apartments and ‘about’ the host. The descriptions of apartments, using both cosine similarity and topic modeling, are about what you would expect: descriptions consist of words people use to describe housing: beds, baths, location, access, nearby restaurants, subways, bars, etc. But the topic modeling on host descriptions can be enlightening, allowing us to see how people think of themselves. Some hosts group themselves into categories, which could involve being a “professional” who “enjoys” “traveling”, or an “artist” in “Brooklyn” who spends time with “girlfriends.”

Host topic modeling using gemsim:
0 place family friends much girlfriends school good entrepreneur ive give
1 really de going also huge always living et vous things
2 living brooklyn ny manhattan people two well park best walk
3 things people time reading make moved meeting year amazing see
4 month great place also architect kyle couple married home ny
5 live favorite years enjoy garden travel home living like life
6 travel professional easy going organized time make clean park slope
7 great work host travel neighborhood see manhattan good currently trip
8 stayed living writer good dogs editor comfortable shows owner magazine
9 great restaurants yorker home space please event traveling way easy

What does the marketplace look like?

The following histogram shows the number of bedrooms:

One bedrooms clearly dominate, with far more units than either studios or two or more bedrooms.

This set of box plots shows the price spread, as well as the frequency of outliers:

What drives the difference in prices?

The following is a multiple OLS of the following variables, with each one corresponding to the X variable in the results:

‘f_number_of_reviews’, ‘f_bedrooms’, ‘f_beds’, ‘f_square_feet’, ‘f_review_scores_accuracy’, ‘f_review_scores_cleanliness’, ‘f_review_scores_checkin’, ‘f_review_scores_communication’, ‘f_review_scores_location’, ‘f_review_scores_value’

OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.146
Model:                            OLS   Adj. R-squared:                  0.146
Method:                 Least Squares   F-statistic:                     469.6
Date:                Mon, 31 Aug 2015   Prob (F-statistic):               0.00
Time:                        11:05:11   Log-Likelihood:            -1.8355e+05
No. Observations:               27469   AIC:                         3.671e+05
Df Residuals:                   27458   BIC:                         3.672e+05
Df Model:                          10
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [95.0% Conf. Int.]
——————————————————————————
const       -238.1189     22.414    -10.624      0.000      -282.051  -194.187
x1            -0.3410      0.057     -5.971      0.000        -0.453    -0.229
x2            70.0637      2.311     30.314      0.000        65.533    74.594
x3            33.1152      1.562     21.196      0.000        30.053    36.177
x4             0.0241      0.010      2.407      0.016         0.004     0.044
x5            10.1972      1.982      5.145      0.000         6.312    14.082
x6             6.7004      1.541      4.349      0.000         3.681     9.720
x7            -3.3756      2.343     -1.441      0.150        -7.968     1.216
x8            -0.1726      2.595     -0.067      0.947        -5.259     4.913
x9            34.1613      1.507     22.670      0.000        31.208    37.115
x10          -20.0611      2.023     -9.915      0.000       -24.027   -16.095
==============================================================================
Omnibus:                    56902.725   Durbin-Watson:                   1.888
Prob(Omnibus):                  0.000   Jarque-Bera (JB):        356453892.818
Skew:                          17.463   Prob(JB):                         0.00
Kurtosis:                     559.972   Cond. No.                     1.35e+04
==============================================================================

Does location have an effect:

This map makes it appear so, with the dark dots representing higher prices:

But, even after setting dummy variables for each neighborhood, there is little statistical significance.

So what does this tell us? 

Airbnb is a highly irregular market; even though there statistical significance in a number of variables, there is little explanatory power due to the low r-squared.

Something else must explain price differentials. Pretty pictures? Host preferences?

Cannot sort by price in Airbnb, only specify a range: both hosts and guests have imperfect information about the market prices, so matches happen with less quantifiable explanation.