""
Airbnb is a vacation rental online marketplace based in San Francisco, California. Founded in 2008, the company has skyrocketed in popularity, with a private valuation of $18 billion and a plan to potentially go public within the year. The scope of this work, is to analyse data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in Singapore.
We begin with exploratory data analysis and mapping of the key insights from the dataset before running various analyses including a demand analysis over time, a superhost analysis on the price and a heatmap of the average daily prices over the past 2 years, before running a thorough regression analysis. We then analyse our final results and wrap it up in our conclusion.
The aim of this section is to gain a better understanding of our data set. To do this, we utilize a variety of tools, such as data visualizations and correlation matrices. Our dataset comes from insideairbnb.com, and represent official AirBnb data.
# Importing listings dataset
singapore <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2020-06-22/data/listings.csv.gz") %>%
clean_names()
##dplyr::glimpse(singapore)
# Importing reviews dataset
reviews <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2020-06-22/data/reviews.csv.gz") %>%
clean_names()
##glimpse(reviews)
To proceed with our analysis we will begin by selecting the specific variables of interest that will inform our regression models. On the figure below, we can see the summary statics of the selected variables and their rough distributions. By inspecting the distributions, all variables seem to be heavily skewed, which could potentially signify the existence of extreme data points, which will be dealt with in part 2.3.2.
#select the variables that will be used for analysis
singapore_listings <- singapore %>%
select(price,
cleaning_fee,
extra_people,
property_type,
room_type,
minimum_nights,
number_of_reviews,
review_scores_rating,
review_scores_location,
longitude,
latitude,
neighbourhood,
neighbourhood_group_cleansed,
zipcode,
host_is_superhost,
is_location_exact,
cancellation_policy,
host_response_time,
host_acceptance_rate,
security_deposit,
accommodates,
bedrooms,
bathrooms,
beds,
square_feet)
#calculate summary statistics and observe missing values if any
skimr::skim(singapore_listings)
Name | singapore_listings |
Number of rows | 7323 |
Number of columns | 25 |
_______________________ | |
Column type frequency: | |
character | 12 |
logical | 2 |
numeric | 11 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
price | 0 | 1.00 | 6 | 10 | 0 | 429 | 0 |
cleaning_fee | 1947 | 0.73 | 5 | 7 | 0 | 113 | 0 |
extra_people | 0 | 1.00 | 5 | 7 | 0 | 81 | 0 |
property_type | 0 | 1.00 | 3 | 22 | 0 | 26 | 0 |
room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
neighbourhood | 2 | 1.00 | 5 | 18 | 0 | 45 | 0 |
neighbourhood_group_cleansed | 0 | 1.00 | 11 | 17 | 0 | 5 | 0 |
zipcode | 818 | 0.89 | 1 | 9 | 0 | 1975 | 0 |
cancellation_policy | 0 | 1.00 | 8 | 27 | 0 | 5 | 0 |
host_response_time | 22 | 1.00 | 3 | 18 | 0 | 5 | 0 |
host_acceptance_rate | 22 | 1.00 | 2 | 4 | 0 | 79 | 0 |
security_deposit | 2217 | 0.70 | 5 | 9 | 0 | 156 | 0 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
host_is_superhost | 22 | 1 | 0.16 | FAL: 6143, TRU: 1158 |
is_location_exact | 0 | 1 | 0.80 | TRU: 5845, FAL: 1478 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
minimum_nights | 0 | 1.00 | 20.73 | 46.87 | 1.00 | 2.0 | 4.00 | 18.00 | 1000.00 | ▇▁▁▁▁ |
number_of_reviews | 0 | 1.00 | 12.46 | 31.65 | 0.00 | 0.0 | 1.00 | 8.00 | 370.00 | ▇▁▁▁▁ |
review_scores_rating | 2969 | 0.59 | 90.59 | 12.47 | 20.00 | 87.0 | 94.00 | 100.00 | 100.00 | ▁▁▁▂▇ |
review_scores_location | 2979 | 0.59 | 9.38 | 1.04 | 2.00 | 9.0 | 10.00 | 10.00 | 10.00 | ▁▁▁▁▇ |
longitude | 0 | 1.00 | 103.85 | 0.04 | 103.65 | 103.8 | 103.85 | 103.87 | 103.97 | ▁▁▃▇▁ |
latitude | 0 | 1.00 | 1.31 | 0.03 | 1.24 | 1.3 | 1.31 | 1.32 | 1.45 | ▂▇▂▁▁ |
accommodates | 0 | 1.00 | 3.25 | 2.51 | 1.00 | 2.0 | 2.00 | 4.00 | 16.00 | ▇▂▁▁▁ |
bedrooms | 12 | 1.00 | 1.31 | 0.87 | 0.00 | 1.0 | 1.00 | 2.00 | 10.00 | ▇▁▁▁▁ |
bathrooms | 3 | 1.00 | 1.54 | 1.16 | 0.00 | 1.0 | 1.00 | 2.00 | 21.00 | ▇▁▁▁▁ |
beds | 71 | 0.99 | 1.98 | 2.52 | 0.00 | 1.0 | 1.00 | 2.00 | 58.00 | ▇▁▁▁▁ |
square_feet | 7292 | 0.00 | 204.55 | 245.89 | 0.00 | 0.0 | 200.00 | 317.50 | 1001.00 | ▇▃▁▁▁ |
After having selected the variables of interest, we need to clean up the dataset to prepare it for analysis. We begin by handling the missing data, transforming price data to numerical values, visualising the distribution of our dependent variable, the price and filtering out extreme data.
singapore_listings <- singapore_listings %>%
#convert all prices to numeric values
mutate(price = parse_number(price),
cleaning_fee = parse_number(cleaning_fee),
extra_people = parse_number(extra_people),
security_deposit = parse_number(security_deposit)) %>%
#replace all NA values for cleaning fee and security deposit to the numeric price 0
mutate(cleaning_fee = case_when(is.na(cleaning_fee) ~ 0, TRUE ~ cleaning_fee)) %>%
mutate(security_deposit = case_when(is.na(security_deposit) ~ 0, TRUE ~ security_deposit))
Below is a histogram of all prices in our Singapore dataset. As we can see, the majority of accommodations are concentrated below $250 USD. In order to better visualise the distribution the price axis needs to be logarithmically transformed and this is how we are going to treat the price in the models we set up.
log_price_hist <-
ggplot(singapore_listings, aes(x= price))+
geom_point(stat = "count",colour="#FF5A5F") +
labs(x = "Price",
y = "Count", title="Most Singapore AirBnbs fall in the $100 - $400 Range",subtitle="Number of listings for each price") +
scale_x_log10()+
scale_y_log10()+
theme_bw()+
theme(panel.border = element_rect(colour = "#00A699", fill=NA, size=1))
log_price_hist
A large majority of accomodations are concentrated below $250 USD.
In this part we are visualising our dataset using boxplots, in order to identify outliers and proceed with filtering our extreme values.
# Observe extreme values
cleaning_box_1 <-
ggplot(singapore_listings, aes(y=price)) +
geom_boxplot(colour="#00A699")+
labs(y="Price ($)")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
NULL
cleaning_box_2 <-
ggplot(singapore_listings, aes(y=minimum_nights)) +
geom_boxplot(colour="#00A699")+
labs(y="Minimum Nights")+
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
NULL
cleaning_box_3 <-
ggplot(singapore_listings, aes(y=cleaning_fee)) +
geom_boxplot(colour="#00A699")+
labs(y="Cleaning Fee ($)") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
NULL
cleaning_box_4<-
ggplot(singapore_listings, aes(y=extra_people)) +
geom_boxplot(colour="#00A699")+
labs(y="Fee for extra people ($)") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
NULL
combined_box_plots <-
ggarrange(cleaning_box_1,
cleaning_box_2,
cleaning_box_3,
cleaning_box_4,
ncol = 2, nrow = 2) +
theme_bw()+
theme(panel.border = element_rect(colour = "#FF5A5F", fill=NA, size=1))
NULL
## NULL
After looking at the above shown figures, we chose to filter the following variables:
Moreover, we determined the top 5 property types and allocated all other types into an “other” category.
#Finding the most listed property types
top_property_types <- singapore_listings %>%
group_by(property_type) %>%
summarise(count = count(property_type)) %>%
top_n(5)
#Adding column for the top 5 property types and categorising the rest as 'Other'
filtered_listings <- singapore_listings %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Apartment","Condominium", "Hostel","House", "Serviced apartment") ~ property_type,
TRUE ~ "Other"
))
# Filter dataframe by minimum nights <= 4
filtered_listings <- filtered_listings %>%
filter(minimum_nights <= 4, price <= 1500, cleaning_fee <= 200, extra_people <= 100)
In this section we will begin the correlation analysis that will assist in the development of the price regression models set-up in subsequent sections. In order to identify which variables are correlated and to what extent, we have made two correlation matrices. The first visually represent the relative strength of correlation of all variables between them considering their r-squared value, and the second graph provides more information regarding the specific distributions.
#regression analysis data set created here
regression_analysis <- filtered_listings %>%
mutate(price_4_nights =
price*4 +
cleaning_fee +
extra_people,
log_4_nights = log10(price_4_nights)
)
#create correlation matrix for variables with data type "double"
correlation_matrix <- regression_analysis %>%
summarise(
cleaning_fee,
extra_people,
minimum_nights,
number_of_reviews,
review_scores_rating,
review_scores_location,
longitude,
latitude,
security_deposit,
price_4_nights,
log_4_nights)
#have a look on the correlation matrix data
glimpse(correlation_matrix)
## Rows: 3,648
## Columns: 11
## $ cleaning_fee <dbl> 70, 15, 25, 63, 0, 35, 0, 0, 36, 0, 0, 0, 80...
## $ extra_people <dbl> 35, 25, 14, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 3...
## $ minimum_nights <dbl> 1, 2, 2, 4, 2, 1, 1, 1, 3, 2, 2, 1, 1, 2, 1,...
## $ number_of_reviews <dbl> 29, 82, 29, 11, 69, 36, 0, 0, 1, 40, 70, 0, ...
## $ review_scores_rating <dbl> 82, 94, 92, 91, 94, 94, NA, NA, 100, 96, 91,...
## $ review_scores_location <dbl> 9, 9, 9, 9, 10, 9, NA, NA, 10, 10, 10, NA, 9...
## $ longitude <dbl> 104, 104, 104, 104, 104, 104, 104, 104, 104,...
## $ latitude <dbl> 1.34, 1.36, 1.34, 1.33, 1.30, 1.35, 1.38, 1....
## $ security_deposit <dbl> 279, 140, 279, 1398, 500, 0, 0, 0, 0, 500, 5...
## $ price_4_nights <dbl> 941, 264, 263, 1175, 660, 222, 820, 672, 552...
## $ log_4_nights <dbl> 2.97, 2.42, 2.42, 3.07, 2.82, 2.35, 2.91, 2....
#glance at the correlation relationship between different variables
ggcorr(correlation_matrix)+
theme_bw() +
NULL
#show the correlations more clearly
ggpairs(correlation_matrix, columns=1:11,
upper=list(continuous='cor'),
lower=list(continuous = 'points'),
diag=list(continuous='density'),
axisLabels='show') +
theme_bw()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Singapore is a vibrant touristic city-state with the largest port by throughput consistently since 2015. Below we mapped AirBnbs in singapore with respect to their geographical location, depicting their price range, the property type and the 4 main neighbourhoods. As expected, the majority of airbnb listings are located in the city-center, with higher prices predominantely in the area. Moreover, larger accommodation types such as hostels are also concentrated in the city center. On the contrary, the northern part of Singapore seems to be significantly less developed and less touristic, as the number of listings is lower.
#mapping visualisation for listings by price
pal <- colorNumeric(palette = "RdBu", domain = c(-100:1000))
map_1 <- leaflet(data = filtered_listings) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~pal(price),
fillOpacity = 0.4,
label = ~price)
map_1
#color by neighborhood
pal <- colorFactor(palette = c("#484848", "#FC642D", "#767676", "#00A699", "#FF5A5F"),
levels = c("North-East Region", "North Region", "East Region", "West Region", "Central Region"))
map_2 <- leaflet(data = filtered_listings) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~pal(neighbourhood_group_cleansed),
fillOpacity = 0.4,
label = ~neighbourhood_group_cleansed)
#color by property type
pal <- colorFactor(palette = c("#e98a15", "#a2ad59", "#1D3461", "#538083", "#a53860"),
levels = c("Apartment","Condominium", "Hostel","House", "Serviced apartment"))
map_3 <- leaflet(data = filtered_listings) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~pal(prop_type_simplified),
fillOpacity = 0.5,
label = ~prop_type_simplified)
map_2
The key insights that can be drawn from the above figures are listed below:
Correlations between 0.40 - 0.69 that show a strong relationships between variables:
correlations between 0.30 - 0.39 indicate moderate relationship between variables:
Cleaning fee seems to have a moderate relationship with several variables such as, extra people, minimum nights and security deposit, which we definitely consider when we run the regression later.
To further investigate the relationship between the key variables We have created the following scatterplots:
We started by plotting number of reviews versus price. It can be seen that having many reviews does not affect positively the price as one would expect. Ofcourse, other factors, such as the quality of reviews (e.g. negative or positive) and the overall competition of AirBns in Singapore cannot be determined, hence lowering the significance of the above relationship.
The next plot shows the price versus review scores. An interesting insight on the graph is that only after a review score of 95% does the price seem to have a positive tendency. This could potentially signify high competition of really good AirBnbs giving a pricing advantage only to hosts with extremely good reviews.
The next scatterplot shows the price versus the property type. We have simplified this data into the following variables: house, condominium, apartment, serviced apartment, hostel, and other. A general conclusion is difficult to be made, as most types of accommodations seem to hover around a similar range of prices. Nevertheless, serviced apartment prices start at a higher price level, which could signify an incentive for hosts to improve the quality of hospitality. Moreover, the price range of hostels as expected is lower than the rest, and finally, as expected, the range of prices for apartments and condominiums is quite high.
Next, we show the price versus the room type, and as expected entire homes and private rooms can reach a higher price. Nevertheless, what’s interesting is that a large part of the dataset is located on the lower end, hence having similar prices for shared, as well as private rooms.
The next scatterplot shows the price versus the region that the property is in. The regions are groups of neighborhoods, and we are using this since the data provide a large amount of distinct neighborhoods. Here we can see how prices vary by region, and as discussed in section 3, the central region has a higher price range than the rest of the regions.
The following scatterplot shows the price versus the number of bedrooms. As the number of bedrooms increases, we can see that the price also increases slightly (excluding the brief dropoff near 7.5 bedrooms). This is unsurprising as larger properties tend to be more expensive. Moreover, the observed dropoff comes as a result of one datapoint and can be thus deemed not significant.
The next scatterplot shows the price versus the number of people the property accommodates. We would expect this plot to show a strong response from price, because usually if a property accommodates more people, it would be more expensive. The lack of a strong response could indicate that there is a large amount of smaller, luxury properties that are expensive, despite not accommodating many people. Additionally, it is possible that cheaper properties that can accommodate many people may be hostels.
Finally, the last scatterplot graphically represents the relationship between Price and cleaning fee, which seem to have a positive relationship. The higher the prices the higher the cleaning fees. The data points along the horizontal axis are from hosts that do not charge for cleaning fee, however we cannot know whether they incorporate it inflating their overall price.
In this section, we have explore the data in a different way, not necessarily trying to observe the correlation between variables. On the first graph we can see the share of the different types of accommodations across the 5 areas. As we can see apartments account for the vast majority of property types, however as we go out of the city center houses become relatively more than condominiums. Finally, in the second graph we can see that the median price is significantly increased.
#get a count of types of props in each neighborhood
prop_type_df <- filtered_listings %>%
group_by(neighbourhood_group_cleansed, property_type) %>%
summarize(Freq = n()) %>%
filter(property_type %in% c("Apartment","House","Condominium","Townhouse", "Loft"))
#find total sum of properties in each neighborhood to find the % breakdown
total_prop_count <- prop_type_df %>%
group_by(neighbourhood_group_cleansed) %>%
summarize(total = sum(Freq))
#merge the dataframes so we can compute the % breakdown
property_percent <- merge(prop_type_df, total_prop_count, by="neighbourhood_group_cleansed") %>%
mutate(percentage = Freq/total)
#create bar graph that shows % of listing types by region
ggplot(property_percent,
aes(x=neighbourhood_group_cleansed,
y=percentage, fill = property_type)) +
geom_bar(position = "dodge",stat="identity") +
xlab("Region") +
ylab("Count")+
scale_fill_discrete(name = "Property Type") +
scale_y_continuous(labels = scales::percent) +
ggtitle("Which types of Listings are there in Singapore?",
subtitle = "Map showing count of listing type of neighborhood") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
xlab("") +
ylab("Listings (%)")+
theme_bw()
# find median and mean for prop type by region
summary_price_stats <- filtered_listings %>%
group_by(prop_type_simplified, neighbourhood_group_cleansed) %>%
summarise(median_price = median(price), avg_price = mean(price))
median_price_by_hood <- ggplot(summary_price_stats,
aes(x = neighbourhood_group_cleansed,
y = median_price,
colour = prop_type_simplified)) +
geom_point(size = 8) +
coord_flip() +
labs(x = "",
y = "Median Price ($)",
title = "Does region affect median rental price?",
subtitle = "The Central region of Singapore tends to have higher median rental prices for houses, apartments, and condos"
)+
scale_color_discrete(name = "Property Type") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
theme_bw()+
NULL
median_price_by_hood
An interesting question that deserves an answer is whether being a super-host allows for a higher price setting power. To answer this question we have created two boxplots. The first one plots the price vs the area, and the second plots the price vs the property type for superhosts and non-superhosts. We can see that indeed being a superhost offers some pricing flexibility, albeit small, which is reinforced for superhosts with appartments in the city center.
# Superhost analysis
superhost_df <- filtered_listings %>%
select(price, cleaning_fee, neighbourhood_group_cleansed, host_is_superhost, prop_type_simplified) %>%
mutate(price_log = log(price)) %>%
drop_na(host_is_superhost)
superhost_df %>%
ggplot(aes(x = neighbourhood_group_cleansed, y = price_log, fill = host_is_superhost))+
geom_boxplot(width=0.5,lwd=1) +
theme_bw() +
labs(title = "Being a superhost has a significant impact in the most competitive areas",
subtitle = "Distribuition of log(price) at different areas dependent on whether the host is a superhost or not",
x = "",
y = "Price"
)+
scale_fill_discrete(name = "Superhost?") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
NULL
superhost_df %>%
ggplot(aes(x = prop_type_simplified, y = price_log, fill = host_is_superhost))+
geom_boxplot(width=0.5,lwd=1) +
theme_bw() +
labs(title = "Being a superhost has a profound impact in hostels and serviced\napartments, whilst less so in all other types",
subtitle = "Distribuition of log(price) at different property types dependent on whether the host is a superhost or not",
x = "",
y = "Log Price"
)+
scale_fill_discrete(name = "Superhost?") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68")) +
NULL
The next piece of analysis we will perform is a demand for Airbnb in Singapore. We will have a look on the demand for Airbnb in Singapore over the years as well as at the seasonality of demand.
We can clearly see some interesting patterns from the graph below:
From 2010 to around 2018, Airbnb demand indicated by reviews increases faster and faster and staged a little bit in the first half of 2018 and continued increased fast until mid 2019. There are two reasons behind the drop of demand from mid 2019: a) Singapore government announced in May 2019 that short-term rentals by platforms such as Airbnb remain illegal in Singapore b) COVID-19 came in 2020 which discourages travelers
#How popular is Airbnb in Singapore?
reviewsNum <- reviews %>%
group_by(date = reviews$date) %>%
summarise(number = n())
ggplot(reviewsNum, aes(date, number)) +
geom_point(na.rm=TRUE, color = "#56B4E9", alpha=0.5) +
geom_smooth(color = "#FF5A5F")+
ggtitle("How popular is Airbnb in Singapore?",
subtitle = "Number of Reviews across years") +
labs(x = "Year", y = "Number of Reviews", title="Number of reviews by year") +
theme_bw()+
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))+
theme(panel.border = element_rect(colour = "#FF5A5F", fill=NA, size=1))
We chose 2017,2018,2019 to see the seasonality of demand since Airbnb gains stable growth in these years and 2017,2018 and 2019 give different overall demand trends.
#demand 2017
ggplot(reviewsNum[year(reviewsNum$date) == 2017,], aes(date, number)) +
geom_point(na.rm=TRUE, color = "#56B4E9", alpha=0.5) +
geom_smooth(color = "#FF5A5F")+
ggtitle("Seasonality of Airbnb in Singapore",
subtitle = "Number of Reviews across Months in 2017") +
labs(x = "Month", y = "Unique listings recieving reviews") +
theme_bw()+
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))+
theme(panel.border = element_rect(colour = "#FF5A5F", fill=NA, size=1))
#demand 2018
ggplot(reviewsNum[year(reviewsNum$date) == 2018,], aes(date, number)) +
geom_point(na.rm=TRUE, color = "#56B4E9", alpha=0.5) +
geom_smooth(color = "#FF5A5F")+
ggtitle("Seasonality in Demand",
subtitle = "Number of Reviews across Months in 2018") +
labs(x = "Month", y = "Unique listings recieving reviews") +
theme_bw()+
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))+
theme(panel.border = element_rect(colour = "#FF5A5F", fill=NA, size=1))
#demand 2019
demand2019 <- ggplot(reviewsNum[year(reviewsNum$date) == 2019,], aes(date, number)) +
geom_point(na.rm=TRUE, color = "#56B4E9", alpha=0.5) +
geom_smooth(color = "#FF5A5F")+
ggtitle("Seasonality in Demand",
subtitle = "Number of Reviews across Months in 2019") +
labs(x = "Month", y = "Unique listings recieving reviews") +
theme_bw()+
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))+
theme(panel.border = element_rect(colour = "#FF5A5F", fill=NA, size=1))
demand2019
No matter the overall demand trend is roaring(in 2017), staging(in 2018) or fluctuating(in 2019), the charts above show strong seasonality of Airbnb demand in Singapore: bottom in Apr and top in Sep.
The final part of analysis prior to building up our models, is a heatmap of Average Daily Prices from 2019 to 2020.
# # Download calendar data about AirBnB bookings in Singapore from March 2019 to July 2020 (including later dates)
# # Selecting the needed values to reduce amount of data
a <- calendar_june_2020 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2020-06-22/data/calendar.csv.gz") %>% # June 2020
select(price, date, listing_id)
b <- calendar_may_2020 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2020-05-27/data/calendar.csv.gz") %>% # May 2020
select(price, date, listing_id)
c <- calendar_april_2020 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2020-04-26/data/calendar.csv.gz") %>% # April 2020
select(price, date, listing_id)
d <- calendar_march_2020 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2020-03-21/data/calendar.csv.gz") %>% # March 2020
select(price, date, listing_id)
e <- calendar_february_2020 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2020-02-27/data/calendar.csv.gz") %>% # February 2020
select(price, date, listing_id)
f <- calendar_january_2020 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2020-01-26/data/calendar.csv.gz") %>% # January 2020
select(price, date, listing_id)
g <- calendar_december_2019 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2019-12-28/data/calendar.csv.gz") %>% # December 2019
select(price, date, listing_id)
h <- calendar_november_2019 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2019-11-26/data/calendar.csv.gz") %>% # November 2019
select(price, date, listing_id)
i <- calendar_october_2019 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2019-10-25/data/calendar.csv.gz") %>% # October 2019
select(price, date, listing_id)
j <- calendar_september_2019 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2019-09-25/data/calendar.csv.gz") %>% # September 2019
select(price, date, listing_id)
k <- calendar_august_2019 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2019-08-28/data/calendar.csv.gz") %>% # August 2019
select(price, date, listing_id)
l <- calendar_july_2019 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2019-07-21/data/calendar.csv.gz") %>% # July 2019
select(price, date, listing_id)
m <- calendar_june_2019 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2019-06-25/data/calendar.csv.gz") %>% # June 2019
select(price, date, listing_id)
n <- calendar_may_2019 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2019-05-23/data/calendar.csv.gz") %>% # May 2019
select(price, date, listing_id)
o <- calendar_april_2019 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2019-04-20/data/calendar.csv.gz") %>% # April 2019
select(price, date, listing_id)
p <- calendar_march_2019 <- vroom("http://data.insideairbnb.com/singapore/sg/singapore/2019-03-18/data/calendar.csv.gz") %>% # March 2019
select(price, date, listing_id)
# # Merging Data from all months and deleting
calendar_add_up <- bind_rows(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p) %>% # Combine the data sets for all 16 months
filter(date < "2020-06-01" & date > "2019-05-31") %>% # Filter out one year of data to include as much "overlapping" data as possible
mutate(price = parse_number(price)) # Remove dollar writing
# # Color Setting
r2g <- c("#D61818", "#FFAE63", "#FFFFBD", "#B5E384")
# # Create
groupedCalendar <- # Since we need the value for every single day, we need to group it
calendar_add_up %>% # Taking the cleaned data from above
filter(price < 1500) %>% # Setting an upper price limit to avoid a distorted mean
group_by(date) %>% # put the data for every single day together
summarise(totalPrice = sum(price, na.rm = TRUE), # Due to problems of the heatmap handling "mean" we need to do it the old-fashioned way
totalListings = n()) %>% # Devide the sum by the count
mutate(mean = totalPrice/totalListings) # This comes into a new column of daily average prices
calendarHeat(groupedCalendar$date, # Building the graphic by date
groupedCalendar$mean, # Taking the mean or average daily price as variable
ncolors = 99, # Giving a color range from green to red in 99 shades (unsure if fully used)
color = "r2g", # Giving a color selection (Red to Green)
varname="Average Price Levels for Singapore AirBnBs") # Naming the variable accordingly to get the right header
The heatmap above gives us a very good overview on the average prices for AirBnB rentings between June 2019 and May 2020. Among the results, we can see interesting trends and manifestations of specific events:
To begin our regression, we created a price_4_nights variable using price, cleaning_fee, guests_included and extra_people. We will be using this as our target variable. This variable represents the cost of stayinig in Singapore for 4 nights.
We looked at the distribution of price during our exploratory data analysis. We found that price lacked normality so we used log10 to normalize it. Moving forward, we will use log_4_nights as our target variable.
##create column for price of 4 nights
regression_analysis <- filtered_listings %>%
mutate(price_4_nights =
price*4 +
cleaning_fee +
extra_people,
log_4_nights = log10(price_4_nights)
)
#check histogram of new variable to check for normality
hist(regression_analysis$log_4_nights)
## log of 4 nights give a much more normal data set so we will use this moving forward
# summary stats on 4 nights by property type
favstats(~log_4_nights | prop_type_simplified, data = regression_analysis)
prop_type_simplified | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
Apartment | 1.75 | 2.42 | 2.62 | 2.86 | 3.68 | 2.64 | 0.307 | 1419 | 0 |
Condominium | 1.78 | 2.54 | 2.8 | 2.96 | 3.73 | 2.76 | 0.276 | 793 | 0 |
Hostel | 1.94 | 2.1 | 2.18 | 2.54 | 3.56 | 2.34 | 0.329 | 306 | 0 |
House | 2.02 | 2.38 | 2.59 | 2.76 | 3.6 | 2.58 | 0.317 | 306 | 0 |
Other | 1.78 | 2.48 | 2.74 | 2.94 | 3.77 | 2.73 | 0.334 | 639 | 0 |
Serviced apartment | 2.56 | 2.8 | 2.96 | 3.08 | 3.78 | 2.97 | 0.217 | 185 | 0 |
We also constructed model2, which is model1 plus room_type. The addition of room_type to our model increased our R2 from .171 to .473 and each room_type variable was significantly significant at p < .001. It is fair to say that room_type is a significant predictor of cost.
In model3, we added host_is_superhost, which was significant with a coefficient of 2.03 and translates to affecting price positively by around 107USD. This indicates that superhosts have a pricing premium.
For model4, we added the region indicator (neighbourhood_group_cleansed), which we decided to leave out of our model since it did not have a significant effect. To show how location can affect price, we decided to use latitude instead in one of our later models.
Our first 4 models are shown below:
##makes first 2 models using instructions from Kostis
model1 <- lm(log_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating,
data = regression_analysis)
model2 <- lm(log_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type,
data = regression_analysis)
#model 3 adds superhost
model3 <- lm(log_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
host_is_superhost,
data = regression_analysis)
##model 4 adds neigihbourhood group (cleansed)
model4 <- lm(log_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
host_is_superhost +
neighbourhood_group_cleansed,
data = regression_analysis)
huxreg(model1, model2, model3, model4)
(1) | (2) | (3) | (4) | |
---|---|---|---|---|
(Intercept) | 2.540 *** | 2.765 *** | 2.783 *** | 2.785 *** |
(0.043) | (0.035) | (0.036) | (0.035) | |
prop_type_simplifiedCondominium | 0.095 *** | 0.025 | 0.022 | 0.021 |
(0.016) | (0.013) | (0.013) | (0.013) | |
prop_type_simplifiedHostel | -0.312 *** | -0.084 *** | -0.083 *** | -0.100 *** |
(0.021) | (0.023) | (0.023) | (0.023) | |
prop_type_simplifiedHouse | -0.107 *** | -0.008 | -0.012 | -0.009 |
(0.022) | (0.018) | (0.018) | (0.018) | |
prop_type_simplifiedOther | 0.041 * | 0.171 *** | 0.167 *** | 0.155 *** |
(0.017) | (0.014) | (0.014) | (0.014) | |
prop_type_simplifiedServiced apartment | 0.280 *** | 0.282 *** | 0.277 *** | 0.267 *** |
(0.029) | (0.025) | (0.025) | (0.025) | |
number_of_reviews | -0.000 ** | -0.001 *** | -0.001 *** | -0.001 *** |
(0.000) | (0.000) | (0.000) | (0.000) | |
review_scores_rating | 0.001 ** | 0.001 *** | 0.001 ** | 0.001 ** |
(0.000) | (0.000) | (0.000) | (0.000) | |
room_typeHotel room | -0.378 *** | -0.380 *** | -0.373 *** | |
(0.022) | (0.022) | (0.022) | ||
room_typePrivate room | -0.369 *** | -0.371 *** | -0.355 *** | |
(0.011) | (0.011) | (0.011) | ||
room_typeShared room | -0.623 *** | -0.627 *** | -0.611 *** | |
(0.025) | (0.025) | (0.025) | ||
host_is_superhostTRUE | 0.028 * | 0.031 * | ||
(0.014) | (0.014) | |||
neighbourhood_group_cleansedEast Region | -0.014 | |||
(0.019) | ||||
neighbourhood_group_cleansedNorth-East Region | -0.083 *** | |||
(0.025) | ||||
neighbourhood_group_cleansedNorth Region | -0.122 *** | |||
(0.033) | ||||
neighbourhood_group_cleansedWest Region | -0.083 *** | |||
(0.019) | ||||
N | 2433 | 2433 | 2419 | 2419 |
R2 | 0.171 | 0.473 | 0.475 | 0.483 |
logLik | -414.940 | 136.523 | 134.989 | 154.063 |
AIC | 847.880 | -249.046 | -243.978 | -274.126 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
In model6, we added latitiude because it is a good indicator of how location can affect price. We tried adding the region variable but it did not affect the model significantly. Also, since Singapore is so small, all of the longiitudes are the same so we did not have to include those.
In model7, we added security deposit, which increased our R2 slightly.
In model8, we can see that the is_location_exact indicator did not have a statistically significant effect, so moving forward we left it out of our model. We can conclude that having the exact location is not a significant indicator of price.
In model 9, we looked at the effect of cancellation policy after controlling for other variables. We found that this variable was only significant for properties that had a super strict 60 day policy, which was only on 4 properties, so we decided to exclude this from our final model.
We chose to stick with model7 after checking for collinearity and looking at the residuals, and renamed it final_model. All of our variables align with our findings from our exploratory data analysis. Property and room type are significant in terms of pricing. Additionally, superhost properties tend to be more expensive, so we include that as a predictor. Further, the size of the property can also impact price, so we have include the number of people the property accommodates as well as the number of bedrooms. In EDA, we looked at the effects of location as well. While the cleansed neighbourhood variable did not have a significant effect on our model, we decided to use latitude as a way to incorporate location. Lastly, we found that more expensive properties tended to have higher security deposits, so we added security deposit as a predictor. Our final model had an R-squared of .597, meaning that our model represents 59.7% of the variation in price.
Models 6-9, the residuals for our final model, and a check for collinearity are shown below (the third on the table is what we chose for our final model):
##model 5 adds accommodates and bedroom count
model5 <- lm(log_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
host_is_superhost +
accommodates +
bedrooms,
data = regression_analysis)
##added latiitude
model6 <- lm(log_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
host_is_superhost +
accommodates +
bedrooms +
latitude,
data = regression_analysis)
##add security deposit
model7 <- lm(log_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
host_is_superhost +
accommodates +
bedrooms +
latitude +
security_deposit,
data = regression_analysis)
# added is_location_exact
model8 <- lm(log_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
host_is_superhost +
accommodates +
bedrooms +
latitude +
security_deposit +
is_location_exact,
data = regression_analysis)
#added cancellation policy
model9 <- lm(log_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
host_is_superhost +
accommodates +
bedrooms +
latitude +
security_deposit +
is_location_exact +
cancellation_policy,
data = regression_analysis)
##summarize these models
huxreg(model5, model6, model7, model8, model9)
(1) | (2) | (3) | (4) | (5) | |
---|---|---|---|---|---|
(Intercept) | 2.469 *** | 4.662 *** | 4.663 *** | 4.605 *** | 4.590 *** |
(0.035) | (0.200) | (0.199) | (0.202) | (0.202) | |
prop_type_simplifiedCondominium | 0.019 | 0.017 | 0.017 | 0.016 | 0.017 |
(0.012) | (0.011) | (0.011) | (0.011) | (0.011) | |
prop_type_simplifiedHostel | -0.134 *** | -0.166 *** | -0.161 *** | -0.161 *** | -0.156 *** |
(0.022) | (0.022) | (0.021) | (0.021) | (0.022) | |
prop_type_simplifiedHouse | -0.021 | -0.013 | -0.016 | -0.018 | -0.017 |
(0.016) | (0.016) | (0.016) | (0.016) | (0.016) | |
prop_type_simplifiedOther | 0.169 *** | 0.146 *** | 0.147 *** | 0.146 *** | 0.146 *** |
(0.013) | (0.013) | (0.013) | (0.013) | (0.013) | |
prop_type_simplifiedServiced apartment | 0.274 *** | 0.255 *** | 0.252 *** | 0.251 *** | 0.241 *** |
(0.022) | (0.022) | (0.022) | (0.022) | (0.022) | |
number_of_reviews | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** |
(0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
review_scores_rating | 0.002 *** | 0.002 *** | 0.002 *** | 0.002 *** | 0.002 *** |
(0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
room_typeHotel room | -0.291 *** | -0.287 *** | -0.275 *** | -0.274 *** | -0.271 *** |
(0.020) | (0.020) | (0.020) | (0.020) | (0.020) | |
room_typePrivate room | -0.274 *** | -0.247 *** | -0.234 *** | -0.234 *** | -0.232 *** |
(0.011) | (0.011) | (0.011) | (0.011) | (0.011) | |
room_typeShared room | -0.534 *** | -0.505 *** | -0.491 *** | -0.490 *** | -0.488 *** |
(0.023) | (0.023) | (0.023) | (0.023) | (0.023) | |
host_is_superhostTRUE | 0.051 *** | 0.050 *** | 0.048 *** | 0.048 *** | 0.048 *** |
(0.012) | (0.012) | (0.012) | (0.012) | (0.012) | |
accommodates | 0.019 *** | 0.019 *** | 0.019 *** | 0.019 *** | 0.019 *** |
(0.002) | (0.002) | (0.002) | (0.002) | (0.002) | |
bedrooms | 0.080 *** | 0.081 *** | 0.079 *** | 0.079 *** | 0.080 *** |
(0.006) | (0.006) | (0.006) | (0.006) | (0.006) | |
latitude | -1.677 *** | -1.682 *** | -1.649 *** | -1.641 *** | |
(0.151) | (0.150) | (0.151) | (0.151) | ||
security_deposit | 0.000 *** | 0.000 *** | 0.000 *** | ||
(0.000) | (0.000) | (0.000) | |||
is_location_exactTRUE | 0.018 | 0.017 | |||
(0.010) | (0.010) | ||||
cancellation_policymoderate | -0.019 | ||||
(0.014) | |||||
cancellation_policystrict_14_with_grace_period | 0.006 | ||||
(0.012) | |||||
cancellation_policysuper_strict_30 | 0.383 | ||||
(0.202) | |||||
cancellation_policysuper_strict_60 | 0.487 * | ||||
(0.202) | |||||
N | 2415 | 2415 | 2415 | 2415 | 2415 |
R2 | 0.571 | 0.592 | 0.597 | 0.598 | 0.600 |
logLik | 377.765 | 438.219 | 453.583 | 455.147 | 462.623 |
AIC | -725.530 | -844.439 | -873.167 | -874.294 | -881.246 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 2.90 5 1.11
## number_of_reviews 1.04 1 1.02
## review_scores_rating 1.11 1 1.05
## room_type 3.23 3 1.22
## host_is_superhost 1.10 1 1.05
## accommodates 1.68 1 1.30
## bedrooms 1.57 1 1.25
## latitude 1.16 1 1.08
## security_deposit 1.12 1 1.06
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.66e+00 1.99e-01 23.41 < 2e-16
## prop_type_simplifiedCondominium 1.66e-02 1.13e-02 1.47 0.14
## prop_type_simplifiedHostel -1.61e-01 2.15e-02 -7.50 9.2e-14
## prop_type_simplifiedHouse -1.57e-02 1.58e-02 -0.99 0.32
## prop_type_simplifiedOther 1.47e-01 1.28e-02 11.45 < 2e-16
## prop_type_simplifiedServiced apartment 2.52e-01 2.19e-02 11.51 < 2e-16
## number_of_reviews -6.72e-04 9.65e-05 -6.97 4.2e-12
## review_scores_rating 1.91e-03 3.45e-04 5.53 3.5e-08
## room_typeHotel room -2.75e-01 1.96e-02 -14.06 < 2e-16
## room_typePrivate room -2.34e-01 1.10e-02 -21.31 < 2e-16
## room_typeShared room -4.91e-01 2.25e-02 -21.78 < 2e-16
## host_is_superhostTRUE 4.84e-02 1.21e-02 4.01 6.3e-05
## accommodates 1.89e-02 1.79e-03 10.56 < 2e-16
## bedrooms 7.93e-02 5.95e-03 13.32 < 2e-16
## latitude -1.68e+00 1.50e-01 -11.20 < 2e-16
## security_deposit 1.07e-04 1.92e-05 5.54 3.3e-08
##
## (Intercept) ***
## prop_type_simplifiedCondominium
## prop_type_simplifiedHostel ***
## prop_type_simplifiedHouse
## prop_type_simplifiedOther ***
## prop_type_simplifiedServiced apartment ***
## number_of_reviews ***
## review_scores_rating ***
## room_typeHotel room ***
## room_typePrivate room ***
## room_typeShared room ***
## host_is_superhostTRUE ***
## accommodates ***
## bedrooms ***
## latitude ***
## security_deposit ***
##
## Residual standard error: 0.201 on 2399 degrees of freedom
## (1233 observations deleted due to missingness)
## Multiple R-squared: 0.597, Adjusted R-squared: 0.595
## F-statistic: 237 on 15 and 2399 DF, p-value: <2e-16
## 2.5 % 97.5 %
## (Intercept) 4.27e+00 5.053654
## prop_type_simplifiedCondominium -5.59e-03 0.038715
## prop_type_simplifiedHostel -2.03e-01 -0.118833
## prop_type_simplifiedHouse -4.66e-02 0.015235
## prop_type_simplifiedOther 1.22e-01 0.171857
## prop_type_simplifiedServiced apartment 2.09e-01 0.294588
## number_of_reviews -8.61e-04 -0.000483
## review_scores_rating 1.23e-03 0.002581
## room_typeHotel room -3.14e-01 -0.236914
## room_typePrivate room -2.56e-01 -0.212814
## room_typeShared room -5.35e-01 -0.446643
## host_is_superhostTRUE 2.47e-02 0.072123
## accommodates 1.54e-02 0.022413
## bedrooms 6.77e-02 0.091018
## latitude -1.98e+00 -1.387339
## security_deposit 6.89e-05 0.000144
We then used our final model to predict the cost of 1 person staying in a private room in a Singapore apartment for 4 nights. We set number of reviews to 10 and the score rating to 90. Further, we elected to have a superhost for our accommodation. We have set latitude and security deposit to the averages for our dataset. We have printed the fitted value and a 95% confidence interval below. After anti-logging our results, we predicted a 302USD price with a 95% confidence interval [282,324].
## Using our model to predict price
predictors <- data.frame(prop_type_simplified = "Apartment",
number_of_reviews = 10,
review_scores_rating = 90,
room_type = "Private room",
host_is_superhost = TRUE,
accommodates = 1,
bedrooms = 1,
latitude = 1.35,
security_deposit = 93)
predict(final_model, newdata = predictors, interval="confidence")
## fit lwr upr
## 1 2.48 2.45 2.51
Using our final model, we were able to predict a price of around $302. We are 95% confident that the cost of 4 nights (including cleaning fee and extra people) in a 1 bedroom private room in a Singapore apartment falls between 282 and 324 USD. To check our results for accuracy, we looked at the AirBnB website for similar listings and found that the average price per night for all properties was 115USD, which translates to around 460USD for 4 nights. Since we are only looking for private rooms that sleep 1, it makes sense that our prediction was a bit lower.
The goal of our project was to be able to predict the priice of staying 4 nights in a Singapore AirBnB that met the following parameters: - private room - 1 bedroom - apartment - more than 10 reviews - average review rating above 90
During our Exploratory Data Analysis, we were able to identify the effects different variables had on price. Further, we were able to identify any multicollinearity before building our regression model. After building different models and identifying our final model, our R-squared approached the 60% level.
Using our final model, we predicted that 4 nights in Singapore with our specific parameters would cost around 302 USD. We had a 95% confidence interval between 282 and 324 USD.
In the future, we would need more complete data to provide the best possible model.