首页 > 其他分享 >Real estate analysis report - data from Lianjia.com

Real estate analysis report - data from Lianjia.com

时间:2023-01-14 19:44:22浏览次数:51  
标签:Real building estate 16 ## price 2e location com

Final work for lianjia data analysis

Final work for lianjia data analysis

Ningzhi Wang

2023-01-13

This is the final data analysis report of Real-estate from Lianjia.com. Documented by Ningzhi Wang.

# pre-porcessing
lj$age <- 2019 - lj$building_year
lj$price_sqm_k <- lj$price_sqm/1000
lj$price_ttl_m <- lj$price_ttl/1000000
pattern <- "(?<line>[0-9]+)"
result <- regexpr(pattern= pattern, text = lj$line, perl=TRUE)
start <- attr(result,"capture.start")
length <- attr(result,"capture.length")
name <- attr(result,"capture.name")
lj$newline <- ifelse(start > 0, 
                     substr(lj$line, start[,name],start[,name] + length[,name]-1),NA)
length <- attr(result,"capture.length")
name <- attr(result,"capture.name")
lj$newline <- ifelse(start > 0, 
                     substr(lj$line, start[,name],start[,name] + length[,name]-1),NA)

lj$height <- lj$building_height
lj$area <- lj$building_area

Variables definition numeric: 1. price_sqm_k (price pre square meter, unit: thousand RMB) 2. price_ttl (total price, unit: 10k RMB) 2. age (years between 2019 and build date) 3. building_height (the height of the building, unit: meters) 4. bedrooms (number of bedrooms) 5. has_elevator (0 denotes haven’t elevator, 1 otherwise)

factors: 1. hml (higher floor, middle floor, lower floor) 2. station (denotes the name of the nearest railway station) 3. building_location (denotes the approximate location of the building)

1 Discriptive statistics

There are 6286 observations in the dataset. The mean price sqm is 67.76k RMB, the mean total price is 6.03 million, the mean age is 23 years, the mean building height is 12.76 meters. The average number of bedrooms is approximately 2, 45% of the houses have elevators.

Table 1 Discriptive statistics
lj %>%
  # remove duplications
  unique() %>%
  # select numeric variables
  select(price_sqm_k,price_ttl,age,building_height,bedrooms,has_elevator) %>%
  # show discription
  describe()
 

2 Single variable analysis

2.1 price hist

The price sqm is approximatly normally distributed, with a median of 64126 RMB.

p<- lj %>%
    ggplot() +
    geom_histogram(aes(price_sqm),col="white",fill="lightskyblue")
p +
  labs(title = "Figure 1.1: Histogram of Price per square meter",
       subtitle = "in RMB",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Price/Square",
       y = "Counts") +
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
                                  ),
        plot.subtitle = element_text(hjust = 0),
        plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen')) +
  geom_vline(aes(xintercept = median(price_sqm)),color='red',linetype="dashed") +
  geom_text(aes(x = median(price_sqm)+12000,y=-10,label = median(price_sqm)),color='red',size = 2)

The total price presents a serious right deviation distribution, but most of the observations are consentrated around the median. The median ttl is 4.3 million RMB.

#### price hist
p<- lj %>%
    ggplot() +
    geom_histogram(aes(price_ttl),col="white",fill="lightskyblue")
p +
  labs(title = "Figure 1.2: Histogram of Total Price",
       subtitle = "in 10k RMB",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Total Price",
       y = "Counts") +
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
                                  ),
        plot.subtitle = element_text(hjust = 0),
        plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen')) +
  geom_vline(aes(xintercept = median(price_ttl)),color='red',linetype="dashed") +
  geom_text(aes(x = median(price_ttl),y=-10,label = median(price_ttl)),color='red',size = 2)

2.2 Building Age

The median age of the buildings is 23 years, with a few outliers have the age of more than 60 years and even 90 years.

p<- lj %>%
    ggplot() +
    geom_histogram(aes(age),col="white",fill="lightskyblue")
p +
  labs(title = "Figure 2: Histogram of building age",
       subtitle = "in YEAR",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Year",
       y = "Counts")+
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
  ),
  plot.subtitle = element_text(hjust = 0),
  plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen')) +
  geom_vline(aes(xintercept = median(age)),color='red',linetype="dashed") +
  geom_text(aes(x = median(age)+2,y=-10,label = median(age)),color='red',size = 2)

2.3 The Number of Bedrooms

Most of the observations, have less than 3 bedrooms.

lj %>%
  mutate(bedrooms = as.factor(bedrooms)) %>%
  count(bedrooms) %>%
  ggplot(aes(x = reorder(bedrooms,-n), n)) +
  geom_bar(stat="identity",col="white",fill="lightskyblue") +
  geom_text(aes(label = n),vjust=-0.5) +
  labs(title = "Figure 3: Histogram of Bedrooms",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Number of Bedrooms",
       y = "Counts")+
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
  ),
  plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen'))

#### 2.4 Building Height A large number of observations are of the height of 7 meters, while the others are relatively uniformly distributed between 3 and 40 meters. This might due to problems left over from history, that is, there are still a large number of low-rise houses with small property rights.

p<- lj %>%
    ggplot() +
    geom_histogram(aes(building_height),col="white",fill="lightskyblue")
p +
  labs(title = "Figure 2: Histogram of building height",
       subtitle = "in Meters",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Meter",
       y = "Counts")+
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
  ),
  plot.subtitle = element_text(hjust = 0),
  plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen')) +
  geom_vline(aes(xintercept = median(building_height)),color='red',linetype="dashed") +
  geom_text(aes(x = median(building_height)+2,y=-10,label = median(building_height)),color='red',size = 2)

3 Multi Variables

3.1 elevator & bedrooms

An obvious trend is that the more bedrooms an apartment have, the more likely it has elevators. The percentage of having elevators is less than half only for observations with 1 or 2 bedrooms.

nl <- table(lj$has_elevator, lj$bedrooms, dnn = c("has_elevator","bedrooms")) %>%
  prop.table(margin = 2) %>%
  round(2) %>%
  as.data.frame(responseName = "prop") %>%
  arrange(has_elevator, prop) %>%
  filter(has_elevator == 1) %>%
  pull(bedrooms) %>%
  as.character()

lj %>%
  ggplot(aes(x=fct_relevel(factor(bedrooms),nl), fill=factor(has_elevator))) + 
  geom_bar(position = "fill") +
  labs(title = "Figure 4: Bedrooms and Elevator",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Bedrooms",
       y = "Have elevator (%)")+
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
  ),
  plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen'))

3.2 elevator & lines

The line with the lowest proportion of elevators is the extension section of Line 5, and the line with the highest proportion is the Pujiang Line. Generally speaking, the proportion of suburban houses with elevators is small.

nl <- table(lj$has_elevator, lj$line, dnn = c("has_elevator","line")) %>%
  prop.table(margin = 2) %>%
  round(2) %>%
  as.data.frame(responseName = "prop") %>%
  arrange(has_elevator, prop) %>%
  filter(has_elevator == 1) %>%
  pull(line) %>%
  as.character()
lj %>%
  ggplot(aes(x=fct_relevel(factor(line),nl), fill=factor(has_elevator))) + 
  geom_bar(position = "fill") +
  labs(title = "Figure 5: Line and Elevator",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Line",
       y = "Have elevator (%)")+
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
  ),
  plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen'),
  axis.text.x = element_text(angle=30,size = 6,vjust=1,hjust=1))

3.3 building area and ttl

Obviously, the larger the building area, the higher the total price of the house. However, this phenomenon is more intuitive in houses below 400 square meters. If you only look at houses over 400 square meters, this positive correlation will obviously weaken or even disappear.

temp <- lj %>%
  filter(building_area > 400)

temp2 <- lj %>%
  filter(building_area>200 & price_ttl < 1000)

p <- ggplot() +
  geom_point(data=lj, aes(x=building_area, y=price_ttl), shape=1,alpha = 0.1) +
  geom_point(data=temp, aes(x=building_area, y=price_ttl),
             color='red',shape=24) +
  geom_point(data = temp2, aes(x=building_area, y=price_ttl),
             color="blue",shape = 21)
p +
  labs(title = "Figure 6: Building area and TTL",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Building area (m2)",
       y = "Price TTL (10k)")+
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
  ),
  plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen'))

3.4 Building area, TTL, elevators & Line

In all lines, the total price of houses rises with the increase of construction area. In most lines, the increase is steeper in the group with elevators, that is, the square meter price of houses with elevators is higher. However, Line 17, Line 5 and Line 6 are exceptions. The square meter price of groups with and without elevators is similar.

p <- ggplot(lj, aes(x=building_area, y=price_ttl, color = factor(has_elevator))) +
  geom_point(shape=1) +
  geom_smooth(method="lm") +
  facet_wrap( ~ line)+
  coord_cartesian(xlim = c(0,100), ylim=c(0,1000)) +
  theme(text=element_text(size=10))

p +
  labs(title = "Figure 7: Building area, TTL, elevators & Line",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Building area (m2)",
       y = "Price TTL (10k)")+
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
  ),
  plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen'))

3.5 number of bedrooms & sqm

When the number of bedrooms is less than 5, the more the number of bedrooms, the higher the price of square meters, which also represents the rule of most samples. When the number of bedrooms is 6-7, the price of square meters will gradually decrease. However, when the number of bedrooms is 9, the price per square meter is the highest.

boxplot(lj$price_sqm_k ~ lj$bedrooms,xlab="",ylab="")
title(main = "Figure 8: Bedrooms and SQM", xlab = "Number of bedrooms", ylab = "Price SQM")

The following figure shows the relationship between the square meter price of the house and the number of rooms. The price of most houses per square meter is less than 100k RMB. When the number of rooms is less than 3, the price per square meter is the most in the range of about 700-800k RMB.

p <-
  lj %>%
  ggplot() +
  geom_freqpoly(aes(price_sqm_k,color=as.factor(bedrooms))) +
  theme(text = element_text(size=12,face="bold")) 
p +
  labs(title = "Figure 9: distribution of SQM group by Bedrooms",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Price SQM (k)",
       y = "Counts")+
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
  ),
  plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen'))

3.6 hml & price

The following figure shows the distribution of the square meter price of houses with different floors in the group with and without elevators.

In most groups, the square meter price of houses is concentrated in the range of 5-7k RMB. However, for the groups without elevators, the price of high-rise buildings tends to be left, that is, the unit price of square meters is lower, while the unit price of square meters of lower floors is higher.

Strangely, in the group without elevators, the number of houses on sale in the high-rise is more, while the number of houses on sale in the lower floors is the least.

p <- lj %>%
  filter(hml %in% c("高","中","低")) %>%
  ggplot() +
  geom_freqpoly(aes(price_sqm_k,color=hml)) +
  facet_wrap(~as.factor(has_elevator)) +
  theme(text = element_text(size=12,face="bold"))
p +
  labs(title = "Figure 10: Price distribution group by Height/elevators",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Price TTL (10k)",
       y = "Counts")+
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
  ),
  plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen'))

#### 3.7 metro lines & sqm The price of Line 17 and Line 5 is obviously low, which may be due to the main layout in the suburbs.

boxplot(lj$price_sqm_k ~ lj$newline, xlab="Line", ylab="price SQM (k)")
title(main = "Figure 11: Subway Line and price SQM")

3.8 metro lines & bedrooms

The number of bedrooms on most lines is 2, only 3 for Line 17. This is interesting, because we have just analyzed that the unit price per square meter of Line 17 is the lowest, so the low unit price per square meter of Line 17 is probably due to the larger area.

boxplot(lj$bedrooms ~ lj$newline, xlab="Line", ylab="Bedrooms")
title(main = "Figure 12: Subway Line and Bedrooms")

3.9 decoration & elevator & area & price

We will consider the decoration style, elevator and housing area together. As shown in the figure below, the hardbound room has a higher unit price and total price than the simple decoration and blank.

Houses with elevators tend to have larger area (higher position) and higher total price (larger point size). However, in terms of unit price, this picture cannot show the obvious difference between the houses with elevators and without elevators. It can only be roughly reflected in the area with the highest unit price, which is basically the hardbound elevator apartment.

qplot(price_sqm_k,building_area,data = lj, color = factor(has_elevator),shape = factor(decoration), size = price_ttl_m, main = "Figure 13: Area & TTL & Elevator & Decoration", ylab = "Building area (m2)", xlab = "price SQM (k)")

#### 3.10 building height & hml & elevator & price It can be seen from this picture that the houses without elevators will not exceed a certain height (about 8 meters), and the houses with elevators will not be hindered by this. It can’t be seen from this picture that there is a clear relationship between the height of the house and the total price. We can see some high price houses with very low height from the picture, which may be villas. Therefore, the type of the house may also be an important factor affecting the price.

qplot(price_sqm_k,building_height,data = lj, color = factor(has_elevator),shape = factor(decoration), size = price_ttl_m, main = "Figure 13: Height & TTL & Elevator & Decoration", ylab = "Building height (m)", xlab = "price SQM (k)")

3.11 numbers based on line

Line 8 has the largest number of houses on sale, close to 800 sets. Line 17 is the least, less than 50 sets.

temp <- count(lj,line)

p <-
  temp %>%
  ggplot() +
  geom_col(aes(x = reorder(line,-n),y=n),fill = "lightskyblue",alpha = 0.5) +
  theme(axis.text.x = element_text(angle=30,size = 6,vjust=1,hjust=1))

p +
  labs(title = "Figure 14: Count across lines",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Line",
       y = "Counts")+
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
  ),
  plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen'))

3.12 price across lines

Next, we will divide the price into five sections, and then analyze the percentage of house prices on different subway lines in these five sections. Obviously, the houses on Line 17, Line 11, Line 5 and Pujiang Line are generally low. The distribution of housing prices on other lines is relatively uniform. This reminds us again that location is very important for house pricing.

lj <- mutate(lj,price_sqm_hl = cut(price_sqm,
                                   quantile(price_sqm,
                                            probs = seq(0,1,length.out = 6)),
                                   labels = c("lowest","below average","average","higher than average","highest"),
                                   include.lowest = TRUE))

ggplot(lj,aes(x = line,fill=price_sqm_hl)) +
  geom_bar(position = "fill") +
  scale_fill_brewer("price", type="div", palette="PiYG") +
  labs(title = "Figure 15: Line and Price SQM",
       caption = "(Source: Lianjia, Shanghai)",
       x = "Line",
       y = "Price SQM (%)")+
  theme(plot.title = element_text(size = 15,
                                  hjust = 0.5,
                                  vjust = 0.5,
                                  angle = 0,
  ),
  plot.caption = element_text(vjust = 0.5,hjust = 0.5,color = 'darkgreen'),
  axis.text.x = element_text(angle=30,size = 6,vjust=1,hjust=1))

4 Econometric Model

The square meter price of hardcover is the highest, followed by simple decoration, and then the blank and others. At the same time, the rate of elevator and the height of the house are the highest. But the housing age of the blank is the lowest. ##### Table 3 summary stats group by decoration

library(plotly)
## 
## 载入程辑包:'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
lj$area <- lj$building_area
lj$height <- lj$building_height
newlj <- lj[,c('decoration','price_sqm','price_ttl','bedrooms','has_elevator','height','age','area')]

newlj %>% 
  group_by(decoration) %>% 
  group_modify(~{
    .x %>% 
      purrr::map_dfc(mean, na.rm = TRUE)
  }) %>% ungroup() 
 
Table 4 Difference test for Blank and simple decoration

Because simple decoration has disadvantages in terms of age, elevator and number of bedrooms, although the average unit price is higher, we still need to further analyze whether there are actual differences between simple decoration and rough decoration. The following table shows that the blank is significantly lower than the simple SQM, but the total price not statistically significant. Nevertheless, the difference between simple decoration and blank is significant in most variables, indicating that there are systematic differences between the two groups, not just decoration.

newlj %>% 
  filter(decoration %in% c("毛坯","简装")) %>%
  pivot_longer(cols = -decoration, names_to = "variable", values_to = "value") %>% 
  group_nest(variable) %>% 
  mutate(t.test = map(data, ~ tidy(t.test(value ~ decoration, data = .x)))) %>% 
  unnest(t.test) %>% 
  select(-data)
 
Table 5 Difference test for Hardbound and simple decoration

The roughcast room is younger than the hardbound room, but the number of elevators is less, but the difference is not significant. The number of space and bedrooms in the roughcast room is less, the floor is lower, and the total price and unit price are lower. It can be speculated that the investment purpose of rough housing is more obvious.

newlj %>% 
  filter(decoration %in% c("毛坯","精装")) %>%
  pivot_longer(cols = -decoration, names_to = "variable", values_to = "value") %>% 
  group_nest(variable) %>% 
  mutate(t.test = map(data, ~ tidy(t.test(value ~ decoration, data = .x)))) %>% 
  unnest(t.test) %>% 
  select(-data)
 
Table 6 Difference test of floor height
newlj <- lj[,c('hml','price_sqm','price_ttl','bedrooms','has_elevator','height','age','area')]
newlj %>% 
  filter(hml %in% c("高","低")) %>%
  pivot_longer(cols = -hml, names_to = "variable", values_to = "value") %>% 
  group_nest(variable) %>% 
  mutate(t.test = map(data, ~ tidy(t.test(value ~ hml, data = .x)))) %>% 
  unnest(t.test) %>% 
  select(-data)
 
Table 7 Regression model1

The housing area, housing year and housing height are significantly related to the price. The price of the square meter will increase by 59 RMB for every square meter of the house area. The price of the square meter will increase by 625 RMB for each year the age of the house increases. The price of the square meter will increase by 959 RMB for every 1 meter increase in the height of the house. Having elevator will increase SQM by 3887 RMB. The price of slab building is 11784 RMB higher than that of tower building. At present, the R square of the model is only 0.21.

smodel <- lm(price_sqm ~ decoration + building_area + age + building_height + building_style + has_elevator, data = lj)
summary(smodel)
## 
## Call:
## lm(formula = price_sqm ~ decoration + building_area + age + building_height + 
##     building_style + has_elevator, data = lj)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69482 -12731  -2312  10208 158973 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            17297.748   1719.064  10.062  < 2e-16 ***
## decoration毛坯         -3046.074   1626.208  -1.873  0.06110 .  
## decoration简装          3496.733   1111.306   3.147  0.00166 ** 
## decoration精装         11060.496   1097.591  10.077  < 2e-16 ***
## building_area             59.349      6.353   9.342  < 2e-16 ***
## age                      625.533     27.582  22.679  < 2e-16 ***
## building_height          959.335     48.862  19.633  < 2e-16 ***
## building_style板塔结合  2229.049   3817.608   0.584  0.55932    
## building_style板楼     11783.503    887.616  13.275  < 2e-16 ***
## has_elevator            3887.464    975.884   3.984 6.86e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20910 on 6517 degrees of freedom
## Multiple R-squared:  0.2186, Adjusted R-squared:  0.2175 
## F-statistic: 202.6 on 9 and 6517 DF,  p-value: < 2.2e-16
Table 8 Regression model2

Adding direction factor does little to increase R square.

smodel <- lm(price_sqm ~ decoration + building_area + age + building_height + building_style + has_elevator + directions1, data = lj)
summary(smodel)
## 
## Call:
## lm(formula = price_sqm ~ decoration + building_area + age + building_height + 
##     building_style + has_elevator + directions1, data = lj)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71135 -12800  -2356  10243 159099 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            19100.484   2293.957   8.326  < 2e-16 ***
## decoration毛坯         -3159.947   1627.550  -1.942  0.05224 .  
## decoration简装          3402.419   1112.908   3.057  0.00224 ** 
## decoration精装         11010.689   1098.645  10.022  < 2e-16 ***
## building_area             58.559      6.378   9.181  < 2e-16 ***
## age                      626.835     27.680  22.646  < 2e-16 ***
## building_height          960.639     49.059  19.581  < 2e-16 ***
## building_style板塔结合  1830.228   3823.178   0.479  0.63215    
## building_style板楼     11748.499    923.142  12.727  < 2e-16 ***
## has_elevator            3914.002    977.196   4.005 6.26e-05 ***
## directions1东北         4964.305   4551.641   1.091  0.27546    
## directions1东南        -2713.642   2033.799  -1.334  0.18216    
## directions1北          -3922.270   2431.405  -1.613  0.10676    
## directions1南          -1651.089   1628.505  -1.014  0.31068    
## directions1暂无数据    -3935.610   3079.261  -1.278  0.20126    
## directions1西          -6753.541   2885.790  -2.340  0.01930 *  
## directions1西北         5756.020   3628.808   1.586  0.11274    
## directions1西南        -1369.858   2254.472  -0.608  0.54346    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20900 on 6509 degrees of freedom
## Multiple R-squared:  0.2205, Adjusted R-squared:  0.2185 
## F-statistic: 108.3 on 17 and 6509 DF,  p-value: < 2.2e-16
Table 9 Regression model2

In order to increase the explanatory power of the model, we have added decoration, orientation and location. At this time, the coefficient of house age is completely reversed, the house age increases by 1 year, and the square meter price decreases by 166 yuan.

When the height increase by 1m, price SQM will increase by 14 yuan, which is economically and statistically not significant.

The coefficient of the building area is still positive, with an increase of 1 square meter and an increase of 45 yuan per square meter. Meaning that smaller house is generally more cost effective.

The coefficients of building_style remain robust, The price of slab building is 10090 RMB higher than that of tower building.

In addition, the hardbound houses are 2192 yuan per square meter more expensive than the category “others”, and the price of simple decoration and rough decoration is lower than that of “other” square meters.

Another obvious phenomenon is that location plays a key role in improving the interpretation of the model. The R square of the model has changed to 0.68, that is, the model can already explain the change of 68% of dependent variables.

smodel <- lm(price_sqm ~ decoration + building_area + age + building_height + building_style + has_elevator + directions1 + building_location, data = lj)
summary(smodel)
## 
## Call:
## lm(formula = price_sqm ~ decoration + building_area + age + building_height + 
##     building_style + has_elevator + directions1 + building_location, 
##     data = lj)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -73274  -6819   -817   5320 146731 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  43645.622   2405.676  18.143  < 2e-16 ***
## decoration毛坯               -2287.401   1076.057  -2.126 0.033565 *  
## decoration简装               -1726.275    735.183  -2.348 0.018900 *  
## decoration精装                2192.532    730.383   3.002 0.002693 ** 
## building_area                   25.639      4.334   5.915 3.49e-09 ***
## age                           -166.492     21.771  -7.648 2.35e-14 ***
## building_height                 14.461     35.640   0.406 0.684943    
## building_style板塔结合        6344.943   2491.602   2.547 0.010903 *  
## building_style板楼           10090.657    607.627  16.607  < 2e-16 ***
## has_elevator                  4901.106    655.808   7.473 8.87e-14 ***
## directions1东北               8660.339   2941.382   2.944 0.003248 ** 
## directions1东南               2625.033   1317.771   1.992 0.046411 *  
## directions1北                 1849.584   1579.474   1.171 0.241638    
## directions1南                 6119.258   1063.998   5.751 9.27e-09 ***
## directions1暂无数据           4884.634   2006.346   2.435 0.014936 *  
## directions1西                 -545.125   1873.345  -0.291 0.771068    
## directions1西北               7694.710   2345.217   3.281 0.001040 ** 
## directions1西南               2533.453   1463.957   1.731 0.083580 .  
## building_location万体馆      14747.849   2549.809   5.784 7.65e-09 ***
## building_location万里        -8704.415   3387.124  -2.570 0.010197 *  
## building_location三林        -5367.893   2325.109  -2.309 0.020994 *  
## building_location上大       -14808.499   4178.112  -3.544 0.000396 ***
## building_location上海南站     -161.306   7961.377  -0.020 0.983836    
## building_location不夜城      11116.995   2622.257   4.239 2.27e-05 ***
## building_location世博          849.983   2162.927   0.393 0.694349    
## building_location世博滨江    31672.742   2762.580  11.465  < 2e-16 ***
## building_location东外滩       8525.590   2998.197   2.844 0.004475 ** 
## building_location中原        -4203.897   2283.709  -1.841 0.065695 .  
## building_location中山公园    13434.000   2209.587   6.080 1.27e-09 ***
## building_location中远两湾城   4092.138   3662.150   1.117 0.263860    
## building_location丰庄       -10792.493   3387.646  -3.186 0.001450 ** 
## building_location临平路      17463.239   2416.834   7.226 5.57e-13 ***
## building_location临港新城   -36110.405   9667.802  -3.735 0.000189 ***
## building_location九亭       -19642.851   4319.306  -4.548 5.52e-06 ***
## building_location五角场       6033.174   2530.083   2.385 0.017128 *  
## building_location五里桥      26080.914   2384.085  10.940  < 2e-16 ***
## building_location人民广场    32836.311   3482.366   9.429  < 2e-16 ***
## building_location仙霞        14796.282   3184.865   4.646 3.46e-06 ***
## building_location光新         2780.793   2853.377   0.975 0.329815    
## building_location共富       -19737.999   5792.424  -3.408 0.000660 ***
## building_location共康        -9887.460   4858.967  -2.035 0.041903 *  
## building_location凉城         1301.919   2819.896   0.462 0.644319    
## building_location北外滩      12726.121   2893.720   4.398 1.11e-05 ***
## building_location北新泾        616.627   2791.525   0.221 0.825183    
## building_location北蔡          854.339   2190.706   0.390 0.696562    
## building_location华东理工    -5185.179   7962.014  -0.651 0.514915    
## building_location南京东路    19089.525   3790.812   5.036 4.89e-07 ***
## building_location南京西路    41392.866   2422.507  17.087  < 2e-16 ***
## building_location南桥       -36118.037   3957.092  -9.127  < 2e-16 ***
## building_location南码头        179.224   2975.345   0.060 0.951969    
## building_location南翔       -15850.782   4648.957  -3.410 0.000655 ***
## building_location古北         9087.770   3353.919   2.710 0.006755 ** 
## building_location古美        -2776.383   2489.092  -1.115 0.264713    
## building_location周家嘴路     4107.279   2234.102   1.838 0.066043 .  
## building_location周浦       -31133.859  13550.700  -2.298 0.021618 *  
## building_location唐镇         9063.242   5113.240   1.773 0.076359 .  
## building_location嘉定新城   -29908.725   3068.833  -9.746  < 2e-16 ***
## building_location嘉定老城   -23139.900   6293.944  -3.677 0.000238 ***
## building_location四川北路    14501.120   2919.161   4.968 6.96e-07 ***
## building_location塘桥         5508.433   2452.735   2.246 0.024749 *  
## building_location夏阳       -28810.918   3619.509  -7.960 2.02e-15 ***
## building_location外高桥     -16748.820   3074.555  -5.448 5.30e-08 ***
## building_location大华        -8626.480   3173.302  -2.718 0.006577 ** 
## building_location大场镇     -14589.932   5108.708  -2.856 0.004305 ** 
## building_location大宁        16331.229   2707.896   6.031 1.72e-09 ***
## building_location天山        15954.265   3347.804   4.766 1.92e-06 ***
## building_location安亭       -34822.475   3030.017 -11.493  < 2e-16 ***
## building_location川沙       -18117.959   2898.129  -6.252 4.33e-10 ***
## building_location康健         6649.968   3220.813   2.065 0.038993 *  
## building_location康桥       -24045.357   5118.613  -4.698 2.69e-06 ***
## building_location建国西路    57207.771   3453.421  16.566  < 2e-16 ***
## building_location张庙       -10029.662   3860.685  -2.598 0.009401 ** 
## building_location张江        -2854.781   2991.691  -0.954 0.340000    
## building_location彭浦        -7101.773   2813.391  -2.524 0.011618 *  
## building_location徐家汇      28708.070   2095.708  13.699  < 2e-16 ***
## building_location徐汇滨江    38249.628   4511.177   8.479  < 2e-16 ***
## building_location徐泾       -20662.953   6972.463  -2.964 0.003053 ** 
## building_location御桥        -3584.904   5104.445  -0.702 0.482511    
## building_location惠南       -32794.261   4059.082  -8.079 7.74e-16 ***
## building_location打浦桥      26255.650   2447.433  10.728  < 2e-16 ***
## building_location控江路       4811.287   2439.687   1.972 0.048642 *  
## building_location斜土路      20813.991   2341.305   8.890  < 2e-16 ***
## building_location新华路      12994.942   2335.179   5.565 2.73e-08 ***
## building_location新场       -31953.284   9670.512  -3.304 0.000958 ***
## building_location新天地      64851.251   2493.307  26.010  < 2e-16 ***
## building_location新江湾城    18539.282   3968.932   4.671 3.06e-06 ***
## building_location曲阳        10301.275   2297.362   4.484 7.46e-06 ***
## building_location曹家渡      38159.951   2351.925  16.225  < 2e-16 ***
## building_location曹杨         3123.979   3066.788   1.019 0.308408    
## building_location曹路       -21647.576   3392.460  -6.381 1.88e-10 ***
## building_location月浦       -24400.822   4661.222  -5.235 1.70e-07 ***
## building_location朱家角     -32774.869  13536.258  -2.421 0.015494 *  
## building_location杨东        20253.876   6984.642   2.900 0.003747 ** 
## building_location杨行       -24155.861   3495.266  -6.911 5.28e-12 ***
## building_location松江大学城 -21587.653   3454.463  -6.249 4.39e-10 ***
## building_location松江新城   -31072.754   4314.180  -7.202 6.60e-13 ***
## building_location松江老城   -27075.233   3774.012  -7.174 8.10e-13 ***
## building_location桃浦       -16215.084   2900.538  -5.590 2.36e-08 ***
## building_location梅陇        -8139.946   3691.789  -2.205 0.027498 *  
## building_location武宁        11659.144   2545.925   4.580 4.75e-06 ***
## building_location永和        -5213.834   4064.161  -1.283 0.199581    
## building_location江宁路      30953.538   2610.616  11.857  < 2e-16 ***
## building_location江桥       -22397.254   3518.404  -6.366 2.08e-10 ***
## building_location江湾镇      -1482.512   4067.274  -0.364 0.715499    
## building_location泗泾       -26782.216   4320.518  -6.199 6.04e-10 ***
## building_location洋泾        20127.180   2669.956   7.538 5.42e-14 ***
## building_location浦江       -29007.985   2653.964 -10.930  < 2e-16 ***
## building_location淞南       -15255.249   3689.763  -4.134 3.60e-05 ***
## building_location淞宝       -16278.401   2910.645  -5.593 2.33e-08 ***
## building_location淮海中路    51061.573   3184.935  16.032  < 2e-16 ***
## building_location源深        37929.943   2334.096  16.250  < 2e-16 ***
## building_location漕河泾      -1312.652   5798.952  -0.226 0.820928    
## building_location潍坊        35280.670   2436.371  14.481  < 2e-16 ***
## building_location甘泉宜川    -2568.374   2671.962  -0.961 0.336472    
## building_location田林        17363.113   2754.129   6.304 3.09e-10 ***
## building_location白鹤       -43532.041   5815.660  -7.485 8.10e-14 ***
## building_location盈浦       -31107.232   4178.618  -7.444 1.10e-13 ***
## building_location真如         2415.493   2561.257   0.943 0.345671    
## building_location碧云        23375.377   3862.815   6.051 1.52e-09 ***
## building_location罗店       -30062.459   3624.256  -8.295  < 2e-16 ***
## building_location老西门      25559.970   2926.753   8.733  < 2e-16 ***
## building_location老闵行     -22446.600   2575.107  -8.717  < 2e-16 ***
## building_location联洋        27611.820   4484.247   6.158 7.84e-10 ***
## building_location航华       -12011.074   4469.006  -2.688 0.007215 ** 
## building_location花木        19723.082   2560.051   7.704 1.52e-14 ***
## building_location莘庄       -10104.978   2819.306  -3.584 0.000341 ***
## building_location菊园新区   -35264.641   4486.218  -7.861 4.46e-15 ***
## building_location董家渡      22133.859   3328.633   6.650 3.18e-11 ***
## building_location蓬莱公园    24326.590   2366.382  10.280  < 2e-16 ***
## building_location虹桥        11036.246   2754.459   4.007 6.23e-05 ***
## building_location衡山路      53733.537   2584.602  20.790  < 2e-16 ***
## building_location西渡       -33557.875   2971.054 -11.295  < 2e-16 ***
## building_location西藏北路    10636.501   2248.218   4.731 2.28e-06 ***
## building_location西郊        -2368.503   3953.318  -0.599 0.549116    
## building_location豫园        26666.625   3167.543   8.419  < 2e-16 ***
## building_location赵巷       -21409.297   7964.166  -2.688 0.007202 ** 
## building_location通河        -7537.221   6284.406  -1.199 0.230435    
## building_location金杨          195.189   2278.621   0.086 0.931739    
## building_location金桥        -2216.213   2107.488  -1.052 0.293028    
## building_location金汇        -4502.825   4059.868  -1.109 0.267426    
## building_location金虹桥        207.337   4183.392   0.050 0.960473    
## building_location镇宁路      30788.234   3282.787   9.379  < 2e-16 ***
## building_location长寿路      14848.587   2421.418   6.132 9.19e-10 ***
## building_location长征        -4780.398   3291.073  -1.453 0.146402    
## building_location长风         8129.731   3177.134   2.559 0.010526 *  
## building_location闸北公园     5325.018   2844.816   1.872 0.061276 .  
## building_location阳城         1314.601   4316.355   0.305 0.760709    
## building_location陆家嘴      26474.307   2382.375  11.113  < 2e-16 ***
## building_location青村       -40540.570  13521.736  -2.998 0.002726 ** 
## building_location静安寺      36882.564   3117.174  11.832  < 2e-16 ***
## building_location静安新城      -22.740   3692.119  -0.006 0.995086    
## building_location鞍山        12423.778   2246.515   5.530 3.32e-08 ***
## building_location顾村       -22040.540   3157.999  -6.979 3.27e-12 ***
## building_location颛桥       -17165.169   2941.830  -5.835 5.65e-09 ***
## building_location高东       -25833.705   6994.529  -3.693 0.000223 ***
## building_location高境       -12583.131   3770.957  -3.337 0.000852 ***
## building_location高行       -15801.077   3855.794  -4.098 4.22e-05 ***
## building_location鲁迅公园     6084.322   2633.743   2.310 0.020912 *  
## building_location黄兴公园     2845.899   2280.869   1.248 0.212177    
## building_location黄浦滨江    50144.006   4656.456  10.769  < 2e-16 ***
## building_location龙华        19648.811   2265.831   8.672  < 2e-16 ***
## building_location龙柏        -2953.139   4177.466  -0.707 0.479641    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13380 on 6365 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.6797 
## F-statistic: 87.01 on 161 and 6365 DF,  p-value: < 2.2e-16

5 Machine Learning Model

From the above analysis, we can draw a preliminary conclusion that the region is the key factor affecting the housing price, because it can explain most of the variation in SQM.

Therefore, I want to determine whether the SQM of a house in a location will be greater than the average SQM of all houses in a specific location.

I calculated a variable price_ above_ Avg, if the square meter price of the sample is greater than the average value of the region where the sample is located, then the variable is 1, and vice versa.

Although all the data are used, the ROC value of the model is still only 0.21. Therefore, from the perspective of the model, it is difficult to use the existing data to predict whether a house is higher than the average price in the region.

Compared with machine learning model, OLS model can reflect higher explanatory power.

library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 1.0.0 --
## v dials        1.1.0     v rsample      1.1.1
## v infer        1.0.4     v tune         1.0.1
## v modeldata    1.0.1     v workflows    1.1.2
## v parsnip      1.0.3     v workflowsets 1.0.0
## v recipes      1.0.4     v yardstick    1.1.0
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x psych::%+%()          masks ggplot2::%+%()
## x yardstick::accuracy() masks generics::accuracy()
## x scales::alpha()       masks psych::alpha(), ggplot2::alpha()
## x broom::bootstrap()    masks modelr::bootstrap()
## x infer::calculate()    masks generics::calculate()
## x scales::discard()     masks purrr::discard()
## x pastecs::extract()    masks tidyr::extract()
## x plotly::filter()      masks dplyr::filter(), stats::filter()
## x pastecs::first()      masks dplyr::first()
## x recipes::fixed()      masks stringr::fixed()
## x infer::generate()     masks generics::generate()
## x infer::hypothesize()  masks generics::hypothesize()
## x dplyr::lag()          masks stats::lag()
## x pastecs::last()       masks dplyr::last()
## x yardstick::mae()      masks modelr::mae()
## x yardstick::mape()     masks modelr::mape()
## x dials::prune()        masks generics::prune()
## x yardstick::rmse()     masks modelr::rmse()
## x yardstick::spec()     masks readr::spec()
## x infer::specify()      masks generics::specify()
## x recipes::step()       masks stats::step()
## x infer::visualize()    masks generics::visualize()
## * Use suppressPackageStartupMessages() to eliminate package startup messages
library(nycflights13)
library(skimr)


# create variable: price_ above_ Avg
mean_sqm <- aggregate(lj$price_sqm_k,by=list(type=lj$building_location),mean)
names(mean_sqm) <- c("building_location","mean_sqm")
newlj <- left_join(lj,mean_sqm,by="building_location")
newlj <- mutate(newlj,price_above_avg = price_sqm_k>mean_sqm)


newlj_data <-
  newlj %>%
  mutate(
    # turn price_above_avg into factor variable
    price_above_avg = factor(price_above_avg),
    directions1 = factor(directions1),
    decoration = factor(decoration),
    has_elevator = factor(has_elevator),
    building_style = factor(building_style),
    line = factor(line)
  ) %>%
  # remain required variables
  select(price_above_avg, price_ttl, building_area, building_height, age, directions1, decoration, has_elevator,
         building_style, line) %>%
  # remove missing data
  na.omit() %>%
  # for creating models, it is better to have qualitative columns
  # encoded as factors (instead of character strings)
  mutate_if(is.character, as.factor)

newlj_data %>%
  count(price_above_avg) %>%
  mutate(prop = n/sum(n))
 
# separate data into training data and test data
set.seed(222)
data_split <- initial_split(newlj_data, prop=3/4)
train_data <- training(data_split)
test_data <- testing(data_split)

newlj_rec <-
  recipe(price_above_avg ~., data=train_data)

summary(newlj_rec)
 
newlj_rec <-
  recipe(price_above_avg ~ ., data=train_data) %>%
  step_dummy(all_nominal_predictors()) 

# using logistic glm method to train the machine learning model
lr_mod <- logistic_reg() %>%
  set_engine("glm")

newlj_wflow <-
  workflow() %>%
  add_model(lr_mod) %>%
  add_recipe(newlj_rec)

# fit the model
newlj_fit <-
  newlj_wflow %>%
  fit(data = train_data)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
newlj_fit %>%
  extract_fit_parsnip() %>%
  tidy()
 
# using the model to predict test data
predict(newlj_fit,test_data)
 
# plot the ROC curve
newlj_aug <-
  augment(newlj_fit, test_data)

newlj_aug %>%
  select(price_above_avg, price_ttl, building_area, building_height, age, directions1, decoration, has_elevator,
         building_style, line, .pred_class, .pred_TRUE)
 
newlj_aug %>%
  roc_curve(truth = price_above_avg, .pred_TRUE) %>%
  autoplot()

# calculate ROC_AUC score
newlj_aug %>%
  roc_auc(truth = price_above_avg, .pred_TRUE) 
 

6 Discussion

In this article, I use the method of single-factor and multi-factor analysis and data visualization to find that the house price is related to the house location, age, floor height, number of bedrooms, building area, elevator, decoration style and house type.

Among them, elevators and hardbound drive up prices of houses of the same size. Houses of younger age tend to be more expensive. The price of houses in the middle and higher floors is higher than that in the lower floors. The price of slab building is often higher than that of tower building.

Finally, using OLS regression, I found that there was no significant relationship between building height and SQM, and found that building location could explain most of the variations in the model. The final model explains the price by 68%.

In the machine learning model, I use the median SQM of each building location to divide the sample into two groups and use the existing variables to predict whether the price of a house will above median (station and property name are not included to improve efficiency). The predicted ROC is only 0.21. Therefore, OLS model is a better choice in this report.

标签:Real,building,estate,16,##,price,2e,location,com
From: https://www.cnblogs.com/qtdwz/p/17052417.html

相关文章