数据分析 - 酒店预订需求分析

标签：数据分析 index 酒店预订 canceled plt import data sklearn

1.背景介绍

在高节奏生活的今天,人们整天在各个城巿穿梭忙碌着,在过去跑到哪个城巿后才能进行预订客房，现在看来是否显得太麻烦，目前很多酒店都已经开始使用通过互联网进行客房预订系统。这样进行酒店的管理就显得更加的轻松,能提高工作的效率,为了方便对酒店和酒店大量资源信息的合理,高效的进行组织和管理,同时应酒店的要求和市场对酒店预订信息做一个分析，然后利用数据集对酒店运营状况/市场情况/客户画像进行数据分析

二．数据概览以及数据预处理

数据集说明

字段名称	字段说明
Hotel	酒店
Is_canceled	表示预定取消（1）或不取消（0）
Lead_time	输入预订日期至抵达日期之间的天数
Arrival_date_year	抵达日期
Adults	成人人数

数据源

数据来自网上的Kaggle网站该数据集包含了一家城市酒店和一家度假酒店的预订

信息，包括预订时间、入住时间、成人、儿童或婴儿数量、可用停车位数量等信息。

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import plotly.express as px

import folium

#import eli5 # Feature importance evaluation

# 机器学习

from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler

from sklearn.impute import SimpleImputer

from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn.model_selection import GridSearchCV

from sklearn.svm import LinearSVC

from sklearn.naive_bayes import GaussianNB

from sklearn.neighbors import KNeighborsClassifier

from sklearn.linear_model import LogisticRegression, SGDClassifier, RidgeClassifier, Perceptron

from sklearn.model_selection import cross_validate,train_test_split

from sklearn.pipeline import make_pipeline

from sklearn.datasets import make_classification

from sklearn import metrics

from sklearn.metrics import plot_roc_curve

from sklearn.metrics import precision_recall_curve

from sklearn.metrics import plot_precision_recall_curve

import matplotlib.pyplot as plt

from sklearn.metrics import average_precision_score

# Other Libraries

from sklearn.model_selection import train_test_split

from sklearn.pipeline import make_pipeline

from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline

from imblearn.over_sampling import SMOTE

from imblearn.under_sampling import NearMiss

from imblearn.metrics import classification_report_imbalanced

from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report

#from collections import Counter

#from sklearn.model_selection import KFold, StratifiedKFold

#import warnings

#warnings.filterwarnings("ignore")

from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,roc_auc_score,precision_recall_fscore_support

2.2 数据预处理

2.2.1 缺失值处理

data = data_origin.copy()

missing=data.isnull().sum(axis=0)

missing[missing!=0]

数据清洗

# 用children和country列中的众数填充空值

data_copy['children'].fillna(data_copy['children'].mode()[0],inplace=True)

data_copy['country'].fillna(data_copy['country'].mode()[0],inplace=True)

#缺失值处理

data = data_origin.copy()

missing=data.isnull().sum(axis=0)

missing[missing!=0]

可以看到一共有4列出现了缺失值，我们做如下处理：

children列和country列缺失值的数量占总体比例非常小，我们直接用对应列的众数进行填充(因为这两列是类别型变量，所以不用平均数去填充）
agent列缺失值数量相对较多，将缺失值单独作为新类别，标记为0
company列几乎全为缺失值，包含了极少量的有效值，因此删除此列

data.children.fillna(data.children.mode()[0],inplace=True)

data.country.fillna(data.country.mode()[0],inplace=True)

data.agent.fillna(0, inplace=True)

data.drop('company', axis=1,inplace=True)

#(执行一遍就可以了)

2.2.2 异常值处理

通过在Kaggle网站上的数据集说明（下载数据的侧面）仔细观察数据可以发现，存在入住总人数为0和入住总天数为0的数据，即异常数据，我们需要对这些数据做筛选和清理
数据集说明里，我们还可以发现meal列的Undefined/SC均表示未预定餐食，我们需要把其并为同一类

zero_guest=data[data[['adults', 'children', 'babies']].sum(axis=1) == 0]

data.drop(zero_guest.index, inplace=True)#筛选入住总人数为0的数据

zero_days = data[data[['stays_in_weekend_nights','stays_in_week_nights']].sum(axis=1) == 0]

data.drop(zero_days.index, inplace=True)#筛选入住总天数为0的数据

data.meal.replace("Undefined", "SC", inplace=True)# 餐食类型Undefined/SC合并

2.3再查看一下经过处理的数据基本信息

data.info()

到这里，数据的预处理工作完成，数据集大小清洗为118565*31

3.大数据分析过程及可视化

① 酒店类型

sns.countplot(x='hotel', hue='is_canceled', data=data)

plt.show()

② 客房类型

# 查看房间类型与取消预订的关系

index = 1

for room_type in ['reserved_room_type', 'assigned_room_type']:

# plt.figure(figsize=(6,8))

ax1 = plt.subplot(2, 1, index)

index += 1

ax2 = ax1.twinx()

ax1.bar(

data.groupby(room_type).size().index,

data.groupby(room_type).size())

ax1.set_xlabel(room_type)

ax1.set_ylabel('Number')

ax2.plot(

data.groupby(room_type)['is_canceled'].mean(), 'ro-')

ax2.set_ylabel('Cancellation rate')

plt.show()

订单预定和分配的房间类型多数集中在A/D/E/F四类，其中A类房型取消率高出其余三类约7-8个百分点，值得关注。

下面进一步探索房间类型的变更（即分配房型≠预定房型）对于取消率的影响。

# 房间类型变更对取消预定的影响

data['room_chaged']=data['reserved_room_type']!=data['assigned_room_type']

sns.countplot(x='room_chaged',hue='is_canceled',data=data)

房型变更过的客户取消预定的概率远远小于未变更过的客户，可能有以下原因：

①客户到达酒店后临时更改房型，多数客户会选择不取消预定，直接入住，

③ 客户自行更改房型，相对取消预定而言，这类客户更愿意更改房间类型而保证正常入住。

3.2 客户信息分析

①入住人数

plt.figure(figsize=(12, 6))

index = 0

for people in ['adults', 'children', 'babies']:

index += 1

plt.subplot(2, 3, index)

plt.plot(data.groupby(people)['is_canceled'].mean(),

'ro-',

ms=4)

plt.title(people, fontsize=20)

plt.subplot(2, 3, index + 3)

people_stats = data[people].value_counts()

sns.barplot(people_stats.index, people_stats.values)

plt.tight_layout()

plt.show()

（1）多数预定订单没有儿童和婴儿入住，其中单人入住和双人入住是主要的预定人数模式；

（2）有婴儿入住时预定取消率大幅下降；

（3）超过5人以上入住的订单基本全部取消，这部分可能是刷单等异常订单，酒店需要注意。

针对不同酒店，重点分析下列几种入住人数情况的取消率：

单人入住：adults=1, children, babies=0

双人入住：adults=2, children, babies=0

家庭入住：adults>2, children, babies>0

# 入住人数模式分析

# 单人

single = (data.adults == 1) & (data.children == 0) & (data.babies == 0)

# 双人

couple = (data.adults == 2) & (data.children == 0) & (data.babies == 0)

# 家庭

family = (data.adults >= 2) & (data.children > 0) | (data.babies > 0)

data['people_mode'] = single.astype(int) + couple.astype(int) * 2 + family.astype(int) * 3

plt.figure(figsize=(10,6))

index=1

for hotel_kind in ['City Hotel','Resort Hotel']:

plt.subplot(1,2,index)

index+=1

sns.countplot(x='people_mode',

hue='is_canceled',

data=data[data.hotel == hotel_kind])

plt.xticks([0, 1, 2, 3], ['Others', 'Single', 'Couple', 'Family'])

plt.title(hotel_kind)

plt.tight_layout()

plt.show()

对于城市酒店，取消预定概率：双人>>单人≈家庭，应注意双人入住客户的高取消率现象，改善酒店对于双人入住客户的配套服务以降低取消率。

对于度假酒店，取消预订概率：家庭>双人>单人，酒店可适当针对家庭客户提供相应的优惠折扣，提高家庭客户入住率。

②餐食类型

# 查看餐食类型与取消预订的关系

plt.figure(figsize=(12, 6))

plt.subplot(121)

plt.pie(data[data['is_canceled'] == 1].meal.value_counts(),

labels=data[data['is_canceled'] == 1].meal.value_counts().index,

autopct="%.2f%%")

plt.legend(loc=1)

plt.title('Canceled')

plt.subplot(122)

plt.pie(data[data['is_canceled'] == 0].meal.value_counts(),

labels=data[data['is_canceled'] == 0].meal.value_counts().index,

autopct="%.2f%%")

plt.legend(loc=1)

plt.title('Uncanceled')

可以看到无论是否取消预订，餐食类型之间差异不大。

④ 车位需求

# 车位需求统计

sns.countplot(x='required_car_parking_spaces',hue='hotel',data=data) 多数客户不需要停车位，相比之下，度假酒店客户需要停车位的比例远大于城市酒店。

④国家/地区

数据集客户来自177个国家/地区，为方便分析，只选择预定数前20的国家/地区进行分析

# 查看不同国家订单取消率

# 选取预定数前20的国家/地区

countries_20 = list(

data.groupby('country').size().sort_values(ascending=False).head(20).index)

data[data.country.isin(countries_20)].shape[0] / data.shape[0]

fig, ax1 = plt.subplots(figsize=(10, 6))

ax2 = ax1.twinx()

plt.xticks(range(20), countries_20)

ax1.bar(

range(20), data[data.country.isin(countries_20)].groupby('country').size().sort_values(ascending=False))

ax1.set_xlabel('Country')

ax1.set_ylabel('Total Number of Booking')

ax2.plot(

range(20),

data[data.country.isin(countries_20)].groupby('country')['is_canceled'].mean().loc[countries_20], 'ro-')

ax2.set_ylabel('Cancellation rate')

统计可知，前20名国家/地区数据量占据全部数据的94%，客户主要来自葡萄牙，英国，法国，西班牙等欧洲国家，不同国家之间预定取消率的差距非常显著，取消率较高的国家有葡萄牙、意大利、巴西、中国、俄罗斯，以发展中国家为主。

⑤客户预定历史

客户预定历史指客户之前预定过的订单的取消情况，可以一定程度上反映客户当前订单的取消意愿。

# 查看客户预定历史与取消订单的关系

# 是否回头客

tick_label = ['New Guest', 'Repeated Guest']

sns.countplot(x='is_repeated_guest', hue='is_canceled', data=data)

plt.xticks([0, 1], tick_label)

# 之前取消预定次数

plt.subplot(121)

plt.plot(data.groupby('previous_cancellations')['is_canceled'].mean(),

'ro')

plt.xlabel('Previous Cancellations')

# 之前未取消预定次数

plt.subplot(122)

plt.plot(data.groupby('previous_bookings_not_canceled')['is_canceled'].mean(),

'bo')

plt.ylim(0, 1)

（1）大多数预定来自于新客，而熟客取消预定的概率远远小于新客；

（2）先前取消过预定的客户本次预定取消的概率较大，尤其是取消过预定15次以上的客户，基本上不会选择入住，可以计入酒店的“黑名单”；

（3）先前预定并入住过的客户相对来说信用较好，高入住次数（>20次）客户基本不会取消预订。

3.3 订单信息分析

# 提前预定时长的分布情况

plt.figure(figsize=(12, 6))

plt.subplot(121)

plt.hist(data['lead_time'], bins=50)

plt.xlabel('Lead Time')

plt.ylabel('Number')

# 提前预定时长对取消的影响

plt.subplot(122)

plt.plot(data.groupby('lead_time')['is_canceled'].mean().index,

data.groupby('lead_time')['is_canceled'].mean(),

'ro',

markersize=2)

plt.xlabel('Lead Time')

plt.ylabel('Cancellation rate')

从预定提前时长分布明显可以看出，客户倾向于选择与入住时间相近的时间预定，并且随着预定提前时长的增大，取消率呈现上升趋势。

②入住时间

入住时长

# 预定入住时长对取消预定的影响

data['stay_nights'] = data['stays_in_weekend_nights'] + data['stays_in_week_nights']

# 分布过散，进行数据分桶

bin = [0, 1, 2, 5, 10, 15, np.inf]

data['stay_nights_bin'] = pd.cut(data['stay_nights'], bin,

labels=['1', '2', '3-5', '6-10', '11-15', '>16'])

plt.figure(figsize=(10,6))

plt.subplot(121)

sns.countplot(x='stay_nights_bin', hue='is_canceled',

data=data[data['hotel'] == 'Resort Hotel'])

plt.xlabel('Stay Nights')

plt.title('Resort Hotel')

plt.subplot(122)

sns.countplot(x='stay_nights_bin', hue='is_canceled',

data=data[data['hotel'] == 'City Hotel'])

plt.xlabel('Stay Nights')

plt.title('City Hotel')

plt.tight_layout()

plt.show()

（1）度假酒店客户入住时长集中在1-10晚，其中入住1晚的客户取消概率最低；

（2）城市酒店客户入住时长多在5晚以内，其中入住2晚的客户取消概率最高；

（3）整体而言，度假酒店客户平均入住天数明显高于城市酒店，可以考虑推出长租优惠方案吸引顾客。

④预定渠道

# 预定渠道对取消率的影响

fig, ax1 = plt.subplots()

ax2 = ax1.twinx()

sns.countplot(

x=data['distribution_channel'],

order=data.groupby('distribution_channel')['is_canceled'].mean().index,

ax=ax1)

ax1.set_xlabel('Distribution Channel')

ax2.plot(data.groupby('distribution_channel')['is_canceled'].mean(), 'ro-')

ax2.set_ylabel('Rate'

（1）预定主要来自于旅行社(TA/TO)，个人直接预定(Direct)和团体预定(Group)；

（2）旅行社取消预定的概率远大于其他渠道，可能是由于旅行社出于盈利考虑会取消利润较低的订单。

5.总结

通过数据分析和挖掘达到了我们预期的目标，然后结合预定量和取消量分析，7-8月度假酒店客流减少，取消率大幅上升，经营者应考虑调整价格策略以增加营收。对于用户而言，应考虑避免8月预定度假酒店，此时酒店价格处于高位，而9月价格便会下跌近一半，气候/环境差异不大，是入住的好时期。

通过学习，发现了自己的很多不足，自己知识的很多漏洞，也有很多闻所未闻的东西,从而更加明白知迟扎实的熏要性,理解实践能力的重要性!因为基础知识的丕扎实让我在这次课程设计史走了许多变路，丕过我认为是值得的，它大大的补充了我知识不足的那一面，在完成此设计时我对可视化各种图的应用和分析有了更深的理解和掌握。

源码

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import folium
#import eli5 # Feature importance evaluation
# 机器学习
from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier, RidgeClassifier, Perceptron
from sklearn.model_selection import cross_validate,train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_classification


from sklearn import metrics
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve
import matplotlib.pyplot as plt
from sklearn.metrics import average_precision_score

# Other Libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, classification_report
#from collections import Counter
#from sklearn.model_selection import KFold, StratifiedKFold
#import warnings
#warnings.filterwarnings("ignore")

from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,roc_auc_score,precision_recall_fscore_support
ata_origin = pd.read_csv('C:/Users/gdh31/Desktop/常用/大数据的统计原理/hotel_bookings.csv') 

#这里的代码路径根据用户保存路径有所不同
data_origin.info()
data = data_origin.copy()
missing=data.isnull().sum(axis=0)
missing[missing!=0]
缺失值处理
data.children.fillna(data.children.mode()[0],inplace=True)
data.country.fillna(data.country.mode()[0],inplace=True)
data.agent.fillna(0, inplace=True)
data.drop('company', axis=1,inplace=True)
#(执行一遍就可以了)
data.info()
zero_guest=data[data[['adults', 'children', 'babies']].sum(axis=1) == 0]  
data.drop(zero_guest.index, inplace=True)#筛选入住总人数为0的数据

zero_days = data[data[['stays_in_weekend_nights','stays_in_week_nights']].sum(axis=1) == 0]
data.drop(zero_days.index, inplace=True)#筛选入住总天数为0的数据

data.meal.replace("Undefined", "SC", inplace=True)# 餐食类型Undefined/SC合并
data.info()
sns.countplot(x='hotel', hue='is_canceled', data=data)
plt.show()
# 查看房间类型与取消预订的关系
index = 1
for room_type in ['reserved_room_type', 'assigned_room_type']:
    # plt.figure(figsize=(6,8))
    ax1 = plt.subplot(2, 1, index)
    index += 1
    ax2 = ax1.twinx()
    ax1.bar(
        data.groupby(room_type).size().index,
        data.groupby(room_type).size())
    ax1.set_xlabel(room_type)
    ax1.set_ylabel('Number')
    ax2.plot(
        data.groupby(room_type)['is_canceled'].mean(), 'ro-')
    ax2.set_ylabel('Cancellation rate')
    plt.show()
    # 房间类型变更对取消预定的影响
data['room_chaged']=data['reserved_room_type']!=data['assigned_room_type']
sns.countplot(x='room_chaged',hue='is_canceled',data=data)
# 查看预定人数与取消预定的关系
plt.figure(figsize=(12, 6))
index = 0
for people in ['adults', 'children', 'babies']:
    index += 1
    plt.subplot(2, 3, index)
    plt.plot(data.groupby(people)['is_canceled'].mean(),
             'ro-',
             ms=4)
    plt.title(people, fontsize=20)
    plt.subplot(2, 3, index + 3)
    people_stats = data[people].value_counts()
    sns.barplot(people_stats.index, people_stats.values)
plt.tight_layout()
plt.show()
# 入住人数模式分析
# 单人
single = (data.adults == 1) & (data.children == 0) & (data.babies == 0)
# 双人
couple = (data.adults == 2) & (data.children == 0) & (data.babies == 0)
# 家庭
family = (data.adults >= 2) & (data.children > 0) | (data.babies > 0)

data['people_mode'] = single.astype(int) + couple.astype(int) * 2 + family.astype(int) * 3
plt.figure(figsize=(10,6))
index=1
for hotel_kind in ['City Hotel','Resort Hotel']:
    plt.subplot(1,2,index)
    index+=1
    sns.countplot(x='people_mode',
              hue='is_canceled',
              data=data[data.hotel == hotel_kind])
    plt.xticks([0, 1, 2, 3], ['Others', 'Single', 'Couple', 'Family'])
    plt.title(hotel_kind)
plt.tight_layout()
plt.show()
# 查看餐食类型与取消预订的关系
plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.pie(data[data['is_canceled'] == 1].meal.value_counts(),
        labels=data[data['is_canceled'] == 1].meal.value_counts().index,
        autopct="%.2f%%")
plt.legend(loc=1)
plt.title('Canceled')
plt.subplot(122)
plt.pie(data[data['is_canceled'] == 0].meal.value_counts(),
        labels=data[data['is_canceled'] == 0].meal.value_counts().index,
        autopct="%.2f%%")
plt.legend(loc=1)
plt.title('Uncanceled')
# 车位需求统计
sns.countplot(x='required_car_parking_spaces',hue='hotel',data=data)
# 查看不同国家订单取消率
# 选取预定数前20的国家/地区
countries_20 = list(
    data.groupby('country').size().sort_values(ascending=False).head(20).index)
data[data.country.isin(countries_20)].shape[0] / data.shape[0]

fig, ax1 = plt.subplots(figsize=(10, 6))
ax2 = ax1.twinx()
plt.xticks(range(20), countries_20)
ax1.bar(
    range(20), data[data.country.isin(countries_20)].groupby('country').size().sort_values(ascending=False))
ax1.set_xlabel('Country')
ax1.set_ylabel('Total Number of Booking')
ax2.plot(
    range(20),
    data[data.country.isin(countries_20)].groupby('country')['is_canceled'].mean().loc[countries_20], 'ro-')
ax2.set_ylabel('Cancellation rate')
# 查看客户预定历史与取消订单的关系
# 是否回头客
tick_label = ['New Guest', 'Repeated Guest']
sns.countplot(x='is_repeated_guest', hue='is_canceled', data=data)
plt.xticks([0, 1], tick_label)

# 之前取消预定次数
plt.subplot(121)
plt.plot(data.groupby('previous_cancellations')['is_canceled'].mean(),
         'ro')
plt.xlabel('Previous Cancellations')
# 之前未取消预定次数
plt.subplot(122)
plt.plot(data.groupby('previous_bookings_not_canceled')['is_canceled'].mean(),
         'bo')
plt.ylim(0, 1)
plt.xlabel('Previous Un-Cancellations')
# 提前预定时长的分布情况
plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.hist(data['lead_time'], bins=50)
plt.xlabel('Lead Time')
plt.ylabel('Number')
# 提前预定时长对取消的影响
plt.subplot(122)
plt.plot(data.groupby('lead_time')['is_canceled'].mean().index,
         data.groupby('lead_time')['is_canceled'].mean(),
         'ro',
         markersize=2)
plt.xlabel('Lead Time')
plt.ylabel('Cancellation rate')
# 不同月份预定和取消情况
ordered_months = [
    "January", "February", "March", "April", "May", "June", "July", "August",
    "September", "October", "November", "December"
]

for hotel in ['City Hotel','Resort Hotel']:
    fig, ax1 = plt.subplots()
    ax2 = ax1.twinx()
    data_hotel=data[data.hotel==hotel]
    monthly = data_hotel.groupby('arrival_date_month').size()
    monthly /= 2
    monthly.loc[['July', 'August']] = monthly.loc[['July', 'August']] * 2 / 3
    sns.barplot(list(range(1, 13)), monthly[ordered_months], ax=ax1)
    ax2.plot(
    range(12), data_hotel.groupby('arrival_date_month')
    ['is_canceled'].mean()[ordered_months].values, 'ro-')
    ax1.set_xlabel('Month')
    ax2.set_ylabel('Cancellation rate')
    # 不同酒店人均价格波动
# 人均价格(不考虑babies)
data['adr_per_person'] = data['adr'] / (data['adults'] + data['children'])
plt.plot(data[data.hotel == 'City Hotel'].groupby('arrival_date_month')
         ['adr_per_person'].mean()[ordered_months],
         label='City Hotel')
plt.plot(data[data.hotel == 'Resort Hotel'].groupby('arrival_date_month')
         ['adr_per_person'].mean()[ordered_months],
         label='Resort Hotel')
plt.xlabel('Month')
plt.ylabel('Fare')
plt.xticks(np.arange(12), range(1, 13))
plt.legend()
# 预定入住时长对取消预定的影响
data['stay_nights'] = data['stays_in_weekend_nights'] + data['stays_in_week_nights']
# 分布过散，进行数据分桶
bin = [0, 1, 2, 5, 10, 15, np.inf]
data['stay_nights_bin'] = pd.cut(data['stay_nights'], bin,
                                 labels=['1', '2', '3-5', '6-10', '11-15', '>16'])
plt.figure(figsize=(10,6))
plt.subplot(121)
sns.countplot(x='stay_nights_bin', hue='is_canceled',
              data=data[data['hotel'] == 'Resort Hotel'])
plt.xlabel('Stay Nights')
plt.title('Resort Hotel')
plt.subplot(122)
sns.countplot(x='stay_nights_bin', hue='is_canceled',
              data=data[data['hotel'] == 'City Hotel'])
plt.xlabel('Stay Nights')
plt.title('City Hotel')
plt.tight_layout()
plt.show()
# 预定渠道对取消率的影响
fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
sns.countplot(
    x=data['distribution_channel'],
    order=data.groupby('distribution_channel')['is_canceled'].mean().index,
    ax=ax1)
ax1.set_xlabel('Distribution Channel')
ax2.plot(data.groupby('distribution_channel')['is_canceled'].mean(), 'ro-')
ax2.set_ylabel('Rate')

标签：数据分析,index,酒店,预订,canceled,plt,import,data,sklearn
From： https://www.cnblogs.com/hzk114455/p/16999726.html

数据分析 - 酒店预订需求分析

二．数据概览以及数据预处理

2.2 数据预处理

2.2.1 缺失值处理

2.2.2 异常值处理

3.2 客户信息分析

3.3 订单信息分析

相关文章

赞助商

阅读排行