一、选题背景:
电话客户流失预测
电话运营商、网络服务上、付费电视公司、保险公司和预警监控服务公司,通常使用客户流失分析和客户流失率作为公司的关键运营指标之一,因为维护客户的成本比获取一个新客户的成本要低得多。这些公司一般开设有客户服务部门,部门工作之一是企图赢回已经流失的客户,因为从长远的角度来看,一个忠实客户的价值远高于一个新客户的价值。
通过使用客户流失模型可以评估客户的流失风险,从而进行客户流失分析。客户流失模型能够对潜在流失客户的优先级进行排序,因而模型能够对可能流失的客户群体实施有效的监控。
二、大数据分析设计方案:
数据包括:churn(客户流失标签),AccountWeeks(帐户周),ContractRenewal(续约), 'DataPlan'(流量套餐),.'DataUsage',(数据使用情况) 'CustServCalls'(客户服务电话), 'DayMins'(日分钟数), 'DayCalls'(日间电话), 'MonthlyCharge'(每月话费), 'OverageFee'(超时间费), 'RoamMins(漫游分钟数)'
用逻辑回归惩罚(lasso),random forest 和knn三种方法去做classification,来对进行客户是否流失进行预测分类,并比较三者模型的效果。
三、数据分析步骤:
数据集采用的是kaggle电信客户流失预测比赛的数据
1:通过检查样本数据标签是否平衡
2:展示各个特征的相关性
3:绘制各个特征的直方图以及密度曲线
4:通过箱线法分析异常值
5:对样本数据归一化
6: 对数据划分训练集,训练样本,测试集,测试标签
完整源代码:
1:
函数构成:
Knn:
通过knn 分类器分类训练集,并预测测试集得到如图的roc 曲线和混淆矩阵,
以及从图中可以得出该分类器的accuracy,recall,AUC,f1score,precision
2:randomforest:
通过随机森林分类器分类训练集,并预测测试集得到如图的roc 曲线和混淆矩阵,
以及从图中可以得出该分类器的accuracy,recall,AUC,f1score,precision
3:
Lasso:
通过lasso逻辑回归分类训练集,并预测测试集得到如图的roc 曲线和混淆矩阵,
以及从图中可以得出该分类器的accuracy,recall,AUC,f1score,precision。
对每个模型进行交叉验证:
该函数返回cv=3 每次模型的训练分数,和测试分数,以及三次的平均测试分数
Logistic:
Knn:
Rfc:
并通过箱线图分析三个模型的训练分数和测试分数:
发现训练集中随机森林分数最高,测试集中随机森林分数也是最高,说明随机森林模型效果最好
四:
1:展示饼形图数据标签和直方图数据标签:
通过对数据标签进行smote采样来对数据标签均衡化
2:
删除相关性不强的特征
3:展示各个特征是否费正态分布:
4:通过箱线法分析异常值:
通过删除异常值。
5:对特征属性归一化:
直方图展示:
五:
代码:
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score,recall_score
from collections import Counter
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 20, 12
import seaborn as sns
sns.set_style('whitegrid')
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import recall_score
from imblearn.over_sampling import SMOTE,RandomOverSampler
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
# 加载数据文件
df = pd.read_csv("telecom_churn(1).csv")#
df.columns
#查看内存
df.info(memory_usage='deep')
##样本标签分布比例
plt.figure(figsize=(12,4),dpi=100)
plt.subplot(1,2,1)
plt.pie(df['Churn'].value_counts(),labels=df['Churn'].value_counts().index,autopct="%.1f%%")
plt.grid()
plt.subplot(1,2,2)
m,n =df['Churn'].value_counts().index.astype(str),df['Churn'].value_counts()
plt.bar(m,n)
for a,b in zip(m,n):
plt.text(a,b,b )
plt.show()##可见样本不均衡
##数据相关性检验-斯皮尔曼系数
data=df.copy()##
corr=data.corr()
corr
####可视化相关性系数
plt.figure(figsize=(12,8))
ax=sns.heatmap(
corr,
annot=False,
vmin = -1,vmax = 1, center = 0 ,square= True,
cmap = sns.diverging_palette(20,220,n=200)
)
ax.set_xticklabels(
ax.get_xticklabels(),
rotation = 45,
horizontalalignment = 'right'
)
sns.despine()
plt.savefig("corr.png")
plt.show()#可见,class和其他特征之间的单一变量相关性均不高,不具有共线性的数据不需要共线性变量处理
#各个特征的直方图分布--分布比例
from scipy.stats import norm
my_figure_hist=plt.figure(figsize=(22,22))
num_columns=list(data.columns)
for col in num_columns:
plt.subplot(3,4,num_columns.index(col)+1)
salary_hist = data[col]
mu =np.mean(salary_hist) #计算均值
sigma =np.std(salary_hist) # 方差
bins=10
n,bins,patches = plt.hist(salary_hist,rwidth = 0.8,density=1,bins = bins,align = 'mid',label = "Number of players")
y = norm.pdf(bins, mu, sigma)#拟合一条最佳正态分布曲线y
plt.plot(bins,y)
plt.xlabel(f'{col}')
plt.ylabel('NUMBER')
plt.title(f'{col} of data')
plt.savefig('histgram.png')
plt.show()
###异常值分析
def remove_filers_with_boxplot(data):
plt.figure(figsize=(16,6))
p = data.boxplot(return_type='dict')
plt.xticks(rotation=90,fontsize=10)
for index,value in enumerate(data.columns):
try:
# 获取异常值
fliers_value_list = p['fliers'][index].get_ydata()
except:pass
# 删除异常值
for flier in fliers_value_list:
data = data[data.loc[:,value] != flier]
return data
# 选择是否删除异常值--默认不删除
drop_outlier = False
if drop_outlier:
df3 = df2.copy()
else:
df3 = data.copy()
print("data std:",round(df_X.std().mean(),2), "data mean:",abs(round(df_X.mean().mean(),2)))
plt.subplot(1,2,1)
df_X['DayMins'].hist(bins=20)
plt.title("DayMins after StandScaler")
plt.subplot(1,2,2)
data['DayMins'].hist(bins=20)
plt.title("DayMins before StandScaler")
plt.show()
# 随机划分训练和测试
from sklearn.model_selection import train_test_split
x_train, x_test, y_train,y_test = train_test_split(df_X, y, test_size=0.2)
smote=SMOTE()
smote_X,smote_y=smote.fit_resample(x_train,y_train)
Counter(smote_y)
smo_counts = dict(Counter(smote_y))
plt.figure(figsize=(12,4),dpi=100)
plt.subplot(1,2,1)
plt.pie(y_train.value_counts(),labels=y_train.value_counts().index,autopct="%.1f%%")
plt.grid()
plt.title("before smote")
plt.subplot(1,2,2)
plt.pie(smo_counts.values(),labels=smo_counts.keys(),autopct="%.1f%%")
plt.title("after smote")
plt.grid()
# 定义计算评价指标的函数
from sklearn.metrics import *
def calculate_metrics(true, pred, modelname):
recall=recall_score(pred, true)
acc = accuracy_score(pred, true)
fpr,tpr,thr=roc_curve(pred, true)
precision = precision_score(pred, true)
f1 = f1_score(pred, true)
AUC=auc(fpr,tpr)
print(f"{modelname} auccuracy:",acc)
print(f"{modelname} recall:",recall)
print(f"{modelname} AUC:",AUC)
print(f"{modelname} f1 score:",f1)
print(f"{modelname} precision:",precision)
mat = confusion_matrix(pred, true)
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.plot(fpr, tpr, 'k--', label='ROC (area = {0:.2f})'.format(AUC), lw=2)
plt.plot([0,1],[0,1],'k--',lw=2)
plt.xlim([-0.05, 1.05]) # 设置x、y轴的上下限,以免和边缘重合,更好的观察图像的整体
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate') # 可以使用中文,但需要导入一些库即字体
plt.title(f'{modelname} ROC Curve')
plt.legend(loc="lower right")
plt.subplot(1,2,2)
sns.heatmap(mat, fmt='d', annot=True)
plt.show()
return recall, acc, AUC, f1, precision
def model_predict(model):
model.fit(x_train,y_train)
pred = model.predict(x_test)
return model, pred
# KNN
knn = KNeighborsClassifier()
knn_model, knn_pred = model_predict(knn)
knn_recall, knn_acc, knn_AUC, knn_f1, knn_precision = calculate_metrics(knn_pred,
y_test, "KNN")
## RFC
rfc = RandomForestClassifier()
rfc_model, rfc_pred = model_predict(rfc)
rfc_recall, rfc_acc, rfc_AUC, rfc_f1, rfc_precision = calculate_metrics(rfc_pred,
y_test, "rfc")
## logistic
# L1正则系数:lasso回归
logistic = LogisticRegression(penalty='l1', solver='saga')
logistic_model, logistic_pred = model_predict(logistic)
logistic_recall, logistic_acc, logistic_AUC, logistic_f1, logistic_precision = calculate_metrics(logistic_pred,
y_test, "logistic")
## 交叉验证
from sklearn.model_selection import cross_validate
labels = []
train_scores = []
test_scores = []
def clf_score(clf, x_train, y_train, label, train_scores, test_scores, cv=3, n_jobs=-1):
score = cross_validate(clf, x_train, y_train, scoring=None, cv=cv, n_jobs=n_jobs,
return_train_score=True, return_estimator=True)
train_scores.append(score['train_score'])
test_scores.append(score['test_score'])
labels.append(label)
print(np.mean(score['test_score']))
print(pd.DataFrame(score['test_score'],score['train_score']))
clfs = score['estimator']
return clfs
from sklearn.model_selection import cross_validate
# 逻辑回归
log_models = clf_score(logistic, x_train, y_train, 'logistic',
train_scores, test_scores, n_jobs=-1)
l1 = pd.DataFrame(log_models[0].coef_, columns=x_train.columns).T
l1.columns=['model1_coef']
l2 = pd.DataFrame(log_models[1].coef_, columns=x_train.columns).T
l2.columns=['model2_coef']
l3 = pd.DataFrame(log_models[2].coef_, columns=x_train.columns).T
l3.columns=['model3_coef']
log_r = pd.concat([l1,l2,l3], axis=1)
log_r
# KNN
clf_score(knn, x_train, y_train, 'KNN', train_scores, test_scores, n_jobs=-1)
# RFC
clf_score(rfc, x_train, y_train, 'RFC', train_scores, test_scores, n_jobs=-1)
import matplotlib.cbook as cbook
train_stats = cbook.boxplot_stats(train_scores, labels=labels)
test_stats = cbook.boxplot_stats(test_scores, labels=labels)
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 10))
ax[0].bxp(train_stats, showfliers=True, meanline=True, showmeans=True)
ax[0].set_title('Train Score')
ax[1].bxp(test_stats, showfliers=True, meanline=True, showmeans=True)
ax[1].set_title('Test Score')
for a in ax:
a.set_ylim(0.8, 1.005)
plt.show()
df4=df3[['Churn','ContractRenewal','DataPlan']]
df5=df3.drop(columns=['ContractRenewal','DataPlan'])
X5=StandardScaler().fit_transform(df5.iloc[:,1:])
x_train4, x_test4, y_train4,y_test5 = train_test_split(df_X, y, test_size=0.2)
x_train5, x_test5, y_train5,y_test5 = train_test_split(df_X, y, test_size=0.2)
model4=clf_score(logistic, x_train4, y_train4, 'logistic',
train_scores, test_scores, n_jobs=-1)
model5=clf_score(logistic, x_train5, y_train5, 'logistic',
train_scores, test_scores, n_jobs=-1)
model6=clf_score(knn, x_train4, y_train4, 'knn',
train_scores, test_scores, n_jobs=-1)
model7=clf_score(knn, x_train5, y_train5, 'knn',
train_scores, test_scores, n_jobs=-1)
model8=clf_score(rfc, x_train4, y_train4, 'rfc',
train_scores, test_scores, n_jobs=-1)
model9=clf_score(rfc, x_train5, y_train5, 'rfc',
train_scores, test_scores, n_jobs=-1)
四、总结:
1;对客户流失预测分析,三者模型效果都很好,平均测试准确率都在85%以上,但随机森林效果最好;达到了预期的目标.
2:学会了如何构建机器学习模型,数据清洗方面可以简洁点。
标签:客户,plt,预测,train,流失,score,scores,test,import From: https://www.cnblogs.com/lonelywater/p/17000522.html