前言
系列专栏:【深度学习:算法项目实战】✨︎
涉及医疗健康、财经金融、商业零售、食品饮料、运动健身、交通运输、环境科学、社交媒体以及文本和图像处理等诸多领域,讨论了各种复杂的深度神经网络思想,如卷积神经网络、循环神经网络、生成对抗网络、门控循环单元、长短期记忆、自然语言处理、深度强化学习、大型语言模型和迁移学习。
人员的流失是每一个公司都面临的问题,而高流失率对任何公司来说都是昂贵的,这体现在招聘和培训成本、生产力的损失和员工士气的降低。员工流失分析是指分析离开组织的员工的行为,并将其与组织中的现有员工进行比较。它有助于发现哪些员工可能很快就会离开。员工流失分析是一种行为分析,我们研究离开组织的员工的行为和特征,并将其特征与现有员工进行比较,以找出可能很快离开组织的员工。通过确定员工流失的原因,公司可以采取措施减少员工流失,留住宝贵的员工。
在进行员工流失分析时,我们需要一个包含员工流失状况和特定公司员工职业生涯特征的数据集。我为这项任务找到了一个理想的数据集。你可以从这里下载数据集。
1. 导入库和数据集
首先,我将导入必要的 Python 库和数据集:
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import chi2
from scipy.stats import chi2_contingency
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Input, Dropout, Dense, BatchNormalization
from sklearn import metrics
from sklearn.metrics import confusion_matrix, roc_curve, auc
np.random.seed(0)
①使用 pandas 函数 .read_csv()
加载数据集
data = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
print(data.head())
Age Attrition BusinessTravel DailyRate Department \
0 41 Yes Travel_Rarely 1102 Sales
1 49 No Travel_Frequently 279 Research & Development
2 37 Yes Travel_Rarely 1373 Research & Development
3 33 No Travel_Frequently 1392 Research & Development
4 27 No Travel_Rarely 591 Research & Development
DistanceFromHome Education EducationField EmployeeCount EmployeeNumber \
0 1 2 Life Sciences 1 1
1 8 1 Life Sciences 1 2
2 2 2 Other 1 4
3 3 4 Life Sciences 1 5
4 2 1 Medical 1 7
... RelationshipSatisfaction StandardHours StockOptionLevel \
0 ... 1 80 0
1 ... 4 80 1
2 ... 2 80 0
3 ... 3 80 0
4 ... 4 80 1
TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany \
0 8 0 1 6
1 10 3 3 10
2 7 3 3 0
3 8 3 3 8
4 6 3 3 2
YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
0 4 0 5
1 7 1 7
2 0 0 0
3 7 3 0
4 2 2 2
[5 rows x 35 columns]
②删除冗余列,这对我们的分析没有任何有意义的帮助
data.drop(columns=["Over18", "EmployeeCount", "EmployeeNumber", "StandardHours"], inplace=True)
# 查看一下每一列有多少类别
cat_cols=data.select_dtypes(include=object).columns.tolist()
cat_df=pd.DataFrame(data[cat_cols].melt(var_name='column', value_name='value')
.value_counts()).rename(columns={0: 'count'}).sort_values(by=['column', 'count'])
display(cat_df)
③让我们来看看这个数据集是否包含任何缺失值:
print(data.isnull().sum())
Age 0
Attrition 0
BusinessTravel 0
DailyRate 0
Department 0
DistanceFromHome 0
Education 0
EducationField 0
EnvironmentSatisfaction 0
Gender 0
HourlyRate 0
JobInvolvement 0
JobLevel 0
JobRole 0
JobSatisfaction 0
MaritalStatus 0
MonthlyIncome 0
MonthlyRate 0
NumCompaniesWorked 0
OverTime 0
PercentSalaryHike 0
PerformanceRating 0
RelationshipSatisfaction 0
StockOptionLevel 0
TotalWorkingYears 0
TrainingTimesLastYear 0
WorkLifeBalance 0
YearsAtCompany 0
YearsInCurrentRole 0
YearsSinceLastPromotion 0
YearsWithCurrManager 0
dtype: int64
2. 探索性可视化分析
2.1 人员流失分布
现在我们来看看数据集中员工流失的分布情况:
fig = make_subplots(subplot_titles=("", "Employee Attrition Rates"),
specs=[[{"type": "bar"}, {"type": "pie"}]],
rows=1, cols=2)
# Bar chart
fig.add_trace(
go.Bar(x=data['Attrition'].value_counts().index,
y=data['Attrition'].value_counts().values,
marker_color=['#C2C4E2', '#D8BFD8'],
hovertemplate='Employee Attrition Counts<br>%{x}: %{y:.4}<extra></extra>',
showlegend=False), row=1, col=1)
fig.update_traces(texttemplate='%{y:.4}', textposition='outside')
attrition_rate = data['Attrition'].value_counts(normalize=True).mul(100).reset_index()
# Pie chart
fig.add_trace(
go.Pie(labels=attrition_rate['Attrition'],
values=attrition_rate['proportion'],
marker=dict(colors=['#C2C4E2', '#D8BFD8'], pattern=dict(shape=[".", "x"])),
pull=[0.2, 0],
hovertemplate='%{label}<br>Attrition Rate: %{value:.3}%<extra></extra>',
showlegend=True), row=1, col=2)
fig.update_layout(
title={'text': 'Employee Attrition Statistics', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
xaxis_title={'text': '', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
yaxis_title={'text': 'Employee Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)
fig.show()
2.2 出差频率分析
按出差频率分析员工流失情况,检查出差频率的差异对员工流失是否有影响:
fig = make_subplots(subplot_titles=("", "Distribution by Business Travel"),
specs=[[{"type": "bar"}, {"type": "pie"}]],
rows=1, cols=2)
# Bar chart
fig.add_trace(
go.Bar(
x=data['BusinessTravel'],
y=data['Attrition'].map({'Yes': 1, 'No': 0}),
marker_color=['#C2C4E2' if x == 'Travel_Rarely'
else '#D8BFD8' if x == 'Travel_Frequently'
else '#EED4E5' for x in data['BusinessTravel']],
hovertemplate='<b>Business Travel</b>: %{x} <br><b>Attrition</b>: %{y}<extra></extra>'
), row=1, col=1
)
# Pie chart
fig.add_trace(
go.Pie(labels=data['BusinessTravel'].value_counts().index,
values=data['BusinessTravel'].value_counts().values,
marker=dict(colors=['#C2C4E2', '#D8BFD8', '#EED4E5'], pattern=dict(shape=[".", "x", "+"])),
pull=[0, 0, 0.2],
hovertemplate='%{label}<br>Count: %{value:}<br>Percent: %{percent:.3p}<extra></extra>',
showlegend=True), row=1, col=2)
fig.update_traces(selector=dict(type='bar'),showlegend=False)
fig.update_layout(
title={'text': 'Attrition by Business Travel Frequency', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
xaxis_title={'text': 'Business Travel', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
yaxis_title={'text': 'Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)
fig.show()
组织中的大多数员工很少出差,经常出差的员工流失率最高,不出差的员工流失率最低。
2.3 部门减员分析
让我们来看看各部门的自然减员比例:
fig = make_subplots(subplot_titles=("", "Distribution by Department"),
specs=[[{"type": "bar"}, {"type": "pie"}]],
rows=1, cols=2)
# Bar chart
fig.add_trace(
go.Bar(
x=data['Department'],
y=data['Attrition'].map({'Yes': 1, 'No': 0}),
marker_color=['gold' if x == 'Research & Development'
else '#D8BFD8' if x == 'Sales'
else '#CDA59E' for x in data['Department']],
hovertemplate='<b>Development</b>: %{x} <br><b>Attrition</b>: %{y}<extra></extra>'
), row=1, col=1
)
# Pie chart
fig.add_trace(
go.Pie(labels=data['Department'].value_counts().index,
values=data['Department'].value_counts().values,
marker=dict(colors=['gold', '#D8BFD8', '#CDA59E'], pattern=dict(shape=[".", "x", "+"])),
pull=[0, 0, 0.2],
hovertemplate='%{label}<br>Count: %{value:}<br>Percent: %{percent:.3p}<extra></extra>',
showlegend=True), row=1, col=2)
fig.update_traces(selector=dict(type='bar'),showlegend=False)
fig.update_layout(
title={'text': 'Attrition by Department', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
xaxis_title={'text': 'Department', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
yaxis_title={'text': 'Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)
fig.show()
这个小仪表盘显示了上一季度的员工流失率。总体而言,约有 16% 的员工离开了公司。在离职员工中,一半以上在研发部门工作,只有 5%来自人力资源部门。
2.4 教育领域分析
现在我们来看看各教育领域的自然减员比例:
fig = make_subplots(subplot_titles=("", "Distribution by EducationField"),
specs=[[{"type": "bar"}, {"type": "pie"}]],
rows=1, cols=2)
# Bar chart
fig.add_trace(
go.Bar(
x=data['EducationField'],
y=data['Attrition'].map({'Yes': 1, 'No': 0}),
marker_color=['gold' if x == 'Life Sciences'
else '#D8BFD8' if x == 'Medical'
else '#8BA583' if x == 'Marketing'
else '#FF770F' if x == 'Technical Degree'
else '#CDA59E' if x == 'Human Resources'
else '#80D1C8' for x in data['EducationField']],
hovertemplate='<b>EducationField</b>: %{x} <br><b>Attrition</b>: %{y}<extra></extra>'
), row=1, col=1
)
# Pie chart
fig.add_trace(
go.Pie(labels=data['EducationField'].value_counts().index,
values=data['EducationField'].value_counts().values,
marker=dict(colors=['gold', '#D8BFD8', '#8BA583', '#FF770F', '#80D1C8', '#CDA59E'], pattern=dict(shape=[".", "x", "+", "-", '|', '/'])),
pull=[0, 0, 0, 0, 0.2, 0.2],
hovertemplate='%{label}<br>Count: %{value:}<br>Percent: %{percent:.3p}<extra></extra>',
showlegend=True), row=1, col=2)
fig.update_traces(selector=dict(type='bar'),showlegend=False)
fig.update_layout(
title={'text': 'Attrition by EducationField', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
xaxis_title={'text': 'EducationField', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
yaxis_title={'text': 'Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)
fig.show()
我们可以看到,以生命科学为教育领域的员工自然减员率较高。
2.5 性别占比分析
现在我们来看看男性与女性的自然减员比例:
fig = make_subplots(subplot_titles=("", "Distribution by Gender"),
specs=[[{"type": "bar"}, {"type": "pie"}]],
rows=1, cols=2)
# Bar chart
fig.add_trace(
go.Bar(
x=data['Gender'],
y=data['Attrition'].map({'Yes': 1, 'No': 0}),
marker_color=['#72755B' if x == 'Male'
else '#CDA59E' for x in data['Gender']],
hovertemplate='<b>Gender</b>: %{x} <br><b>Attrition</b>: %{y}<extra></extra>'
), row=1, col=1
)
# Pie chart
fig.add_trace(
go.Pie(labels=data['Gender'].value_counts().index,
values=data['Gender'].value_counts().values,
marker=dict(colors=['#72755B', '#CDA59E'], pattern=dict(shape=["x", "+"])),
pull=[0, 0, 0.2],
hovertemplate='%{label}<br>Count: %{value:}<br>Percent: %{percent:.3p}<extra></extra>',
showlegend=True), row=1, col=2)
fig.update_traces(selector=dict(type='bar'),showlegend=False)
fig.update_layout(
title={'text': 'Attrition by Gender', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
xaxis_title={'text': 'Gender', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
yaxis_title={'text': 'Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)
fig.show()
与女性相比,男性的流失率较高。
2.6 婚姻状况分析
接下来,让我们看一看单身、未婚与离异的自然减员比例:
fig = make_subplots(subplot_titles=("", "Distribution by MaritalStatus"),
specs=[[{"type": "bar"}, {"type": "pie"}]],
rows=1, cols=2)
# Bar chart
fig.add_trace(
go.Bar(
x=data['MaritalStatus'],
y=data['Attrition'].map({'Yes': 1, 'No': 0}),
marker_color=['#BCB0D7' if x == 'Single'
else '#CC5369' if x == 'Married'
else '#EED4E5' for x in data['MaritalStatus']],
hovertemplate='<b>MaritalStatus</b>: %{x} <br><b>Attrition</b>: %{y}<extra></extra>'
), row=1, col=1
)
# Pie chart
fig.add_trace(
go.Pie(labels=data['MaritalStatus'].value_counts().index,
values=data['MaritalStatus'].value_counts().values,
marker=dict(colors=['#CC5369', '#BCB0D7', '#EED4E5'], pattern=dict(shape=[".", "x", "+"])),
pull=[0, 0, 0.2],
hovertemplate='%{label}<br>Count: %{value:}<br>Percent: %{percent:.3p}<extra></extra>',
showlegend=True), row=1, col=2)
fig.update_traces(selector=dict(type='bar'),showlegend=False)
fig.update_layout(
title={'text': 'Attrition by MaritalStatus', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
xaxis_title={'text': 'MaritalStatus', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
yaxis_title={'text': 'Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)
fig.show()
组织中的大多数员工都已婚。离婚员工的流失率非常高,单身员工的流失率较低。
2.7 工作角色分析
fig = make_subplots(subplot_titles=("", "Distribution by JobRole"),
specs=[[{"type": "bar"}, {"type": "pie"}]],
rows=1, cols=2)
# Bar chart
fig.add_trace(
go.Bar(
x=data['JobRole'],
y=data['Attrition'].map({'Yes': 1, 'No': 0}),
marker_color=['gold' if x == 'Sales Executive'
else '#D8BFD8' if x == 'Research Scientist'
else '#8BA583' if x == 'Laboratory Technician'
else '#FF770F' if x == 'Manufacturing Director'
else '#80D1C8' if x == 'Healthcare Representative'
else '#CDA59E' if x == 'Manager'
else '#8EA3A6' if x == 'Sales Representative'
else '#6182A8' if x == 'Research Director'
else '#CC5369' for x in data['JobRole']],
hovertemplate='<b>JobRole</b>: %{x} <br><b>Attrition</b>: %{y}<extra></extra>'
), row=1, col=1
)
# Pie chart
fig.add_trace(
go.Pie(labels=data['JobRole'].value_counts().index,
values=data['JobRole'].value_counts().values,
marker=dict(colors=['gold', '#D8BFD8', '#8BA583', '#FF770F', '#80D1C8',\
'#CDA59E', '#8EA3A6', '#6182A8', '#CC5369'],
pattern=dict(shape=[".", "x", "+", "-", '|', '/', '+','x','|'])),
pull=[0, 0, 0, 0, 0, 0, 0.2, 0, 0],
hovertemplate='%{label}<br>Count: %{value:}<br>Percent: %{percent:.3p}<extra></extra>',
showlegend=True), row=1, col=2)
fig.update_traces(selector=dict(type='bar'),showlegend=False)
fig.update_layout(
title={'text': 'Attrition by JobRole', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
xaxis_title={'text': 'JobRole', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
yaxis_title={'text': 'Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)
fig.show()
2.8 月收入分布情况
①按减员状况分列的月收入分布情况
plot_df=data.sort_values(by="Attrition")
fig=px.histogram(plot_df, x='MonthlyIncome', color='Attrition',
opacity=0.8, histnorm='density', barmode='overlay', marginal='box',
color_discrete_map={'Yes': '#C02B34','No': '#CDBBA7'})
fig.update_layout(title_text='Distribution of Monthly Income by Attrition Status',
xaxis_title='Monthly Income, $', yaxis_title='Density', width=900, height=500,
font_color='#28221D', font_family= 'Comic Sans MS',
plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2', legend_traceorder='reversed')
fig.show()
在职员工和离职员工的月收入分布呈正偏态,离职员工的月收入总体较低。离职员工的月收入中位数比在职员工少 2000 多美元。
②按工作生活平衡度分列的月收入分布情况
fig=go.Figure()
colors=['#214D5C','#91ABB4']
for i, j in enumerate(data['Gender'].unique()):
df_plot=data[data['Gender']==j]
fig.add_trace(go.Box(x=df_plot['WorkLifeBalance'], y=df_plot['MonthlyIncome'],
notched=True, line=dict(color=colors[i]),name=j))
fig.update_layout(title='Distribution of Monthly Income by Work Life Balance',
xaxis_title='Work Life Balance', boxmode='group',
font_color='#28221D', font_family='Comic Sans MS',
xaxis = dict(tickmode = 'array', tickvals = [1, 2, 3, 4],
ticktext = ['Poor', 'Fair', 'Good', 'Excellent']),
width=900, height=500,
plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',)
fig.show()
③月收入随工作总年数和工作级别的增加而增加
plot_df = data.copy()
plot_df['JobLevel'] = pd.Categorical(
plot_df['JobLevel']).rename_categories(
['Entry level', 'Mid level', 'Senior', 'Lead', 'Executive'])
col=['#73AF8E', '#4F909B', '#707BAD', '#A89DB7','#C99193']
fig = px.scatter(plot_df, x='TotalWorkingYears', y='MonthlyIncome',
color='JobLevel', size='MonthlyIncome',
color_discrete_sequence=col,
category_orders={'JobLevel': ['Entry level', 'Mid level', 'Senior', 'Lead', 'Executive']})
fig.update_layout(legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
title='Monthly income increases with total number of years worked and job level <br>',
xaxis_title='Total Working Years', yaxis=dict(title='Income',tickprefix='$'),
legend_title='', font_color='#28221D', font_family='Comic Sans MS', width=900, height=500,
margin=dict(l=40, r=30, b=80, t=120),plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',)
fig.show()
从上面的散点图来看,月收入与工作总年数呈正相关,而雇员的收入与其工作级别也有密切联系。
2.9 员工流失相关矩阵
使用create_annotated_heatmap
函数绘制矩阵,探索数据的相关性
cat_cols=[]
for i in data.columns:
if data[i].nunique() <= 5 or data[i].dtype == object:
cat_cols.append(i)
df=data.copy()
df.drop(df[cat_cols], axis=1, inplace=True)
corr=df.corr().round(2)
x=corr.index.tolist()
y=corr.columns.tolist()
z=corr.to_numpy()
fig = ff.create_annotated_heatmap(z=z, x=x, y=y, annotation_text=z, name='',
hovertemplate="Correlation between %{x} and %{y}= %{z}",
colorscale='GnBu')
fig.update_yaxes(autorange="reversed")
fig.update_layout(title="Correlation Matrix of Employee Attrition",
font_color='#28221D',margin=dict(t=180), height=600)
fig.show()
月收入与总工作年数呈 0.77 的强正相关,这与我们在上述散点图中的发现相吻合。此外,在公司工作的年数与在公司担任经理的年数(相关性 = 0.77)以及担任现职的年数(相关性 = 0.76)也有很强的正相关性。没有变量的相关性超过 0.8,这表明可能存在共线性问题。
3. 统计分析 - 特征重要性
3.1 数值变量特征
进行方差分析测试,分析数值特征在员工流失中的重要性。
num_cols = data.select_dtypes(np.number).columns # 选择数值特征
new_data = data.copy()
new_data["Attrition"] = new_data['Attrition'].map({'Yes': 1, 'No': 0}) # 映射编码
f_scores = {}
p_values = {}
for i in num_cols:
f_score, p_value = stats.f_oneway(new_data[i],new_data["Attrition"])
f_scores[i] = f_score
p_values[i] = p_value
每个数字特征的方差分析检验 F_Score 可视化。
fig = go.Figure(data=[go.Bar(
x=list(f_scores.keys()),
y=list(f_scores.values()),
marker_color=["#C2C4E2","#CDB5CD","khaki","#D1E6DC","#BDE2E2",]*5
)])
fig.update_traces(texttemplate='%{value:.0f}', textposition='outside')
fig.update_layout(title={'text': 'Anova-Test F_scores Comparison', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
xaxis_title={'text': '', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
yaxis_title={'text': 'F_scores', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'),
plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2', width=900, height=500)
fig.show()
比较方差分析检验的 F_Score 和 P_值。
annova_data = pd.DataFrame({"Features" : list(f_scores.keys()),
"F_Score" : list(f_scores.values())
})
annova_data["P_value"] = [format(p, '.20f') for p in list(p_values.values())]
annova_data
3.1 分类变量特征
用齐次方检验分析员工流失的分类特征重要性
cat_cols = data.select_dtypes(include="object").columns.tolist() #选择分类特征
cat_cols.remove("Attrition")
chi2_statistic = {}
p_values = {}
# Perform chi-square test for each column
for col in cat_cols:
contingency_table = pd.crosstab(data[col], data['Attrition'])
chi2, p_value, _, _ = chi2_contingency(contingency_table)
chi2_statistic[col] = chi2
p_values[col] = p_value
可视化每个分类特征的 Chi-Square 统计值。
fig = go.Figure(data=[go.Bar(
x=list(chi2_statistic.keys()),
y=list(chi2_statistic.values()),
marker_color=["#C2C4E2","#CDB5CD","khaki","#D1E6DC","#BDE2E2",]*5
)])
fig.update_traces(texttemplate='%{value:.1f}', textposition='outside')
fig.update_layout(title={'text': 'Chi-Square Statistic Value of each Categorical Columns', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
xaxis_title={'text': '', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
yaxis_title={'text': 'Chi-Square Statistic', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'),
plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2', width=900, height=500)
fig.show()
比较 Chi2 统计量和 Chi_Square 检验的 P_值。
chi_data = pd.DataFrame({"Features":list(chi2_statistic.keys()),"Chi_2 Statistic":list(chi2_statistic.values())})
chi_data["P_value"] = [format(p, '.20f') for p in list(p_values.values())]
chi_data
4. 数据预处理
该数据集中有很多具有分类值的特征。我将把这些分类变量转换成数值变量。
data.columns
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
'DistanceFromHome', 'Education', 'EducationField',
'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'OverTime',
'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
'YearsSinceLastPromotion', 'YearsWithCurrManager'],
dtype='object')
cat_cols = data.select_dtypes(include=['object']).columns.tolist()
cat_cols
['Attrition',
'BusinessTravel',
'Department',
'EducationField',
'Gender',
'JobRole',
'MaritalStatus',
'OverTime']
4.1 分类变量标签编码
# 映射编码
data["Gender"] = data["Gender"].map({"Female":0 ,"Male":1})
le = LabelEncoder()
data['Attrition'] = le.fit_transform(data['Attrition'])
ohe = OneHotEncoder() # 离散值特征One-hot编码
encoded = ohe.fit_transform(data[['BusinessTravel',
'Department',
'EducationField',
'JobRole',
'MaritalStatus',
'OverTime']])
encoded_data = pd.DataFrame(encoded.toarray(),columns = ohe.get_feature_names_out())
data = pd.concat([data,encoded_data],axis=1)
data = data.drop(['BusinessTravel',
'Department',
'EducationField',
'JobRole',
'MaritalStatus',
'OverTime'],axis =1)
print(data.info())
现在让我们使用 .corr()
函数来看看数据之间的相关性:
correlation = data.corr()
print(correlation["Attrition"].sort_values(ascending=False))
4.2 数据集过采样 SMOTE
标准的 ML 技术(如决策树和逻辑回归)会偏向于多数类,而忽略少数类。它们倾向于只预测多数人类别,因此,与多数人类别相比,少数人类别存在严重的分类错误。用更专业的术语来说,如果我们的数据集中存在不平衡的数据分布,那么我们的模型就更容易出现少数人类别的召回率可以忽略不计或非常低的情况。
SMOTE (synthetic minority oversampling technique) 合成少数群体超采样技术是解决不平衡问题最常用的超采样方法之一。它的目的是通过复制少数类实例来随机增加少数类实例,从而平衡类的分布。SMOTE 在现有的少数类实例之间合成新的少数类实例。它通过线性插值为少数类生成虚拟训练记录。这些合成训练记录是通过为少数群体中的每个实例随机选择一个或多个 k 近邻来生成的。过采样过程结束后,数据将被重建,并可对处理后的数据应用多个分类模型。
# dividing the future(X) and the target(y) from the dataset
X = data.drop(['Attrition'], axis=1)
y = data['Attrition']
# Oversampling
smote = SMOTE(random_state=42)
X_resample, y_resample = smote.fit_resample(X, y)
4.3 标准化数据特征缩放
数据集的标准化是许多机器学习估计器的共同要求:如果单个特征看起来不像标准正态分布数据(如均值为 0、方差为单位的高斯分布),估计器的表现可能会很糟糕。
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_resample)
5. 员工流失预测模型
5.1 数据准备(拆分为训练集和测试集)
现在,让我们把数据分成训练集和测试集:
X_train, X_test,\
y_train, y_test = train_test_split(X_scaled, y_resample,
test_size=0.2, random_state=42)
5.2 构建神经网络
5.2.1 定义网络结构
# Define network structure
model = Sequential()
model.add(Input(shape=(X_train.shape[1],)))
model.add(Dropout(0.1))
model.add(Dense(64, activation ='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(32, activation ='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(1, activation ='sigmoid'))
model.compile(loss ='binary_crossentropy',
optimizer ='rmsprop', metrics =['accuracy'])
接下来,让我们用 .summary()
函数查看一下模型概要
# Model summary
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type) ┃ Output Shape ┃ Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dropout (Dropout) │ (None, 50) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense (Dense) │ (None, 64) │ 3,264 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization │ (None, 64) │ 256 │
│ (BatchNormalization) │ │ │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_1 (Dropout) │ (None, 64) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_1 (Dense) │ (None, 32) │ 2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_1 │ (None, 32) │ 128 │
│ (BatchNormalization) │ │ │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_2 (Dropout) │ (None, 32) │ 0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_2 (Dense) │ (None, 1) │ 33 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
Total params: 5,761 (22.50 KB)
Trainable params: 5,569 (21.75 KB)
Non-trainable params: 192 (768.00 B)
5.2.2 训练与可视化(accuracy曲线)
# Training Model
history = model.fit(X_train, y_train, epochs =100, verbose = 1, batch_size = 20, validation_data=(X_test, y_test))
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
history_df = pd.DataFrame(history.history)
plt.plot(history_df.loc[:, ['accuracy']], "#BDE2E2", label='Training accuracy')
plt.plot(history_df.loc[:, ['val_accuracy']], "#C2C4E2", label='Validation accuracy')
plt.title('Training and Validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
5.2.2 训练与可视化(loss曲线)
history_df = pd.DataFrame(history.history)
plt.plot(history_df.loc[:, ['loss']], "#BDE2E2", label='Training loss')
plt.plot(history_df.loc[:, ['val_loss']],"#C2C4E2", label='Validation loss')
plt.title('Training and Validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend(loc="best")
plt.show()
6. 模型评估
evaluate()
函数将返回一个包含两个值的列表。第一个值是模型在数据集上的损失,第二个值是模型在数据集上的准确度。若您只想评估准确率,请忽略损失值。
# evaluate the keras model
_, accuracy = model.evaluate(X_test, y_test)
print('Accuracy: %.2f' % (accuracy*100))
16/16 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.9636 - loss: 0.1376
Accuracy: 96.76
在模型上调用 predict()
函数就可以轻松完成预测。输出层使用的是 sigmoid
激活函数,因此预测结果将是介于 0 和 1 之间的概率。
# make probability predictions with the model
predictions = model.predict(X_test)
# round predictions
y_pred = [round(x[0]) for x in predictions]
6.1 计算混淆矩阵与ROC曲线数据
计算混淆矩阵与ROC曲线数据
# confusion matrix
cf_matrix = confusion_matrix(y_test, y_pred)
cf_matrix
# AUC-ROC Curve 计算 ROC曲线数据
fpr, tpr, thresholds = roc_curve(y_test, predictions)
roc_auc = auc(fpr, tpr)
6.2 绘制混淆矩阵与AUC-ROC曲线
接下来让我们使用 .Heatmap()
函数绘制混淆矩阵,.Scatter()
函数绘制ROC曲线
# Mapping Confusion Matrix and AUC-ROC Curve
fig = make_subplots(
subplot_titles=("Confusion Matrix", "AUC-ROC Curve"),
rows=1, cols=2)
fig.add_trace(
go.Heatmap(
x=['Predicted Positive', 'Predicted Negative'],
y=['Actual Negative', 'Actual Positive'],
z=[cf_matrix.tolist()[1],cf_matrix.tolist()[0]],
text=[cf_matrix.tolist()[1],cf_matrix.tolist()[0]],
texttemplate='%{text}',
colorscale='cividis',
showscale=True), row=1, col=1
)
fig.add_trace(
go.Scatter(
x=fpr, y=tpr,
name='ROC curve (area = %0.2f)' % roc_auc,
line=dict(color='darkorange', width=2),
mode='lines'),row=1, col=2
)
fig.update_layout(
title=dict(text='Confusion Matrix & AUC-ROC Curve',
font=dict(size=24, family='Times New Roman')),
xaxis1=dict(title='Predicted', tickfont=dict(size=12, family='Verdana')),
yaxis1=dict(title='Actual', tickfont=dict(size=12, family='Verdana')),
xaxis2=dict(title='False Positive Rate'),
yaxis2=dict(title='True Positive Rate'),
width=900,height=500,plot_bgcolor='#F2F2F2'
)
fig.show()
标签:color,ANN,流失,神经网络,dict,fig,Attrition,font,data
From: https://blog.csdn.net/m0_63287589/article/details/139574728