基于神经网络模型ANN的人力资源分析与员工流失预测

标签：color ANN 流失神经网络 dict fig Attrition font data

员工流失

前言

系列专栏:【深度学习：算法项目实战】✨︎
涉及医疗健康、财经金融、商业零售、食品饮料、运动健身、交通运输、环境科学、社交媒体以及文本和图像处理等诸多领域，讨论了各种复杂的深度神经网络思想，如卷积神经网络、循环神经网络、生成对抗网络、门控循环单元、长短期记忆、自然语言处理、深度强化学习、大型语言模型和迁移学习。

人员的流失是每一个公司都面临的问题，而高流失率对任何公司来说都是昂贵的，这体现在招聘和培训成本、生产力的损失和员工士气的降低。员工流失分析是指分析离开组织的员工的行为，并将其与组织中的现有员工进行比较。它有助于发现哪些员工可能很快就会离开。员工流失分析是一种行为分析，我们研究离开组织的员工的行为和特征，并将其特征与现有员工进行比较，以找出可能很快离开组织的员工。通过确定员工流失的原因，公司可以采取措施减少员工流失，留住宝贵的员工。

在进行员工流失分析时，我们需要一个包含员工流失状况和特定公司员工职业生涯特征的数据集。我为这项任务找到了一个理想的数据集。你可以从这里下载数据集。

1. 导入库和数据集

首先，我将导入必要的 Python 库和数据集：

import numpy as np
import pandas as pd

from scipy import stats
from scipy.stats import chi2
from scipy.stats import chi2_contingency

import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from keras.models import Sequential
from keras.layers import Input, Dropout, Dense, BatchNormalization

from sklearn import metrics
from sklearn.metrics import confusion_matrix, roc_curve, auc
np.random.seed(0)

①使用 pandas 函数 .read_csv() 加载数据集

data = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
print(data.head())

   Age Attrition     BusinessTravel  DailyRate              Department  \
0   41       Yes      Travel_Rarely       1102                   Sales   
1   49        No  Travel_Frequently        279  Research & Development   
2   37       Yes      Travel_Rarely       1373  Research & Development   
3   33        No  Travel_Frequently       1392  Research & Development   
4   27        No      Travel_Rarely        591  Research & Development   

   DistanceFromHome  Education EducationField  EmployeeCount  EmployeeNumber  \
0                 1          2  Life Sciences              1               1   
1                 8          1  Life Sciences              1               2   
2                 2          2          Other              1               4   
3                 3          4  Life Sciences              1               5   
4                 2          1        Medical              1               7   

   ...  RelationshipSatisfaction StandardHours  StockOptionLevel  \
0  ...                         1            80                 0   
1  ...                         4            80                 1   
2  ...                         2            80                 0   
3  ...                         3            80                 0   
4  ...                         4            80                 1   

   TotalWorkingYears  TrainingTimesLastYear WorkLifeBalance  YearsAtCompany  \
0                  8                      0               1               6   
1                 10                      3               3              10   
2                  7                      3               3               0   
3                  8                      3               3               8   
4                  6                      3               3               2   

  YearsInCurrentRole  YearsSinceLastPromotion  YearsWithCurrManager  
0                  4                        0                     5  
1                  7                        1                     7  
2                  0                        0                     0  
3                  7                        3                     0  
4                  2                        2                     2  

[5 rows x 35 columns]

②删除冗余列，这对我们的分析没有任何有意义的帮助

data.drop(columns=["Over18", "EmployeeCount", "EmployeeNumber", "StandardHours"], inplace=True)

# 查看一下每一列有多少类别
cat_cols=data.select_dtypes(include=object).columns.tolist()
cat_df=pd.DataFrame(data[cat_cols].melt(var_name='column', value_name='value')
                    .value_counts()).rename(columns={0: 'count'}).sort_values(by=['column', 'count'])
display(cat_df)

③让我们来看看这个数据集是否包含任何缺失值：

print(data.isnull().sum())

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64

2. 探索性可视化分析

2.1 人员流失分布

现在我们来看看数据集中员工流失的分布情况：

fig = make_subplots(subplot_titles=("", "Employee Attrition Rates"), 
                    specs=[[{"type": "bar"}, {"type": "pie"}]], 
                    rows=1, cols=2)

# Bar chart
fig.add_trace(
    go.Bar(x=data['Attrition'].value_counts().index,
           y=data['Attrition'].value_counts().values,
           marker_color=['#C2C4E2', '#D8BFD8'],
           hovertemplate='Employee Attrition Counts<br>%{x}: %{y:.4}<extra></extra>',
           showlegend=False), row=1, col=1)
fig.update_traces(texttemplate='%{y:.4}', textposition='outside')

attrition_rate = data['Attrition'].value_counts(normalize=True).mul(100).reset_index()

# Pie chart
fig.add_trace(
    go.Pie(labels=attrition_rate['Attrition'], 
           values=attrition_rate['proportion'],
           marker=dict(colors=['#C2C4E2', '#D8BFD8'], pattern=dict(shape=[".", "x"])),
           pull=[0.2, 0],
           hovertemplate='%{label}<br>Attrition Rate: %{value:.3}%<extra></extra>',
           showlegend=True), row=1, col=2)

fig.update_layout(
    title={'text': 'Employee Attrition Statistics', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
    xaxis_title={'text': '', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
    yaxis_title={'text': 'Employee Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
    xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
    plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)
fig.show()

员工流失统计图

2.2 出差频率分析

按出差频率分析员工流失情况，检查出差频率的差异对员工流失是否有影响：

fig = make_subplots(subplot_titles=("", "Distribution by Business Travel"), 
                    specs=[[{"type": "bar"}, {"type": "pie"}]], 
                    rows=1, cols=2)

# Bar chart
fig.add_trace(
    go.Bar(
        x=data['BusinessTravel'],
        y=data['Attrition'].map({'Yes': 1, 'No': 0}),
        marker_color=['#C2C4E2' if x == 'Travel_Rarely' 
                    else '#D8BFD8' if x == 'Travel_Frequently' 
                    else '#EED4E5' for x in data['BusinessTravel']],
        hovertemplate='<b>Business Travel</b>: %{x} <br><b>Attrition</b>: %{y}<extra></extra>'
    ), row=1, col=1
)

# Pie chart
fig.add_trace(
    go.Pie(labels=data['BusinessTravel'].value_counts().index, 
           values=data['BusinessTravel'].value_counts().values,
           marker=dict(colors=['#C2C4E2', '#D8BFD8', '#EED4E5'], pattern=dict(shape=[".", "x", "+"])),
           pull=[0, 0, 0.2],
           hovertemplate='%{label}<br>Count: %{value:}<br>Percent: %{percent:.3p}<extra></extra>',
           showlegend=True), row=1, col=2)

fig.update_traces(selector=dict(type='bar'),showlegend=False)

fig.update_layout(
    title={'text': 'Attrition by Business Travel Frequency', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
    xaxis_title={'text': 'Business Travel', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
    yaxis_title={'text': 'Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
    xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
    plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)

fig.show()

出差频率统计图
组织中的大多数员工很少出差，经常出差的员工流失率最高，不出差的员工流失率最低。

2.3 部门减员分析

让我们来看看各部门的自然减员比例：

fig = make_subplots(subplot_titles=("", "Distribution by Department"), 
                    specs=[[{"type": "bar"}, {"type": "pie"}]], 
                    rows=1, cols=2)

# Bar chart
fig.add_trace(
    go.Bar(
        x=data['Department'],
        y=data['Attrition'].map({'Yes': 1, 'No': 0}),
        marker_color=['gold' if x == 'Research & Development' 
                    else '#D8BFD8' if x == 'Sales' 
                    else '#CDA59E' for x in data['Department']],
        hovertemplate='<b>Development</b>: %{x} <br><b>Attrition</b>: %{y}<extra></extra>'
    ), row=1, col=1
)

# Pie chart
fig.add_trace(
    go.Pie(labels=data['Department'].value_counts().index, 
           values=data['Department'].value_counts().values,
           marker=dict(colors=['gold', '#D8BFD8', '#CDA59E'], pattern=dict(shape=[".", "x", "+"])),
           pull=[0, 0, 0.2],
           hovertemplate='%{label}<br>Count: %{value:}<br>Percent: %{percent:.3p}<extra></extra>',
           showlegend=True), row=1, col=2)

fig.update_traces(selector=dict(type='bar'),showlegend=False)

fig.update_layout(
    title={'text': 'Attrition by Department', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
    xaxis_title={'text': 'Department', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
    yaxis_title={'text': 'Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
    xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
    plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)

fig.show()

部门减员分析
这个小仪表盘显示了上一季度的员工流失率。总体而言，约有 16% 的员工离开了公司。在离职员工中，一半以上在研发部门工作，只有 5%来自人力资源部门。

2.4 教育领域分析

现在我们来看看各教育领域的自然减员比例：

fig = make_subplots(subplot_titles=("", "Distribution by EducationField"), 
                    specs=[[{"type": "bar"}, {"type": "pie"}]], 
                    rows=1, cols=2)

# Bar chart
fig.add_trace(
    go.Bar(
        x=data['EducationField'],
        y=data['Attrition'].map({'Yes': 1, 'No': 0}),
        marker_color=['gold' if x == 'Life Sciences' 
                      else '#D8BFD8' if x == 'Medical'
                      else '#8BA583' if x == 'Marketing'
                      else '#FF770F' if x == 'Technical Degree'
                      else '#CDA59E' if x == 'Human Resources'
                      else '#80D1C8' for x in data['EducationField']],
        hovertemplate='<b>EducationField</b>: %{x} <br><b>Attrition</b>: %{y}<extra></extra>'
    ), row=1, col=1
)

# Pie chart
fig.add_trace(
    go.Pie(labels=data['EducationField'].value_counts().index, 
           values=data['EducationField'].value_counts().values,
           marker=dict(colors=['gold', '#D8BFD8', '#8BA583', '#FF770F', '#80D1C8', '#CDA59E'], pattern=dict(shape=[".", "x", "+", "-", '|', '/'])),
           pull=[0, 0, 0, 0, 0.2, 0.2], 
           hovertemplate='%{label}<br>Count: %{value:}<br>Percent: %{percent:.3p}<extra></extra>',
           showlegend=True), row=1, col=2)

fig.update_traces(selector=dict(type='bar'),showlegend=False)

fig.update_layout(
    title={'text': 'Attrition by EducationField', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
    xaxis_title={'text': 'EducationField', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
    yaxis_title={'text': 'Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
    xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
    plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)

fig.show()

教育领域分析
我们可以看到，以生命科学为教育领域的员工自然减员率较高。

2.5 性别占比分析

现在我们来看看男性与女性的自然减员比例：

fig = make_subplots(subplot_titles=("", "Distribution by Gender"), 
                    specs=[[{"type": "bar"}, {"type": "pie"}]], 
                    rows=1, cols=2)

# Bar chart
fig.add_trace(
    go.Bar(
        x=data['Gender'],
        y=data['Attrition'].map({'Yes': 1, 'No': 0}),
        marker_color=['#72755B' if x == 'Male' 
                      else '#CDA59E' for x in data['Gender']],
        hovertemplate='<b>Gender</b>: %{x} <br><b>Attrition</b>: %{y}<extra></extra>'
    ), row=1, col=1
)

# Pie chart
fig.add_trace(
    go.Pie(labels=data['Gender'].value_counts().index, 
           values=data['Gender'].value_counts().values,
           marker=dict(colors=['#72755B', '#CDA59E'], pattern=dict(shape=["x", "+"])),
           pull=[0, 0, 0.2],
           hovertemplate='%{label}<br>Count: %{value:}<br>Percent: %{percent:.3p}<extra></extra>',
           showlegend=True), row=1, col=2)

fig.update_traces(selector=dict(type='bar'),showlegend=False)

fig.update_layout(
    title={'text': 'Attrition by Gender', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
    xaxis_title={'text': 'Gender', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
    yaxis_title={'text': 'Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
    xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
    plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)

fig.show()

性别占比分析
与女性相比，男性的流失率较高。

2.6 婚姻状况分析

接下来，让我们看一看单身、未婚与离异的自然减员比例：

fig = make_subplots(subplot_titles=("", "Distribution by MaritalStatus"), 
                    specs=[[{"type": "bar"}, {"type": "pie"}]], 
                    rows=1, cols=2)

# Bar chart
fig.add_trace(
    go.Bar(
        x=data['MaritalStatus'],
        y=data['Attrition'].map({'Yes': 1, 'No': 0}),
        marker_color=['#BCB0D7' if x == 'Single'
                       else '#CC5369' if x == 'Married' 
                       else '#EED4E5' for x in data['MaritalStatus']],
        hovertemplate='<b>MaritalStatus</b>: %{x} <br><b>Attrition</b>: %{y}<extra></extra>'
    ), row=1, col=1
)

# Pie chart
fig.add_trace(
    go.Pie(labels=data['MaritalStatus'].value_counts().index, 
           values=data['MaritalStatus'].value_counts().values,
           marker=dict(colors=['#CC5369', '#BCB0D7', '#EED4E5'], pattern=dict(shape=[".", "x", "+"])),
           pull=[0, 0, 0.2],
           hovertemplate='%{label}<br>Count: %{value:}<br>Percent: %{percent:.3p}<extra></extra>',
           showlegend=True), row=1, col=2)

fig.update_traces(selector=dict(type='bar'),showlegend=False)

fig.update_layout(
    title={'text': 'Attrition by MaritalStatus', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
    xaxis_title={'text': 'MaritalStatus', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
    yaxis_title={'text': 'Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
    xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
    plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)

fig.show()

婚姻状况分析
组织中的大多数员工都已婚。离婚员工的流失率非常高，单身员工的流失率较低。

2.7 工作角色分析

fig = make_subplots(subplot_titles=("", "Distribution by JobRole"), 
                    specs=[[{"type": "bar"}, {"type": "pie"}]], 
                    rows=1, cols=2)

# Bar chart
fig.add_trace(
    go.Bar(
        x=data['JobRole'],
        y=data['Attrition'].map({'Yes': 1, 'No': 0}),
        marker_color=['gold' if x == 'Sales Executive' 
                      else '#D8BFD8' if x == 'Research Scientist'
                      else '#8BA583' if x == 'Laboratory Technician'
                      else '#FF770F' if x == 'Manufacturing Director'
                      else '#80D1C8' if x == 'Healthcare Representative'
                      else '#CDA59E' if x == 'Manager'
                      else '#8EA3A6' if x == 'Sales Representative'
                      else '#6182A8' if x == 'Research Director'
                      else '#CC5369' for x in data['JobRole']],
        hovertemplate='<b>JobRole</b>: %{x} <br><b>Attrition</b>: %{y}<extra></extra>'
    ), row=1, col=1
)

# Pie chart
fig.add_trace(
    go.Pie(labels=data['JobRole'].value_counts().index, 
           values=data['JobRole'].value_counts().values,
           marker=dict(colors=['gold', '#D8BFD8', '#8BA583', '#FF770F', '#80D1C8',\
                               '#CDA59E', '#8EA3A6', '#6182A8', '#CC5369'],
                       pattern=dict(shape=[".", "x", "+", "-", '|', '/', '+','x','|'])),
           pull=[0, 0, 0, 0, 0, 0, 0.2, 0, 0], 
           hovertemplate='%{label}<br>Count: %{value:}<br>Percent: %{percent:.3p}<extra></extra>',
           showlegend=True), row=1, col=2)

fig.update_traces(selector=dict(type='bar'),showlegend=False)

fig.update_layout(
    title={'text': 'Attrition by JobRole', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
    xaxis_title={'text': 'JobRole', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
    yaxis_title={'text': 'Attrition Counts', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
    xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'), width=900, height=500,
    plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',
)

fig.show()

工作角色分析

2.8 月收入分布情况

①按减员状况分列的月收入分布情况

plot_df=data.sort_values(by="Attrition")
fig=px.histogram(plot_df, x='MonthlyIncome', color='Attrition', 
                 opacity=0.8, histnorm='density', barmode='overlay', marginal='box',
                 color_discrete_map={'Yes': '#C02B34','No': '#CDBBA7'})
fig.update_layout(title_text='Distribution of Monthly Income by Attrition Status',
                  xaxis_title='Monthly Income, $', yaxis_title='Density', width=900, height=500, 
                  font_color='#28221D', font_family= 'Comic Sans MS',
                  plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2', legend_traceorder='reversed')
fig.show()

月收入分布情况
在职员工和离职员工的月收入分布呈正偏态，离职员工的月收入总体较低。离职员工的月收入中位数比在职员工少 2000 多美元。

②按工作生活平衡度分列的月收入分布情况

fig=go.Figure()
colors=['#214D5C','#91ABB4']
for i, j in enumerate(data['Gender'].unique()):
    df_plot=data[data['Gender']==j]
    fig.add_trace(go.Box(x=df_plot['WorkLifeBalance'], y=df_plot['MonthlyIncome'],
                         notched=True, line=dict(color=colors[i]),name=j))
fig.update_layout(title='Distribution of Monthly Income by Work Life Balance',
                  xaxis_title='Work Life Balance', boxmode='group', 
                  font_color='#28221D', font_family='Comic Sans MS',
                  xaxis = dict(tickmode = 'array', tickvals = [1, 2, 3, 4],
                               ticktext = ['Poor', 'Fair', 'Good', 'Excellent']),
                  width=900, height=500,
                  plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',)
fig.show()

按工作生活平衡度分列的月收入分布情况
③月收入随工作总年数和工作级别的增加而增加

plot_df = data.copy()
plot_df['JobLevel'] = pd.Categorical(
    plot_df['JobLevel']).rename_categories( 
    ['Entry level', 'Mid level', 'Senior', 'Lead', 'Executive'])
col=['#73AF8E', '#4F909B', '#707BAD', '#A89DB7','#C99193']
fig = px.scatter(plot_df, x='TotalWorkingYears', y='MonthlyIncome', 
                 color='JobLevel', size='MonthlyIncome',
                 color_discrete_sequence=col, 
                 category_orders={'JobLevel': ['Entry level', 'Mid level', 'Senior', 'Lead', 'Executive']})
fig.update_layout(legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
                  title='Monthly income increases with total number of years worked and job level <br>',
                  xaxis_title='Total Working Years', yaxis=dict(title='Income',tickprefix='$'), 
                  legend_title='', font_color='#28221D', font_family='Comic Sans MS', width=900, height=500,
                  margin=dict(l=40, r=30, b=80, t=120),plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2',)
fig.show()

月收入随工作总年数和工作级别的增加而增加
从上面的散点图来看，月收入与工作总年数呈正相关，而雇员的收入与其工作级别也有密切联系。

2.9 员工流失相关矩阵

使用create_annotated_heatmap 函数绘制矩阵，探索数据的相关性

cat_cols=[]
for i in data.columns:
    if data[i].nunique() <= 5 or data[i].dtype == object:
        cat_cols.append(i)
df=data.copy()
df.drop(df[cat_cols], axis=1, inplace=True)
corr=df.corr().round(2)
x=corr.index.tolist()
y=corr.columns.tolist()
z=corr.to_numpy()
fig = ff.create_annotated_heatmap(z=z, x=x, y=y, annotation_text=z, name='',
                                  hovertemplate="Correlation between %{x} and %{y}= %{z}",
                                  colorscale='GnBu')
fig.update_yaxes(autorange="reversed")
fig.update_layout(title="Correlation Matrix of Employee Attrition", 
                  font_color='#28221D',margin=dict(t=180), height=600)
fig.show()

相关性矩阵
月收入与总工作年数呈 0.77 的强正相关，这与我们在上述散点图中的发现相吻合。此外，在公司工作的年数与在公司担任经理的年数（相关性 = 0.77）以及担任现职的年数（相关性 = 0.76）也有很强的正相关性。没有变量的相关性超过 0.8，这表明可能存在共线性问题。

3. 统计分析 - 特征重要性

3.1 数值变量特征

进行方差分析测试，分析数值特征在员工流失中的重要性。

num_cols = data.select_dtypes(np.number).columns # 选择数值特征

new_data = data.copy()
new_data["Attrition"] = new_data['Attrition'].map({'Yes': 1, 'No': 0}) # 映射编码

f_scores = {}
p_values = {}

for i in num_cols:
    f_score, p_value = stats.f_oneway(new_data[i],new_data["Attrition"])
    
    f_scores[i] = f_score
    p_values[i] = p_value

每个数字特征的方差分析检验 F_Score 可视化。

fig = go.Figure(data=[go.Bar(
    x=list(f_scores.keys()),
    y=list(f_scores.values()),
    marker_color=["#C2C4E2","#CDB5CD","khaki","#D1E6DC","#BDE2E2",]*5
)])
fig.update_traces(texttemplate='%{value:.0f}', textposition='outside')
fig.update_layout(title={'text': 'Anova-Test F_scores Comparison', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
                  xaxis_title={'text': '', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
                  yaxis_title={'text': 'F_scores', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
                  xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'),
                  plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2', width=900, height=500)
fig.show()

方差分析检验 F_Score 可视化
比较方差分析检验的 F_Score 和 P_值。

annova_data = pd.DataFrame({"Features" : list(f_scores.keys()),
                            "F_Score" : list(f_scores.values())
                           })
annova_data["P_value"] = [format(p, '.20f') for p in list(p_values.values())]
annova_data

方差分析检验的 F_Score 和 P_值

3.1 分类变量特征

用齐次方检验分析员工流失的分类特征重要性

cat_cols = data.select_dtypes(include="object").columns.tolist() #选择分类特征
cat_cols.remove("Attrition")

chi2_statistic = {}
p_values = {}

# Perform chi-square test for each column
for col in cat_cols:
    contingency_table = pd.crosstab(data[col], data['Attrition'])
    chi2, p_value, _, _ = chi2_contingency(contingency_table)
    chi2_statistic[col] = chi2
    p_values[col] = p_value

可视化每个分类特征的 Chi-Square 统计值。

fig = go.Figure(data=[go.Bar(
    x=list(chi2_statistic.keys()),
    y=list(chi2_statistic.values()),
    marker_color=["#C2C4E2","#CDB5CD","khaki","#D1E6DC","#BDE2E2",]*5
)])
fig.update_traces(texttemplate='%{value:.1f}', textposition='outside')
fig.update_layout(title={'text': 'Chi-Square Statistic Value of each Categorical Columns', 'font_size': 24, 'font_family': 'Comic Sans MS', 'font_color': '#454545'},
                  xaxis_title={'text': '', 'font_size': 18, 'font_family': 'Courier New', 'font_color': '#454545'},
                  yaxis_title={'text': 'Chi-Square Statistic', 'font_size': 18, 'font_family': 'Lucida Console', 'font_color': '#454545'},
                  xaxis_tickfont=dict(color='#663300'), yaxis_tickfont=dict(color='#663300'),
                  plot_bgcolor='#F2F2F2', paper_bgcolor='#F2F2F2', width=900, height=500)
fig.show()

分类特征的 Chi-Square 统计值
比较 Chi2 统计量和 Chi_Square 检验的 P_值。

chi_data = pd.DataFrame({"Features":list(chi2_statistic.keys()),"Chi_2 Statistic":list(chi2_statistic.values())})
chi_data["P_value"] =  [format(p, '.20f') for p in list(p_values.values())]
chi_data

Chi2 统计量和 Chi_Square 检验的 P_值

4. 数据预处理

该数据集中有很多具有分类值的特征。我将把这些分类变量转换成数值变量。

data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField',
       'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement',
       'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus',
       'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'OverTime',
       'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction',
       'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear',
       'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole',
       'YearsSinceLastPromotion', 'YearsWithCurrManager'],
      dtype='object')

cat_cols = data.select_dtypes(include=['object']).columns.tolist()
cat_cols

['Attrition',
 'BusinessTravel',
 'Department',
 'EducationField',
 'Gender',
 'JobRole',
 'MaritalStatus',
 'OverTime']

4.1 分类变量标签编码

# 映射编码
data["Gender"] = data["Gender"].map({"Female":0 ,"Male":1})

le = LabelEncoder()
data['Attrition'] = le.fit_transform(data['Attrition'])

ohe = OneHotEncoder() # 离散值特征One-hot编码
encoded = ohe.fit_transform(data[['BusinessTravel',
                                   'Department',
                                   'EducationField',
                                   'JobRole',
                                   'MaritalStatus',
                                   'OverTime']])
encoded_data = pd.DataFrame(encoded.toarray(),columns = ohe.get_feature_names_out())

data = pd.concat([data,encoded_data],axis=1)
data = data.drop(['BusinessTravel',
                  'Department',
                  'EducationField',
                  'JobRole',
                  'MaritalStatus',
                  'OverTime'],axis =1)
print(data.info())

现在让我们使用 .corr() 函数来看看数据之间的相关性：

correlation = data.corr()
print(correlation["Attrition"].sort_values(ascending=False))

4.2 数据集过采样 SMOTE

标准的 ML 技术（如决策树和逻辑回归）会偏向于多数类，而忽略少数类。它们倾向于只预测多数人类别，因此，与多数人类别相比，少数人类别存在严重的分类错误。用更专业的术语来说，如果我们的数据集中存在不平衡的数据分布，那么我们的模型就更容易出现少数人类别的召回率可以忽略不计或非常低的情况。

SMOTE (synthetic minority oversampling technique) 合成少数群体超采样技术是解决不平衡问题最常用的超采样方法之一。它的目的是通过复制少数类实例来随机增加少数类实例，从而平衡类的分布。SMOTE 在现有的少数类实例之间合成新的少数类实例。它通过线性插值为少数类生成虚拟训练记录。这些合成训练记录是通过为少数群体中的每个实例随机选择一个或多个 k 近邻来生成的。过采样过程结束后，数据将被重建，并可对处理后的数据应用多个分类模型。

# dividing the future(X) and the target(y) from the dataset 
X = data.drop(['Attrition'], axis=1)
y = data['Attrition']

# Oversampling
smote = SMOTE(random_state=42)
X_resample, y_resample = smote.fit_resample(X, y)

4.3 标准化数据特征缩放

数据集的标准化是许多机器学习估计器的共同要求：如果单个特征看起来不像标准正态分布数据（如均值为 0、方差为单位的高斯分布），估计器的表现可能会很糟糕。

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_resample)

5. 员工流失预测模型

5.1 数据准备（拆分为训练集和测试集）

现在，让我们把数据分成训练集和测试集：

X_train, X_test,\
    y_train, y_test = train_test_split(X_scaled, y_resample, 
                                       test_size=0.2, random_state=42)

5.2 构建神经网络

5.2.1 定义网络结构

# Define network structure
model = Sequential()
model.add(Input(shape=(X_train.shape[1],)))
model.add(Dropout(0.1))
model.add(Dense(64, activation ='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(32, activation ='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Dense(1, activation ='sigmoid'))

model.compile(loss ='binary_crossentropy', 
              optimizer ='rmsprop', metrics =['accuracy'])

接下来，让我们用 .summary() 函数查看一下模型概要

# Model summary
model.summary()

Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Layer (type)                         ┃ Output Shape                ┃         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ dropout (Dropout)                    │ (None, 50)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense (Dense)                        │ (None, 64)                  │           3,264 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization                  │ (None, 64)                  │             256 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_1 (Dropout)                  │ (None, 64)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_1 (Dense)                      │ (None, 32)                  │           2,080 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ batch_normalization_1                │ (None, 32)                  │             128 │
│ (BatchNormalization)                 │                             │                 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dropout_2 (Dropout)                  │ (None, 32)                  │               0 │
├──────────────────────────────────────┼─────────────────────────────┼─────────────────┤
│ dense_2 (Dense)                      │ (None, 1)                   │              33 │
└──────────────────────────────────────┴─────────────────────────────┴─────────────────┘
 Total params: 5,761 (22.50 KB)
 Trainable params: 5,569 (21.75 KB)
 Non-trainable params: 192 (768.00 B)

5.2.2 训练与可视化（accuracy曲线）

# Training Model
history = model.fit(X_train, y_train, epochs =100, verbose = 1, batch_size = 20, validation_data=(X_test, y_test))

import seaborn as sns 
import matplotlib.pyplot as plt
sns.set_theme()
history_df = pd.DataFrame(history.history)
plt.plot(history_df.loc[:, ['accuracy']], "#BDE2E2", label='Training accuracy')
plt.plot(history_df.loc[:, ['val_accuracy']], "#C2C4E2", label='Validation accuracy')
plt.title('Training and Validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

plt.show()

请添加图片描述

5.2.2 训练与可视化（loss曲线）

history_df = pd.DataFrame(history.history)

plt.plot(history_df.loc[:, ['loss']], "#BDE2E2", label='Training loss')
plt.plot(history_df.loc[:, ['val_loss']],"#C2C4E2", label='Validation loss')
plt.title('Training and Validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend(loc="best")

plt.show()