机器学习-过采样（全网最详解）

过采样实际运用

这里我们通过讲解信用卡贷款的问题来为大家展示过采样的相关用法，包括过采样、模型搭建、混淆矩阵、数据标准化和交叉验证等多种代码的实现与应用。

1.导入相关包

import matplotlib.pylab as plt
import numpy as np
import pandas as pd
from pylab import mpl


def cm_plot(y, yp):
    from sklearn.metrics import confusion_matrix
    import matplotlib.pyplot as plt

    cm = confusion_matrix(y, yp)
    plt.matshow(cm, cmap=plt.cm.Blues)
    plt.colorbar()
    for x in range(len(cm)):
        for y in range(len(cm)):
            plt.annotate(cm[x, y], xy=(y, x), horizontalalignment='center', verticalalignment='center')
            plt.ylabel('True label')
            plt.xlabel('Predicted label')
    return plt

这里我们导入相关包，并绘制混淆矩阵，对应包的作用如下：

pandas：用于数据处理和分析的包。
numpy：提供高性能多维数组的对象和相关操作的运用。
matplotlib.pylab：绘制图像，这里通过绘制图像来展示数据之间的关系。

2.数据预处理

data = pd.read_csv(r"./creditcard.csv")
data.head()  # 默认打印前5行
"""数据标准化：Z标准化"""
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
a = data[['Amount']]  # 返回dataframe数据，而不是series
data['Amount'] = scaler.fit_transform(data[['Amount']])
data = data.drop(['Time'], axis=1)  # 删除无用列
"""切分测试集，测试集使用原始数据进行预测"""
from sklearn.model_selection import train_test_split

X_whole = data.drop('Class', axis=1)
y_whole = data.Class
x_train_w, x_test_w, y_train_w, y_test_w = \
    train_test_split(X_whole, y_whole, test_size=0.3, random_state=0)

读取数据：通过pandas库的read_csv读取数据集。
数据标准化：引入sklearn库，对Amount列数据进行标准化，这里我们对其进行Z标准化操作，将其列内的数据限制在（-1，1）范围内，以便减小Amount列对数据集的影响，同时删除无用列Time并将数据再次赋值给data。
数据集分割：引入sklearn库，同时将data中除去Class列的所有数据全部给X_whole,将Class列给y_whole。通过sklearn库中的train_test_split方法将X_whole与y_whole按随机种子为0，测试数据为原数据的30%来切分成测试集与训练集。

注意：
这里的axis=1为列，axis=0为行。
random_state是一个随机种子，用于控制随机过程（如样本的选择和合成），以确保每次运行代码时都能得到相同的结果。

3.过采样操作

from imblearn.over_sampling import SMOTE

oversampler = SMOTE(random_state=0)
os_x_train,os_y_train = oversampler.fit_resample(x_train_w,y_train_w)

mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei']
mpl.rcParams['axes.unicode_minus'] = False

lables_count = pd.value_counts(os_y_train)  # 0有多少个数据，1有多少个数据
plt.title("正负样本数")
plt.xlabel("类别")
plt.ylabel("频数")
lables_count.plot(kind='bar')
plt.show()

os_x_train_w, os_x_test_w, os_y_train_w, os_y_test_w = \
    train_test_split(os_x_train, os_y_train, test_size=0.3, random_state=0)

过采样：导入SMOTE类，并通过SMOTE建立一个实列，使用fit_resample方法将SMOTE用于训练训练集数据特征x_train_w与变量y_train_w，并将过采样后的数据返回到特征os_x_train与变量os_y_train中去。
绘制图像：将过采样后返回的数据进行图像绘制，以便查看0、1数据的数量。图形展示如下：
切分测试集：对过采样数据再次进行切分，按照测试集为30%的方法切分新的测试集与训练集，让过采样后的数据先进行内部测试

4.交叉验证

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score  # 交叉验证的函数

# 交叉验证选择较优惩罚因子
scores = []
c_param_range = [0.01, 0.1, 1, 10, 100]  # 参数
for i in c_param_range:  # 第1词循环的时候C=0.01，
    lr = LogisticRegression(C=i, penalty='l2', solver='lbfgs', max_iter=1000)
    score = cross_val_score(lr, os_x_train_w, os_y_train_w, cv=8, scoring='recall')
    score_mean = sum(score) / len(score)  # 交叉验证后的召回率
    scores.append(score_mean)  # 所有交叉验证的召回率
    print(score_mean)

best_c = c_param_range[np.argmax(scores)]  # 寻找scores中最大值对应的C
print("最大值对应的C为：{}".format(best_c))

# 建立最优模型
lr = LogisticRegression(C=best_c, penalty='l2', solver='lbfgs', max_iter=1000)
lr.fit(os_x_train_w, os_y_train_w)

设置参数：依次设置内部参数，C为正则化强度，正则化系数λ的倒数，float类型，默认为1.0。必须是正浮点型数。像SVM一样，越小的数值表示越强的正则化。penalty为正则化方式，有l1和l2两种，这里我们选择l2方式。Solver为优化拟合参数算法选择，默认为liblinear，这里我们选择lbfgs。max_iter为最大迭代次数，这里我们设置为1000。
交叉验证：通过K折交叉验证来选择最优的惩罚因子，防止过拟合。这里K设置为8。然后计算8次验证后的召回率将其返回到scores中。
寻找最优正则化强度：通过np.argmax方法寻找最大召回率对应的C值
建立模型：取出最优值，然后进行最优的模型建立。

5.绘制混淆矩阵

from sklearn import metrics

os_train_predicted = lr.predict(os_x_train_w)
print(metrics.classification_report(os_y_train_w, os_train_predicted))
cm_plot(os_y_train_w, os_train_predicted).show()

os_test_predicted = lr.predict(os_x_test_w)  # 小数据测试
print(metrics.classification_report(os_y_test_w, os_test_predicted))
cm_plot(os_y_test_w, os_test_predicted).show()

train_predicted = lr.predict(x_train_w)
print(metrics.classification_report(y_train_w, train_predicted))
cm_plot(y_train_w, train_predicted).show()

test_predicted = lr.predict(x_test_w)
print(metrics.classification_report(y_test_w, test_predicted))
cm_plot(y_test_w, test_predicted).show()

绘制混淆矩阵：绘制全部测试集与训练集的混淆矩阵和数据图像，以便观察相应的值。

在这里插入图片描述

6.模型评估与测试

thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
recalls = []
for i in thresholds:
    y_predict_proba = lr.predict_proba(x_test_w)

    y_predict_proba = pd.DataFrame(y_predict_proba)
    y_predict_proba = y_predict_proba.drop([0], axis=1)
    # 当预测概率大于i，0.1，0.2，预测的标签设置1
    y_predict_proba[y_predict_proba[[1]] > i] = 1
    # 当预测概率小于等于i 预测的标签设置为0
    y_predict_proba[y_predict_proba[[1]] <= i] = 0

    recall = metrics.recall_score(y_test_w, y_predict_proba[1])
    recalls.append(recall)
    print(metrics.classification_report(y_test_w, y_predict_proba[1]))
    print("{} Recall metric in the testing dataset: {:.3f}".format(i, recall))

模型预测：更改标签设置的范围，计算每次更改i值时的召回率，并绘制相应的混淆矩阵，输出召回率。

总结

过采样是逻辑回归中处理不平衡数据集的一种有效方法。通过增加少数类样本的数量，可以平衡数据集，提高模型对少数类的识别能力。然而，在选择过采样方法时，需要考虑其潜在的缺点，并结合实际情况选择最适合的方法。

标签：采样,样本,predict,全网,train,test,os,详解
From： https://blog.csdn.net/2301_77698138/article/details/141401094

机器学习-过采样（全网最详解）

相关介绍

1.过采样的基本概念

2.常见的过采样方法

3.过采样在逻辑回归中的应用