一.数据集准备
二.模型准备
三.交叉验证(k折交叉验证(10))
四.知识点补充:混淆矩阵(准确率,召回率)
五.知识点补充:阈值和ROC曲线
-
1.数据集处理(读取,切分,shuffle洗牌操作)
-
fetch_openml()函数可以下载openml.org公共数据库中的数据集,如:mnist_784,得到mnist(mnist数据是图像数据:(28,28,1)的灰度图),
并获取data列和target列from sklearn.datasets import fetch_openml mnist = fetch_openml('mnist_784') mnist X, y = mnist["data"], mnist["target"]
*该数据有70000个,我取前60000作为训练集和测试集,在进行混洗打乱原本顺序(.loc函数(通过行索引 "Index" 中的具体值来取行数据))
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] import numpy as np shuffle_index = np.random.permutation(60000) X_train, y_train = X_train.iloc[shuffle_index], y_train[shuffle_index]
-
-
2.模型(二元分类模型:识别是否为6)
- 准备数据(true,flase)
y_train_6 = (y_train=='6') y_test_6 = (y_test==6)
- 训练分类SGDClassifier模型,并且完成20704预测
from sklearn.linear_model import SGDClassifier sgd_clf = SGDClassifier(max_iter=10,random_state=42) sgd_clf.fit(X_train,y_train_6) sgd_clf.predict([X.loc[20704]])
- 准备数据(true,flase)
-
3.K折交叉验证
-
简单使用sklearn库中cross_val_score()可以直接得到3折训练的模型效果
from sklearn.model_selection import cross_val_score cross_val_score(sgd_clf,X_train,y_train_6,cv=3,scoring='accuracy')
-
cross_val_score参数含义:cross_val_score参数
-
自定义使用(Kfold)
import numpy as np from sklearn.model_selection import KFold from sklearn.metrics import accuracy_score,recall_score,f1_score from sklearn.base import clone skflods = KFold(n_splits=3,shuffle=True,random_state=42) accuracy_score_list,recall_score_list,f1_score_list = [],[],[] for train_index,test_index in skflods.split(X_train,y_train_6): clone_clf = clone(sgd_clf) # 准备交叉验证的数据 X_train_folds = X_train.loc[train_index] y_train_folds = y_train_6[train_index] X_test_folds = X_train.loc[test_index] y_test_folds = y_train_6[test_index] # 训练模型 clone_clf.fit(X_train_folds,y_train_folds) y_pred = clone_clf.predict(X_test_folds) n_correct = sum(y_pred == y_test_folds) # 评估模型 AccuracyScore = accuracy_score(y_test_folds,y_pred) RecallScore = recall_score(y_test_folds,y_pred) F1Score = f1_score(y_test_folds,y_pred) # 将评估指标存放对应的列表中 accuracy_score_list.append(AccuracyScore) recall_score_list.append(RecallScore) f1_score_list.append(F1Score) # 打印每一次训练的正确率、召回率、F1值 print('accuracy_score:',AccuracyScore,'recall_score:',RecallScore,'f1_score:',F1Score) # 打印各指标的平均值和95%的置信区间 print("Accuracy: %0.2f (+/- %0.2f)" % (np.average(accuracy_score_list), np.std(accuracy_score_list) * 2)) print("Recall: %0.2f (+/- %0.2f)" % (np.average(recall_score_list), np.std(recall_score_list) * 2)) print("F1_score: %0.2f (+/- %0.2f)" % (np.average(f1_score_list), np.std(f1_score_list) * 2)) print(n_correct/len(y_pred))
-
自定义使用(StratifiedKFold)
import numpy as np from sklearn.model_selection import StratifiedKFold from sklearn.metrics import accuracy_score,recall_score,f1_score from sklearn.base import clone skflods = StratifiedKFold(n_splits=3,shuffle=True,random_state=42) accuracy_score_list,recall_score_list,f1_score_list = [],[],[] for train_index,test_index in skflods.split(X_train,y_train_6): clone_clf = clone(sgd_clf) # 准备交叉验证的数据 X_train_folds = X_train.loc[train_index] y_train_folds = y_train_6[train_index] X_test_folds = X_train.loc[test_index] y_test_folds = y_train_6[test_index] # 训练模型 clone_clf.fit(X_train_folds,y_train_folds) y_pred = clone_clf.predict(X_test_folds) n_correct = sum(y_pred == y_test_folds) # 评估模型 AccuracyScore = accuracy_score(y_test_folds,y_pred) RecallScore = recall_score(y_test_folds,y_pred) F1Score = f1_score(y_test_folds,y_pred) # 将评估指标存放对应的列表中 accuracy_score_list.append(AccuracyScore) recall_score_list.append(RecallScore) f1_score_list.append(F1Score) # 打印每一次训练的正确率、召回率、F1值 print('accuracy_score:',AccuracyScore,'recall_score:',RecallScore,'f1_score:',F1Score) # 打印各指标的平均值和95%的置信区间 print("Accuracy: %0.2f (+/- %0.2f)" % (np.average(accuracy_score_list), np.std(accuracy_score_list) * 2)) print("Recall: %0.2f (+/- %0.2f)" % (np.average(recall_score_list), np.std(recall_score_list) * 2)) print("F1_score: %0.2f (+/- %0.2f)" % (np.average(f1_score_list), np.std(f1_score_list) * 2)) print(n_correct/len(y_pred))
-
注意:因为是去预测y是否为6,但对于y来说为6的并不多,样本不平衡,所以使用StratifiedKFold(具有分层的交叉验证迭代器)解决样本不平衡问题
StratifiedKFold用法类似Kfold,但是它是分层采样,确保训练集,测试集中各类别样本的比例与原始数据集中相同。这一区别在于当遇到非平衡数据时,
StratifiedKFold() 各个类别的比例大致和完整数据集中相同,若数据集有4个类别,比例是2:3:3:2,则划分后的样本比例约是2:3:3:2;
但是KFold可能存在一种情况:数据集有5类,抽取出来的也正好是按照类别划分的5类,也就是说第一折全是0类,第二折全是1类等等,
这样的结果就会导致模型训练时没有学习到测试集中数据的特点,从而导致模型得分很低,甚至为0。
-
-
4.混淆矩阵(准确率,召回率)
-
TP,FP,FN,TN
TP:预测为正,并且预测对了
FP:预测为正,并且预测错了
FN:预测为负,并且预测错了
TN:预测为负,并且预测对了 -
accuracy 精准度,precision 查准率,recall 查全率(召回率),F1度量
accuracy =(TP+TN)/(TP+FP+FN+TN)
precision = TP/(TP+FP)
recall = TP/(TP+FN)
F1=2/[(1/precision) + (1/recall)]
- 使用confusion_matrix可以查看矩阵不过形式为
TN FP
FN TPfrom sklearn.metrics import confusion_matrix confusion_matrix(y_train_6,y_train_pred)
-
-
5.阈值和ROC曲线
-
关于阈值:
-
关于roc
TPR = TP / (TP + FN) (Recall)
FPR = FP / (FP + TN)
-