首页 > 编程语言 >python k-近邻算法 案例实战 预测Pima 印度安人的糖尿病

python k-近邻算法 案例实战 预测Pima 印度安人的糖尿病

时间:2022-12-09 10:36:22浏览次数:65  
标签:768 安人 python non score Pima test null data


前言

一、目的和要求
理解k-近邻算法的原理,掌握k-近邻算法的应用开发。

二、主要内容
实例:糖尿病预测
任务:预测Pima 印度安人的糖尿病
数据来源:

  1. ​https://www.kaggle.com/uciml/pima-indians-diabetes-database​
  2. 在实验1文件夹里,pima-indians-diabetes

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score,precision_score, \
recall_score,f1_score,cohen_kappa_score
from collections import Counter
from sklearn.metrics import roc_curve,auc
data = pd.read_csv('./diabetes.csv')

数据探索

data.head()

Pregnancies

Glucose

BloodPressure

SkinThickness

Insulin

BMI

DiabetesPedigreeFunction

Age

Outcome

0

6

148

72

35

0

33.6

0.627

50

1

1

1

85

66

29

0

26.6

0.351

31

0

2

8

183

64

0

0

23.3

0.672

32

1

3

1

89

66

23

94

28.1

0.167

21

0

4

0

137

40

35

168

43.1

2.288

33

1

数据说明

  • Pregnancies : 怀孕次数
  • Glucose : 口服葡萄糖耐量测试中2小时的血浆葡萄糖浓度
  • BloodPressure : 舒张压(毫米汞柱)
  • SkinThickness : 三头肌皮肤褶皱厚度(毫米)
  • Insulin : 2小时血清胰岛素(mu U / ml)
  • BMI : 体重指数(体重(kg)/(身高(m))^ 2)
  • DiabetesPedigreeFunction : 糖尿病谱系功能
  • Age : 年龄
  • Outcome : 结果

data.describe()

Pregnancies

Glucose

BloodPressure

SkinThickness

Insulin

BMI

DiabetesPedigreeFunction

Age

Outcome

count

768.000000

768.000000

768.000000

768.000000

768.000000

768.000000

768.000000

768.000000

768.000000

mean

3.845052

120.894531

69.105469

20.536458

79.799479

31.992578

0.471876

33.240885

0.348958

std

3.369578

31.972618

19.355807

15.952218

115.244002

7.884160

0.331329

11.760232

0.476951

min

0.000000

0.000000

0.000000

0.000000

0.000000

0.000000

0.078000

21.000000

0.000000

25%

1.000000

99.000000

62.000000

0.000000

0.000000

27.300000

0.243750

24.000000

0.000000

50%

3.000000

117.000000

72.000000

23.000000

30.500000

32.000000

0.372500

29.000000

0.000000

75%

6.000000

140.250000

80.000000

32.000000

127.250000

36.600000

0.626250

41.000000

1.000000

max

17.000000

199.000000

122.000000

99.000000

846.000000

67.100000

2.420000

81.000000

1.000000

data.shape
(768, 9)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
data['Outcome'].value_counts()
0    500
1 268
Name: Outcome, dtype: int64

小结

  • 1.数据没有缺失值
  • 2.数据类别没有不平衡的问题,数据非常好!!,一点不脏,不过这里做的是k-近邻算法
  • 3.是二分类问题

模型构建

数据标准化

new_data = data.drop([ 'Outcome'], axis=1)
scale = MinMaxScaler().fit(new_data)## 训练规则
biao_data = scale.transform(new_data) ## 应用规则

划分训练集和测试集

X_train,X_test,y_train,y_test = train_test_split(biao_data,data['Outcome'],test_size=0.2,random_state=123)
# 模型训练
k = 5
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X_train, y_train)
KNeighborsClassifier()
y_pred = clf.predict(X_test)

模型评估

fpr,tpr,threshold = roc_curve(y_test, y_pred)
print('数据的AUC为:',auc(fpr,tpr))
print('数据的准确率为:',accuracy_score(y_test,y_pred))
print('数据的精确率为:',precision_score(y_test,y_pred))
print('数据的召回率为:',recall_score(y_test,y_pred))
print('数据的F1值为:',f1_score(y_test,y_pred))
print('数据的Cohen’s Kappa系数为:',cohen_kappa_score(y_test,y_pred))
print('Counter:',Counter(y_pred))
数据的AUC为: 0.7634698275862069
数据的准确率为: 0.7987012987012987
数据的精确率为: 0.8
数据的召回率为: 0.6206896551724138
数据的F1值为: 0.6990291262135923
数据的Cohen’s Kappa系数为: 0.5514001127607593
Counter: Counter({0: 109, 1: 45})

参考


标签:768,安人,python,non,score,Pima,test,null,data
From: https://blog.51cto.com/u_15796263/5923940

相关文章