python k-近邻算法案例实战预测Pima 印度安人的糖尿病

标签：768 安人 python non score Pima test null data

前言

一、目的和要求
理解k-近邻算法的原理，掌握k-近邻算法的应用开发。

二、主要内容
实例：糖尿病预测
任务：预测Pima 印度安人的糖尿病
数据来源：

https://www.kaggle.com/uciml/pima-indians-diabetes-database
在实验1文件夹里，pima-indians-diabetes

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score,precision_score, \
recall_score,f1_score,cohen_kappa_score
from collections import Counter
from sklearn.metrics import roc_curve,auc

data = pd.read_csv('./diabetes.csv')

数据探索

data.head()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

数据说明

Pregnancies : 怀孕次数
Glucose : 口服葡萄糖耐量测试中2小时的血浆葡萄糖浓度
BloodPressure : 舒张压（毫米汞柱）
SkinThickness : 三头肌皮肤褶皱厚度（毫米）
Insulin : 2小时血清胰岛素（mu U / ml）
BMI : 体重指数（体重（kg）/（身高（m））^ 2）
DiabetesPedigreeFunction : 糖尿病谱系功能
Age : 年龄
Outcome : 结果

data.describe()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

data.shape

(768, 9)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

data['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

小结

1.数据没有缺失值
2.数据类别没有不平衡的问题,数据非常好!!,一点不脏,不过这里做的是k-近邻算法
3.是二分类问题

模型构建

数据标准化

new_data = data.drop([ 'Outcome'], axis=1)
scale = MinMaxScaler().fit(new_data)## 训练规则
biao_data = scale.transform(new_data) ## 应用规则

划分训练集和测试集

X_train,X_test,y_train,y_test = train_test_split(biao_data,data['Outcome'],test_size=0.2,random_state=123)

# 模型训练
k = 5
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X_train, y_train)

KNeighborsClassifier()

y_pred = clf.predict(X_test)

模型评估

fpr,tpr,threshold = roc_curve(y_test, y_pred)
print('数据的AUC为:',auc(fpr,tpr))
print('数据的准确率为：',accuracy_score(y_test,y_pred))
print('数据的精确率为：',precision_score(y_test,y_pred))
print('数据的召回率为：',recall_score(y_test,y_pred))
print('数据的F1值为：',f1_score(y_test,y_pred))
print('数据的Cohen’s Kappa系数为：',cohen_kappa_score(y_test,y_pred))
print('Counter:',Counter(y_pred))

数据的AUC为: 0.7634698275862069
数据的准确率为： 0.7987012987012987
数据的精确率为： 0.8
数据的召回率为： 0.6206896551724138
数据的F1值为： 0.6990291262135923
数据的Cohen’s Kappa系数为： 0.5514001127607593
Counter: Counter({0: 109, 1: 45})

参考

标签：768,安人,python,non,score,Pima,test,null,data
From： https://blog.51cto.com/u_15796263/5923940

python k-近邻算法案例实战预测Pima 印度安人的糖尿病

前言

数据探索

小结

模型构建

数据标准化

划分训练集和测试集

模型评估

参考

相关文章

赞助商

阅读排行

python k-近邻算法 案例实战 预测Pima 印度安人的糖尿病

前言

数据探索

小结

模型构建

数据标准化

划分训练集和测试集

模型评估

参考

相关文章

赞助商

阅读排行

python k-近邻算法案例实战预测Pima 印度安人的糖尿病