大数据分析--南瓜籽品种分类
1.选题背景
南瓜籽为南瓜的种子,葫芦科南瓜属植物南瓜Cucurbita moschata Duch.的种子。一端略尖,外表黄白色,边缘稍有棱,表面带有毛茸。除去种皮,可见绿色菲薄的胚乳,具有健脑提神、降压镇痛、驱虫等功效。南瓜籽因其含有足量的蛋白质、脂肪、碳水化合物和矿物质而在世界范围内经常作为糖果食用。同时南瓜籽富含不饱和脂肪酸、胡萝卜素、过氧化物以及酶等物质,适当食用能保证大脑血流量,健脑提神。南瓜籽的提取物可使血吸虫虫体萎缩、生殖器官退化、减少虫卵,对其幼虫有抑制和杀灭作用。那么什么品种的南瓜籽最优质呢?如何进行分析计算呢?
2.数据分析方案
从网址中下载完数据后,在python环境中导入pandas、numpy等库进行数据整理、建模、评估,经过数据清洗,检查数据等,然后进行可视化处理,对南瓜种子的品种进行分析、预测与分类,进一步确定了南瓜籽品种分类的最成功方法。
数据集:https://www.muratkoklu.com/datasets/
3.数据分析
(1)导入库并进行数据清洗
import pandas as pd # 用于数据处理 import numpy as np # 线性代数 port immatplotlib.pyplot as plt # 用于可视化 import seaborn as sns
# 读取这个数据集 df = pd.read_excel('C:/Users/lenovo/Desktop/Python/Pumpkin_Seeds_Dataset/Pumpkin_Seeds_Dataset.xlsx') # 读取前五行 df.head()
# 形状和大小 print(f"shape : {df.shape}") print(f"size : {df.size}")
# 数据集基本摘要 # 没有丢失值 df.info()
#描述结果 df.describe()
(2)选择最重要的两种南瓜籽进行研究分析
pd.unique(df['Class'])
(3)数据可视化 对这两种南瓜品种的2500个南瓜种子进行形态学测量
sns.heatmap(df.drop(columns='Class').corr())
sns.pairplot(df,hue='Class')
(4)导入sklearn库 对目标x、y进行选择与处理
# 导入库 # 用于数据预处理 from sklearn.preprocessing import StandardScaler, LabelEncoder, PolynomialFeatures # 用于拆分数据集 from sklearn.model_selection import train_test_split
# 选择目标和特性 x = np.array(df.drop(columns='Class')) y = np.array(df['Class'])
#x[:,:6] # 缩放特性 scaler = StandardScaler() scaler.fit(x[:,:6]) x[:,:6] = scaler.transform(x[:,:6])
# 编码目标y encoder = LabelEncoder() encoder.fit(y) y=encoder.transform(y) y
# 分割数据集 x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=0.8)
(5)利用逻辑回归算法对南瓜籽品种进行分类计算
# 导入库 # 线性模型逻辑回归 from sklearn.linear_model import LogisticRegression # 用于评估模型的度量 from sklearn.metrics import r2_score ,mean_squared_error, confusion_matrix
# 建立模型并进行拟合 lr = LogisticRegression(max_iter=10000) lr.fit(x_train,y_train) lr.score(x_test,y_test)
train_mses = [] cv_mses = [] models = [] scalers = [] accuracy=[] for i in range(1,6) : poly = PolynomialFeatures(degree=i, include_bias=False) X_train_mapped = poly.fit_transform(x_train) scaler_poly = StandardScaler() X_train_mapped_scaled = scaler_poly.fit_transform(X_train_mapped) scalers.append(scaler_poly) model = LogisticRegression(max_iter=10000) model.fit(X_train_mapped_scaled, y_train ) models.append(model) yhat = model.predict(X_train_mapped_scaled) train_mse = mean_squared_error(y_train, yhat) / 2 train_mses.append(train_mse) poly = PolynomialFeatures(i, include_bias=False) X_cv_mapped = poly.fit_transform(x_test) X_cv_mapped_scaled = scaler_poly.transform(X_cv_mapped) # 计算交叉验证MSE yhat = model.predict(X_cv_mapped_scaled) cv_mse = mean_squared_error(y_test, yhat) / 2 cv_mses.append(cv_mse) accuracy.append(model.score(X_cv_mapped_scaled,y_test)*100)
# 模型的精度 acc = np.argmax(accuracy) accuracy[acc] print(f" the accuracy is : {accuracy[acc]:.4} and the polynomial degree is : {acc+1}")
train_mses1 = [] cv_mses1 = [] models1 = [] scalers1 = [] accuracy1=[] poly = PolynomialFeatures(degree=acc+1, include_bias=False) X_train_mapped = poly.fit_transform(x_train) scaler_poly = StandardScaler() X_train_mapped_scaled = scaler_poly.fit_transform(X_train_mapped) scalers1.append(scaler_poly) model = LogisticRegression(max_iter=10000) model.fit(X_train_mapped_scaled, y_train ) models1.append(model) yhat = model.predict(X_train_mapped_scaled) train_mse = mean_squared_error(y_train, yhat) / 2 train_mses1.append(train_mse) poly = PolynomialFeatures(acc+1, include_bias=False) X_cv_mapped = poly.fit_transform(x_test) X_cv_mapped_scaled = scaler_poly.transform(X_cv_mapped) # 计算交叉验证MSE yhat = model.predict(X_cv_mapped_scaled) cv_mse = mean_squared_error(y_test, yhat) / 2 cv_mses1.append(cv_mse) accuracy1.append(model.score(X_cv_mapped_scaled,y_test)*100) y_pred_test=model.predict(X_cv_mapped_scaled) accuracy1
#混淆矩阵 conf=confusion_matrix(y_test,y_pred_test) models1[0].score(X_cv_mapped_scaled,y_test)
sns.heatmap(conf, annot=True, cmap='Greens') plt.xlabel('Predicted labels') plt.ylabel('True labels') plt.show()
tot = 500 vrai = 248+198 faux = 34+20 per = (vrai / tot)*100 per
4.总结与认识
近年来,南瓜籽的效益较好,有些地区南瓜籽生产加工已成为一项主导产业,形成了一定规模,很多农户靠种植南瓜、加工销售南瓜籽(一般是白瓜籽)发家致富。更重要的是,南瓜籽有降血压、血糖的功效。因为南瓜籽中含有大量容易消化的蛋白质,可起到稳定血糖的效果。同时,南瓜籽中含有的丰富的泛酸,这种物质可缓解静止性心绞痛,并有一定的降压作用。此外,南瓜籽中富含的维生素B1、维生素E,还可起到稳定情绪、缓解失眠的效果;南瓜籽还有有保护心脑血管的作用。因此,优质的南瓜籽品种可大大提高这些功效,所以怎样分析优异品种就显得更加重要了。
收获:通过本次我发现对数据的选择、筛选、清洗与处理是本专业学习与研究的主要方向,通过不同的算法对数据进行灵活的处理会让我们的生活便捷且高效。大数据时代已经到来,作为大数据专业的学生,应该多多学习数据算法和数据处理的方法,对不同类型的数据采取不同的划分与整合最后进行可视化展示,使散乱的数据可以清晰直观且有序,使社会大众受益,也使工作生活便捷。
不足:对南瓜籽样本的品种选择过少,算法代码过于冗长,分析算法学习不够透彻,在算法使用中不够熟练,仍需要借助外力完成代码实现,不能独立进行代码设计。在以后的学习中,不仅需要加强算法的学习与练习,对于数据分析思维的灵活度也需要强化与完善。
5.全代码
import pandas as pd # 用于数据处理 import numpy as np # 线性代数 port immatplotlib.pyplot as plt # 用于可视化 import seaborn as sns # 读取这个数据集 df = pd.read_excel('C:/Users/lenovo/Desktop/Python/Pumpkin_Seeds_Dataset/Pumpkin_Seeds_Dataset.xlsx') # 读取前五行 df.head() # 读取这个数据集 df = pd.read_excel('C:/Users/lenovo/Desktop/Python/Pumpkin_Seeds_Dataset/Pumpkin_Seeds_Dataset.xlsx') # 读取前五行 df.head() # 形状和大小 print(f"shape : {df.shape}") print(f"size : {df.size}") # 数据集基本摘要 # 没有丢失值 df.info() #描述结果 df.describe() pd.unique(df['Class']) sns.heatmap(df.drop(columns='Class').corr()) sns.pairplot(df,hue='Class') # 导入库 # 用于数据预处理 from sklearn.preprocessing import StandardScaler, LabelEncoder, PolynomialFeatures # 用于拆分数据集 from sklearn.model_selection import train_test_split # 选择目标和特性 x = np.array(df.drop(columns='Class')) y = np.array(df['Class']) #x[:,:6] # 缩放特性 scaler = StandardScaler() scaler.fit(x[:,:6]) x[:,:6] = scaler.transform(x[:,:6]) # 编码目标y encoder = LabelEncoder() encoder.fit(y) y=encoder.transform(y) Y # 分割数据集 x_train,x_test,y_train,y_test = train_test_split(x,y,train_size=0.8) # 导入库 # 线性模型逻辑回归 from sklearn.linear_model import LogisticRegression # 用于评估模型的度量 from sklearn.metrics import r2_score ,mean_squared_error, confusion_matrix # 建立模型并进行拟合 lr = LogisticRegression(max_iter=10000) lr.fit(x_train,y_train) lr.score(x_test,y_test) train_mses = [] cv_mses = [] models = [] scalers = [] accuracy=[] for i in range(1,6) : poly = PolynomialFeatures(degree=i, include_bias=False) X_train_mapped = poly.fit_transform(x_train) scaler_poly = StandardScaler() X_train_mapped_scaled = scaler_poly.fit_transform(X_train_mapped) scalers.append(scaler_poly) model = LogisticRegression(max_iter=10000) model.fit(X_train_mapped_scaled, y_train ) models.append(model) yhat = model.predict(X_train_mapped_scaled) train_mse = mean_squared_error(y_train, yhat) / 2 train_mses.append(train_mse) poly = PolynomialFeatures(i, include_bias=False) X_cv_mapped = poly.fit_transform(x_test) X_cv_mapped_scaled = scaler_poly.transform(X_cv_mapped) # 计算交叉验证MSE yhat = model.predict(X_cv_mapped_scaled) cv_mse = mean_squared_error(y_test, yhat) / 2 cv_mses.append(cv_mse) accuracy.append(model.score(X_cv_mapped_scaled,y_test)*100) # 模型的精度 acc = np.argmax(accuracy) accuracy[acc] print(f" the accuracy is : {accuracy[acc]:.4} and the polynomial degree is : {acc+1}") train_mses1 = [] cv_mses1 = [] models1 = [] scalers1 = [] accuracy1=[] poly = PolynomialFeatures(degree=acc+1, include_bias=False) X_train_mapped = poly.fit_transform(x_train) scaler_poly = StandardScaler() X_train_mapped_scaled = scaler_poly.fit_transform(X_train_mapped) scalers1.append(scaler_poly) model = LogisticRegression(max_iter=10000) model.fit(X_train_mapped_scaled, y_train ) models1.append(model) yhat = model.predict(X_train_mapped_scaled) train_mse = mean_squared_error(y_train, yhat) / 2 train_mses1.append(train_mse) poly = PolynomialFeatures(acc+1, include_bias=False) X_cv_mapped = poly.fit_transform(x_test) X_cv_mapped_scaled = scaler_poly.transform(X_cv_mapped) # 计算交叉验证MSE yhat = model.predict(X_cv_mapped_scaled) cv_mse = mean_squared_error(y_test, yhat) / 2 cv_mses1.append(cv_mse) accuracy1.append(model.score(X_cv_mapped_scaled,y_test)*100) y_pred_test=model.predict(X_cv_mapped_scaled) accuracy1 #混淆矩阵 conf=confusion_matrix(y_test,y_pred_test) models1[0].score(X_cv_mapped_scaled,y_test) sns.heatmap(conf, annot=True, cmap='Greens') plt.xlabel('Predicted labels') plt.ylabel('True labels') plt.show() tot = 500 vrai = 248+198 faux = 34+20 per = (vrai / tot)*100 Per
标签:数据分析,--,poly,train,南瓜籽,mapped,test,model,cv From: https://www.cnblogs.com/clever1/p/17458964.html