项目七:集成学习
实践目的
- 理解集成学习算法原理;
- 熟悉并掌握常用集成学习算法的使用方法;
- 熟悉模型性能评估的方法;
- 掌握模型优化的方法。
实践平台
- 操作系统:Windows7及以上
- Python版本:3.8.x及以上
- 集成开发环境:PyCharm或Anoconda
实践内容
数据集文件名为“aqi.csv”,包含了2020年全国空气质量数据,该数据集主要记录了2020年1月至2020年9月的空气质量指标,包括日期、AQI、质量等级、PM2.5含量(ppm)、PM10含量(ppm)、SO2含量(ppm)、CO含量(ppm)、NO2含量(ppm)、O3_8h含量(ppm)等字段。
本项目实践所涉及的业务为天气质量分析和预测。将数据分为训练集和测试集,通过集成学习建立算法模型预测AQI值和质量等级。
(一)数据理解及准备
- 导入本案例所需的Python包;
- 通过describe()、info()方法、shape属性等对读入的数据对象进行探索性分析。
- 结合实际数据情况,对数据集进行适当的预处理;
- 提取用于数据分析的特征,并划分训练集和测试集。
(二)模型建立、预测及优化
任务一:随机森林
-
回归模型
- 通过RandomForestRegressor()方法建立模型并训练;
- 使用该模型预测AQI值;
- 使用评价指标对模型进行评价,包括平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score;
- 使用GridSearchCV网格搜索函数对模型进行优化,并通过best_params_属性返回性能最好的参数组合;
- 根据以上参数对模型进行优化,并输出新模型的平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score评价指标,与优化前的指标进行对比;
- 使用feature_importances_属性输出模型每个特征的重要度,并按重要程度进行排序;
- 使用优化后的模型进行预测,并输出预测结果;
- 可视化展示预测值和测试值的对比情况。
-
分类模型
- 通过RandomForestClassifier()方法建立模型并训练;
- 使用该模型预测空气质量等级;
- 使用confusion_matrix()、accuracy_scorer()、precision_score()、recall_score()、f1_score()方法分别对模型的混淆矩阵、准确率、精确率、召回率、f1值指标进行评价,并输出评价结果;
- 如评价结果不理想需对模型进行优化。
任务二:梯度提升机 (GBM)
-
回归模型
- 通过GradientBoostingRegressor()方法建立模型并训练;
- 使用该模型预测AQI值;
- 使用评价指标对模型进行评价,包括平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score;
- 使用GridSearchCV网格搜索函数对模型进行优化,并通过best_params_属性返回性能最好的参数组合;
- 根据以上参数对模型进行优化,并输出新模型的平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score评价指标,与优化前的指标进行对比;
- 使用feature_importances_属性输出模型每个特征的重要度,并按重要程度进行排序;
- 使用优化后的模型进行预测,并输出预测结果;
- 可视化展示预测值和测试值的对比情况。
-
分类模型
- 通过GradientBoostingClassifier()方法建立模型并训练;
- 使用该模型预测空气质量等级;
- 使用confusion_matrix()、accuracy_scorer()、precision_score()、recall_score()、f1_score()方法分别对模型的混淆矩阵、准确率、精确率、召回率、f1值指标进行评价,并输出评价结果;
- 如评价结果不理想需对模型进行优化。
任务三:轻量级梯度提升机 (LightGBM)
-
回归模型
- 通过LGBMRegressor()方法建立模型并训练;
- 使用该模型预测AQI值;
- 使用评价指标对模型进行评价,并输出评价结果;
- 如评价结果不理想需对模型进行优化。
-
分类模型
- 通过LGBMClassifier()方法建立模型并训练;
- 使用该模型预测空气质量等级;
- 使用评价指标对模型进行评价,并输出评价结果;
- 如评价结果不理想需对模型进行优化。
(一)数据理解及准备
# 导入必要的库
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error, r2_score, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import lightgbm as lgb
# 读取数据
data = pd.read_csv('output/modified_data.csv')
# 显示数据基本信息
print("数据信息:")
print(data.info())
print("\n数据描述:")
print(data.describe())
print("\n数据形状:", data.shape)
# 检查并处理缺失值
if data.isnull().sum().sum() > 0:
# 可以选择填充缺失值或删除含有缺失值的行
# 这里简单地用列的平均值填充
data.fillna(data.mean(), inplace=True)
# 转换日期格式
data['Date'] = pd.to_datetime(data['Date'])
print(data.head)
# 特征提取
features = ['PM2_5_(ppm)', 'PM10_(ppm)', 'SO2_(ppm)', 'CO_(ppm)', 'NO2_(ppm)', 'O3_8h_(ppm)']
target_aqi = 'AQI'
target_quality = 'Quality_Level'
# 划分训练集和测试集
X = data[features]
y_aqi = data[target_aqi]
y_quality = data[target_quality]
X_train, X_test, y_aqi_train, y_aqi_test, y_quality_train, y_quality_test = train_test_split(X, y_aqi, y_quality, test_size=0.2, random_state=42)
(二)模型建立、预测及优化
任务一:随机森林
# 1 建立随机森林回归模型 训练模型
rf_reg = RandomForestRegressor(random_state=42)
rf_reg.fit(X_train, y_aqi_train)
# 2 预测 AQI 值
y_aqi_pred = rf_reg.predict(X_test)
print('随机森林回归模型预测 AQI 值:', y_aqi_pred)
# 3 计算评估指标
mae = mean_absolute_error(y_aqi_test, y_aqi_pred)
mse = mean_squared_error(y_aqi_test, y_aqi_pred)
rmse = np.sqrt(mse)
mape = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred)
r2 = r2_score(y_aqi_test, y_aqi_pred)
print("随机森林回归模型评价指标:")
print(f'MAE: {mae}, \nMSE: {mse}, \nRMSE: {rmse}, \nMAPE: {mape}, \nR2_SCORE: {r2}')
随机森林回归模型预测 AQI 值: [124.48 81.15 72.38 71.58 45.77 82.31 34.6 31.42 80.58 83.7
47.59 74.32 75.95 47.13 39.53 33.33 45.76 75.11 80.87 42.57
39.87 58.52 44.34 45.51 60.06 40.73 51.15 45.06 51.46 43.2
70.71 37.2 127.29 31.26 86.79 43.56 90.83 66. 111.21 80.26
33.47 53.14 47.4 130.66 73.89 47.37 47.58 47.16 66.56 39.78
44.36 115.1 105.81 110.77 74.06]
随机森林回归模型评价指标:
MAE: 3.0150909090909086,
MSE: 25.965259999999997,
RMSE: 5.095611837650116,
MAPE: 0.05528859263399927,
R2_SCORE: 0.965415994168552
# 4. 使用 GridSearchCV 网格搜索函数对模型进行优化
# 定义参数网格
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# 创建 GridSearchCV 对象
grid_search = GridSearchCV(estimator=rf_reg, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
# 进行网格搜索
grid_search.fit(X_train, y_aqi_train)
# 获取最佳参数组合
best_params = grid_search.best_params_
print(f'最佳参数组合: {best_params}')
最佳参数组合: {‘max_depth’: 20, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2, ‘n_estimators’: 100}
# 5. 根据最佳参数重新训练模型
best_rf_reg = RandomForestRegressor(**best_params, random_state=42)
best_rf_reg.fit(X_train, y_aqi_train)
# 预测并评价优化后的模型
y_aqi_pred_optimized = best_rf_reg.predict(X_test)
print('优化后的随机森林回归模型预测 AQI 值:', y_aqi_pred_optimized)
mae_optimized = mean_absolute_error(y_aqi_test, y_aqi_pred_optimized)
mse_optimized = mean_squared_error(y_aqi_test, y_aqi_pred_optimized)
rmse_optimized = np.sqrt(mse_optimized)
mape_optimized = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred_optimized)
r2_optimized = r2_score(y_aqi_test, y_aqi_pred_optimized)
print("优化后的随机森林回归模型评价指标:")
print(f'MAE: {mae_optimized}, \nMSE: {mse_optimized}, \nRMSE: {rmse_optimized}, \nMAPE: {mape_optimized}, \nR2_SCORE: {r2_optimized}')
优化后的随机森林回归模型预测 AQI 值: [124.48 81.15 72.38 71.58 45.77 82.31 34.6 31.42 80.58 83.7
47.59 74.32 75.95 47.13 39.53 33.33 45.76 75.11 80.87 42.57
39.87 58.52 44.34 45.51 60.06 40.73 51.15 45.06 51.46 43.2
70.71 37.2 127.29 31.26 86.79 43.56 90.83 66. 111.21 80.26
33.47 53.14 47.4 130.66 73.89 47.37 47.58 47.16 66.56 39.78
44.36 115.1 105.81 110.77 74.06]
优化后的随机森林回归模型评价指标:
MAE: 3.0150909090909086,
MSE: 25.965259999999997,
RMSE: 5.095611837650116,
MAPE: 0.05528859263399927,
R2_SCORE: 0.965415994168552
# 比较优化前后的指标
print("优化前后指标对比:")
print(f"优化前: MAE: {mae}, MSE: {mse}, RMSE: {rmse}, MAPE: {mape}, R2_SCORE: {r2}")
print(f"优化后: MAE: {mae_optimized}, MSE: {mse_optimized}, RMSE: {rmse_optimized}, MAPE: {mape_optimized}, R2_SCORE: {r2_optimized}")
优化前后指标对比:
优化前: MAE: 3.0150909090909086, MSE: 25.965259999999997, RMSE: 5.095611837650116, MAPE: 0.05528859263399927, R2_SCORE: 0.965415994168552
优化后: MAE: 3.0150909090909086, MSE: 25.965259999999997, RMSE: 5.095611837650116, MAPE: 0.05528859263399927, R2_SCORE: 0.965415994168552
未优化成功
# 6. 使用feature_importances_属性输出模型每个特征的重要度
# 特征重要度
importances = best_rf_reg.feature_importances_
feature_importances = pd.Series(importances, index=features).sort_values(ascending=False)
# 7. 输出预测结果
print(feature_importances)
# 8. 可视化展示预测值和测试值的对比情况
plt.figure(figsize=(10, 6))
plt.scatter(y_aqi_test, y_aqi_pred_optimized, alpha=0.5)
plt.xlabel('Actual AQI')
plt.ylabel('Predicted AQI')
plt.title('Actual vs Predicted AQI')
plt.show()
PM10_(ppm) 0.400419
PM2_5_(ppm) 0.291729
O3_8h_(ppm) 0.288429
CO_(ppm) 0.008184
NO2_(ppm) 0.007934
SO2_(ppm) 0.003305
dtype: float64
任务二:GBM
回归模型
# 1. 通过 GradientBoostingRegressor()方法建立模型并训练
gb_reg = GradientBoostingRegressor(random_state=42)
gb_reg.fit(X_train, y_aqi_train)
# 2. 使用该模型预测 AQI 值
y_aqi_pred = gb_reg.predict(X_test)
print('GBM回归模型预测 AQI 值:', y_aqi_pred)
# 3. 使用评价指标对模型进行评价
mae = mean_absolute_error(y_aqi_test, y_aqi_pred)
mse = mean_squared_error(y_aqi_test, y_aqi_pred)
rmse = np.sqrt(mse)
mape = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred)
r2 = r2_score(y_aqi_test, y_aqi_pred)
print("Gradient Boosting Regression Model Evaluation Metrics:")
print(f'MAE: {mae}, \nMSE: {mse}, \nRMSE: {rmse}, \nMAPE: {mape}, \nR2_SCORE: {r2}')
GBM回归模型预测 AQI 值: [122.57416247 83.36233285 73.90280417 71.61249735 45.90407098
83.09407824 35.38809475 32.1115523 81.92797541 83.40916295
48.82405535 74.28270394 74.96495747 45.69629863 39.59354642
33.09971192 45.41896268 75.52727318 81.71507209 47.02496198
41.96486507 59.76085878 45.10753769 46.1912337 59.05166283
49.05189862 53.29885368 47.58476507 46.59894793 42.17298408
70.67172663 35.57436497 130.76443134 33.12142879 85.93142525
41.04272972 88.25804535 64.42863259 112.47587802 80.12500147
32.96123373 55.09504267 50.37469809 125.99062665 75.72767345
48.10707457 51.29551088 47.94867709 70.66198919 40.51320902
40.7250176 115.95276244 114.3584965 112.04106305 74.86570745]
Gradient Boosting Regression Model Evaluation Metrics:
MAE: 3.017067274506405,
MSE: 19.567961603563685,
RMSE: 4.4235688763218874,
MAPE: 0.058486439950287586,
R2_SCORE: 0.9739367717401174
# 4. 使用 GridSearchCV 网格搜索函数对模型进行优化
# 定义参数网格
param_grid = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 10],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# 创建 GridSearchCV 对象
grid_search = GridSearchCV(estimator=gb_reg, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
# 进行网格搜索
grid_search.fit(X_train, y_aqi_train)
# 获取最佳参数组合
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')
Best Parameters: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_samples_leaf’: 1, ‘min_samples_split’: 10, ‘n_estimators’: 300}
# 5. 根据最佳参数重新训练模型
best_gb_reg = GradientBoostingRegressor(**best_params, random_state=42)
best_gb_reg.fit(X_train, y_aqi_train)
# 使用优化后的模型进行预测
y_aqi_pred_optimized = best_gb_reg.predict(X_test)
print('优化后的模型预测结果:', y_aqi_pred_optimized)
# 计算优化后的模型评估指标
mae_optimized = mean_absolute_error(y_aqi_test, y_aqi_pred_optimized)
mse_optimized = mean_squared_error(y_aqi_test, y_aqi_pred_optimized)
rmse_optimized = np.sqrt(mse_optimized)
mape_optimized = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred_optimized)
r2_optimized = r2_score(y_aqi_test, y_aqi_pred_optimized)
print("优化梯度增强回归模型评价指标:")
print(f'Optimized MAE: {mae_optimized}, \nOptimized MSE: {mse_optimized}, \nOptimized RMSE: {rmse_optimized}, \nOptimized MAPE: {mape_optimized}, \nOptimized R2_SCORE: {r2_optimized}')
优化后的模型预测结果: [124.50273685 83.46470773 76.67313626 71.43908717 46.06087546
82.48270218 35.45781481 30.29347664 80.68483493 83.63975494
48.01910073 75.04391558 74.6780025 44.02048381 39.16875902
31.57326064 45.52152266 74.54621085 81.98742113 41.15229431
40.05067005 60.0349372 43.40693783 42.44777993 60.0874834
46.4533299 53.98613726 45.00781228 51.56679542 38.97574632
73.97473389 36.03646256 131.65412729 30.82872235 86.88627133
44.17166092 89.64827072 66.71578258 112.06193027 80.82544043
32.13607404 53.33558888 48.52689834 125.55765644 77.38396113
48.52990476 51.07272122 48.89955218 69.66154718 40.70715896
49.21862157 117.74301294 107.39395475 111.89285961 75.11097803]
优化梯度增强回归模型评价指标:
Optimized MAE: 2.7372135954667853,
Optimized MSE: 20.880137541908642,
Optimized RMSE: 4.569478913608054,
Optimized MAPE: 0.048316013543643156,
Optimized R2_SCORE: 0.9721890403365572
# 比较优化前后的指标
print("优化前后评价指标的比较:")
print(f"优化前: MAE: {mae}, MSE: {mse}, RMSE: {rmse}, MAPE: {mape}, R2_SCORE: {r2}")
print(f"优化后: MAE: {mae_optimized}, MSE: {mse_optimized}, RMSE: {rmse_optimized}, MAPE: {mape_optimized}, R2_SCORE: {r2_optimized}")
优化前后评价指标的比较:
优化前: MAE: 3.017067274506405, MSE: 19.567961603563685, RMSE: 4.4235688763218874, MAPE: 0.058486439950287586, R2_SCORE: 0.9739367717401174
优化后: MAE: 2.7372135954667853, MSE: 20.880137541908642, RMSE: 4.569478913608054, MAPE: 0.048316013543643156, R2_SCORE: 0.9721890403365572
# 6. 输出特征重要性
importances = best_gb_reg.feature_importances_
feature_importances = pd.Series(importances, index=features).sort_values(ascending=False)
print(feature_importances)
# 可视化预测值和测试值的对比
plt.figure(figsize=(10, 6))
plt.scatter(y_aqi_test, y_aqi_pred_optimized, alpha=0.5)
plt.xlabel('Actual AQI')
plt.ylabel('Predicted AQI')
plt.title('Actual vs Predicted AQI')
plt.show()
PM10_(ppm) 0.422780
O3_8h_(ppm) 0.303769
PM2_5_(ppm) 0.265263
NO2_(ppm) 0.006753
SO2_(ppm) 0.001192
CO_(ppm) 0.000243
dtype: float64
分类模型
# 1 建立模型并训练
gbm_clf = GradientBoostingClassifier(random_state=42)
gbm_clf.fit(X_train, y_quality_train)
# 2 预测空气质量等级
y_quality_pred = gbm_clf.predict(X_test)
print('GBM分类模型预测结果:', y_quality_pred)
# 3 评价模型
conf_matrix = confusion_matrix(y_quality_test, y_quality_pred)
accuracy = accuracy_score(y_quality_test, y_quality_pred)
precision = precision_score(y_quality_test, y_quality_pred, average='weighted')
recall = recall_score(y_quality_test, y_quality_pred, average='weighted')
f1 = f1_score(y_quality_test, y_quality_pred, average='weighted')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Accuracy: {accuracy}, \nPrecision: {precision}, \nRecall: {recall}, \nF1 Score: {f1}')
GBM分类模型预测结果: [‘C’ ‘B’ ‘B’ ‘B’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘B’ ‘A’ ‘A’ ‘A’ ‘A’ ‘B’
‘B’ ‘A’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘B’ ‘B’ ‘B’ ‘A’ ‘B’ ‘A’ ‘C’ ‘A’ ‘B’ ‘A’
‘B’ ‘B’ ‘C’ ‘B’ ‘A’ ‘B’ ‘B’ ‘C’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘C’ ‘C’ ‘C’
‘B’]
Confusion Matrix:
[[20 3 0]
[ 0 25 0]
[ 0 0 7]]
Accuracy: 0.9454545454545454,
Precision: 0.9512987012987013,
Recall: 0.9454545454545454,
F1 Score: 0.9450955363197574
# 4. 对模型进行优化
# 定义参数网格
param_grid = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 10],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# 创建 StratifiedKFold 对象
# stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# 创建 GridSearchCV 对象
grid_search = GridSearchCV(estimator=gbm_clf, param_grid=param_grid, cv=2, scoring='accuracy')
# 进行网格搜索
grid_search.fit(X_train, y_quality_train)
# 获取最佳参数组合
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')
Best Parameters: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_samples_leaf’: 4, ‘min_samples_split’: 2, ‘n_estimators’: 200}
# 5. 根据最佳参数重新训练模型
best_gbm_clf = GradientBoostingClassifier(**best_params, random_state=42)
best_gbm_clf.fit(X_train, y_quality_train)
# 6. 使用优化后的模型进行预测
y_quality_pred_optimized = best_gbm_clf.predict(X_test)
print('优化后的模型预测空气API结果:', y_quality_pred_optimized)
# 7. 计算优化后的模型评估指标
conf_matrix_optimized = confusion_matrix(y_quality_test, y_quality_pred_optimized)
accuracy_optimized = accuracy_score(y_quality_test, y_quality_pred_optimized)
precision_optimized = precision_score(y_quality_test, y_quality_pred_optimized, average='weighted')
recall_optimized = recall_score(y_quality_test, y_quality_pred_optimized, average='weighted')
f1_optimized = f1_score(y_quality_test, y_quality_pred_optimized, average='weighted')
print(f'Optimized Confusion Matrix:\n{conf_matrix_optimized}')
print(f'Optimized Accuracy: {accuracy_optimized}, \nOptimized Precision: {precision_optimized}, \nOptimized Recall: {recall_optimized}, \nOptimized F1 Score: {f1_optimized}')
# 比较前后的指标
print("优化前后评价指标的比较:\n")
print(f"优化前: Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1 Score: {f1}")
print(f"优化后: Accuracy: {accuracy_optimized}, Precision: {precision_optimized}, Recall: {recall_optimized}, F1 Score: {f1_optimized}")
优化后的模型预测空气API结果: [‘C’ ‘B’ ‘B’ ‘B’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘B’ ‘A’ ‘A’ ‘A’ ‘A’ ‘B’
‘B’ ‘A’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘B’ ‘A’ ‘A’ ‘A’ ‘B’ ‘A’ ‘C’ ‘A’ ‘B’ ‘A’
‘B’ ‘B’ ‘C’ ‘B’ ‘A’ ‘B’ ‘A’ ‘C’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘C’ ‘C’ ‘C’
‘B’]
Optimized Confusion Matrix:
[[22 1 0]
[ 1 24 0]
[ 0 0 7]]
Optimized Accuracy: 0.9636363636363636,
Optimized Precision: 0.9636363636363636,
Optimized Recall: 0.9636363636363636,
Optimized F1 Score: 0.9636363636363636
优化前后评价指标的比较:
优化前: Accuracy: 0.9454545454545454, Precision: 0.9512987012987013, Recall: 0.9454545454545454, F1 Score: 0.9450955363197574
优化后: Accuracy: 0.9636363636363636, Precision: 0.9636363636363636, Recall: 0.9636363636363636, F1 Score: 0.9636363636363636
任务三:LIGHTGBM
# 1. 使用 LGBMRegressor() 方法建立回归模型并训练
# 1.1 使用 LGBMRegressor() 方法建立回归模型并训练
lgb_reg = lgb.LGBMRegressor(random_state=42)
lgb_reg.fit(X_train, y_aqi_train)
# 1.2 使用该模型预测 AQI 值
y_aqi_pred = lgb_reg.predict(X_test)
# 1.3 对模型进行评价
mse = mean_squared_error(y_aqi_test, y_aqi_pred)
r2 = r2_score(y_aqi_test, y_aqi_pred)
print("LGBMRegressor Model Evaluation Metrics:")
print(f'Mean Squared Error (MSE): {mse}')
print(f'R^2 Score: {r2}')
LGBMRegressor Model Evaluation Metrics:
Mean Squared Error (MSE): 70.6065599506794
R^2 Score: 0.9059567406190894
from sklearn.preprocessing import LabelEncoder
# 2. 使用 LGBMClassifier() 方法建立分类模型并训练
# 2.1 使用 LGBMClassifier() 方法建立分类模型并训练
# 将类别标签转换为整数
label_encoder = LabelEncoder()
y_quality_train_encoded = label_encoder.fit_transform(y_quality_train)
y_quality_test_encoded = label_encoder.transform(y_quality_test)
lgb_clf = lgb.LGBMClassifier(random_state=42)
lgb_clf.fit(X_train, y_quality_train_encoded)
# 2.2 使用该模型预测空气质量等级
y_quality_pred = lgb_clf.predict(X_test)
print('预测空气质量等级结果:', y_quality_pred)
# 2.3 对模型进行评价
conf_matrix = confusion_matrix(y_quality_test_encoded, y_quality_pred)
accuracy = accuracy_score(y_quality_test_encoded, y_quality_pred)
precision = precision_score(y_quality_test_encoded, y_quality_pred, average='weighted')
recall = recall_score(y_quality_test_encoded, y_quality_pred, average='weighted')
f1 = f1_score(y_quality_test_encoded, y_quality_pred, average='weighted')
print("LGBMClassifier Model Evaluation Metrics:")
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')
预测空气质量等级结果: [2 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 2 0 1 0 1
1 2 1 0 1 0 2 1 0 0 1 1 0 1 2 1 2 1]
LGBMClassifier Model Evaluation Metrics:
Confusion Matrix:
[[22 1 0]
[ 1 24 0]
[ 0 1 6]]
Accuracy: 0.9454545454545454
Precision: 0.9468531468531469
Recall: 0.9454545454545454
F1 Score: 0.9452900041135335
# 3. 如评价结果不理想需对模型进行优化
# 3.1 定义参数网格
param_grid_reg = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.1, 0.2],
'num_leaves': [31, 63, 127],
'max_depth': [-1, 5, 10],
'min_child_samples': [20, 50, 100]
}
param_grid_clf = {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.1, 0.2],
'num_leaves': [31, 63, 127],
'max_depth': [-1, 5, 10],
'min_child_samples': [20, 50, 100]
}
# 3.2 创建 GridSearchCV 对象
grid_search_reg = GridSearchCV(estimator=lgb_reg, param_grid=param_grid_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_clf = GridSearchCV(estimator=lgb_clf, param_grid=param_grid_clf, cv=5, scoring='accuracy')
# 3.3 进行网格搜索
grid_search_reg.fit(X_train, y_aqi_train)
grid_search_clf.fit(X_train, y_quality_train_encoded)
# 3.4 获取最佳参数组合
best_params_reg = grid_search_reg.best_params_
best_params_clf = grid_search_clf.best_params_
print(f'Best Parameters for Regression: {best_params_reg}')
print(f'Best Parameters for Classification: {best_params_clf}')
Best Parameters for Regression: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_child_samples’: 20, ‘n_estimators’: 100, ‘num_leaves’: 31}
Best Parameters for Classification: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_child_samples’: 20, ‘n_estimators’: 100, ‘num_leaves’: 31}
# 3.5 根据最佳参数重新训练模型
best_lgb_reg = lgb.LGBMRegressor(**best_params_reg, random_state=42)
best_lgb_clf = lgb.LGBMClassifier(**best_params_clf, random_state=42)
best_lgb_reg.fit(X_train, y_aqi_train)
best_lgb_clf.fit(X_train, y_quality_train_encoded)
# 3.6 使用优化后的模型进行预测
y_aqi_pred_optimized = best_lgb_reg.predict(X_test)
y_quality_pred_optimized = best_lgb_clf.predict(X_test)
print('优化后的模型预测空气质量结果:', y_aqi_pred_optimized)
print('优化后的模型预测空气API结果:', y_quality_pred_optimized)
# 3.7 对优化后的模型进行评价
mse_optimized = mean_squared_error(y_aqi_test, y_aqi_pred_optimized)
r2_optimized = r2_score(y_aqi_test, y_aqi_pred_optimized)
conf_matrix_optimized = confusion_matrix(y_quality_test_encoded, y_quality_pred_optimized)
accuracy_optimized = accuracy_score(y_quality_test_encoded, y_quality_pred_optimized)
precision_optimized = precision_score(y_quality_test_encoded, y_quality_pred_optimized, average='weighted')
recall_optimized = recall_score(y_quality_test_encoded, y_quality_pred_optimized, average='weighted')
f1_optimized = f1_score(y_quality_test_encoded, y_quality_pred_optimized, average='weighted')
print("Optimized LGBMRegressor Model Evaluation Metrics:")
print(f'Optimized Mean Squared Error (MSE): {mse_optimized}')
print(f'Optimized R^2 Score: {r2_optimized}')
print("Optimized LGBMClassifier Model Evaluation Metrics:")
print(f'Optimized Confusion Matrix:\n{conf_matrix_optimized}')
print(f'Optimized Accuracy: {accuracy_optimized}')
print(f'Optimized Precision: {precision_optimized}')
print(f'Optimized Recall: {recall_optimized}')
print(f'Optimized F1 Score: {f1_optimized}')
优化后的模型预测空气质量结果: [119.6722501 90.43625783 76.95094102 69.7134339 45.49522429
85.11016632 35.29020956 34.373013 76.12352252 86.39110431
48.60966258 71.83512479 73.98876859 44.13139587 40.82771554
34.96190592 45.33698962 73.97657317 84.40383692 48.74370587
42.31917891 54.61740284 43.89328402 50.84420449 61.99838848
44.00117867 54.84723723 47.00982841 47.98332788 49.8258541
62.28614705 36.04575205 113.75560249 34.31105093 88.98552298
45.43941569 106.8158533 62.86787307 111.01787045 82.98067324
34.80876636 65.3185259 50.05687814 115.46064086 84.07845619
49.74122766 52.93800566 54.78650467 54.40771277 40.1914266
36.17207261 107.11934225 97.1210987 100.3162011 74.79308805]
优化后的模型预测空气API结果: [2 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 2 0 1 0 1
1 2 1 0 1 0 2 1 0 0 1 1 0 0 2 1 2 1]
Optimized LGBMRegressor Model Evaluation Metrics:
Optimized Mean Squared Error (MSE): 75.10416813845606
Optimized R^2 Score: 0.8999662245297593
Optimized LGBMClassifier Model Evaluation Metrics:
Optimized Confusion Matrix:
[[22 1 0]
[ 2 23 0]
[ 0 1 6]]
Optimized Accuracy: 0.9272727272727272
Optimized Precision: 0.9287878787878787
Optimized Recall: 0.9272727272727272
Optimized F1 Score: 0.9271536973664632
# 比较优化前后的指标
print("Comparison of Evaluation Metrics Before and After Optimization:")
print(f"Regression: MSE: {mse} -> {mse_optimized}, R^2 Score: {r2} -> {r2_optimized}")
print(f"Classification: Accuracy: {accuracy} -> {accuracy_optimized}, Precision: {precision} -> {precision_optimized}, Recall: {recall} -> {recall_optimized}, F1 Score: {f1} -> {f1_optimized}")
Comparison of Evaluation Metrics Before and After Optimization:
Regression: MSE: 70.6065599506794 -> 75.10416813845606, R^2 Score: 0.9059567406190894 -> 0.8999662245297593
Classification: Accuracy: 0.9454545454545454 -> 0.9272727272727272, Precision: 0.9468531468531469 -> 0.9287878787878787, Recall: 0.9454545454545454 -> 0.9272727272727272, F1 Score: 0.9452900041135335 -> 0.9271536973664632