参数调优(Hyperparameter Tuning)是机器学习模型优化过程的重要部分,通过调整模型的超参数来提升其性能。以下详细讲解如何进行参数调优,并提供相应的代码实例。
1. 什么是超参数?
超参数是在训练之前设置的参数,不同于模型训练过程中学到的参数(如权重)。决策树和随机森林的超参数包括树的深度、分裂标准、树的数量等。
2. 参数调优方法
2.1 网格搜索(Grid Search)
网格搜索通过穷举搜索的方式,在指定的参数范围内寻找最佳参数组合。
2.2 随机搜索(Random Search)
随机搜索在参数空间内随机采样进行搜索,通常能在较少的计算时间内找到接近最优的参数组合。
2.3 贝叶斯优化(Bayesian Optimization)
贝叶斯优化利用先前的结果来指导参数搜索,通常比网格搜索和随机搜索更高效。
3. 具体方法和代码示例
以下代码示例使用 scikit-learn
的 GridSearchCV
和 RandomizedSearchCV
进行参数调优。
3.1 数据准备
首先准备好数据集。
from sklearn.datasets import load_iris, load_boston
from sklearn.model_selection import train_test_split
# 分类数据集
iris = load_iris()
X_clf, y_clf = iris.data, iris.target
X_clf_train, X_clf_test, y_clf_train, y_clf_test = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)
# 回归数据集
boston = load_boston()
X_reg, y_reg = boston.data, boston.target
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)
3.2 网格搜索(Grid Search)
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
# 决策树分类器的网格搜索
param_grid_clf = {
'criterion': ['gini', 'entropy'],
'max_depth': [3, 5, 7, 10],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 5]
}
grid_search_clf = GridSearchCV(DecisionTreeClassifier(), param_grid_clf, cv=5, scoring='accuracy')
grid_search_clf.fit(X_clf_train, y_clf_train)
print(f"Best parameters for DecisionTreeClassifier: {grid_search_clf.best_params_}")
print(f"Best cross-validation accuracy: {grid_search_clf.best_score_}")
# 决策树回归器的网格搜索
param_grid_reg = {
'criterion': ['mse', 'friedman_mse', 'mae'],
'max_depth': [3, 5, 7, 10],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 5]
}
grid_search_reg = GridSearchCV(DecisionTreeRegressor(), param_grid_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_reg.fit(X_reg_train, y_reg_train)
print(f"Best parameters for DecisionTreeRegressor: {grid_search_reg.best_params_}")
print(f"Best cross-validation MSE: {-grid_search_reg.best_score_}")
3.3 随机搜索(Random Search)
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from scipy.stats import randint
# 随机森林分类器的随机搜索
param_dist_clf = {
'n_estimators': randint(50, 200),
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': randint(3, 10),
'min_samples_split': randint(2, 11),
'min_samples_leaf': randint(1, 11)
}
random_search_clf = RandomizedSearchCV(RandomForestClassifier(), param_distributions=param_dist_clf, n_iter=100, cv=5, scoring='accuracy', random_state=42)
random_search_clf.fit(X_clf_train, y_clf_train)
print(f"Best parameters for RandomForestClassifier: {random_search_clf.best_params_}")
print(f"Best cross-validation accuracy: {random_search_clf.best_score_}")
# 随机森林回归器的随机搜索
param_dist_reg = {
'n_estimators': randint(50, 200),
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth': randint(3, 10),
'min_samples_split': randint(2, 11),
'min_samples_leaf': randint(1, 11)
}
random_search_reg = RandomizedSearchCV(RandomForestRegressor(), param_distributions=param_dist_reg, n_iter=100, cv=5, scoring='neg_mean_squared_error', random_state=42)
random_search_reg.fit(X_reg_train, y_reg_train)
print(f"Best parameters for RandomForestRegressor: {random_search_reg.best_params_}")
print(f"Best cross-validation MSE: {-random_search_reg.best_score_}")
4. 评估调优效果
通过最佳参数训练模型,并在测试集上评估模型性能。
from sklearn.metrics import accuracy_score, mean_squared_error
# 决策树分类器最佳模型评估
best_clf = grid_search_clf.best_estimator_
y_clf_pred = best_clf.predict(X_clf_test)
print(f"Test set accuracy with best DecisionTreeClassifier: {accuracy_score(y_clf_test, y_clf_pred)}")
# 决策树回归器最佳模型评估
best_reg = grid_search_reg.best_estimator_
y_reg_pred = best_reg.predict(X_reg_test)
print(f"Test set MSE with best DecisionTreeRegressor: {mean_squared_error(y_reg_test, y_reg_pred)}")
# 随机森林分类器最佳模型评估
best_clf_rf = random_search_clf.best_estimator_
y_clf_rf_pred = best_clf_rf.predict(X_clf_test)
print(f"Test set accuracy with best RandomForestClassifier: {accuracy_score(y_clf_test, y_clf_rf_pred)}")
# 随机森林回归器最佳模型评估
best_reg_rf = random_search_reg.best_estimator_
y_reg_rf_pred = best_reg_rf.predict(X_reg_test)
print(f"Test set MSE with best RandomForestRegressor: {mean_squared_error(y_reg_test, y_reg_rf_pred)}")
5. 贝叶斯优化
贝叶斯优化通过模型来预测参数空间的最优解,更高效地找到最佳参数。以下是使用 skopt
进行贝叶斯优化的示例:
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
# 贝叶斯优化
param_space = {
'n_estimators': Integer(50, 200),
'max_features': Categorical(['auto', 'sqrt', 'log2']),
'max_depth': Integer(3, 10),
'min_samples_split': Integer(2, 11),
'min_samples_leaf': Integer(1, 11)
}
bayes_search = BayesSearchCV(estimator=RandomForestRegressor(), search_spaces=param_space, n_iter=32, cv=5, scoring='neg_mean_squared_error', random_state=42)
bayes_search.fit(X_reg_train, y_reg_train)
print(f"Best parameters with Bayesian Optimization: {bayes_search.best_params_}")
print(f"Best cross-validation MSE: {-bayes_search.best_score_}")
通过这些方法,可以有效地进行参数调优,找到最佳参数组合,从而提升模型的性能。在实际应用中,可以根据具体问题选择合适的参数调优方法。
标签:search,机器,train,clf,调优,参数,test,reg,best From: https://blog.csdn.net/a6181816/article/details/139301795