机器学习_参数调优

标签：search 机器 train clf 调优参数 test reg best

参数调优（Hyperparameter Tuning）是机器学习模型优化过程的重要部分，通过调整模型的超参数来提升其性能。以下详细讲解如何进行参数调优，并提供相应的代码实例。

1. 什么是超参数？

超参数是在训练之前设置的参数，不同于模型训练过程中学到的参数（如权重）。决策树和随机森林的超参数包括树的深度、分裂标准、树的数量等。

2. 参数调优方法

2.1 网格搜索（Grid Search）

网格搜索通过穷举搜索的方式，在指定的参数范围内寻找最佳参数组合。

2.2 随机搜索（Random Search）

随机搜索在参数空间内随机采样进行搜索，通常能在较少的计算时间内找到接近最优的参数组合。

2.3 贝叶斯优化（Bayesian Optimization）

贝叶斯优化利用先前的结果来指导参数搜索，通常比网格搜索和随机搜索更高效。

3. 具体方法和代码示例

以下代码示例使用 scikit-learn 的 GridSearchCV 和 RandomizedSearchCV 进行参数调优。

3.1 数据准备

首先准备好数据集。

from sklearn.datasets import load_iris, load_boston
from sklearn.model_selection import train_test_split

# 分类数据集
iris = load_iris()
X_clf, y_clf = iris.data, iris.target
X_clf_train, X_clf_test, y_clf_train, y_clf_test = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)

# 回归数据集
boston = load_boston()
X_reg, y_reg = boston.data, boston.target
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

3.2 网格搜索（Grid Search）

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# 决策树分类器的网格搜索
param_grid_clf = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5]
}

grid_search_clf = GridSearchCV(DecisionTreeClassifier(), param_grid_clf, cv=5, scoring='accuracy')
grid_search_clf.fit(X_clf_train, y_clf_train)

print(f"Best parameters for DecisionTreeClassifier: {grid_search_clf.best_params_}")
print(f"Best cross-validation accuracy: {grid_search_clf.best_score_}")

# 决策树回归器的网格搜索
param_grid_reg = {
    'criterion': ['mse', 'friedman_mse', 'mae'],
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5]
}

grid_search_reg = GridSearchCV(DecisionTreeRegressor(), param_grid_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_reg.fit(X_reg_train, y_reg_train)

print(f"Best parameters for DecisionTreeRegressor: {grid_search_reg.best_params_}")
print(f"Best cross-validation MSE: {-grid_search_reg.best_score_}")

3.3 随机搜索（Random Search）

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from scipy.stats import randint

# 随机森林分类器的随机搜索
param_dist_clf = {
    'n_estimators': randint(50, 200),
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': randint(3, 10),
    'min_samples_split': randint(2, 11),
    'min_samples_leaf': randint(1, 11)
}

random_search_clf = RandomizedSearchCV(RandomForestClassifier(), param_distributions=param_dist_clf, n_iter=100, cv=5, scoring='accuracy', random_state=42)
random_search_clf.fit(X_clf_train, y_clf_train)

print(f"Best parameters for RandomForestClassifier: {random_search_clf.best_params_}")
print(f"Best cross-validation accuracy: {random_search_clf.best_score_}")

# 随机森林回归器的随机搜索
param_dist_reg = {
    'n_estimators': randint(50, 200),
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': randint(3, 10),
    'min_samples_split': randint(2, 11),
    'min_samples_leaf': randint(1, 11)
}

random_search_reg = RandomizedSearchCV(RandomForestRegressor(), param_distributions=param_dist_reg, n_iter=100, cv=5, scoring='neg_mean_squared_error', random_state=42)
random_search_reg.fit(X_reg_train, y_reg_train)

print(f"Best parameters for RandomForestRegressor: {random_search_reg.best_params_}")
print(f"Best cross-validation MSE: {-random_search_reg.best_score_}")

4. 评估调优效果

通过最佳参数训练模型，并在测试集上评估模型性能。

from sklearn.metrics import accuracy_score, mean_squared_error

# 决策树分类器最佳模型评估
best_clf = grid_search_clf.best_estimator_
y_clf_pred = best_clf.predict(X_clf_test)
print(f"Test set accuracy with best DecisionTreeClassifier: {accuracy_score(y_clf_test, y_clf_pred)}")

# 决策树回归器最佳模型评估
best_reg = grid_search_reg.best_estimator_
y_reg_pred = best_reg.predict(X_reg_test)
print(f"Test set MSE with best DecisionTreeRegressor: {mean_squared_error(y_reg_test, y_reg_pred)}")

# 随机森林分类器最佳模型评估
best_clf_rf = random_search_clf.best_estimator_
y_clf_rf_pred = best_clf_rf.predict(X_clf_test)
print(f"Test set accuracy with best RandomForestClassifier: {accuracy_score(y_clf_test, y_clf_rf_pred)}")

# 随机森林回归器最佳模型评估
best_reg_rf = random_search_reg.best_estimator_
y_reg_rf_pred = best_reg_rf.predict(X_reg_test)
print(f"Test set MSE with best RandomForestRegressor: {mean_squared_error(y_reg_test, y_reg_rf_pred)}")

5. 贝叶斯优化

贝叶斯优化通过模型来预测参数空间的最优解，更高效地找到最佳参数。以下是使用 skopt 进行贝叶斯优化的示例：

from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical

# 贝叶斯优化
param_space = {
    'n_estimators': Integer(50, 200),
    'max_features': Categorical(['auto', 'sqrt', 'log2']),
    'max_depth': Integer(3, 10),
    'min_samples_split': Integer(2, 11),
    'min_samples_leaf': Integer(1, 11)
}

bayes_search = BayesSearchCV(estimator=RandomForestRegressor(), search_spaces=param_space, n_iter=32, cv=5, scoring='neg_mean_squared_error', random_state=42)
bayes_search.fit(X_reg_train, y_reg_train)

print(f"Best parameters with Bayesian Optimization: {bayes_search.best_params_}")
print(f"Best cross-validation MSE: {-bayes_search.best_score_}")

通过这些方法，可以有效地进行参数调优，找到最佳参数组合，从而提升模型的性能。在实际应用中，可以根据具体问题选择合适的参数调优方法。

标签：search,机器,train,clf,调优,参数,test,reg,best
From： https://blog.csdn.net/a6181816/article/details/139301795