scikit-learn (sklearn) 基础教程

标签：iris scikit train 基础教程 test import sklearn

scikit-learn (sklearn) 基础教程

scikit-learn 是一个强大的 Python 库，用于机器学习和数据挖掘。它基于 SciPy、NumPy 和 matplotlib 构建，提供了简单且高效的工具，适用于数据分析和建模。

安装

在安装 scikit-learn 之前，请确保已安装以下依赖库：NumPy、SciPy 和 matplotlib。

使用 pip 安装 scikit-learn：

pip install scikit-learn

数据集加载

scikit-learn 提供了一些常用的数据集，可以直接加载使用：

from sklearn.datasets import load_iris
from sklearn.datasets import load_digits

# 加载鸢尾花数据集
iris = load_iris()
print(iris.data.shape)  # (150, 4)

# 加载数字数据集
digits = load_digits()
print(digits.data.shape)  # (1797, 64)

数据预处理

数据预处理是机器学习工作流程中的重要步骤。scikit-learn 提供了多种数据预处理方法，例如标准化、归一化、缺失值处理等。

标准化

将数据标准化到均值为 0，标准差为 1：

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaled_data = scaler.fit_transform(iris.data)

归一化

将数据归一化到 [0, 1] 范围：

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(iris.data)

模型训练与预测

scikit-learn 提供了多种机器学习算法，如线性回归、决策树、支持向量机等。以下示例展示了如何训练和使用模型进行预测。

训练模型

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# 分割数据集为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# 创建逻辑回归模型
model = LogisticRegression()

# 训练模型
model.fit(X_train, y_train)

进行预测

# 使用模型进行预测
predictions = model.predict(X_test)

模型评估

评估模型性能是确保其有效性的关键。scikit-learn 提供了多种评估指标，例如准确率、精确率、召回率和 F1 分数。

from sklearn.metrics import accuracy_score, classification_report

# 计算准确率
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

# 输出分类报告
report = classification_report(y_test, predictions)
print(report)

超参数调优

超参数调优是提升模型性能的重要步骤。scikit-learn 提供了网格搜索（GridSearchCV）和随机搜索（RandomizedSearchCV）来自动调优超参数。

from sklearn.model_selection import GridSearchCV

# 定义参数网格
param_grid = {
    'C': [0.1, 1, 10, 100],
    'solver': ['lbfgs', 'liblinear']
}

# 使用网格搜索调优超参数
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")

示例：鸢尾花分类

以下是一个完整的鸢尾花分类示例，展示了从数据加载、预处理、模型训练、评估到超参数调优的完整流程。

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# 加载数据
iris = load_iris()
X = iris.data
y = iris.target

# 数据预处理
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# 训练模型
model = LogisticRegression()
model.fit(X_train, y_train)

# 预测
predictions = model.predict(X_test)

# 评估模型
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")

report = classification_report(y_test, predictions)
print(report)

# 超参数调优
param_grid = {
    'C': [0.1, 1, 10, 100],
    'solver': ['lbfgs', 'liblinear']
}

grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")

结论

scikit-learn 提供了丰富的功能，支持从数据预处理、模型训练、评估到超参数调优的整个机器学习流程。希望本教程能帮助你快速上手 scikit-learn，开始你的机器学习之旅。

标签：iris,scikit,train,基础教程,test,import,sklearn
From： https://blog.csdn.net/2401_85342379/article/details/139720182

scikit-learn (sklearn) 基础教程