阿里云天池——零基础入门数据挖掘 - 二手车交易价格预测

标签：loss 入门 nn self torch train 数据挖掘天池 data

赛题数据见官网：

零基础入门数据挖掘 - 二手车交易价格预测_学习赛_天池大赛-阿里云天池的赛制 (aliyun.com)

因为本人第一次接触神经网络的深度学习，在此就将教学代码逐框解析

1.1导入用于数据处理、模型训练、数据加载以及可视化的Python库

%matplotlib inline
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd
import zipfile  
import re
import numpy as np
import torch
from torch import nn
from matplotlib_inline import backend_inline
import matplotlib.pyplot as plt  
import matplotlib.image as mpimg  
from IPython import display
import torch

其中Python库列举如下

# 启用Jupyter Notebook中的matplotlib内联显示：%matplotlib inline

# 数据集分割：from sklearn.model_selection import train_test_split

# PyTorch数据加载工具：from torch.utils.data import TensorDataset, DataLoader

# 数据处理：
import pandas as pd
import zipfile
import re
import numpy as np

# PyTorch核心库：
import torch
import torch.nn as nn

# 绘图工具：
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

# IPython用于在Jupyter Notebook中显示图像或HTML ：
from IPython import display

1.2导入随机种子

为了在网络训练过程中加入随机种子，确保实验结果的可复现性，你已经在正确的方向上进行了设置。不过，仅仅设置CPU或GPU的随机种子可能还不够，因为PyTorch中还有其他可能影响结果随机性的因素，比如数据加载的随机性

import torch  
  
# 确保每次运行代码时，PyTorch的随机生成器都会生成相同的随机数  
torch.manual_seed(42)  
  
# 如果使用GPU，也设置GPU的随机种子  
if torch.cuda.is_available():  
    torch.cuda.manual_seed(42)  
    torch.cuda.manual_seed_all(42)  # 如果代码中使用了多个GPU  
  
# 设置随机生成器的状态，这也会影响数据加载的随机性  
torch.backends.cudnn.benchmark = False  
torch.backends.cudnn.deterministic = True  
  
# 如果使用了numpy，也应该设置numpy的随机种子  
import numpy as np  
np.random.seed(42)

1.3解压数据包

假设您有一个名为 example.zip 的 ZIP 文件，您想将其内容解压到当前工作目录下的 unzipped_files 文件夹中。以下是如何使用 unzip_file 函数的示例：

该代码分为三步，定义，使用，调用

# 定义解压函数（如果还没有定义的话）  
def unzip_file(zip_filepath, dest_path):  
    with zipfile.ZipFile(zip_filepath, 'r') as zip_ref:  
        zip_ref.extractall(dest_path)


# 使用函数  
zip_filepath = 'example.zip'  # ZIP文件的路径  
dest_path = 'unzipped_files'  # 解压后文件的目标路径

# 调用函数  
unzip_file(zip_filepath, dest_path)

然后进行解包

# 对.zip进行解包
unzip_file('used_car_train_20200313.zip','./')
unzip_file('used_car_testB_20200421.zip','./')

1.4读取数据

使用 pandas 的 read_csv 函数从两个 CSV 文件中读取数据，重新保存到新的 CSV 文件中。

test_data = pd.read_csv('used_car_testB_20200421.csv', sep=' ')
train_data = pd.read_csv('used_car_train_20200313.csv', sep=' ')
test_data.to_csv('used_car_testB.csv')
train_data.to_csv('used_car_train.csv')

1.5数据拼接

将 train_data 和 test_data 这两个DataFrame沿着它们的默认轴（通常是行，即axis=0）拼接起来。

data = pd.concat([train_data, test_data])

1.6数据替换和类型转换

# 假设 'specific_column' 是需要替换 '-' 的列  
data['specific_column'] = data['specific_column'].replace('-', np.nan)  
  
# 转换 notRepairedDamage 列的数据类型  
data['notRepairedDamage'] = pd.to_numeric(data['notRepairedDamage'], errors='coerce').astype('float32')  
# 注意：这里使用了 pd.to_numeric 并将无法转换的值设置为 NaN ('coerce')  
  
# 截断 power 列的值  
data.loc[data['power'] > 600, 'power'] = 600

1.7数据预处理

`1.7.1` cate_cols

cate_cols 列表中的特征包括车辆的品牌（brand）、型号（model）、车身类型（bodyType）、燃料类型（fuelType）、变速箱类型（gearbox）、卖家类型（seller）以及是否有未修复的损伤（notRepairedDamage）。这些特征通常是文本或分类标签，不能直接用于大多数机器学习算法，因为它们需要被编码成数值形式。常见的编码方式包括独热编码（One-Hot Encoding）、标签编码（Label Encoding）或目标编码（Target Encoding）等。

`1.7.2` num_cols

num_cols 列表中的特征包括注册日期（regDate）、创建日期（creatDate，这里可能是个笔误，通常应为 createDate）、车辆功率（power）、行驶里程（kilometer）以及一系列以 v_ 开头的特征（可能是车辆的某些性能指标或评分）。这些特征通常是数值型的，可以直接用于大多数机器学习算法中。但是，对于日期类型的特征（如 regDate 和 creatDate），可能需要进行进一步的处理（如转换为距离某个固定日期的天数或月份数）以更好地反映它们对目标变量的影响。

1.8定义One-Hot编码函数

下列代码是在循环中逐个处理列，一个更高效的方法是使用 pd.get_dummies 一次性处理多个列

import pandas as pd  
  
def oneHotEncode(df, colNames):  
    # 检查输入是否为 DataFrame  
    if not isinstance(df, pd.DataFrame):  
        raise ValueError("df must be a pandas DataFrame")  
      
    # 检查 colNames 是否为列表且所有列名都存在于 DataFrame 中  
    if not isinstance(colNames, list) or not all(col in df.columns for col in colNames):  
        raise ValueError("colNames must be a list of column names that exist in df")  
      
    # 使用 pandas 的 get_dummies 方法一次性对所有列进行编码  
    # 注意：这里使用了 join 而不是 concat，因为 join 可以直接替换掉原列  
    result = df.drop(colNames, axis=1)  # 先删除原列  
    for col in colNames:  
        dummies = pd.get_dummies(df[col], prefix=col)  
        result = pd.concat([result, dummies], axis=1)  
      
    return result  
  
# 示例使用  
# 假设 df 是一个包含 'model', 'brand' 等分类列的 DataFrame  
# new_df = oneHotEncode(df, ['model', 'brand'])

1.9处理数据

处理离散数据（分类特征），处理连续数据（数值特征），最后删除了可能无关的数据列

# 处理离散数据
for col in cate_cols:
    data[col] = data[col].fillna('-1')
data = oneHotEncode(data, cate_cols)

# 处理连续数据
for col in num_cols:
    data[col] = data[col].fillna(0)
    data[col] = (data[col]-data[col].min()) / (data[col].max()-data[col].min())

# 处理(可能)无关数据 
data.drop(['name', 'regionCode'], axis=1, inplace=True)

data.columns

对于离散数据（分类特征），将所有缺失值（NaN）填充为字符串 '-1'。然后调用了一个自定义的 oneHotEncode 函数来对这些分类特征进行 One-Hot 编码。注意，将缺失值填充为 '-1' 并假设它代表一个新的类别是可行的，有时候，将缺失值视为一个单独的类别是有意义的。

对于连续数据（数值特征），将所有缺失值填充为 0，并将每个特征缩放到 [0, 1] 区间内。这种缩放方法称为最小-最大标准化（Min-Max Scaling）。将缺失值填充为 0 可能不适用于所有情况，特别是当 0 在该特征中具有实际含义时。此外，最小-最大标准化对于某些机器学习算法（如基于距离的算法）是有用的，但它可能会受到新数据中的极端值的影响。

删除了 name 和 regionCode 列，因为这些列对于模型来说是无关紧要的。这是一个很好的做法，可以减少数据集中的噪声和冗余。

1.10打包和批量加载

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=512) 
X_train, X_test, y_train, y_test=torch.Tensor(X_train), torch.Tensor(X_test), torch.Tensor(y_train), torch.Tensor(y_test)
# TensorDataset是PyTorch中用于将样本和标签打包成单个数据集的类
train_dataset = TensorDataset(X_train, y_train)  
test_dataset = TensorDataset(X_test, y_test)   
# DataLoader是一个可迭代的对象，它提供了对TensorDataset的批量加载功能。
train_iter = DataLoader(train_dataset, batch_size=512, shuffle=True,num_workers=3)  
test_iter = DataLoader(test_dataset, batch_size=512, shuffle=False,num_workers=3)

1.11检查 `X_train`和 `y_train`的形状

X_train.shape,y_train.shape

1.12神经网络构建模型

模型包含多个 nn.Linear 层，用于线性变换输入数据，以及 nn.BatchNorm1d 层来归一化每个小批量数据的特征，还有 nn.ReLU 激活函数来增加非线性。此外，还定义了一个权重初始化函数 init_weights，该函数使用 Xavier 均匀初始化（也称为 Glorot 初始化）来初始化线性层的权重。

net = nn.Sequential(
            nn.BatchNorm1d(334),
            nn.Linear(334, 568),
            nn.BatchNorm1d(568),
            nn.ReLU(),
            nn.Linear(568, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Linear(256,256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Linear(256,256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Linear(256,128),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.Linear(128,1))

def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)

net.apply(init_weights);

1.13调用GPU

下列代码：

尝试获取一个指定的 GPU 设备（通过索引 i），如果指定的 GPU 设备存在且可用，则返回该 GPU 设备的 torch.device 对象；如果指定的 GPU 设备不存在或不可用（比如超出了系统中 GPU 的数量），则函数会回退到 CPU，并返回 CPU 的 torch.device 对象。

def try_gpu(i=0):
    if torch.cuda.device_count() >= i + 1:
        return torch.device(f'cuda:{i}')
    return torch.device('cpu')

1.14设置图形显示

设置图形显示的后端以使用 SVG 格式。在数据可视化或机器学习项目的上下文中，特别是在 Jupyter Notebook 或类似环境中，这种设置对于提高图形的清晰度和可伸缩性非常有用。

def use_svg_display():
    backend_inline.set_matplotlib_formats('svg')

1.15固定长度的数组累加

init 用于初始化，add 用于向数组的每个元素添加给定的值，reset 用于将所有元素重置为0，以及一个特殊方法 getitem 允许通过索引访问数组中的元素。

class Accumulator:
    """在n个变量上累加"""
    def __init__(self, n):
        self.data = [0.0] * n

    def add(self, *args):
        self.data = [a + float(b) for a, b in zip(self.data, args)]

    def reset(self):
        self.data = [0.0] * len(self.data)#reset方法用于将self.data`中的所有元素重置为0.0。

    def __getitem__(self, idx):
        # getitem是一个特殊方法，它允许类的实例支持索引操作。例如，如果你有一个Accumulator的实例acc，你可以使用acc[i]来获取self.data`中的第i个元素。
        return self.data[idx]

1.16计算模型输出与真实标签之间的差的绝对值之和来评估准确率

def evaluate_accuracy(net, device,loss,data_iter):
    net.eval()
    metric = Accumulator(2)  # 正确预测数、预测总数
    with torch.no_grad():
        for X, y in data_iter:        
            X=X.to(device)
            y=y.to(device)
            metric.add(abs(net(X)-y).sum().item(), y.numel())#y.numel()：表示预测的数量
            # 将每次的正确预测数和预测数量一次加入迭代器中
    return metric[0] / metric[1]
# .item()将矩阵转化为python标量

1.17训练模型

class Accumulator:  
    """用于累积数据"""  
    def __init__(self, n):  
        self.data = [0.0] * n  
  
    def add(self, *args):  
        self.data = [a + float(b) for a, b in zip(self.data, args)]  
  
    def reset(self):  
        self.data = [0.0] * len(self.data)  
  
    def __getitem__(self, idx):  
        return self.data[idx]  
  
def train_epoch_ch3(net, device, train_iter, loss, updater):  
    """训练模型一个迭代周期"""  
    net.train()  # 将模型设置为训练模式  
    metric = Accumulator(1)  # 只需要累积损失总和  
    for X, y in train_iter:  
        X, y = X.to(device), y.to(device)  
        y_hat = net(X)  
        l = loss(y_hat, y)  # 计算损失  
          
        # 反向传播和参数更新  
        updater.zero_grad()  
        l.backward()  
        updater.step()  
          
        # 累积损失  
        metric.add(l.item())  
  
    # 返回平均损失  
    return metric[0] / len(train_iter)  # 假设train_iter有__len__方法  
  
# 注意：如果您需要计算准确率，可以在循环中添加以下代码（假设是分类问题）  
# correct = 0  
# total = 0  
# for X, y in train_iter:  
#     X, y = X.to(device), y.to(device)  
#     y_hat = net(X)  
#     _, predicted = torch.max(y_hat, 1)  
#     correct += (predicted == y).sum().item()  
#     total += y.size(0)  
# accuracy = correct / total

1.18设置Matplotlib图表

定义的 set_axes 函数是一个很有用的工具，用于设置Matplotlib图表的轴标签、范围、比例尺、图例以及网格线。这个函数接受一个Matplotlib的Axes对象（通常是通过plt.subplots()或类似函数返回的），以及一些用于定制图表外观的参数

def set_axes(axes,xlabel,ylabel,xlim,ylim,xscale,yscale,legend):
    axes.set_xlabel(xlabel)
    axes.set_ylabel(ylabel)
    axes.set_xscale(xscale)
    axes.set_yscale(yscale)
    axes.set_xlim(xlim)
    axes.set_ylim(ylim)
    if legend:
        axes.legend(legend)
    axes.grid()

1.19绘制数据点

class Animator:
    """在动画中绘制数据"""
    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
                 ylim=None, xscale='linear', yscale='linear',
                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
                 figsize=(7, 5)):
        # 增量地绘制多条线
        if legend is None:
            legend = []
        use_svg_display()
        self.fig, self.axes = plt.subplots(nrows, ncols, figsize=figsize)
        if nrows * ncols == 1:
            self.axes = [self.axes, ]
        # 使用lambda函数捕获参数
        self.config_axes = lambda: set_axes(
            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
        self.X, self.Y, self.fmts = None, None, fmts

    def add(self, x, y):
        # 向图表中添加多个数据点
        if not hasattr(y, "__len__"):
            y = [y]
        n = len(y)
        if not hasattr(x, "__len__"):
            x = [x] * n
        if not self.X:
            self.X = [[] for _ in range(n)]
        if not self.Y:
            self.Y = [[] for _ in range(n)]
        for i, (a, b) in enumerate(zip(x, y)):
            if a is not None and b is not None:
                self.X[i].append(a)
                self.Y[i].append(b)
        self.axes[0].cla()
        for x, y, fmt in zip(self.X, self.Y, self.fmts):
            self.axes[0].plot(x, y, fmt)
        self.config_axes()
        display.display(self.fig)
        display.clear_output(wait=True)

1.19.1__init__：

使用初始化方法 __init__创建图表和子图（如果有多个），并初始化一些内部变量（如 X, Y）来存储将要绘制的数据点。同时，配置坐标轴的基本属性，并设置一个 config_axes lambda 函数，用于在添加数据点后重新配置坐标轴。

1.19.2add：

向图表中添加新的数据点。如果 x 或 y 是单个值，则会将其扩展为与 y（或 x）列表长度相同的列表。然后，将新的数据点添加到内部存储的 X 和 Y 列表中。之后，清除当前子图的内容，使用新的数据点重新绘制图表，并重新配置坐标轴（通过调用 config_axes lambda 函数）。最后，使用 Jupyter Notebook 的显示系统更新图表显示，实现动画效果。

1.20训练一个神经网络模型

def train_ch3(net, device, train_iter, test_iter, loss, num_epochs, updater):  
    net.to(device)  
    # 假设 animator 类已经定义，并且可以接受损失值作为输入  
    animator = Animator(xlabel='epoch', xlim=[1, num_epochs],  
                        legend=['train loss', 'test loss'])  
      
    for epoch in range(num_epochs):  
        train_loss = train_epoch_ch3(net, device, train_iter, loss, updater)  
        # 假设 evaluate_loss 是计算测试损失的正确函数  
        test_loss = evaluate_loss(net, device, loss, test_iter)  
        # 将训练和测试损失添加到 animator 中  
        animator.add(epoch + 1, (train_loss, test_loss))  
      
    # 返回最终的训练和测试损失（这里只返回最后一个epoch的，但可以根据需要调整）  
    return train_loss, test_loss  
  
# 注意：您需要确保 train_epoch_ch3 和 evaluate_loss 函数已经正确定义，  
# 并且它们能够接受您传递给 train_ch3 的参数。  
  
# 示例的 evaluate_loss 函数可能如下所示（需要您根据实际情况实现）：  
def evaluate_loss(net, device, loss, data_iter):  
    total_loss = 0.0  
    num_batches = 0  
    with torch.no_grad():  
        for X, y in data_iter:  
            X, y = X.to(device), y.to(device)  
            l = loss(net(X), y)  
            total_loss += l.item() * y.size(0)  
            num_batches += 1  
    return total_loss / num_batches

1.21预测

使用训练好的神经网络 net 对输入数据 X_result 进行预测，并返回预测结果 y_hat。

pred=predict_ch3(net,X_result)
pred=pred.to('cpu')

1.22训练

因为设置了学习率 (lr)、训练轮数 (num_epochs)、损失函数 (loss)、优化器 (trainer)，以及设备 (device)，现在调用 train_ch3 函数以训练神经网络 net

#训练
lr, num_epochs =  0.01, 150
loss = nn.MSELoss()
trainer = torch.optim.Adam(net.parameters(), lr=lr)
device=try_gpu()

train_loss,test_loss=train_ch3(net, device,train_iter, test_iter, loss, num_epochs, trainer)

1.23结果转移

调用了 predict_ch3 函数来获取预测结果 pred，然后将这些结果从可能的其他设备（如 GPU）移动到 CPU 上

pred=predict_ch3(net,X_result)
pred=pred.to('cpu')

1.24将张量转换为NumPy数组

pred=pred.detach().numpy() #安全地将张量转换为NumPy数组

1.25张量转化

将PyTorch张量pred转换为一个Pandas DataFrame，并将其列命名为'price'。并重置X_id的索引，并删除旧的索引。

res=pd.DataFrame(pred, columns=['price']) 
X_id=X_id.reset_index(drop=True)

1.26保存output

将两个DataFrame合并成一个，并保存为CSV文件。

submission = pd.concat([X_id, res['price']], axis=1)
submission.to_csv('submission.csv',index=False)

标签：loss,入门,nn,self,torch,train,数据挖掘,天池,data
From： https://blog.csdn.net/huodaqi666/article/details/140746190

阿里云天池——零基础入门数据挖掘 - 二手车交易价格预测

1.1导入用于数据处理、模型训练、数据加载以及可视化的Python库

1.2导入随机种子

1.3解压数据包

1.4读取数据

1.5数据拼接

1.6数据替换和类型转换

1.7数据预处理

`1.7.1` cate_cols

`1.7.2` num_cols

1.8定义One-Hot编码函数

1.9处理数据

1.10打包和批量加载

1.11检查 `X_train`和 `y_train`的形状

1.12神经网络构建模型

1.13调用GPU

1.14设置图形显示

1.15固定长度的数组累加

1.16计算模型输出与真实标签之间的差的绝对值之和来评估准确率

1.17训练模型

1.18设置Matplotlib图表

1.19绘制数据点

1.20训练一个神经网络模型

1.21预测

1.22训练

1.23结果转移

1.24将张量转换为NumPy数组

1.25张量转化

1.26保存output

相关文章

赞助商

阅读排行

阿里云天池——零基础入门数据挖掘 - 二手车交易价格预测

1.1导入用于数据处理、模型训练、数据加载以及可视化的Python库

1.2导入随机种子

1.3解压数据包

1.4读取数据

1.5数据拼接

1.6数据替换和类型转换

1.7数据预处理

1.7.1 cate_cols

1.7.2 num_cols

1.8定义One-Hot编码函数

1.9处理数据

1.10打包和批量加载

1.11检查 X_train和 y_train的形状

1.12神经网络构建模型

1.13调用GPU

1.14设置图形显示

1.15固定长度的数组累加

1.16计算模型输出与真实标签之间的差的绝对值之和来评估准确率

1.17训练模型

1.18设置Matplotlib图表

1.19绘制数据点

1.20训练一个神经网络模型

1.21预测

1.22训练

1.23结果转移

1.24将张量转换为NumPy数组

1.25张量转化

1.26保存output

相关文章

赞助商

阅读排行

`1.7.1` cate_cols

`1.7.2` num_cols

1.11检查 `X_train`和 `y_train`的形状