b站的up主刘二大人的《PyTorch深度学习实践》P8 笔记+代码,视频链接。
所需糖尿病数据可以在视频评论区下方的网盘资源里下载(转载评论区):
- 链接:https://pan.baidu.com/s/1cUI518pgLWY1oCn2DkjuEQ?pwd=kif1 提取码:kif1
- 或者是点击链接下载:【免费】b站的up主刘二大人的《PyTorch深度学习实践》所需数据资源-CSDN文库
所需Titanic数据:
kaggle链接:
https://www.kaggle.com/c/titanic/data- 网盘链接:百度网盘 请输入提取码 提取码2024
目录
一、DataLoader的使用方法
介绍:DataLoader
是 PyTorch 中用于将数据集打包成小批量并提供迭代器的工具,常用于训练模型时按批次读取数据。它的使用可以极大地简化数据加载过程,并支持多线程读取、随机打乱、批量数据等功能。
基本用法代码:(具体可以看后面的代码和作业)
from torch.utils.data import DataLoader
# 假设有一个自定义的数据集 DiabetesDataset
train_loader = DataLoader(dataset=train_dataset, batch_size=32, shuffle=True)
for batch_data, batch_labels in train_loader:
pass
主要参数:
- dataset: 传入
Dataset
对象,它定义了数据集(如你自定义的TitanicDataset
)。 - batch_size: 每个批次的数据量,默认是 1。通常会根据显存大小和训练需要设置合适的批次大小。
- shuffle: 是否在每个 epoch 开始时打乱数据。如果设置为
True
,会在每次迭代时随机打乱数据。对于训练集来说,通常会设置为True
以增加模型泛化能力。 - num_workers: 加载数据时使用的子进程数量。默认为 0,表示使用主进程加载数据。增加
num_workers
数量可以加速数据加载(特别是数据预处理时间较长时)。 - drop_last: 如果
True
,则丢弃最后一个批次的数据,如果该批次的样本数量不足batch_size
。默认为False
。 - pin_memory: 如果设置为
True
,会将数据加载到固定内存中,有助于加速 GPU 的数据传输。常用于 GPU 训练。
二、PPT上代码
糖尿病预测代码:
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader
# 创建一个数据集类
class DiabetesDataset(Dataset):
def __init__(self, filepath):
xy = np.loadtxt(filepath, delimiter=',', dtype=np.float32)
self.len = xy.shape[0]
self.x_data = torch.from_numpy(xy[:, :-1])
self.y_data = torch.from_numpy(xy[:, [-1]])
def __getitem__(self, index):
# 根据索引返回特征和标签
return self.x_data[index], self.y_data[index]
def __len__(self):
return self.len
# 创建数据集实例
dataset = DiabetesDataset('data\diabetes.csv.gz')
# 创建数据加载器
train_loader = DataLoader(dataset=dataset,
batch_size=32, # 每个批次的大小
shuffle=True, # 是否打乱数据
num_workers=2) # 使用的子进程数
# 创建模型类
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.linear1 = torch.nn.Linear(8, 6)
self.linear2 = torch.nn.Linear(6, 4)
self.linear3 = torch.nn.Linear(4, 1)
self.sigmoid = torch.nn.Sigmoid()
def forward(self, x):
x = self.sigmoid(self.linear1(x))
x = self.sigmoid(self.linear2(x))
x = self.sigmoid(self.linear3(x))
return x
model = Model()
# 定义损失函数和优化器
criterion = torch.nn.BCELoss(reduction='mean') # 二分类交叉熵损失
optimizer = torch.optim.SGD(model.parameters(), lr=0.01) # 随机梯度下降
if __name__ == '__main__':
# 训练
for epoch in range(100):
for i, data in enumerate(train_loader, 0):
# 1. 准备数据
inputs, labels = data
# 2. 前向传播
y_pred = model(inputs)
loss = criterion(y_pred, labels)
print(epoch, i, loss.item())
# 3. 反向传播
optimizer.zero_grad()
loss.backward()
# 4. 更新参数
optimizer.step()
输出:
.............
99 17 0.6633598804473877
99 18 0.74329674243927
99 19 0.6236274242401123
99 20 0.6034884452819824
99 21 0.7631409764289856
99 22 0.6634839773178101
99 23 0.646409809589386
别的数据集:
import torch
import numpy as np
from torch.utils.data import DataLoader
from torchvision import transforms
from torchvision import datasets
train_dataset = datasets.MNIST(root='../dataset/mnist',
train=True,
transform= transforms.ToTensor(),
download=True)
test_dataset = datasets.MNIST(root='../dataset/mnist',
train=False,
transform= transforms.ToTensor(),
download=True)
train_loader = DataLoader(dataset=train_dataset,
batch_size=32,
shuffle=True)
test_loader = DataLoader(dataset=test_dataset,
batch_size=32,
shuffle=False)
if __name__ == '__main__':
# for batch_idx, (inputs, target) in enumerate(train_loader):
三、作业
数据集kaggle网址:https://www.kaggle.com/c/titanic/data
或者数据集网盘链接:百度网盘 请输入提取码 提取码:2024
1、Titanic数据生存预测代码:
a、创建数据类、处理数据、创建训练数据集和测试数据集、创建DataLoader
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import re
# 创建数据类
class TitanicDataset(Dataset):
def __init__(self, x_data, y_data):
self.x_data = torch.tensor(x_data.values, dtype=torch.float32)
self.y_data = torch.tensor(y_data.values, dtype=torch.float32)
self.len = len(self.x_data)
def __getitem__(self, index):
return self.x_data[index], self.y_data[index]
def __len__(self):
return self.len
data = pd.read_csv('/kaggle/input/titanic/train.csv')
# 添加一个Ticket_len列,将票号简单转化为数字位数
data['Ticket_len'] = data['Ticket'].apply(lambda x: len(re.sub(r'\D', '', x)))
# 删除不需要的非数值列 Cabin里缺失值太多
data = data.drop(columns=['Name', 'Ticket', 'Cabin', 'PassengerId'])
# 检查并处理类别型和非数值型数据
data = pd.get_dummies(data) # 将类别型数据转换为独热编码
data = data.fillna(data.mean()) #填充缺失值
data = data.astype(float) #将所有数据类型转为浮点数(主要是布尔类型转换为浮点数)
# print(data.dtypes)
# 提取x_data和y_data
y_data = data.iloc[:, 0]
x_data = data.drop(data.columns[0], axis=1)
# 使用train_test_split按8:2的比例划分数据集
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=42)
# 创建训练集和测试集数据类
train_dataset = TitanicDataset(x_train, y_train)
test_dataset = TitanicDataset(x_test, y_test)
# 创建DataLoader
train_loader = DataLoader(dataset=train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=32, shuffle=False)
print(f'训练集大小:{len(train_dataset)}') #712
print(f'测试集大小:{len(test_dataset)}') #179
b、创建模型类:
import torch.optim as optim
# 创建模型类
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.linear1 = torch.nn.Linear(x_train.shape[1], 6) # 输入特征数为11
self.linear2 = torch.nn.Linear(6, 4)
self.linear3 = torch.nn.Linear(4, 1)
self.sigmoid = torch.nn.Sigmoid()
def forward(self, x):
x = self.sigmoid(self.linear1(x))
x = self.sigmoid(self.linear2(x))
x = self.sigmoid(self.linear3(x))
return x
c、定义损失函数和优化器、训练模型:
model = Model()
# 定义损失函数和优化器
criterion = torch.nn.BCELoss(reduction='mean')
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
num_epochs = 2000
# 训练模型
for epoch in range(num_epochs):
model.train() # 设置模型为训练模式
running_loss = 0.0
for i, (x_train, y_train) in enumerate(train_loader):
optimizer.zero_grad()
outputs = model(x_train)
loss = criterion(outputs.squeeze(), y_train) # squeeze()是为了让输出与标签形状一致
loss.backward()
optimizer.step()
running_loss += loss.item()
print(f'Epoch[{epoch + 1}], Loss: {running_loss / len(train_loader)}')
输出:
Epoch[9989], Loss: 0.38880317496216815
Epoch[9990], Loss: 0.3819621872642766
Epoch[9991], Loss: 0.3719022695137107
Epoch[9992], Loss: 0.3825868627299433
Epoch[9993], Loss: 0.37470011153946753
Epoch[9994], Loss: 0.37995268404483795
Epoch[9995], Loss: 0.42972452096317126
Epoch[9996], Loss: 0.39383051512034045
Epoch[9997], Loss: 0.3703241633332294
Epoch[9998], Loss: 0.3689967789079832
Epoch[9999], Loss: 0.3822028377781744
Epoch[10000], Loss: 0.3745391517877579
d、测试模型:
# 定义测试函数
def test_model(test_loader, model):
model.eval() # 设置模型为评估模式
test_loss = 0.0
correct = 0
total = 0
with torch.no_grad(): # 禁用梯度计算,加速推理过程
for x_test_batch, y_test_batch in test_loader:
outputs = model(x_test_batch).squeeze() # 预测输出
loss = criterion(outputs, y_test_batch)
test_loss += loss.item()
# 将输出的连续概率值转换为0或1的类别预测
predicted = (outputs > 0.5).float()
total += y_test_batch.size(0)
correct += (predicted == y_test_batch).sum().item()
avg_loss = test_loss / len(test_loader)
accuracy = correct / total
print(f'Test Loss: {avg_loss:.4f}, Test Accuracy: {accuracy:.4f}')
# 使用测试集评估模型
test_model(test_loader, model)
输出:
Test Loss: 0.4919, Test Accuracy: 0.7709
2、改进①
对模型类进行改进,其余不变,成功
# 创建模型类
class Model(torch.nn.Module):
def __init__(self):
super(Model, self).__init__()
self.linear1 = torch.nn.Linear(x_train.shape[1], 16) # 输入特征数为x_train的列数,11
print(x_train.shape)
self.linear2 = torch.nn.Linear(16,12)
self.linear3 = torch.nn.Linear(12,6)
self.linear4 = torch.nn.Linear(6,1)
self.sigmoid = torch.nn.Sigmoid()
def forward(self, x):
x = self.sigmoid(self.linear1(x))
x = self.sigmoid(self.linear2(x))
x = self.sigmoid(self.linear3(x))
x = self.sigmoid(self.linear4(x))
return x
训练输出:
Epoch[9986], Loss: 0.37727273741494055
Epoch[9987], Loss: 0.39805085827475006
Epoch[9988], Loss: 0.36478128834911017
Epoch[9989], Loss: 0.3576905150776324
Epoch[9990], Loss: 0.36630746268707776
Epoch[9991], Loss: 0.3829510918130045
Epoch[9992], Loss: 0.37961867969969043
Epoch[9993], Loss: 0.35922896019790485
Epoch[9994], Loss: 0.36333370662253833
Epoch[9995], Loss: 0.3758453253818595
Epoch[9996], Loss: 0.361093489372212
Epoch[9997], Loss: 0.36214717898679816
Epoch[9998], Loss: 0.36393981394560443
Epoch[9999], Loss: 0.35134082157974655
Epoch[10000], Loss: 0.36573962996835296
测试输出:
测试:
Test Loss: 0.4821, Test Accuracy: 0.7877
3、改进②
在改进①的基础上修改优化器为Adam:训练损失下降很快,但是测试精度没有提高
训练输出:
Epoch[1990], Loss: 0.2565524513306825
Epoch[1991], Loss: 0.2779191806912422
Epoch[1992], Loss: 0.2523664009311925
Epoch[1993], Loss: 0.2586914858092432
Epoch[1994], Loss: 0.25613301895234897
Epoch[1995], Loss: 0.25247146837089374
Epoch[1996], Loss: 0.2560588529576426
Epoch[1997], Loss: 0.2710036752016648
Epoch[1998], Loss: 0.2821883316273275
Epoch[1999], Loss: 0.2581835415052331
Epoch[2000], Loss: 0.25669303180082986
测试输出:
Test Loss: 0.5428, Test Accuracy: 0.7765
朋友们有什么建议或疑问可以在评论区给出,或者是私信我!
标签:Loss,torch,代码,DataLoader,作业,Epoch,train,data,self From: https://blog.csdn.net/weixin_46046293/article/details/143066952