首页 > 其他分享 >pytorch多卡训练DDP卡死问题排查

pytorch多卡训练DDP卡死问题排查

时间:2022-08-28 17:22:57浏览次数:86  
标签:torch DDP 多卡 rank pytorch cuda device model

背景

单机多卡并行模型训练,使用DistributedDataParallel加速,调用超过一个GPU会发生卡死,表现为GPU0占用100%且无法继续。

排查

使用nvtop工具查看,发现GPU0会被分配nproc_per_node对应数量的process,表现与预期N卡N线不符。
调用DDP部分代码展示如下:

model = MyNet(config).cuda()
model = torch.nn.parallel.DistributedDataParallel(model, 
                                                      device_ids=[config.LOCAL_RANK], 
                                                      output_device=config.LOCAL_RANK, 
                                                      broadcast_buffers=False,
                                                      find_unused_parameters=True)

通过log排查出每次model都被分配在cuda:0上,这也就解释了为什么nproc_per_node=1才能正常训练。

案例

阅读DDP官方文档,多进程部分实现主要分为两种:

  • 使用torch.multiprocessing手动spawn
  • 使用torch.distributed.run/torchrun自动初始化

这里选用后者的官方DDP案例elastic_ddp.py

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim

from torch.nn.parallel import DistributedDataParallel as DDP

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def demo_basic():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    print(f"Start running basic DDP example on rank {rank}.")

    # create model and move it to GPU with id rank
    device_id = rank % torch.cuda.device_count()
    model = ToyModel().to(device_id)
    ddp_model = DDP(model, device_ids=[device_id])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_id)
    loss_fn(outputs, labels).backward()
    optimizer.step()

if __name__ == "__main__":
    demo_basic()

运行结果

$ torchrun --nproc_per_node=8 elastic_ddp.py
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Start running basic DDP example on rank 7.
Start running basic DDP example on rank 4.
Start running basic DDP example on rank 2.
Start running basic DDP example on rank 6.
Start running basic DDP example on rank 1.
Start running basic DDP example on rank 3.
Start running basic DDP example on rank 0.
Start running basic DDP example on rank 5.

解决

参考官方案例后,基本确定是cuda device分配出现问题。
修改mian函数如下:

dist.init_process_group("nccl")
rank = dist.get_rank()
print(f"Start running basic DDP example on rank {rank}.")

# create model and move it to GPU with id rank
device_id = rank % torch.cuda.device_count()
model = MyNet(config).to(device_id)
ddp_model = DDP(model, broadcast_buffers=False, find_unused_parameters=True)

官方文档指出device_idsoutput_device两个参数在multi-GPU模式下必须给默认值None

device_ids (list of python:int or torch.device) :
CUDA devices. 1) For single-device modules, device_ids can contain exactly one device id, which represents the only CUDA device where the input module corresponding to this process resides. Alternatively, device_ids can also be None. 2) For multi-device modules and CPU modules, device_ids must be None.
When device_ids is None for both cases, both the input data for the forward pass and the actual module must be placed on the correct device. (default: None)
output_device (int or torch.device) :
Device location of output for single-device CUDA modules. For multi-device modules and CPU modules, it must be None, and the module itself dictates the output location. (default: device_ids[0] for single-device modules)

使用上述方法获得正确的device_id后log显示确实将模型分配在不同cuda,随后开始分布训练问题演变为:

"RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1"

看来模型中还存在参数与输入不在一张卡的问题,由于数据集采用numpy格式的pickle进行data feed存在转换,
因此改动思路是在所有layer调用的forward函数中偷传device_id参数,从而定转换后cuda tensor保存位置。

def forward(self, input, device):
    input = torch.from_numpy(input).float().cuda(device, non_blocking=True)

简化版的input.cuda()方法会自动分配current_cuda_device = cuda:0导致错误。

参考

Getting Started with Distributed Data Parallel — PyTorch Tutorials 1.12.1+cu102 documentation
python - Stuck at this error "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu" - Stack Overflow

标签:torch,DDP,多卡,rank,pytorch,cuda,device,model
From: https://www.cnblogs.com/azureology/p/16632988.html

相关文章

  • 9 PyTorch的模型部署
    9.1 ONNX(OpenNeuralNetworkExchange)简介ONNX(OpenNeuralNetworkExchange) 通过定义一组与环境和平台无关的标准格式,使AI模型可以在不同框架和环境下交互......
  • 8. PyTorch生态简介
    由于本人未接触过也并未打算从事图像相关工作,所以只介绍了torchtext生态。有关torchvision和PytorchViseo只是了解了一下并未进行笔记输出。torchtext简介torch......
  • PyTorch Geometric(pyg)学习
    参考2个链接: 第十六课.Pytorch-geometric入门(一)_tzc_fly的博客-CSDN博客_pytorch-geometric 第十七课.Pytorch-geometric入门(二)_tzc_fly的博客-CSDN博客......
  • pytorch转为mindspore模型
    MindConverter将PyTorch(ONNX)模型快速迁移到MindSpore框架下使用。第一步:pytorch模型转onnx:importtorch#根据实际情况替换以下类路径fromcustomized.path.to.py......
  • TensorFlow和CUDA、cudnn、Pytorch以及英伟达显卡对应版本对照表
    TensorFlow和CUDA、cudnn、Pytorch以及英伟达显卡对应版本对照表CUDA下载地址CUDNN下载地址torch下载英伟达显卡下载一、TensorFlow对应版本对照表版本Python版......
  • TensorFlow和CUDA、cudnn以及Pytorch对应版本对照表
    TensorFlow和CUDA、cudnn以及Pytorch对应版本对照表CUDA下载地址CUDNN下载地址torch下载一、TensorFlow对应版本对照表版本Python版本编译器cuDNNCUDAte......
  • Pytorch中获取模型摘要的3种方法
    在pytorch中获取模型的可训练和不可训练的参数,层名称,内核大小和数量。Pytorchnn.Module类中没有提供像与Keras那样的可以计算模型中可训练和不可训练的参数的数量并显示......
  • PyTorch中的CUDA操作
      CUDA(ComputeUnifiedDeviceArchitecture)是NVIDIA推出的异构计算平台,PyTorch中有专门的模块torch.cuda来设置和运行CUDA相关操作。本地安装环境为Windows10,Python3.......
  • 5. PyTorch模型定义
    5.1PyTorch模型定义的方式5.1.1必要的知识回顾Module类是torch.nn模块里提供的一个模型构造类(nn.Moduel),是所有神经网络模块的基类,通过继承nn.Moduel定义所需的模型。......
  • 3. PyTorch主要组成模块(2)
    3.5损失函数损失函数:也称模型的负反馈,是数据输入到模型当中,产生的结果与真实标签的评价指标,我们的模型可以按照损失函数的目标来做出改进。3.5.1二分类交叉熵损失......