首页 > 其他分享 >deepspeed学习-多机all_reduce

deepspeed学习-多机all_reduce

时间:2024-04-04 15:58:06浏览次数:19  
标签:INFO NCCL deepspeed 04 reduce py worker 3956 多机

deepspeed学习-多机all_reduce


本文演示了如何采用deepspeed做多机torch.distributed.all_reduce

一.安装nvidia-docker

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get update
apt-get install -y nvidia-docker2

二.构建容器

1.创建容器

cd /mnt
docker stop pytorch_cuda
docker rm pytorch_cuda
nvidia-docker run -ti -e NVIDIA_VISIBLE_DEVICES=all --privileged --net=host -v $PWD:/home \
            -w /home --name pytorch_cuda  ubuntu:22.04 /bin/bash
docker start pytorch_cuda
docker exec -ti pytorch_cuda /bin/bash

2.更新apt源

sed -i "s@http://.*archive.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list
sed -i "s@http://.*security.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list
apt update

3.安装依赖

apt install gcc g++ vim git wget curl unzip make -y
apt install -y pkg-config  
apt install -y python3.10
apt install -y python3.10-dev
apt install -y python3-pip
apt install -y libsystemd*
apt install -y libabsl-dev
apt install -y libopencv-dev
apt install -y psmisc
apt install -y openssh-server
apt install -y gdb
apt install -y pciutils
apt install -y nfs-common
apt install -y openmpi-bin openmpi-doc libopenmpi-dev
apt install -y pdsh 

4.安装cuda12.1(编译deepspeed需要)

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-ubuntu2204-12-1-local_12.1.1-530.30.02-1_amd64.deb
cp cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
dpkg -i cuda-repo-ubuntu2204-12-1-local_12.1.1-530.30.02-1_amd64.deb
cp /var/cuda-repo-ubuntu2204-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt-get -y install cuda-toolkit-12

5.设置ssh端口和密码(为避免跟host sshd冲突,修改了容器里sshd端口)

sed -i 's/^.*PermitRootLogin.*$/PermitRootLogin yes/g' /etc/ssh/sshd_config
sed -i 's/^.*Port.*$/Port 2223/g' /etc/ssh/sshd_config
export passwd=Hello123 && printf "${passwd}\n${passwd}\n"  | passwd root

6.运行sshd服务

cat >/usr/bin/run.sh <<EOF
#!/bin/bash
mkdir  -p /run/sshd
source ~/.bashrc
/usr/sbin/sshd -D
EOF
chmod  777 /usr/bin/run.sh
nohup /usr/bin/run.sh &

7.安装pytorch

pip install torch==2.2.2 -i https://pypi.tuna.tsinghua.edu.cn/simple

8.测试nccl

cd /home
cat > reduce_demo.py <<EOF
import os
import torch
import argparse
import torch.distributed as dist
from torch.distributed import ReduceOp
import time
 
dist.init_process_group(backend='nccl')

local_rank = int(os.environ["LOCAL_RANK"])
world_size = torch.distributed.get_world_size()
rank=rank = torch.distributed.get_rank()
 
torch.cuda.set_device(local_rank)
 
device = torch.device("cuda",local_rank)
 
dist.barrier()
tensor = (torch.ones(world_size, dtype=torch.int64) * rank).to(device)
print(tensor)
dist.barrier()
 
dist.all_reduce(tensor, op=ReduceOp.SUM)
dist.barrier()
time.sleep(2)
 
print("reduce result:",tensor)
 
dist.destroy_process_group()
EOF
torchrun -m --nnodes=1 --nproc_per_node=1 reduce_demo

9.安装 deepspeed

pip install deepspeed

10.退出容器

exit

11.上传镜像

docker commit pytorch_cuda harbor.hi20240217.com/public/pytorch_cuda:v1.0
docker login harbor.hi20240217.com
docker push harbor.hi20240217.com/public/pytorch_cuda:v1.0

三.多机环境部署(每台主机上执行)

1.创建容器

cd /mnt			
docker pull harbor.hi20240217.com/public/pytorch_cuda:v1.0
docker run -ti --privileged --net=host -v $PWD:/home \
            -w /home --rm harbor.hi20240217.com/public/pytorch_cuda:v1.0 /bin/bash

2.生成密钥

rm -rf ~/.ssh/*
ssh-keygen

3.创建ssh主机列表

tee ~/.ssh/config <<-'EOF'
Host worker_1
        User  root
        Hostname 192.168.1.100
        port 2223
        IdentityFile ~/.ssh/id_rsa
Host worker_2
        User  root
        Hostname 192.168.1.101
        port 2223
        IdentityFile ~/.ssh/id_rsa        
EOF

4.启动sshd

nohup /usr/bin/run.sh &

5.拷贝密钥

ssh-copy-id worker_1
ssh-copy-id worker_2

6.准备测试代码

cd /home
cat > reduce_demo.py <<EOF
import os
import torch
import argparse
import torch.distributed as dist
from torch.distributed import ReduceOp
import time
 
dist.init_process_group(backend='nccl')

local_rank = int(os.environ["LOCAL_RANK"])
world_size = torch.distributed.get_world_size()
rank=rank = torch.distributed.get_rank()
 
torch.cuda.set_device(local_rank)
 
device = torch.device("cuda",local_rank)
 
dist.barrier()
tensor = (torch.ones(world_size, dtype=torch.int64) * rank).to(device)
print(tensor)
dist.barrier()
 
dist.all_reduce(tensor, op=ReduceOp.SUM)
dist.barrier()
time.sleep(2)
 
print("reduce result:",tensor)
 
dist.destroy_process_group()
EOF

四.运行测试程序(随便找一台)

1.创建主机列表

cd /home
tee hostfile <<-'EOF'
worker_1 slots=1
worker_2 slots=1
EOF

2.运行程序

export NCCL_DEBUG=info 
export NCCL_SOCKET_IFNAME=enp5s0 
export NCCL_IB_DISABLE=1	
deepspeed --hostfile ./hostfile reduce_demo.py		 

3.运行日志

[2024-04-04 15:32:08,327] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-04 15:32:08,838] [INFO] [runner.py:463:main] Using IP address of 192.168.1.100 for node worker_1
[2024-04-04 15:32:08,838] [INFO] [multinode_runner.py:80:get_cmd] Running on the following workers: worker_1,worker_2
[2024-04-04 15:32:08,839] [INFO] [runner.py:568:main] cmd = pdsh -S -f 1024 -w worker_1,worker_2 export NCCL_SOCKET_IFNAME=enp5s0; export NCCL_DEBUG=info; export NCCL_IB_DISABLE=1; export PYTHONPATH=/home;  cd /home; /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJ3b3JrZXJfMSI6IFswXSwgIndvcmtlcl8yIjogWzBdfQ== --node_rank=%n --master_addr=192.168.1.100 --master_port=29500 reduce_demo.py
worker_2: [2024-04-04 15:32:09,956] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
worker_2: [2024-04-04 15:32:10,111] [INFO] [launch.py:138:main] 1 NCCL_SOCKET_IFNAME=enp5s0
worker_2: [2024-04-04 15:32:10,111] [INFO] [launch.py:138:main] 1 NCCL_DEBUG=info
worker_2: [2024-04-04 15:32:10,111] [INFO] [launch.py:138:main] 1 NCCL_IB_DISABLE=1
worker_2: [2024-04-04 15:32:10,111] [INFO] [launch.py:145:main] WORLD INFO DICT: {'worker_1': [0], 'worker_2': [0]}
worker_2: [2024-04-04 15:32:10,111] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
worker_2: [2024-04-04 15:32:10,111] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'worker_1': [0], 'worker_2': [1]})
worker_2: [2024-04-04 15:32:10,111] [INFO] [launch.py:163:main] dist_world_size=2
worker_2: [2024-04-04 15:32:10,111] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
worker_2: [2024-04-04 15:32:10,112] [INFO] [launch.py:253:main] process 4296 spawned with command: ['/usr/bin/python3', '-u', 'reduce_demo.py', '--local_rank=0']
worker_1: [2024-04-04 15:32:11,037] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
worker_1: [2024-04-04 15:32:11,314] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=enp5s0
worker_1: [2024-04-04 15:32:11,314] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info
worker_1: [2024-04-04 15:32:11,314] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=1
worker_1: [2024-04-04 15:32:11,314] [INFO] [launch.py:145:main] WORLD INFO DICT: {'worker_1': [0], 'worker_2': [0]}
worker_1: [2024-04-04 15:32:11,314] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
worker_1: [2024-04-04 15:32:11,314] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'worker_1': [0], 'worker_2': [1]})
worker_1: [2024-04-04 15:32:11,314] [INFO] [launch.py:163:main] dist_world_size=2
worker_1: [2024-04-04 15:32:11,314] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
worker_1: [2024-04-04 15:32:11,315] [INFO] [launch.py:253:main] process 3956 spawned with command: ['/usr/bin/python3', '-u', 'reduce_demo.py', '--local_rank=0']
worker_1: test-X99:3956:3956 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp5s0
worker_1: test-X99:3956:3956 [0] NCCL INFO Bootstrap : Using enp5s0:192.168.1.100<0>
worker_1: test-X99:3956:3956 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
worker_1: test-X99:3956:3956 [0] NCCL INFO cudaDriverVersion 12020
worker_1: NCCL version 2.19.3+cuda12.3
worker_2: work:4296:4296 [0] NCCL INFO cudaDriverVersion 12020
worker_2: work:4296:4296 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp5s0
worker_2: work:4296:4296 [0] NCCL INFO Bootstrap : Using enp5s0:192.168.1.101<0>
worker_2: work:4296:4296 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
worker_2: work:4296:4316 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
worker_2: work:4296:4316 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp5s0
worker_2: work:4296:4316 [0] NCCL INFO NET/Socket : Using [0]enp5s0:192.168.1.101<0>
worker_2: work:4296:4316 [0] NCCL INFO Using non-device net plugin version 0
worker_2: work:4296:4316 [0] NCCL INFO Using network Socket
worker_1: test-X99:3956:4018 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
worker_1: test-X99:3956:4018 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp5s0
worker_1: test-X99:3956:4018 [0] NCCL INFO NET/Socket : Using [0]enp5s0:192.168.1.100<0>
worker_1: test-X99:3956:4018 [0] NCCL INFO Using non-device net plugin version 0
worker_1: test-X99:3956:4018 [0] NCCL INFO Using network Socket
worker_2: work:4296:4316 [0] NCCL INFO comm 0x559a38e9b2a0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 9000 commId 0x7b537e17c819b798 - Init START
worker_1: test-X99:3956:4018 [0] NCCL INFO comm 0x557ee0bc6880 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 3000 commId 0x7b537e17c819b798 - Init START
worker_2: work:4296:4316 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
worker_2: work:4296:4316 [0] NCCL INFO P2P Chunksize set to 131072
worker_1: test-X99:3956:4018 [0] NCCL INFO Channel 00/02 :    0   1
worker_1: test-X99:3956:4018 [0] NCCL INFO Channel 01/02 :    0   1
worker_1: test-X99:3956:4018 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
worker_1: test-X99:3956:4018 [0] NCCL INFO P2P Chunksize set to 131072
worker_2: work:4296:4316 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/Socket/0
worker_2: work:4296:4316 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/Socket/0
worker_2: work:4296:4316 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/Socket/0
worker_2: work:4296:4316 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/Socket/0
worker_1: test-X99:3956:4018 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
worker_1: test-X99:3956:4018 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
worker_1: test-X99:3956:4018 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/Socket/0
worker_1: test-X99:3956:4018 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/Socket/0
worker_2: work:4296:4316 [0] NCCL INFO Connected all rings
worker_2: work:4296:4316 [0] NCCL INFO Connected all trees
worker_2: work:4296:4316 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
worker_2: work:4296:4316 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
worker_1: test-X99:3956:4018 [0] NCCL INFO Connected all rings
worker_1: test-X99:3956:4018 [0] NCCL INFO Connected all trees
worker_1: test-X99:3956:4018 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
worker_1: test-X99:3956:4018 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
worker_2: work:4296:4316 [0] NCCL INFO comm 0x559a38e9b2a0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 9000 commId 0x7b537e17c819b798 - Init COMPLETE
worker_1: test-X99:3956:4018 [0] NCCL INFO comm 0x557ee0bc6880 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 3000 commId 0x7b537e17c819b798 - Init COMPLETE
worker_2: tensor([1, 1], device='cuda:0')
worker_1: tensor([0, 0], device='cuda:0')
worker_1: reduce result: tensor([1, 1], device='cuda:0')
worker_2: reduce result: tensor([1, 1], device='cuda:0')
worker_1: test-X99:3956:4019 [0] NCCL INFO [Service thread] Connection closed by localRank 0
worker_2: work:4296:4317 [0] NCCL INFO [Service thread] Connection closed by localRank 0
worker_2: work:4296:4296 [0] NCCL INFO comm 0x559a38e9b2a0 rank 1 nranks 2 cudaDev 0 busId 9000 - Abort COMPLETE
worker_1: test-X99:3956:3956 [0] NCCL INFO comm 0x557ee0bc6880 rank 0 nranks 2 cudaDev 0 busId 3000 - Abort COMPLETE
worker_2: [2024-04-04 15:32:16,118] [INFO] [launch.py:348:main] Process 4296 exits successfully.
worker_1: [2024-04-04 15:32:16,320] [INFO] [launch.py:348:main] Process 3956 exits successfully.

标签:INFO,NCCL,deepspeed,04,reduce,py,worker,3956,多机
From: https://blog.csdn.net/m0_61864577/article/details/137376285

相关文章

  • 大数据实验统计-1、Hadoop安装及使用;2、HDFS编程实践;3、HBase编程实践;4、MapReduce编
    大数据实验统计1、Hadoop安装及使用;一.实验内容Hadoop安装使用:1)在PC机上以伪分布式模式安装Hadoop;2)访问Web界面查看Hadoop信息。二.实验目的1、熟悉Hadoop的安装流程。2、熟悉Hadoop访问Web界等基本操作。大数据实验一,Hadoop安装及使用-CSDN博客文章浏览阅读149次,点赞3......
  • ES6 reduce方法:示例与详解、应用场景
    还是大剑师兰特:曾是美国某知名大学计算机专业研究生,现为航空航海领域高级前端工程师;CSDN知名博主,GIS领域优质创作者,深耕openlayers、leaflet、mapbox、cesium,canvas,webgl,echarts等技术开发,欢迎加底部微信(gis-dajianshi),一起交流。No.内容链接1Openlayers【入门教程】-......
  • 大数据之 MapReduce 相关的 Java API 应用
    注意:本文基于前两篇教程Linux系统CentOS7上搭建HadoopHDFS集群详细步骤YARN集群和MapReduce原理及应用MapReduce是ApacheHadoop项目中的一种编程模型,用于大规模数据集的并行处理。在Hadoop中,MapReduce使用JavaAPI来编写Map和Reduce函数。API简......
  • 6.Hadoop MapReduce
    6.1编辑WordCount.java创建wordcount测试目录 编辑WordCount.java输入下面代码:可以访问https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html查看importjava.io.IOException;importjava.util.String......
  • 对数组方法reduce和map的深入理解,用reduce方法实现map方法
     引言:偶然看到一道面试题,希望可以用reduce方法实现map方法,看过之后发现这是个有趣的逻辑思维题、既要对数组方法有深入理解也要有一定编程能力。一、reduce语法reduce:高阶数组方法,对数组中的所有元素应用一个函数(由你提供),将其减少为单个输出值。arr.reduce((prev,cur,ind......
  • 多机器人协同SLAM论文解读系列1——多机器人协同导航技术综述
    多机器人协同导航技术综述期刊:无人系统技术(综合影响因子2.018)第一作者:张辰Content多机器人协同定位技术概率估计方法优化方法地图匹配方法多机器人路径规划技术耦合式方法解耦式方法多机器人任务分配技术基于行为的分配方法市场机制方法群体智能方法人......
  • 多机器人协同SLAM论文解读系列1——多机器人协同导航技术综述
    多机器人协同导航技术综述期刊:无人系统技术(综合影响因子2.018)第一作者:张辰Content多机器人协同定位技术概率估计方法优化方法地图匹配方法多机器人路径规划技术耦合式方法解耦式方法多机器人任务分配技术基于行为的分配方法市场机制方法群体智能方法人......
  • AIStation制作DeepSpeed镜像
    如何在AIStation训练平台中制作DeepSpeed镜像需要注意:以下操作都是在普通账户操作的,管理员无法操作1、导入NGC镜像1.1到NGC官网连接https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags,根据需求torch及cuda版本,选择对应的NGC镜像地址,并复制,如选择希望导入21.1......
  • MapReduce执行流程
    MapReduce执行流程MapTask执行流程Read:读取阶段MapTask会调用InputFormat中的getSplits方法来对文件进行切片切片之后,针对每一个Split,产生一个RecordReader流用于读取数据数据是以Key-Value形式来产生,交给map方法来处理。每一个键值对触发调用一次map方法Map:映射阶......
  • 数组的reduce 的使用和扁平化处理
    <!DOCTYPEhtml><htmllang="en"><head><metacharset="UTF-8"><metaname="viewport"content="width=device-width,initial-scale=1.0"><title>Document</title></hea......