master的端口默认是29500,如果被占用了就换一个
torchrun --master_port 61234 --nproc_per_node $gpu_num train.py ...
shell脚本:
export CUDA_VISIBLE_DEVICES=$1
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F ',' '{print NF}')
torchrun --master_port 61234 --nproc_per_node $gpu_num hf_train.py ...
标签:61234,--,DDP,torchrun,pytorch,num,master,gpu
From: https://www.cnblogs.com/wangbingbing/p/16934995.html