./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 268435456 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 60407 on gpu003 device 0 [0x26] NVIDIA A800-SXM4-80GB
# Rank 1 Group 0 Pid 60407 on gpu003 device 1 [0x2c] NVIDIA A800-SXM4-80GB
gpu003:60407:60407 [0] NCCL INFO Bootstrap : Using ibs13:10.110.10.7<0>
gpu003:60407:60407 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gpu003:60407:60407 [1] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.1+cuda11.8
gpu003:60407:60420 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_4:1/IB [3]mlx5_5:1/IB [RO]; OOB ibs13:10.110.10.7<0>
gpu003:60407:60420 [0] NCCL INFO Using network IB
gpu003:60407:60421 [1] NCCL INFO Using network IB
gpu003:60407:60420 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
gpu003:60407:60421 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
gpu003:60407:60420 [0] NCCL INFO Channel 00/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 01/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 02/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 03/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 04/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 05/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 06/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 07/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 08/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 09/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 10/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 11/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 12/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 13/16 : 0 1
gpu003:60407:60421 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] -1/-1/-1->1->0 [7] -1/-1/-1->1->0 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0
gpu003:60407:60420 [0] NCCL INFO Channel 14/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Channel 15/16 : 0 1
gpu003:60407:60420 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1
gpu003:60407:60420 [0] NCCL INFO Channel 00/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 00/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 01/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 01/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 02/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 02/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 03/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 03/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 04/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 04/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 05/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 05/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 06/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 06/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 07/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 07/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 08/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 08/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 09/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 09/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 10/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 10/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 11/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 11/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 12/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 12/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 13/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 13/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 14/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 14/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60421 [1] NCCL INFO Channel 15/0 : 1[2c000] -> 0[26000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Channel 15/0 : 0[26000] -> 1[2c000] via P2P/direct pointer/read
gpu003:60407:60420 [0] NCCL INFO Connected all rings
gpu003:60407:60421 [1] NCCL INFO Connected all rings
gpu003:60407:60420 [0] NCCL INFO Connected all trees
gpu003:60407:60421 [1] NCCL INFO Connected all trees
gpu003:60407:60421 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpu003:60407:60421 [1] NCCL INFO 16 coll channels, 16 p2p channels, 16 p2p channels per peer
gpu003:60407:60420 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpu003:60407:60420 [0] NCCL INFO 16 coll channels, 16 p2p channels, 16 p2p channels per peer
gpu003:60407:60421 [1] NCCL INFO comm 0x55e4b15b8870 rank 1 nranks 2 cudaDev 1 busId 2c000 - Init COMPLETE
gpu003:60407:60420 [0] NCCL INFO comm 0x55e4b15b5de0 rank 0 nranks 2 cudaDev 0 busId 26000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 11.72 0.00 0.00 0 12.04 0.00 0.00 0
16 4 float sum -1 11.75 0.00 0.00 0 11.97 0.00 0.00 0
32 8 float sum -1 11.99 0.00 0.00 0 12.00 0.00 0.00 0
64 16 float sum -1 12.05 0.01 0.01 0 11.62 0.01 0.01 0
128 32 float sum -1 11.69 0.01 0.01 0 11.82 0.01 0.01 0
256 64 float sum -1 11.90 0.02 0.02 0 11.87 0.02 0.02 0
512 128 float sum -1 13.60 0.04 0.04 0 12.54 0.04 0.04 0
1024 256 float sum -1 12.86 0.08 0.08 0 12.65 0.08 0.08 0
2048 512 float sum -1 12.77 0.16 0.16 0 12.48 0.16 0.16 0
4096 1024 float sum -1 12.76 0.32 0.32 0 12.10 0.34 0.34 0
8192 2048 float sum -1 13.60 0.60 0.60 0 13.14 0.62 0.62 0
16384 4096 float sum -1 15.09 1.09 1.09 0 15.05 1.09 1.09 0
32768 8192 float sum -1 15.31 2.14 2.14 0 14.81 2.21 2.21 0
65536 16384 float sum -1 15.60 4.20 4.20 0 15.85 4.13 4.13 0
131072 32768 float sum -1 16.32 8.03 8.03 0 15.72 8.34 8.34 0
262144 65536 float sum -1 18.66 14.05 14.05 0 18.29 14.34 14.34 0
524288 131072 float sum -1 23.79 22.03 22.03 0 22.67 23.13 23.13 0
1048576 262144 float sum -1 43.77 23.95 23.95 0 37.99 27.60 27.60 0
2097152 524288 float sum -1 49.24 42.59 42.59 0 49.09 42.72 42.72 0
4194304 1048576 float sum -1 66.89 62.70 62.70 0 65.71 63.83 63.83 0
8388608 2097152 float sum -1 94.55 88.72 88.72 0 94.20 89.05 89.05 0
16777216 4194304 float sum -1 161.0 104.18 104.18 0 158.6 105.80 105.80 0
33554432 8388608 float sum -1 288.9 116.15 116.15 0 287.7 116.61 116.61 0
67108864 16777216 float sum -1 533.9 125.70 125.70 0 533.5 125.80 125.80 0
134217728 33554432 float sum -1 1034.7 129.72 129.72 0 1034.1 129.80 129.80 0
268435456 67108864 float sum -1 2010.3 133.53 133.53 0 2011.1 133.47 133.47 0
gpu003:60407:60407 [1] NCCL INFO comm 0x55e4b15b5de0 rank 0 nranks 2 cudaDev 0 busId 26000 - Destroy COMPLETE
gpu003:60407:60407 [1] NCCL INFO comm 0x55e4b15b8870 rank 1 nranks 2 cudaDev 1 busId 2c000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 34.0238
#
你执行的命令和输出信息意味着以下内容:
命令解释
bash
复制
./build/all_reduce_perf -b 8 -e 256M -f 2 -g 2
./build/all_reduce_perf
:运行 NCCL 的 All Reduce 性能测试程序。-b 8
:设置最小数据大小为 8 字节。-e 256M
:设置最大数据大小为 256MB。-f 2
:设置步长因子为 2(数据大小增长的倍数)。-g 2
:使用 2 个 GPU 进行测试。
输出解释
- nThread 1 nGpus 2:表示使用 1 个线程和 2 个 GPU。
- Rank 0 和 Rank 1:表示两个 GPU 的进程信息。
- NVIDIA A800-SXM4-80GB:显示 GPU 型号。
- NCCL version 2.15.1+cuda11.8:显示 NCCL 和 CUDA 的版本信息。
- Using network IB:表示使用 InfiniBand 网络。
- Connected all rings / trees:表示 NCCL 成功建立了所有通信环。
- 性能数据:列出不同数据大小下的测试结果,包括:
- size: 数据大小(字节)。
- time: 执行时间(微秒)。
- algbw: 算法带宽(GB/s)。
- busbw: 总线带宽(GB/s)。
性能结果
- 结果展示了不同数据大小(从 8 字节到 268,435,456 字节)下的性能表现。
- Avg bus bandwidth: 平均总线带宽为 34.0238 GB/s,表明在测试过程中数据传输的效率。
总结
整个输出显示了 NCCL 在指定条件下的性能测试结果,表明测试成功且性能表现良好。如果你有进一步的问题或需要详细分析某个部分,请告诉我!
标签:NCCL,INFO,60420,gpu003,60407,测试,Channel From: https://blog.csdn.net/sinat_37574187/article/details/140604699