Learning Roadmap:
Section 1: Intro to Parallel Programming & MUSA
- Deep Learning Ecosystem(摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/20)
- Ubuntu+Driver+Toolkit+conda+pytorch+torch_musa环境安装(摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/24-CSDN博客)
- C/C++ Review(摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/22-CSDN博客)
- GPU intros(摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/25-CSDN博客)
- GPU硬件架构 (摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/26-CSDN博客)
- Write First Kernels (Here)(2024/11/27-线程层级)
- MUSA API
- Faster Matrix Multiplication
- Triton
- Pytorch Extensions(摩尔线程国产显卡 MUSA 并行编程学习笔记-2024/11/21-CSDN博客)
- MNIST Multilayer Perceptron
Section 2: Parallel Programming & MUSA in Depth
- Analyzing Parallel Program Performance on a Quad-Core CPU
- Scheduling Task Graphs on a Multi-Core CPU
- A Simple Renderer in MUSA
- Optimizing DNN Performance on DNN Accelerator Hardware
- llm.c
Ref:
https://www.youtube.com/watch?v=86FAWCzIe_4&t=1012s
https://www.youtube.com/watch?v=V1tINV2-9p4
https://gfxcourses.stanford.edu/cs149/fall24
First MUSA Program to Count Thread
Ref: 2024/11/27 线程层级 | High-Performance Computing with GPUs
2024/11/27 线程层级上一次记到了,GPU上的线程层级,本次将通过写一个基于MUSA的kernel,在过程中具象化ThreadId,BlockId,与BlockDim等概念
代码地址
代码
#include <stdio.h>
__global__ void whoami(void) {
int block_id =
blockIdx.x + // apartment number on this floor (points across)
blockIdx.y * gridDim.x + // floor number in this building (rows high)
blockIdx.z * gridDim.x * gridDim.y; // building number in this city (panes deep)
int block_offset =
block_id * // times our apartment number
blockDim.x * blockDim.y * blockDim.z; // total threads per block (people per apartment)
int thread_offset =
threadIdx.x +
threadIdx.y * blockDim.x +
threadIdx.z * blockDim.x * blockDim.y;
int id = block_offset + thread_offset; // global person id in the entire apartment complex
printf("%04d | Block(%d %d %d) = %3d | Thread(%d %d %d) = %3d\n",
id,
blockIdx.x, blockIdx.y, blockIdx.z, block_id,
threadIdx.x, threadIdx.y, threadIdx.z, thread_offset);
// printf("blockIdx.x: %d, blockIdx.y: %d, blockIdx.z: %d, threadIdx.x: %d, threadIdx.y: %d, threadIdx.z: %d\n", blockIdx.x, blockIdx.y, blockIdx.z, threadIdx.x, threadIdx.y, threadIdx.z);
}
int main(int argc, char **argv) {
const int b_x = 2, b_y = 3, b_z = 4;
const int t_x = 4, t_y = 4, t_z = 4; // the max warp size is 32, so
// we will get 2 warp of 32 threads per block
int blocks_per_grid = b_x * b_y * b_z;
int threads_per_block = t_x * t_y * t_z;
printf("%d blocks/grid\n", blocks_per_grid);
printf("%d threads/block\n", threads_per_block);
printf("%d total threads\n", blocks_per_grid * threads_per_block);
dim3 blocksPerGrid(b_x, b_y, b_z); // 3d cube of shape 2*3*4 = 24
dim3 threadsPerBlock(t_x, t_y, t_z); // 3d cube of shape 4*4*4 = 64
whoami<<<blocksPerGrid, threadsPerBlock>>>();
musaDeviceSynchronize();
}
编译
mcc 01_indexing.mu -o indexing -mtgpu -O2 -lmusart
./indexing
通过如上指令,我们将01_indexing.mu进行编译,并执行。其中 -O2是gcc的标准优化项, -lmusart则连接了musa的运行时库,
GridDim
在这个代码里面我们定义了2*3*4的Grid,每个Grid中有24个block
BlockDim
每个Block的维度为4*4*4,每个Block里面有64个Thread
ThreadNum
基于blocks per grid 与threads per block我们得到了Thread的总数:24*64 = 1536
ThreadId
如代码所示,基于每个block的坐标与thread的坐标我们计算出了block offset与thread offset,通过将两者相加我们得到了thread的全局唯一id
输出结果
输出结果也符合我们的预期
标签:11,threadIdx,MUSA,28,per,2024,blockIdx,线程,block From: https://blog.csdn.net/weixin_47469677/article/details/144111698