首页 > 编程语言 >cuda 编程(3)基本介绍

cuda 编程(3)基本介绍

时间:2023-01-25 21:04:41浏览次数:55  
标签:__ function kernel CUDA 编程 介绍 cuda device size


Chapter 3 Basic framework of simple CUDA programs

3.1 An example: adding up two arrays

We consider a simple task: adding up two arrays of the same length (same number of elements). We first write a C++ program ​​add.cpp​​​ solving this problem. It can be compiled by using ​​g++​​​ (or ​​cl.exe​​):

g++ add.cpp

Running the executable, we will see the following message on the screen:

No errors

which indicates that the calculations have been done in an expected way. The reader should be able to understand this program without difficulty, otherwise he/she needs to gain sufficient knowledge of C++ programming first.

3.2 Basic framework of simple CUDA programs

For a simple CUDA program written in a single source file, the basic framework is as follows:

header inclusion
const or macro definition
declarations of C++ functions and CUDA kernels

int main()
{
allocate host and device memory
initialize data in host memory
transfer data from host to device
launch (call) kernel to do calculations in the device
transfer data from device to host
free host and device memory
}

definitions of C++ functions and CUDA kernels

We first write a CUDA program ​​add1.cu​​​ which does the same calculations as the C++ program ​​add.cpp​​. This CUDA program can be compiled as follows:

$ nvcc -arch=sm_75 add1.cu

Executing the executable will produce the same output as the C++ program:

No errors

We will describe the CUDA program ​​add1.cu​​ in detail in the following sections.

3.2.1 Memory allocation in device

In our CUDA program, we defined three pointers

double *d_x, *d_y, *d_z;

and used the ​​cudaMalloc()​​​ function to allocate memory in device. This is a CUDA runtime API function. Every CUDA runtime API function begins with ​​cuda​​. Here is the online manual for all the CUDA runtime functions: https://docs.nvidia.com/cuda/cuda-runtime-api.

The prototype of ​​cudaMalloc()​​ is:

cudaError_t cudaMalloc(void **address, size_t size);

Here, ​​address​​​ is the address of the pointer (so it is a double pointer), ​​size​​​ is the number of bytes to be allocated, and ​​cudaSuccess​​ is a return value indicating whether there is error when calling this function. We will ignore this return value in this Chapter and discuss it in the next Chapter. In the CUDA program, we have used this function to allocate memory for the three pointers:

cudaMalloc((void **)&d_x, M);
cudaMalloc((void **)&d_y, M);
cudaMalloc((void **)&d_z, M);

Here, ​​M​​​ is ​​sizeof(double) * N​​​, where ​​N​​​ is the number of elements in an array, and ​​sizeof(double)​​​ is the memory size (number of bytes) for a double-precision floating point number. The type conversion ​​(void **)​​ can be omitted, i.e., we can change the above lines to:

cudaMalloc(&d_x, M);
cudaMalloc(&d_y, M);
cudaMalloc(&d_z, M);

The reason for using a pointer to pointer for the first parameter in this function is that we need to change the value of the pointer itself, other than the value in the memory pointed by the pointer.

Memory allocated by ​​cudaMalloc()​​​ needs to be freed by using the ​​cudaFree()​​ function:

cudaError_t cudaFree(void* address);

Note that the argument here is a pointer, not a double pointer.

3.2.2 Data transfer between host and device

We can transfer (copy) some data from host to device after allocating the device memory, see lines 29-30 in ​​add1.cu​​​. Here we used the CUDA runtime API function ​​cudaMemcpy()​​ with the following prototype:

cudaError_t cudaMemcpy(     
void *dst,
const void *src,
size_t count,
enum cudaMemcpyKind kind);

Here, ​​dst​​​ is the address of the destination (to be transferred to), ​​src​​​ is the address of the source (to be transferred from), ​​count​​​ is the number of bytes to transferred , and ​​kind​​​ indicates the direction of the data transfer. The possible values of the enum parameter ​​kind​​​ include ​​cudaMemcpyHostToHost​​​, ​​cudaMemcpyHostToDevice​​​, ​​cudaMemcpyDeviceToHost​​​, ​​cudaMemcpyDeviceToDevice​​​, and ​​cudaMemcpyDefault​​​. The meanings of the first 4 are obvious and for the last one, it means that transfer direction will be automatically inferred from the pointers ​​dst​​​ and ​​src​​​. This automatic process requires that the host system is 64 bit supporting unified virtual addressing. Therefore, one can also use ​​cudaMemcpyDefault​​ in lines 29-30.

After calling the kernel at line 34, we use the ​​cudaMemcpy​​​ function to transfer some data from device to host, where the last parameter should be ​​cudaMemcpyDeviceToHost​​​ or ​​cudaMemcpyDefault​​.

In the ​​add2wrong.cu​​​ program, the author intentionally changed ​​cudaMemcpyHostToDevice​​​ to ​​cudaMemcpyDeviceToHost​​. The reader can try to compile and run it to see what happens.

3.2.3 Correspondence between data and threads in CUDA kernel

Lines 32-34 defined the execution configuration for the kernel: a block size of 128 and a grid size of 10^8/128.

Now we check the ​​add​​​ functions in ​​add.cpp​​​ and ​​add1.cu​​​. We see that there is a ​​for​​​ loop in the host function, but not in the kernel. In the host ​​add​​​ function, we need to loop over each element of the arrays, and thus need a ​​for​​​ loop. In the device ​​add​​​ function (the kernel), this ​​for​​​-loop is gone. This is because we have many threads in the kernel and each thread will do the same calculation, but with different data, int the so-called “single-instruction-multiple-threads” way. In the kernel, we define a one-dimensional index ​​n​​ in the following way:

const int n = blockDim.x * blockIdx.x + threadIdx.x;

This provides a correspondence between the data index ​​n​​​ and the thread indices ​​blockIdx​​​ and ​​threadIdx​​​. With this ​​n​​ defined, we can simply use it to access the data stored in the arrays:

z[n] = x[n] + y[n];

We stress again that even if each thread executes this same statement, the value of ​​n​​ is different for different threads.

3.2.4 Some requirements for kernels

Kernels are the most important aspect in CUDA programming. Here we list a few general requirements for CUDA kernels:

  • A kernel must return ​​void​​.
  • A kernel must be decorated by ​​__global__​​.
  • Function name for a kernel can be overloaded.
  • The number of parameters for a kernel must be fixed.
  • We can pass normal values to a kernel, which is visible for each thread. We will know that these parameters will be read through the constant cache in Chapter 6.
  • Pointers passed to a kernel must point to device memory, unless unified memory is used (to be discussed in Chapter 12).
  • Kernels cannot be class member functions. Usually, one wraps kernels within class members.
  • A kernel cannot call another kernel, unless dynamic parallelism is used, but we will not touch this topic in this book.
  • One must provide an execution configuration when launching a kernel.

3.2.5 The necessity of ​​if​​ statements in most kernels

The kernel in ​​add1.cu​​​ does not use the parameter ​​N​​​. When ​​N​​​ can be divided by ​​blockDim.x​​​, this is OK. Otherwise, we will be in trouble. To show this, we change the value of ​​N​​ from 10^8 to 10^8+1. If we want to have enough threads for our task and still use one thread for one element in an array, the grid size should be 10^8/128 + 1 = 781250 + 1 = 781251. In general, when the number of elements cannot be divided by the block size, the grid size can be calculated in one of the following ways:

int grid_size  = (N - 1) / block_size + 1;
int grid_size = (N + block_size - 1) / block_size;

They are both equivalent to the following statement:

int grid_size = (N % block_size == 0) 
? (N / block_size)
: (N / block_size + 1);

Because now the number of threads (10^8+128) exceeds the number of elements (10^8+1), we must use an ​​if​​ statement to avoid manipulating invalid addresses:

const int n = blockDim.x * blockIdx.x + threadIdx.x;
if (n < N)
{
z[n] = x[n] + y[n];
}

It can be equivalently written as:

const int n = blockDim.x * blockIdx.x + threadIdx.x;
if (n >= N) return;
z[n] = x[n] + y[n];

See the program ​​add3if.cu​​ for the whole code.

3.3 User-defined device functions

Kernels can call functions without a execution configuration, which are called device functions. These functions are called within kernels and executed in devices. To distinguish the various functions in a CUDA program, some execution space specifiers are introduced:

  • Functions decorated as ​​__global__​​ are kernels, which are called from host and executed in device.
  • Functions decorated as ​​__device__​​ are device functions, which are called from kernels and executed in device.
  • Functions decorated as ​​__host__​​​ are device functions, which are called from kernels and executed in device. This is usually used together with ​​__host__​​ to indicate that a function is simultaneously a host function and a device function. Compilers will generate both versions.
  • It is apparent that ​​__device__​​​ cannot be used together with ​​__global​​.
  • It is apparent that ​​__host__​​​ cannot be used together with ​​__global​​.
  • ​__noinline__​​​ and ​​__forceinline​​ can be used for a device function to suggest the compiler treat it as a non-inline or inline function.

The program ​​add4device.cu​​ demonstrates the definition and use of device functions, using different styles of returning values.


标签:__,function,kernel,CUDA,编程,介绍,cuda,device,size
From: https://blog.51cto.com/u_15202985/6022896

相关文章

  • 编程与类型系统读后总结与感想
    1. 基本信息编程与类型系统 ProgrammingWithTypes:ExamplesinTypeScript[美]弗拉德·里斯库迪亚(VladRiscutia)著,赵利通译机械工业出版社,2021年1月出版1.1. ......
  • UEFI EDKII 编程学习
    环境搭建部分第一步:下载EDK2​​https://sourceforge.net/projects/edk2/files/latest/download?source=files​​ 第二步:将下载的UDK2015.Complete.MyWorkSpace中的BaseTo......
  • 节约内存的编程方式
    以图形用户界面(GUI,GraphicalUserInterface)为基础的Windows,可以说是一个巨大的操作系统。Windows的前身是MS-DOS操作系统,最初版本可以在128KB左右的内存上运行,而想要W......
  • 学习ASP.NET Core Blazor编程系列二十二——登录(1)
    学习ASP.NETCoreBlazor编程系列文章之目录学习ASP.NETCoreBlazor编程系列一——综述学习ASP.NETCoreBlazor编程系列二——第一个Blazor应用程序(上)学习A......
  • tensorflow不同版本对应的Python 版本,cuDNN版本,CUDA版本
    ​​welcometomyblog​​​​原图地址​​Linux下的对应版本macOS下的对应版本......
  • kylin详细介绍
    OLAP(on-LineAnalysisProcessing)的实现方式ROLAP:基于​​关系数据库​​的OLAP实现(RelationalOLAP)。ROLAP将多维数据库的多维结构划分为两类表:一类是事实表,用来存储数......
  • Day14 - 网络编程
    1.IP地址IP地址的概念IP地址就是标识网络中设备的一个地址,好比现实生活中的家庭地址。网络中的设备效果图:IP地址的表现形式说明:IP地址分为两类:IPv4......
  • Day13 - 多任务编程【线程】
    1.线程介绍线程也是实现多任务的一种方式一个程序在执行时会对应一个主进程,主进程中会有一个主线程通过主线程手动产生的线程称为子线程进程是最小资源分配单位线程......
  • Hadoop MapReduce介绍、官方示例及执行流程Apache Hadoop概述
    Hadoop离线是大数据生态圈的核心与基石,是整个大数据开发的入门。本次分享内容让初学者能高效、快捷掌握Hadoop必备知识,大大缩短Hadoop离线阶段学习时间,下面一起开始今天的学......
  • Day12 - 多任务编程【进程】
    0.多任务的概念多任务是指在同一时间内执行多个任务,例如:现在电脑安装的操作系统都是多任务操作系统,可以同时运行着多个软件。1.多任务介绍多任务为提高程序的执行效......