2.1 Data parallelism

Taking an example about Calculate image.

2.2 CUDA C program structure

CUDA C 的代码包括：Host 端(CPU)和Device端(GPU);

CUDA 程序执行流程：Host ⇒ Device ⇒ 等待Device执行完毕 ⇒ Host

2.3 A vector addition kernel

Device 端的命名通常加一个suffix _d

Host 端的命名通常加一个suffix _h

线性执行代码描述如下：

// Compute vector sum C_h = A_h + B_h
void vecAdd(float* A_h, float* B_h, float* C_h, int n) {
	for (int i = 0; i < n; ++i) {
		C_h[i] = A_h[i] + B_h[i];
	}
} 
int main() { 
	// Memory allocation for arrays A, B, and C
	// I/O to read A and B, N elements each
	... 
	vecAdd(A, B, C, N); 
}

按照并行修改，描述如下：

void vecAdd(float* A_h, float* B_h, float* C_h, int n) {
	int size = n* sizeof(float);
	float *d_A *d_B, *d_C;

	// Part 1: Allocate device memory for A, B, and C
  // Copy A and B to device memory
  ...
  
  // Part 2: Call kernel – to launch a grid of threads
  // to perform the actual vector addition
  ...
  
  // Part 3: Copy C from the device memory
  // Free device vectors
}

也就是 allocate memory_d ⇒ copy memory_h to memory_d ⇒ call Kernel ⇒ copy res from memory_d ⇒ free memory(_d _h)

2.4 Device global memory and data transfer

像传统说的内存我们称为main memory(主存)，而GPU 中的“main memory”则叫做：global memory，也就是通常说的“显存”，以此来区分两个不同的存储。

Data is transferred from host to device is equal to data is transferred from Host main memory to Device global memory.

cudaMalloc(param_1, size) ：

　　作用和malloc() 差不多；

　　The first parameter is the address of a pointer variable.

　　The second paramter is size( bytes, uint64_t type )

Notice：

Address must Convert to void**

This parameter allows the cudaMalloc function to write the address of the allocated memory into the provided pointer variable regardless of its type.
cudaMalloc() is different from malloc() ，malloc() takes only one parameter and its return value is a pointer, which points to address of the allocated object. cudaMalloc() can return error value.

cudaFree(param_1) ：和free() 作用一样，pass the value as an argument；

Example of using cuadMalloc& cudaFree

float * A_d;
uint64_t size = n * sizeof(float);
cudaMalloc((void **) &A_d, size);
...
cudaFree(A_d);

cudaMemcpy():

　　The first parameter is a pointer to the destination location for the data object to be copied.

　　The second parameter points to the source location.

　　The third parameter specifies the number of bytes to be copied.

　　The fourth parameter indicates the types of memory involved in the copy: from host to host, from host to device, from device to host, from device to device.

Complete Version of Vecadd code

void vecAdd(float* A_h, float* B_h, float* C_h, int n) { 
    int size = n * sizeof(float); 
    float *A_d, *B_d, *C_d; 

    cudaMalloc((void **) &A_d, size); 
    cudaMalloc((void **) &B_d, size); 
    cudaMalloc((void **) &C_d, size); 

    cudaMemcpy(A_d, A_h, size, cudaMemcpyHostToDevice); 
    cudaMemcpy(B_d, B_h, size, cudaMemcpyHostToDevice); 

    // Kernel invocation code – to be shown later 
    ... 

    cudaMemcpy(C_h, C_d, size, cudaMemcpyDeviceToHost); 

    cudaFree(A_d); 
    cudaFree(B_d); 
    cudaFree(C_d); 
}

Error Checking

　　　　主要还是需要自己来写一些判断，比如下面判断分配如果超过可用的内存就会报错；

　　　　cudaError_t err 5 cudaMalloc((void) &A_d, size);
　　　　if (error! 5 cudaSuccess) {
	　　　　printf(“%s in %s at line %d\\n”, cudaGetErrorString(err),
	　　　　__FILE__, __LINE__);
	　　　　exit(EXIT_FAILURE);
　　　　}

标签：computing,--,void,float,cudaMalloc,CUDA,memory,device,size
From： https://www.cnblogs.com/Eternal-LL/p/17583412.html

无旋平衡树（范浩强Treap）平均时间复杂度证明
范浩强Treap是一种应用广泛的数据结构（可参考OI_Wiki），然而网上难以找到比较严谨的复杂度证明.本文将严格证明\(n\)个结点的Treap的期望树高为\(\Theta(\logn)\)，由于一次分裂或合并操作的递归深度恰为树高，这便说明了一次操作的平均时间复杂度为\(\Theta(\logn)\).首先，由......
CSP6
T1题目描述给出一个长为的排列，请你把它排序。排序方法是：定义一种操作表示交换，先找到所有逆序对满足，任意排成一个排列，使得按照这个顺序操作以后是单调递增的。如果有多种排列，输出任意一种。输入格式第一行输入，第二行输入数组。保证是排列。输出格式如果不存在答案，输出。否则......
23暑假友谊赛No.2
23暑假友谊赛No.2A-雨_23暑假友谊赛No.2(nowcoder.com)#include<bits/stdc++.h>usingnamespacestd;signedmain(){ios::sync_with_stdio(false);cin.tie(nullptr);inta,b,c,d,x;cin>>a>>b>>c>>d>>x;cout......
网格距离计算
defget_dis_tm(origin,destination):url='https://restapi.amap.com/v3/direction/driving?'key='208ce530fdd2dc162c8831657fff3232'#这里就是需要去高德开放平台去申请key,请在xxxx位置填写link='{}origin={}&destination={}&key={}&......
23暑假友谊赛No.2
A-雨#include<bits/stdc++.h>usingnamespacestd;#defineintlonglongvoidsolve(){vector<int>a(4);intx;for(auto&i:a)cin>>i;cin>>x;for(autoi:a)cout<<max(x......
Python学习4
Python学习（二）1Python集合1.1集合（Set）集合是无序和无索引的集合。在Python中，集合用花括号编写。1.2访问项目您无法通过引用索引来访问set中的项目，因为set是无序的，项目没有索引。但是您可以使用for循环遍历set项目，或者使用in关键字查询集合中是否存在指定值。......
23暑假友谊赛No.2
23暑假友谊赛No.2雨#include<bits/stdc++.h>usingnamespacestd;#defineintlonglong//#defineint__int128typedefpair<int,int>PII;typedefpair<string,int>PSI;typedefpair<string,string>PSS;constintN=50+5,INF=0x3f3f3f3f,Mod=1......
MySQL存储过程
什么是存储过程存储过程(StoredProcedure)是一种在数据库中存储复杂程序,以便外部程序调用的一种数据库对象。简单理解，存储过程其实就是一堆SQL语句的合并。中间加入了一些逻辑控制。存储过程的创建方式存储过程的创建方式：创建无参存储过程创建有参存储过程1.创建无参存......
忘记密码时使用急救模式修改密码
实验linux系统中，忘记密码时，在急救模式页面修改密码进入急救页面重启计算机，来到如下页面，按键盘“e” 在以下页面可以向↓翻动页面在linux16行末尾输入rd.break 然后按ctrl+X进入下一个页面进行重挂载进入下一个输出页面，改密码关闭selinux并......
[ARC143B] Counting Grids 题解
CountingGrids题目大意将\(1\simn^2\)填入\(n\timesn\)的网格\(A\)中，对于每个格子满足以下条件之一：该列中存在大于它的数。该行中存在小于它的数。求方案数。思路分析首先有一个比较显然的结论：对于一个不合法的方案，有且仅有一个数不满足任何一个条件。考虑......

2. CUDA--Heterogeneous data parallel computing

2.1 Data parallelism

2.2 CUDA C program structure

2.3 A vector addition kernel

2.4 Device global memory and data transfer

相关文章

赞助商

阅读排行