本文给出一个规约算法求数组的和的例子。本例子求20000000(两千万)个整数的和。运算过程分成了两步,第一步是GPU对每一个工作组内规约求和,然后将每个工作组的求和结果放到数组中输出。第二步是对输出的数组用CPU求和。实际运行对比发现GPU的效率不如用CPU直接求和。下述算法运行环境是VS2015、OpenCL3,CPU是AMD A4-9125,显卡就是CPU自带的核芯显卡。
本例不需要头文件,下面是CPP文件:
const string kernel = u8R"( kernel void reduceSum(global int* num, global int* result, int nCount) { unsigned int lid = get_local_id(0); unsigned int bid = get_group_id(0); unsigned int gid = get_global_id(0); unsigned int localSize = get_local_size(0); local int pData[1024]; pData[lid] = num[gid]; barrier(CLK_LOCAL_MEM_FENCE); unsigned int halfSize = localSize; while (halfSize > 1) { unsigned int odd = halfSize & 1; halfSize >>= 1; unsigned int lastIndex = halfSize - 1; if (lid <= lastIndex) { pData[lid] += pData[lid + halfSize]; } if (lid == lastIndex && odd == 1) { pData[lid] += pData[lid + halfSize + 1]; } barrier(CLK_LOCAL_MEM_FENCE); } if (lid == 0) { result[bid] = pData[0]; } })"; void main() { cl::Program program(kernel); try { program.build("-cl-std=CL3.0"); } catch (...) { cl_int buildErr = CL_SUCCESS; auto buildInfo = program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(&buildErr); for (auto &pair : buildInfo) { std::cerr << pair.second << std::endl << std::endl; } } int64 t1, t2; t1 = getTickCount(); auto kernelFunc = cl::KernelFunctor<cl::Buffer, cl::Buffer, int>(program, "reduceSum"); vector<int> array1(20000000, 1); vector<int> temp(100000, 0); array1[32] = 100; cl::Buffer inputBuffer(array1.begin(), array1.end(), true); cl::Buffer outputBuffer(temp.begin(), temp.end(), false); kernelFunc(cl::EnqueueArgs(cl::NDRange(20000000), cl::NDRange(200)), inputBuffer, outputBuffer, 20000000); cl::copy(outputBuffer, temp.begin(), temp.end()); int sum1 = std::accumulate(temp.begin(), temp.end(), 0); t2 = getTickCount(); cout << (t2 - t1) / getTickFrequency() * 1000 << "ms" << endl; vector<int> array2(20000000, 1); array2[32] = 100; t1 = getTickCount(); int sum2 = std::accumulate(array2.begin(), array2.end(), 0); t2 = getTickCount(); cout << (t2 - t1) / getTickFrequency() * 1000 << "ms" << endl; int c; cin >> c; }
上述代码的核函数考虑了工作组大小在规约过程中变成奇数的可能。因此用变量odd判断此时的数组大小是否为奇数,对应地做特殊处理(即加上末尾多出来的1个数)。在Release版下测试,GPU速度远低于CPU。我学艺不精,不会优化这种简单的运算,不能让GPU速度快于CPU速度,敬请谅解。
另外,此代码在用Qt调试过程中出现使Qt程序无法关闭的情况,数据正常也没有死循环,暂时不知道为什么。在控制台程序中调试程序可以正常退出。
标签:temp,cl,规约,OpenCL,unsigned,halfSize,int,算法,CPU From: https://www.cnblogs.com/mengxiangdu/p/17908342.html