首页 > 其他分享 >CUDA memories

CUDA memories

时间:2023-09-16 17:33:08浏览次数:36  
标签:cache memories memory registers CUDA GPU L1 shared

Global

  There's a large amount of global memory. It's slower to access than other memory like shared and registers.   All running threads can read and write global memory and so can the CPU.   The functions cudaMalloc, cudaFree, cudaMemcpy, cudaMemset all deal with global memory.   This is the main memory store of the GPU, every byte is addressable.   It is persistent across kernel  calls.   For cards with compute capability 2.0 or higher, it is cached.   Local     This is also part of the main memory of the GPU, so it's generally slow.   Local memory is used automatically by NVCC when we run out of registers or when registers cannot be used. This is called register spilling.   It happens if there's too many variables per thread to use registers or if kernels use structures.   Also, array that aren't indexed with constants use local memory since registers don't have addresses, a memory space that's addressable must be used   The scope for local memory is per thread.   Local memory is cached in an L1 then an L2 cache, so register spillling may not mean a dramatic performance decrease on compute capability 2.0 and up.   Caches L1 and L2

   On compute capability 2.0 and up, there's an L1 cache per multiprocessor. There's also an L2 cache which is shared between all multiprocessors.

  Global and local memory use these.   The L1 is very fast, shared memory speeds. The L1 and shared memory are actually the same bytes. the can be configured to be 48k shared and 16k of L1 or 16k of shared and 48k of L1.   All global memory accesses go through the L2 cache, including those by the CPU.   You can turn caching on and off with a compiler option.   Texture and constant memory have their own separate caches.   Constant This  memory is also part of the GPU's main memory. It has its own cache. Not related to the L1 and L2 of global memory. All threads have access to the same constant memory but they can only read, they can't write to it. The CPU(host) sets the values in constant memory before launching the kernel. It is very fast(register and shared memory speeds) even if all running threads in a warp read exactly the same address. It's small, there's only 64k of constant memory. All running threads share constant memory.
In graphics programming, this memory holds the constants like the model, view and projection matrices.     Texture   Texture memory resides in device memory also. It has its own cache.   It is read only to the GPU, the CPU sets it up.   Texture memory has many extra addressing tricks because it is designed for indexing(called texture fetching) and interpolating pixels in a 2D image.   The texture cache has a lower bandwidth than global memory's L1 so it might be better to stick to L1.     Shared   Shared memory is very fast(register speeds). It is shared between threads of each block.   Bank conflicts can slow access down.   It's fastest when all threads read from different banks or all thread of a warp read exactly the same value.   Successive dword(4bytes) reside in different banks. There's 16 banks in comput capability 1.0 and 32 in 2.0   Shared memory is used to enable fast communication between threads in a block.   Register   Registers are the fastest memory on the GPU.   The variables we declare in a kernel will use registers unless we run out or the can't be stored in registers. then local memory will be used.   Register scope is per thread. Unlike CPU, there's thousands of registers in a GPU.   Carefully selecting a few register instead of using 50 per thread can dasily double the number of concurrent blocks the GPU can execute and threrfore increase performance substantially.  

Summary

 

标签:cache,memories,memory,registers,CUDA,GPU,L1,shared
From: https://www.cnblogs.com/yuxiaolan/p/17707012.html

相关文章

  • 在Ubuntu20.0下搭建CUDA、cuDNN、Anaconda、pycharm
    其他链接1.buntu18.04下搭建CUDA、cuDNN、Anaconda、tensorflow1.15、Pycharm、ros、Cl2.pip使用中科大源、清华源或修改默认源为中科大源、清华源安装照着链接1的方法就可以下好cuda、cudnn和anaconda1.先装驱动,检验驱动nvidia-smi出现这个页面就成功了2.安装cuda,检验c......
  • WSL 炼丹报错:Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: c
    确认驱动没问题(nvidia-smi可以正常使用)解决办法参照:https://github.com/pytorch/pytorch/issues/85773#issuecomment-1288033297内容如下:......
  • ubuntu16.04安装cuda8.0+pytorch1.0.0
    1.安装cuda1.1查看ubuntu的英伟达显卡驱动nvidia-smi得到驱动版本是384.130,比较老,所以需要下载旧版本的cuda1.2查看显卡是否支持CUDA计算然后去到这里https://developer.nvidia.com/cuda-gpus查看你的显卡是否在表中,在的话你显卡就是支持CUDA计算的(CUDA-capable)。结果......
  • mac的m芯片上跑cuda程序
    config里parser.add_argument('--device',type=str,default='mps')main里device=torch.device(cfg['device'])train里x_batch=x_batch.astype('float32')y_batch=y_batch.astype('float32')aux_batch=a......
  • 用OLED屏幕播放视频(3): 使用cuda编程加速视频处理
    下面的系列文章记录了如何使用一块linux开发扳和一块OLED屏幕实现视频的播放:项目介绍为OLED屏幕开发I2C驱动使用cuda编程加速视频处理这是此系列文章的第3篇,主要总结和记录了如何使用cuda编程释放GPU的算力.在此之前尝试过使用python调用opencv直接处理视频数据,但使用......
  • qt程序调用cuda-11.7,cmake编译时,提示:"CMakeCUDACompilerId.cu" failed. Compiler:
    报错显示:Running/home/wc/software/cmake-3.26.3-linux-x86_64/bin/cmake/home/wc/work/junke_src/missile-sim'-GCodeBlocks-UnixMakefiles'in/home/wc/work/junke_src/build/debug.CMakeErrorat/home/wc/software/cmake-3.26.3-linux-x86_64/share/cmak......
  • CUDA 线程ID 计算方式
    threadID的计算方式,简单来说很像小学学的除法公式,本文转载自同学一篇博客;并进行简单修改;被除数=除数*商+余数用公式表示:$$线程Id=blockId*blockSize+threadId$$blockId:当前block在grid中的坐标(可能是1维到3维)blockSize:block的大小,描述其中含有多少个thr......
  • 自定义CUDA实现PyTorch算子的四种简单方法
    背景在探索新的深度学习算法的时候,我们可能会遇到PyTorch提供的算子不能满足需求的情况,这时候就需要自定义PyTorch算子,将我们的算法集成到PyTorch的工作流中。同时,为了提高运算效率,算子往往都需要使用CUDA实现。所幸,PyTorch及很多其他Python库都提供了简化这一过程的方法,完全不需......
  • Cuda编程学习记录
    一.基础知识 nvidia-smi指令: nvidia-smi-q-i0#只显示0卡信息nvidia-smi-q-i0-dMEMORY|tail-n5#只显示0卡内存信息nvidia-smi-q-i0-dUTILIZATION|tail-n4#只显示0卡使用率nvidia-smi-a-q-dCLOCK|fgrep-A3"MaxClocks"|fgrep"......
  • vs2019-cuda配置入门
    cuda使用如下1、打开VS,新建C++空项目 2、右击源文件->添加->新建项 3、选择CUDAC/C++File,名称位main.cu 4、把下面的示例源码复制到main.cu中#include"cuda_runtime.h"#include"device_launch_parameters.h"#include<stdio.h>/***************************......