现象描述
在运行程序时,发现torch.cuda.OutOfMemoryError: CUDA out of memory.
错误,考虑模型大小远小于所用显卡显存,使用:
$ nvidia-smi
# 或每隔两秒自动刷新
$ watch -n 2 -d nvidia-smi
进行查看,发现显存占用高且GPU利用低,结果如下:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:12:00.0 Off | N/A |
| 38% 28C P8 20W / 350W | 12120MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
考虑意外关闭的程序产生内存泄漏,拟打算手动释放显存。
解决方案
使用fuser
工具进行进程查询,如果未安装过该指令,使用以下指令进行下载:
# Ubuntu20.04
$ apt-get install psmisc
使用以下指令进行查询:
$ fuser -v /dev/nvidia*
USER PID ACCESS COMMAND
/dev/nvidia5: root kernel mount /dev/nvidia5
root 47329 F...m pt_main_thread
/dev/nvidiactl: root kernel mount /dev/nvidiactl
root 47329 F...m pt_main_thread
/dev/nvidia-uvm: root kernel mount /dev/nvidia-uvm
root 47329 F...m pt_main_thread
/dev/nvidia-uvm-tools:
root kernel mount /dev/nvidia-uvm-tools
使用kill -9 pid
杀死对应进程即可:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:12:00.0 Off | N/A |
| 38% 28C P8 20W / 350W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
可使用以下指令直接完成上述操作:
fuser -v /dev/nvidia* |awk '{for(i=1;i<=NF;i++)print "kill -9 " $i;}' | sh
参考资料
[1] 解决gpu没有运行进程,但是显存一直占用的方式_此gpu上没有正在运行的程序是什么意思-CSDN博客
[2] 释放异常占用的GPU内存_gpu upload与释放-CSDN博客
标签:显存,NVIDIA,dev,实训,nvidia,Usage,GPU,root From: https://www.cnblogs.com/yichengliu0219/p/18263033