项目场景 [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 999: unknown error ; GPU=24 :
需要升级之前老的程序,之前的cuda 是10.2
问题描述:
环境
cuda 11.2 (之前是10.2)
onnxruntime-gpu 1.10
python 3.9.7
启动程序的时候
Traceback (most recent call last):
File "/home/aiuser/cover/liheng-foggun/app.py", line 15, in <module>
model = DetectMultiBackend(weights=config.paddle.model_file)
File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/aiuser/cover/liheng-foggun/models/yolo.py", line 37, in __init__
self.session = onnxruntime.InferenceSession(weights, providers=['CUDAExecutionProvider'])
File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 335, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/home/aiuser/miniconda3/envs/cover/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 379, in _create_inference_session
sess.initialize_session(providers, provider_options, disabled_optimizers)
RuntimeError: /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE =
cudaError; bool THRW = true] /onnxruntime_src/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*
, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 999: unknown error ; GPU=24 ; hostname=aiserver-sl-01 ; expr=cudaSetDevice(info_.device_id);
原因分析:
1.刚开始以为是onnxruntime-gpu 版本问题 升级到了 1.12 还是报错
2.网上又说是不兼容的问题
3.试试重装下驱动,卸载了11.2 的时候 通过nvidia-smi 发现之前10.2的驱动还存在
4.是因为之前的驱动没有卸载干净
解决方案:
1.卸载10.2
sudo /usr/local/cuda-10.2/bin/cuda-uninstaller
2.安装新驱动
#离线安装 515.57
sudo ./NVIDIA-Linux-x86_64-515.57.run -no-x-check -no-nouveau-check
VIDIA-Linux-x86_64-515.57.run -no-x-check -no-nouveau-check