前言
一、安装edgeai-torchvision环境
首先需要理解的是,虚拟环境安装完torch之后再安装torchvision,且torchvision是基于源码编译安装的,因为the standard torchvision will not support all the features in this repository. 博主系统CUDA版本是11.7,但是当前edgeai-torchvision只支持到cuda11.3,故安装cuda11.3支持的pytorch版本和torchvision,根据setup.sh,安装pytorch1.10.0和torchvision0.11.0,其他依赖项版本能够支持使用即可;
但是出错
RuntimeError: Detected that PyTorch and torchvision were compiled with different CUDA versions. PyTorch has CUDA Version=11.3 and torchvision has CUDA Version=11.7. Please reinstall the torchvision that matches your PyTorch install.
尝试了多种方法,均失败。深入理解setup.py代码之后意识到,就是源码安装torchvision的时候链接不到虚拟环境的CUDA,而是系统的CUDA版本;
edgeai-torchvision/torchvision/extension.py
def _check_cuda_version(): """ Make sure that CUDA versions match between the pytorch install and torchvision install """ if not _HAS_OPS: return -1 import torch _version = torch.ops.torchvision._cuda_version() if _version != -1 and torch.version.cuda is not None: tv_version = str(_version) if int(tv_version) < 10000: tv_major = int(tv_version[0]) tv_minor = int(tv_version[2]) else: tv_major = int(tv_version[0:2]) tv_minor = int(tv_version[3]) t_version = torch.version.cuda t_version = t_version.split('.') t_major = int(t_version[0]) t_minor = int(t_version[1]) if t_major != tv_major or t_minor != tv_minor: raise RuntimeError("Detected that PyTorch and torchvision were compiled with different CUDA versions. " "PyTorch has CUDA Version={}.{} and torchvision has CUDA Version={}.{}. " "Please reinstall the torchvision that matches your PyTorch install." .format(t_major, t_minor, tv_major, tv_minor)) return _version
/home/xxx/miniconda3/envs/edgeaitv/lib/python3.8/site-packages/torch/utils/cpp_extension.py
def _check_cuda_version(self): if CUDA_HOME: nvcc = os.path.join(CUDA_HOME, 'bin', 'nvcc') cuda_version_str = subprocess.check_output([nvcc, '--version']).strip().decode(*SUBPROCESS_DECODE_ARGS) cuda_version = re.search(r'release (\d+[.]\d+)', cuda_version_str) if cuda_version is not None: cuda_str_version = cuda_version.group(1) cuda_ver = packaging.version.parse(cuda_str_version) torch_cuda_version = packaging.version.parse(torch.version.cuda) if cuda_ver != torch_cuda_version: # major/minor attributes are only available in setuptools>=49.6.0 if getattr(cuda_ver, "major", float("nan")) != getattr(torch_cuda_version, "major", float("nan")): raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda)) warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda)) else: raise RuntimeError(CUDA_NOT_FOUND_MESSAGE)
从这些出错部分的源码看出,出错的主要原因是源码编译安装torchvision的时候,是从CUDA_HOME/NVCC中获取的CUDA版本,故虚拟环境的CUDA版本需要和系统的CUDA版本一致。目前系统版本是CUDA11.7,现在为了编译edgeai-torchvision,需要用到cuda11.3,且必须是从系统获取的,所以需要重新安装cuda11.3版本,以后也要便于切换回cuda11.7,具体的安装过程请参考【软硬件环境及工具安装】nvidia驱动/CUDA版本关系及CUDA安装;
错误1:raise AttributeError(__former_attrs__[attr]) AttributeError: module 'numpy' has no attribute 'int'. `np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
这个问题和numpy的版本有关,直接安装指定版本的numpy即可;
1)numpy.int was deprecated in NumPy 1.20 and was removed in NumPy 1.24. You can change it to numpy.int_, or just int.2)pip3 install numpy==1.19
错误2:
packages/torch/utils/tensorboard/__init__.py", line 4, in <module> LooseVersion = distutils.version.LooseVersion AttributeError: module 'distutils' has no attribute 'version'setuptools版本问题,版本过高导致的问题;setuptools版本 AttributeError: module ‘distutils‘ has no attribute ‘version‘ 解决方案 AttributeError: module ‘distutils‘ has no attribute ‘version‘
# 使用pip,不能使用 conda uninstall setuptools,原因是conda在卸载的时候,会自动分析与其相关的库,然后全部删除,如果y的话,整个环境都需要重新配置。 pip3 uninstall setuptools pip3 install setuptools==59.5.0
二、测试环境;
1. 图像分类
直接运行脚本文件
sh run_edgeailite_classification.sh
也可以直接运行命令行
python ./references/edgeailite/scripts/train_classification_main.py --dataset_name cifar100_classification --model_name mobilenetv2_tv_x1 --data_path ./data/datasets/cifar100_classification --img_resize 32 --img_crop 32 --rand_scale 0.5 1.0
error
edgeai-torchvision/references/edgeailite/engine/train_classification.py", line 695, in validate progress_bar.set_postfix(Epoch='{}'.format(status_str)) TypeError: set_postfix() missing 1 required positional argument: 'postfix'
原因是源码中函数使用有误,修改即可;
progress_bar.set_postfix('Epoch={}'.format(status_str))先训练,训练之后基于训练的模型进行量化训练,最后验证,估计量化结果的准确性;基本上理解分类过程的实现逻辑和流程框架;
每个阶段生成3个文件,训练pytorch模型文件,转换的onnx模型文件,以及torchscript模型文件;
2. 语义分割
直接根据软硬件环境修改配置参数,运行脚本文件
sh run_edgeailite_segmentation.sh
错误1:
edgeai-torchvision/torchvision/edgeailite/xvision/datasets/cityscapes_plus.py", line 519, in cityscapes_segmentation train_split = CityscapesDataLoader(dataset_config, root, split_name, gt, transforms=transforms[0], TypeError: __init__() got an unexpected keyword argument 'annotation_prefix'python *args和**kwargs详解_惊瑟的博客-CSDN博客 将错误行替换为不使用annotation_prefix参数(查看以前版本的代码),解决问题;
使用
train_dataset, val_dataset = xvision.datasets.__dict__[args.dataset_name](args.dataset_config, args.data_path, split=split_arg, transforms=transforms)
替换原来的
train_dataset, val_dataset = xvision.datasets.__dict__[args.dataset_name](args.dataset_config, args.data_path, split=split_arg, transforms=transforms, annotation_prefix=args.annotation_prefix)
错误2:
AttributeError: module 'PIL.Image' has no attribute 'ANTIALIAS'
原因:AttributeError: module ‘PIL.Image‘ has no attribute ‘ANTIALIAS‘_软件测试大叔的博客-CSDN博客
原来是在pillow的10.0.0版本中,ANTIALIAS方法被删除了,使用新的方法即可,现在需要使用PIL.Image.LANCZOS
或PIL.Image.Resampling.LANCZOS
。(这与ANTIALIAS
引用的算法完全相同,只是不能再通过名称ANTIALIAS
访问它。);或者降低pillow的版本,使用低版本的pillow;
print(PIL.__version__) pip uninstall -y Pillow pip install Pillow==9.5.0
三、设计任务;
参考
1. 安装torch/torchvision/cuda版本关系;
完
标签:edgeai,torchvision,tv,int,version,CUDA,软硬件,cuda From: https://www.cnblogs.com/happyamyhope/p/17635633.html