【软硬件环境及工具安装使用】edgeai-torchvision的使用

标签：edgeai torchvision tv int version CUDA 软硬件 cuda

前言

一、安装edgeai-torchvision环境

首先需要理解的是，虚拟环境安装完torch之后再安装torchvision，且torchvision是基于源码编译安装的，因为the standard torchvision will not support all the features in this repository. 博主系统CUDA版本是11.7，但是当前edgeai-torchvision只支持到cuda11.3，故安装cuda11.3支持的pytorch版本和torchvision，根据setup.sh，安装pytorch1.10.0和torchvision0.11.0，其他依赖项版本能够支持使用即可；

但是出错

RuntimeError: Detected that PyTorch and torchvision were compiled with different CUDA versions. PyTorch has CUDA Version=11.3 and torchvision has CUDA Version=11.7. Please reinstall the torchvision that matches your PyTorch install.

尝试了多种方法，均失败。深入理解setup.py代码之后意识到，就是源码安装torchvision的时候链接不到虚拟环境的CUDA，而是系统的CUDA版本；

edgeai-torchvision/torchvision/extension.py

def _check_cuda_version():
    """
    Make sure that CUDA versions match between the pytorch install and torchvision install
    """
    if not _HAS_OPS:
        return -1
    import torch
    _version = torch.ops.torchvision._cuda_version()
    if _version != -1 and torch.version.cuda is not None:
        tv_version = str(_version)
        if int(tv_version) < 10000:
            tv_major = int(tv_version[0])
            tv_minor = int(tv_version[2])
        else:
            tv_major = int(tv_version[0:2])
            tv_minor = int(tv_version[3])
        t_version = torch.version.cuda
        t_version = t_version.split('.')
        t_major = int(t_version[0])
        t_minor = int(t_version[1])
        if t_major != tv_major or t_minor != tv_minor:
            raise RuntimeError("Detected that PyTorch and torchvision were compiled with different CUDA versions. "
                               "PyTorch has CUDA Version={}.{} and torchvision has CUDA Version={}.{}. "
                               "Please reinstall the torchvision that matches your PyTorch install."
                               .format(t_major, t_minor, tv_major, tv_minor))
    return _version

/home/xxx/miniconda3/envs/edgeaitv/lib/python3.8/site-packages/torch/utils/cpp_extension.py

def _check_cuda_version(self):
        if CUDA_HOME:
            nvcc = os.path.join(CUDA_HOME, 'bin', 'nvcc')
            cuda_version_str = subprocess.check_output([nvcc, '--version']).strip().decode(*SUBPROCESS_DECODE_ARGS)
            cuda_version = re.search(r'release (\d+[.]\d+)', cuda_version_str)
            if cuda_version is not None:
                cuda_str_version = cuda_version.group(1)
                cuda_ver = packaging.version.parse(cuda_str_version)
                torch_cuda_version = packaging.version.parse(torch.version.cuda)
                if cuda_ver != torch_cuda_version:
                    # major/minor attributes are only available in setuptools>=49.6.0
                    if getattr(cuda_ver, "major", float("nan")) != getattr(torch_cuda_version, "major", float("nan")):
                        raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))
                    warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))

        else:
            raise RuntimeError(CUDA_NOT_FOUND_MESSAGE)

从这些出错部分的源码看出，出错的主要原因是源码编译安装torchvision的时候，是从CUDA_HOME/NVCC中获取的CUDA版本，故虚拟环境的CUDA版本需要和系统的CUDA版本一致。目前系统版本是CUDA11.7，现在为了编译edgeai-torchvision，需要用到cuda11.3，且必须是从系统获取的，所以需要重新安装cuda11.3版本，以后也要便于切换回cuda11.7，具体的安装过程请参考【软硬件环境及工具安装】nvidia驱动/CUDA版本关系及CUDA安装；

错误1：

    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.

这个问题和numpy的版本有关，直接安装指定版本的numpy即可；

1）numpy.int was deprecated in NumPy 1.20 and was removed in NumPy 1.24.
   You can change it to numpy.int_, or just int.

2）pip3 install numpy==1.19

错误2：

packages/torch/utils/tensorboard/__init__.py", line 4, in <module>
    LooseVersion = distutils.version.LooseVersion
AttributeError: module 'distutils' has no attribute 'version'

setuptools版本问题，版本过高导致的问题；setuptools版本 AttributeError: module ‘distutils‘ has no attribute ‘version‘ 解决方案 AttributeError: module ‘distutils‘ has no attribute ‘version‘

# 使用pip，不能使用 conda uninstall setuptools，原因是conda在卸载的时候，会自动分析与其相关的库，然后全部删除，如果y的话，整个环境都需要重新配置。
pip3 uninstall setuptools
pip3 install setuptools==59.5.0

二、测试环境；

1. 图像分类

直接运行脚本文件

sh run_edgeailite_classification.sh

也可以直接运行命令行

python ./references/edgeailite/scripts/train_classification_main.py --dataset_name cifar100_classification --model_name mobilenetv2_tv_x1 --data_path ./data/datasets/cifar100_classification --img_resize 32 --img_crop 32 --rand_scale 0.5 1.0

error

edgeai-torchvision/references/edgeailite/engine/train_classification.py", line 695, in validate
    progress_bar.set_postfix(Epoch='{}'.format(status_str))
TypeError: set_postfix() missing 1 required positional argument: 'postfix'

原因是源码中函数使用有误，修改即可；

progress_bar.set_postfix('Epoch={}'.format(status_str))

先训练，训练之后基于训练的模型进行量化训练，最后验证，估计量化结果的准确性；基本上理解分类过程的实现逻辑和流程框架；

每个阶段生成3个文件，训练pytorch模型文件，转换的onnx模型文件，以及torchscript模型文件；

2. 语义分割

直接根据软硬件环境修改配置参数，运行脚本文件

sh run_edgeailite_segmentation.sh

错误1：

edgeai-torchvision/torchvision/edgeailite/xvision/datasets/cityscapes_plus.py", line 519, in cityscapes_segmentation
    train_split = CityscapesDataLoader(dataset_config, root, split_name, gt, transforms=transforms[0],
TypeError: __init__() got an unexpected keyword argument 'annotation_prefix'

python *args和**kwargs详解_惊瑟的博客-CSDN博客将错误行替换为不使用annotation_prefix参数(查看以前版本的代码)，解决问题；

使用

train_dataset, val_dataset = xvision.datasets.__dict__[args.dataset_name](args.dataset_config, args.data_path, split=split_arg, transforms=transforms)

替换原来的

train_dataset, val_dataset = xvision.datasets.__dict__[args.dataset_name](args.dataset_config, args.data_path, split=split_arg, transforms=transforms, annotation_prefix=args.annotation_prefix)

错误2：

AttributeError: module 'PIL.Image' has no attribute 'ANTIALIAS'

原因：AttributeError: module ‘PIL.Image‘ has no attribute ‘ANTIALIAS‘_软件测试大叔的博客-CSDN博客

原来是在pillow的10.0.0版本中，ANTIALIAS方法被删除了，使用新的方法即可，现在需要使用PIL.Image.LANCZOS或PIL.Image.Resampling.LANCZOS。（这与ANTIALIAS引用的算法完全相同，只是不能再通过名称ANTIALIAS访问它。）；或者降低pillow的版本，使用低版本的pillow；

print(PIL.__version__)

pip uninstall -y Pillow
pip install Pillow==9.5.0

三、设计任务；

参考

1. 安装torch/torchvision/cuda版本关系；

2. github_edgeai-torchvision；

3. github_torchvision；

完

标签：edgeai,torchvision,tv,int,version,CUDA,软硬件,cuda
From： https://www.cnblogs.com/happyamyhope/p/17635633.html

【软硬件环境及工具安装使用】edgeai-torchvision的使用

相关文章

赞助商

阅读排行