51c~TensorRT~合集1

标签：合集 51c torch TensorRT input True self size

我自己的原文哦~ https://blog.51cto.com/whaosoft/11744302

一、TensorRT-LLM~最佳部署实践

TensorRT-LLM（Large Language Model）部署实践的详细介绍

TRT-LLM简单再介绍

TensorRT-LLM的介绍前几篇中已提到，就不过多赘述了。

这里列一个TensorRT-LLM的功能和定位：

trt-llm 功能与架构

TRT-LLM和vllm、lmdeploy、sglang[6]一样，提供大模型的推理支持，包含了大模型推理的：

模型结构，提前定义好的模型结构
runtime调度（inflight batching、kv cache reuse）
kernels（MMHA、FMHA）
量化技术（FP8、INT8、INT4、kv cache）

这里挨个过下：

模型结构

模型结构就是提前定义好的llama或者其他大模型的网络结构，直接复用就行。

搭建好的模型可以使用TensorRT帮你生成kernel，和小模型走onnx的路子不一样，trt-llm完善了TensorRT-python-api，使其更好用和易于搭建，更灵活一点，不过说实话，相比使用vllm搭建还是稍微难一点。

kernel优化

对于大模型来说，简单对于kernel的优化是不够的。之前小模型的经验，优化模型第一直觉就是优化kernel，但是对于大模型来说runtime、调度也很重要。

优化kernel直接可以优化模型性能，降低latency；而runtime或者说调度可以提升整体的吞吐。

目前trt-llm中比较常用的就是MMHA（MaskedMultiheadAttention）和FMHA（FusedMultiheadAttention），这俩都是fused Multi Head Attention，MMHA是context那边的MMHA，也是fused的，这俩都是从faster transformer借鉴来，目前也更新了很多版本。

runtime调度

Runtime调度也很重要，除了最基本的inflight batching之外，kv cache优化目前更重要一些。

因为目前长context的需求现在比较多，缓存kv cache的部分需要的memory越来越重。所以有了kv cache压缩，也就是量化以及low rank的需求。当context变超长的时候，kv cache需要的显存大小甚至会比模型本身还大，所以kv cache压缩也比较重要，比如INT4。

Runtime方式变化也会影响kernel实现的方式，需要修改kernel的实现方式去配合runtime。另外，TP和PP也是标配，算是runtime的一部分，相比于TensorRT只支持单卡，trt-llm增加了多卡的支持，是通过trt-plugin支持的。

量化

最后是量化，量化的支持trt-llm支持也是不少，这里暂时略过。两量化需要单独的篇幅去说，开放日也有单独的讲座去讲量化相关。

关于开源

TRT比较被诟病的就是开源开的不彻底，能改动的地方不多。

trt-llm之开源与半开源

究其原因，是有很多针对不同硬件做了定制化优化的kernel，如果放出来，我们就可以通过代码反推到硬件的底层逻辑的设计。

比如FMHA的代码，某一个配置中，可以看到有很多针对不同sm架构的实现代码：

针对不同size不同显卡架构都有不同的实现，相对比较细致、比较极致，总之就是和硬件比较相关的代码没有开源，其他的代码开源。

值得一提是还有一个runtime代码，之前GptSession好歹是开源的，后来切换成了Executor，直接给你闭源了，想要添加功能只能去官方提需求，没办法自己修改：

端到端 workflow

这里我们讲下端到端使用trt-llm开发的流程。

首先提一下Python API，相比于使用原始TensorRT的python-api，TRT-LLM在上面又封装了一层，尽可能和pytorch的风格一致，易于搭建新的网络。

不过要注意，python-api只是搭建，搭建好的网络只是TensorRT的网络格式（类似于onnx2trt的中间IR形式），不能直接运行（这点要区分于Pytorch），需要build engine后才可以。实际运行还是使用C++，和TensorRT一样。

TRT-LLM也提供了High Level API，类似于torch和huggingface的关系：

当然流程还是先转换权重格式、搭建trt-llm网络结构、build engine，然后就可以运行了。

回到端到端的workflow，就是上述聊到的那几个流程：

我们首先需要convert checkpoint，将原始的weight转换为trt-llm的格式
然后将转换好的weight全部填入提前定义好的网络结构中
最后build已经读取了权重而且定义好网络结构的network，这里区别于pytorch，因为pytorch是动态图，而trt-llm是静态图，所以相对来说没有那么方便，需要先定义好网络结构，再build才能得到最终的engine结构，这也是优化后的计算图

最后编译出来的engine可以在python中先进行测试，测试没问题后，就可以部署到C++中，最终通过triton上线。

安装 && install

开放日中也简单提了下安装过程，可能是TRT-LLM安装坑确实比较多吧...

第一个是在利用docker自行编译trt-llm源码，也是我比较常用的方式，优点是可以修改源码以及不需要考虑环境，不好的就是对网络要求稍高点（懂得都懂）。

第二种方式是直接通过pip安装，这个建议在之前已经有环境可以跑起来trt-llm的基础上，你想要更新版本，可以这么搞。或者说你有纯净的trt-llm依赖的环境（比如从ngc拉下来的镜像，或者第一个build出来的镜像），直接在这个环境中pip install即可。

如果你想直接在其他开发环境中pip install，可能会和你本地的一些库有一些不兼容的地方（比如你的torch是自己编译的，gcc版本不不一样），可能有些symbolic找不到，所以最好是纯净的环境。

第三种是借用NGC中提供的镜像。

NVIDIA NGC（NVIDIA GPU Cloud）是一个为深度学习、机器学习和高性能计算（HPC）提供优化的GPU软件的中心。这个平台提供了容器、预训练模型、模型脚本和行业解决方案，帮助数据科学家、开发者和研究人员更快地构建解决方案，我们快速开发使用的镜像一般来源于这里

NGC中的镜像已经提前预装了trtllm-triton-backend和trt-llm这俩库，所以trt-llm需要的系统环境也有了。虽然说有预先编译好的trt-llm，其实后续我们也可以自行编译其他版本的，都比较灵活。

最后总结了下各种安装方式的优点和缺点：

转换权重（checkpoint）

之前TRT-LLM每个模型都有一个convert脚本，会比较乱而且不好维护，所以现在TRT-LLM统一了convert接口：

在convert checkpoint的地方统一了之后会有很多好处：

权重转换后需要把权重塞到模型中，需要定义模型结构，trt-llm预先提供了一些比较火的模型结构，对这些个模型提供支持：

第四步就是在权重转换为trt-llm之后，开始进行build构建。有个细节是，在build的时候有很多参数会影响性能，官方预设的参数默认是效果比较好的，但是我们肯定要根据自己实际的需求去调节参数，不论是速度还是精度问题：

在构建好engine之后，就可以开始运行了，建议首先使用run.py在python端进行测试。然后也可以使用其他的.py文件或者gptManagerBenchmark去评测模型精度或者性能：

MMLU、公开的LLM测试集，来测试trt-llm模型build之后的精度，一般就是测试一个pytorch的再测试一个trt-llm的，简单对比即可。

TRT-LLM也提供了benchmark工具，gptManagerBenchmark是提前编译好的可执行文件，专门用来测试性能，也可以测试带上inflight batching的整体吞吐：

如何debug

调试的话，有两个logger可以使用，也就是可以通过设置环境变量或者传入参数开启某些logger设置。

Logger could provide many useful/important information to help debugging

Python side: controlled by --log, levelin python examples (defined in tensorrt llm/logger.py)
C++ side: 这个比较隐蔽，一般是开发者使用 controlled by TLLM_LOG_LEVEL environment variable (defined in cpp/include/tensorrt llm/common/logger.h）Could print all function calls on C++ level；Help to trace the codes and locate error position

编译

这里也提到了一个加速编译的功能，有时候我们修改了一些源文件，重新编译会比较耗时。

比如改了一个.h的头文件，但是这个头文件被很多C++文件引用，所以这些个c++文件理论上都会被重编译一遍，加上trt-llm有很多kernel需要编译，编译时间很长。

官方提供了一些方法：

一般我们在某个卡测试的时候，不需要把所有cuda architecture都编译，按需编译自己当前这张卡对应architecture就行。

issue查找

可能会影响精度的选项：

用BF16训练出来，使用FP16跑（反之亦然），在小模型上可能影响不大；但是如果在大模型上，还是会有些精度问题；
context_fmha vs context_fmha_fp32_acc 默认是fp16 acc，如果遇到精度问题，可以尝试fp32_acc但是会影响速度；
Disable gemm_plugin；之前我们默认都是打开的，首先会加速编译流程；后来TRT-10优化了编译速度和支持了FP32 acc，可以尝试使用trt内部的gemm去寻找更好的性能，都可以试下；

如何添加一个新的模型

TRT-LLM使用新的Python-API的初衷就是想要后续改动或者添加新模型更方便些。

因为TRT-LLM来源于TRT，因此构建网络想要通过trt-python-api去构建，这里trt-llm对这个api做了改进，但是相比pytorch可能还是难度大些。不过trt-llm已经提供了一些例子，比如我们可以使用llama的实现去适配其他模型。
以下是官方提供的添加新模型的流程：

具体来说，就是仿照llama的实现，以及一些llm的基本class和内部具体实现的layer：

除了模型的搭建，还需要实现convert权重相关的地方，从huggingface权重到trt-llm权重格式的转换：

如果官方提供的例子没有模型中某些层的实现，但你这个层可以通过官方已经提供的layer接口实现，那么我们可以利用官方提供的Python-API搭建出来：

当然，如果提供的layer、functional接口也没有你的实现，那就只能自己搓一个kernel出来，不过这个会比较复杂。参考之前TRT中的方式，我们需要先定义一个kernel plugin，写好相关的kernel实现，然后在外部引用：

以上就是NVIDIA-AI技术开放日关于TRT-LLM 最佳性能实践的全部内容。

参考

https://www.bilibili.com/video/BV1aT42167mk/\?spm\_id\_from=pageDriver\&vd\_source=eec038509607175d58cdfe2e824e8ba2[7]

参考资料

[1]

NVIDIA AI技术开放日 2024 夏: https://space.bilibili.com/1320140761/channel/collectiondetail?sid=3446369

[2]TRT-LLM 最佳部署实践: https://www.bilibili.com/video/BV1MS421d7Jm/

[3]TRT-LLM 最佳部署实践: https://www.bilibili.com/video/BV1MS421d7Jm/

[4]TensorRT-LLM初探（一）基于最新commit运行llama，以及triton-tensorrt-llm-backend: https://ai.oldpan.me/t/topic/260

[5]TensorRT-LLM初探（二）简析了结构，用的更明白: https://ai.oldpan.me/t/topic/203

[7]https://www.bilibili.com/video/BV1aT42167mk/?spm_id_from=pageDriver&vd_source=eec038509607175d58cdfe2e824e8ba2: https://www.bilibili.com/video/BV1aT42167mk/?spm_id_from=pageDriver&vd_source=eec038509607175d58cdfe2e824e8ba2

二、兼顾灵活性和性能以及调试的手搓TensorRT网络

用过TensorRT的基本都接触过trtexec[1]，可以方便快捷地将你的ONNX模型转换为TensorRT的engine：

./trtexec --notallow=model.onnx

其中原理是啥，这就涉及到了另外一个库onnx-tensorrt[2]，可以解析onnx模型并且将onnx中的每一个op转换为TensorRT的op，进而构建得到engine，trtexec转模型的核心就是onnx-tensorrt。

如果没有onnx-tensorrt[3]，我们该怎么使用TensorRT去加速你的模型的呢？

幸运的是TensorRT官方提供了API[4]去搭建网络，你可以像使用Pytorch一样去搓一个网络出来，比如TensorRTx[5]这个库，就包含了很多直接使用API搭建出来的TensorRT网络：

nvinfer1::IHostMemory* buildEngineYolov8n(nvinfer1::IBuilder* builder,
                                          nvinfer1::IBuilderConfig* config, nvinfer1::DataType dt, const std::string& wts_path) {
    std::map<std::string, nvinfer1::Weights> weightMap = loadWeights(wts_path);
    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0U);

    /*******************************************************************************************************
    ******************************************  YOLOV8 INPUT  **********************************************
    *******************************************************************************************************/
    nvinfer1::ITensor* data = network->addInput(kInputTensorName, dt, nvinfer1::Dims3{3, kInputH, kInputW});
    assert(data);

    /*******************************************************************************************************
    *****************************************  YOLOV8 BACKBONE  ********************************************
    *******************************************************************************************************/
    nvinfer1::IElementWiseLayer* conv0 = convBnSiLU(network, weightMap, *data, 16, 3, 2, 1, "model.0");
    nvinfer1::IElementWiseLayer* conv1 = convBnSiLU(network, weightMap, *conv0->getOutput(0), 32, 3, 2, 1, "model.1");
    nvinfer1::IElementWiseLayer* conv2 = C2F(network, weightMap, *conv1->getOutput(0), 32, 32, 1, true, 0.5, "model.2");
    nvinfer1::IElementWiseLayer* conv3 = convBnSiLU(network, weightMap, *conv2->getOutput(0), 64, 3, 2, 1, "model.3");
    nvinfer1::IElementWiseLayer* conv4 = C2F(network, weightMap, *conv3->getOutput(0), 64, 64, 2, true, 0.5, "model.4");
    nvinfer1::IElementWiseLayer* conv5 = convBnSiLU(network, weightMap, *conv4->getOutput(0), 128, 3, 2, 1, "model.5");
    nvinfer1::IElementWiseLayer* conv6 = C2F(network, weightMap, *conv5->getOutput(0), 128, 128, 2, true, 0.5, "model.6");
    nvinfer1::IElementWiseLayer* conv7 = convBnSiLU(network, weightMap, *conv6->getOutput(0), 256, 3, 2, 1, "model.7");
    nvinfer1::IElementWiseLayer* conv8 = C2F(network, weightMap, *conv7->getOutput(0), 256, 256, 1, true, 0.5, "model.8");
    nvinfer1::IElementWiseLayer* conv9 = SPPF(network, weightMap, *conv8->getOutput(0), 256, 256, 5, "model.9");
...
}

这种方式的搭建，相比使用onnx-tensorrt[6]的优点：

可以更精确控制网络中的每一层，规避onnx中冗余的造成性能下降的结构，所以理论上通过API搭建的trt网络，在构建后性能会更好一些（当然也分情况哈，对于大部分模型来说，现在onnx2trt + TensorRT 配合其实已经和纯API搭建性能几乎一样了）
后期可以比较方便的修改trt网络层中的某一层，以及加plugin

不过缺点很显然，搭网络很耗时，还需要你熟悉TensorRT的api，入手期间可能会经历无数的坑。有那时间使用onnx2trt一行命令就转好了，没有onnx2trt灵活。

不过当然不能无脑使用onnx，遇到网络中不支持的算子，或者你的网络比较特殊的话，会直接GG，看看onnx2TensorRT仓库的issue，直到2023年还会有各种各样的op问题：

另外，当模型特别大（嗯我说的就是llm），层数特别多的话，onnx就不是很好用了，也不是不能导出来，就是当onnx比较大的时候，看网络结构、定位问题不是很好搞，总得经过onnx这个IR，而ONNX用起来有很多小坑，虽说最后可以完成任务，但过程总归是很辛苦的（苦力活，懂的都懂）。

那么有没有更好的方式呢？同时兼顾灵活性和性能？

更好的方式 v1

想必有些童鞋也用过类似于torch2trt[7]的TensorRT转换工具，通过遍历你的Pytorch网络，在遍历每一个op的时候将每个op转换为相应的TensorRT-op，搭建好网络后就可以build成TensorRT的engine：

model = deeplabv3_resnet50().cuda().eval（).half()
  data = torch.randn((1, 3, 224, 224)).cuda().half()

  print('Running torch2trt...')
  model_trt = torch2trt_dynamic(
      model, [data], fp16_mode=True, max_workspace_size=1 << 25)

比如下述这个converter，当你模型遍历到torch.nn.functional.leaky_relu这个op的时候，会执行这个转换脚本生成TensorRT-network的op：ctx.network.add_activation(input_trt, trt.ActivationType.LEAKY_RELU)。

@tensorrt_converter('torch.nn.functional.leaky_relu')
@tensorrt_converter('torch.nn.functional.leaky_relu_')
def convert_leaky_relu(ctx):
    input = get_arg(ctx, 'input', pos=0, default=None)
    negative_slope = get_arg(ctx, 'negative_slope', pos=1, default=0.01)
    output = ctx.method_return

    input_trt = trt_(ctx.network, input)
    layer = ctx.network.add_activation(input_trt,
                                       trt.ActivationType.LEAKY_RELU)
    layer.alpha = negative_slope

    output._trt = layer.get_output(0)

这种方式的好处是修改网络比较简单，因为是直接从你pytorch模型去转换而不是经过onnx，虽然说经过onnx也可以修改网络，但是终归是要经过onnx这个IR，有些op从pytorch->onnx的时候会变，到时候出现了问题不好定位。

另外，需要debug的时候你可以很方便的设置哪些是output（直接在网络中找到你想要设置output的地方，将子模型单独截取出来转换即可），方便定位问题。如果是onnx的话，首先需要获取pytorch-onnx的对应层，然后在onnx2trt脚本中设置才可以，虽然TensorRT官方也提供了Polygraphy[8]这样的debug工具，但是实际使用起来没有直接在pytorch网络上修改方便。

后续的trtorch，又或者叫torch-TensorRT[9]的工具，原理和torch2trt差不多，也是通过遍历torch的网络去一层一层转化为TensorRT的op：

更好的方式 v2

上述的v1方法，相比onnx2trt更直接一些，可以直接在pytorch模型中进行转换，不过我们拿到的只是build后的TensorRT-engine，中间TensorRT-network网络的搭建过程被隐藏起来了，之后网络中遇到问题，之后想要进一步debug的时候，对于网络的全局观还是要差那么一点，如果能直接debug使用TensorRT-API搭建的网络会更好更直观一点：

class Centernet_dla34(object):
    def __init__(self, weights) -> None:
        super().__init__()
        self.weights = weights
        self.levels = [1, 1, 1, 2, 2, 1]
        self.channels = [16, 32, 64, 128, 256, 512]
        self.down_ratio = 4
        self.last_level = 5
        self.engine = self.build_engine()

    def add_batchnorm_2d(self, input_tensor, parent):
        gamma = self.weights[parent + '.weight'].numpy()
        beta = self.weights[parent + '.bias'].numpy()
        mean = self.weights[parent + '.running_mean'].numpy()
        var = self.weights[parent + '.running_var'].numpy()
        eps = 1e-5

        scale = gamma / np.sqrt(var + eps)
        shift = beta - mean * gamma / np.sqrt(var + eps)
        power = np.ones_like(scale)

        return self.network.add_scale(input=input_tensor.get_output(0), mode=trt.ScaleMode.CHANNEL, shift=shift, scale=scale, power=power)
...
    def populate_network(self):
        # Configure the network layers based on the self.weights provided.
        input_tensor = self.network.add_input(
            name=ModelData.INPUT_NAME, dtype=ModelData.DTYPE, shape=ModelData.INPUT_SHAPE)

        y = self.add_base(input_tensor, 'module.base')

        first_level = int(np.log2(self.down_ratio))
        last_level = self.last_level
        dla_up = self.add_dla_up(y, first_level, 'module.dla_up')
        ida_up = self.add_ida_up(dla_up[:last_level-first_level], self.channels[first_level], [
                                 2 ** i for i in range(last_level - first_level)], 0, 'module.ida_up')

        hm = self.add_head(ida_up[-1], 80, 'module.hm')
        wh = self.add_head(ida_up[-1], 2, 'module.wh')
        reg = self.add_head(ida_up[-1], 2, 'module.reg')

        hm.get_output(0).name = 'hm'
        wh.get_output(0).name = 'wh'
        reg.get_output(0).name = 'reg'
        self.network.mark_output(tensor=hm.get_output(0))
        self.network.mark_output(tensor=wh.get_output(0))
        self.network.mark_output(tensor=reg.get_output(0))
...

但上文也提到过，这种搭建网络的方式较为费事费力，有没有稍微自动化的方法呢？

用过fx[10]的童鞋应该记得有个to_folder方法

model = centernet().cuda()  
dummy_input = torch.randn(1, 3, 1024, 1024).cuda()  
res_origin = model(dummy_input)  
  
from torch.fx import symbolic_trace  
m = symbolic_trace(model.fx_model.cpu())  
m.to_folder("fx_debug","centernet_res50")

可以将fx trace后的网络生成出来：

class centernet_res50(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = torch.load(r'fx_debug/backbone.pt') # Module(   (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)   (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)   (relu): ReLU(inplace=True)   (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilatinotallow=1, ceil_mode=False)   (layer1): Module(     (0): Module(       (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (downsample): Module(         (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)         (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       )     )     (1): Module(       (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (2): Module(       (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )   )   (layer2): Module(     (0): Module(       (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (downsample): Module(         (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)         (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       )     )     (1): Module(       (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (2): Module(       (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (3): Module(       (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )   )   (layer3): Module(     (0): Module(       (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (downsample): Module(         (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)         (1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       )     )     (1): Module(       (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (2): Module(       (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (3): Module(       (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (4): Module(       (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (5): Module(       (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )   )   (layer4): Module(     (0): Module(       (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (downsample): Module(         (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)         (1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       )     )     (1): Module(       (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )     (2): Module(       (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (relu): ReLU(inplace=True)       (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)       (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)       (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)       (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     )   ) )
        self.upsampler = torch.load(r'fx_debug/upsampler.pt') # Module(   (deconv_layers): Module(     (0): ConvTranspose2d(2048, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)     (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     (2): ReLU(inplace=True)     (3): ConvTranspose2d(256, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)     (4): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     (5): ReLU(inplace=True)     (6): ConvTranspose2d(256, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)     (7): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)     (8): ReLU(inplace=True)   ) )
        self.head = torch.load(r'fx_debug/head.pt') # Module(   (hm): Module(     (0): Conv2d(256, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))     (1): ReLU(inplace=True)     (2): Conv2d(64, 3, kernel_size=(1, 1), stride=(1, 1))   )   (wh): Module(     (0): Conv2d(256, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))     (1): ReLU(inplace=True)     (2): Conv2d(64, 2, kernel_size=(1, 1), stride=(1, 1))   )   (reg): Module(     (0): Conv2d(256, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))     (1): ReLU(inplace=True)     (2): Conv2d(64, 2, kernel_size=(1, 1), stride=(1, 1))   ) )
        self.load_state_dict(torch.load(r'fx_debug/state_dict.pt'))

    def forward(self, input):
        input_1 = input
        backbone_conv1 = self.backbone.conv1(input_1);  input_1 = None
        backbone_bn1 = self.backbone.bn1(backbone_conv1);  backbone_conv1 = None
        backbone_relu = self.backbone.relu(backbone_bn1);  backbone_bn1 = None
        backbone_maxpool = self.backbone.maxpool(backbone_relu);  backbone_relu = None
        ...
        head_reg_1 = getattr(self.head.reg, "1")(head_reg_0);  head_reg_0 = None
        head_reg_2 = getattr(self.head.reg, "2")(head_reg_1);  head_reg_1 = None
        return (head_hm_2, head_wh_2, head_reg_2)
        
if __name__ == '__main__':

    model = centernet_res50()
    dummy_input = torch.randn(1, 3, 1024, 1024)
    output = model(dummy_input)

通过这种方式我们可以简单将trace后模型直接导出成py文件，然后自然而然地可以看到模型的网络结构，这里是拿到了Pytorch模型。

既然可以生成Pytorch模型，那么可不可以生成直接利用TensorRT-API搭建的网络呢？

我们先仿照TensorRT-API的方式去实现类似于Pytorch的network接口：

class Downsample2D(Module):

    def __init__(self,
                 channels,
                 use_cnotallow=False,
                 out_channels=None,
                 padding=1) -> None:
        super().__init__()
        self.channels = channels
        self.out_channels = out_channels or channels
        self.use_conv = use_conv
        self.padding = padding
        stride = (2, 2)

        if use_conv:
            self.conv = Conv2d(self.channels,
                               self.out_channels, (3, 3),
                               stride=stride,
                               padding=(padding, padding))
        else:
            assert self.channels == self.out_channels
            self.conv = AvgPool2d(kernel_size=stride, stride=stride)

    def forward(self, hidden_states):
        assert not hidden_states.is_dynamic()
        batch, channels, _, _ = hidden_states.size()
        assert channels == self.channels

        hidden_states = self.conv(hidden_states)

        return hidden_states

是不是很像Pytorch的网络结构，但这里继承的Module是模仿nn.Module单独实现的一个模块。 细节先不介绍了，这里的类成员Conv2d看起来和Pytorch版本的区别不大:

class Conv2d(Module):

    def __init__(
            self,
            in_channels: int,
            out_channels: int,
            kernel_size: Tuple[int, int],
            stride: Tuple[int, int] = (1, 1),
            padding: Tuple[int, int] = (0, 0),
            dilation: Tuple[int, int] = (1, 1),
            groups: int = 1,
            bias: bool = True,
            padding_mode: str = 'zeros',  # TODO: refine this type
            dtype=None) -> None:
        super().__init__()
        if groups <= 0:
            raise ValueError('groups must be a positive integer')
        if in_channels % groups != 0:
            raise ValueError('in_channels must be divisible by groups')
        if out_channels % groups != 0:
            raise ValueError('out_channels must be divisible by groups')

        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        self.dilation = dilation
        self.groups = groups
        self.padding_mode = padding_mode

        self.weight = Parameter(shape=(out_channels, in_channels // groups,
                                       *kernel_size),
                                dtype=dtype)
        if bias:
            self.bias = Parameter(shape=(out_channels, ), dtype=dtype)
        else:
            self.register_parameter('bias', None)

    def forward(self, input):
        return conv2d(input, self.weight.value,
                      None if self.bias is None else self.bias.value,
                      self.stride, self.padding, self.dilation, self.groups)

是不是很像Pytorch的网络结构，但这里继承的Module是模仿nn.Module单独实现的一个模块。细节先不介绍了，这里的类成员Conv2d看起来和Pytorch版本的区别不大:

class Conv2d(Module):  
  
    def __init__(  
            self,  
            in_channels: int,  
            out_channels: int,  
            kernel_size: Tuple[int, int],  
            stride: Tuple[int, int] = (1, 1),  
            padding: Tuple[int, int] = (0, 0),  
            dilation: Tuple[int, int] = (1, 1),  
            groups: int = 1,  
            bias: bool = True,  
            padding_mode: str = 'zeros',  # TODO: refine this type  
            dtype=None) -> None:  
        super().__init__()  
        if groups <= 0:  
            raise ValueError('groups must be a positive integer')  
        if in_channels % groups != 0:  
            raise ValueError('in_channels must be divisible by groups')  
        if out_channels % groups != 0:  
            raise ValueError('out_channels must be divisible by groups')  
  
        self.in_channels = in_channels  
        self.out_channels = out_channels  
        self.kernel_size = kernel_size  
        self.stride = stride  
        self.padding = padding  
        self.dilation = dilation  
        self.groups = groups  
        self.padding_mode = padding_mode  
  
        self.weight = Parameter(shape=(out_channels, in_channels // groups,  
                                       *kernel_size),  
                                dtype=dtype)  
        if bias:  
            self.bias = Parameter(shape=(out_channels, ), dtype=dtype)  
        else:  
            self.register_parameter('bias', None)  
  
    def forward(self, input):  
        return conv2d(input, self.weight.value,  
                      None if self.bias is None else self.bias.value,  
                      self.stride, self.padding, self.dilation, self.groups)

那我们看核心实现conv2d(input, self.weight.value,...：

def conv2d(input: Tensor,  
           weight: Tensor,  
           bias: Optional[Tensor] = None,  
           stride: Tuple[int, int] = (1, 1),  
           padding: Tuple[int, int] = (0, 0),  
           dilation: Tuple[int, int] = (1, 1),  
           groups: int = 1) -> Tensor:  
  
    assert not input.is_dynamic()  
  
    ndim = input.ndim()  
    if ndim == 3:  
        input = expand_dims(input, 0)  
  
    noutput = weight.size()[0]  
    kernel_size = (weight.size()[-2], weight.size()[-1])  
  
    is_weight_constant = (weight.producer is not None  
                          and weight.producer.type == trt.LayerType.CONSTANT)  
    weight = weight.producer.weights if is_weight_constant else trt.Weights()  
  
    if bias is not None:  
        is_bias_constant = (bias.producer is not None  
                            and bias.producer.type == trt.LayerType.CONSTANT)  
        bias = bias.producer.weights if is_bias_constant else trt.Weights()  
  
    layer = default_trtnet().add_convolution_nd(input.trt_tensor, noutput,  
                                                kernel_size, weight, bias)  
    layer.stride_nd = stride  
    layer.padding_nd = padding  
    layer.dilation = dilation  
    layer.num_groups = groups  
  
    if not is_weight_constant:  
        layer.set_input(1, weight.trt_tensor)  
    if bias is not None and not is_bias_constant:  
        layer.set_input(2, bias.trt_tensor)  
  
    output = _create_tensor(layer.get_output(0), layer)  
  
    if ndim == 3:  
        return output.view(  
            concat([output.size(1),  
                    output.size(2),  
                    output.size(3)]))  
  
    return output

可以看到conv2d的核心实现就是利用TensorRT-API去搭建conv网络。

看到这里，想一想如果可以直接将trace后的网络直接使用类似于Pytorch的TensorRT-API搭建，然后生成，是不是就类似于直接生成一个利用TensorRT-API搭建的网络？

后记

当然这只是个抛砖引玉，很多细节其实还没有提到，我之前也用过一些其他公司的类似于TensorRT的工具，在转换完模型后可以直接生成利用该推理后端API搭建的网络文件（可以是cpp，也可以是python），当然权重和参数也在里头了，如果是量化的话，量化参数也可以放到里头，可以做的事情有很多。这种方式的话，我们可以对推理框架即将要优化的网络一目了然，在修改或者调试的情况下都比较方便。

这里仅是简单的讨论，至于后续的细节实现，之后老潘也会继续写一些文章，大家有想法也可以留言哈~

参考

三、TensorRT-LLM | 大模型部署专用框架

TensorRT-LLM是NVIDIA推出的一款高性能深度学习推理优化库，专注于提升大型语言模型（LLM）在NVIDIA GPU上的推理速度和效率。如果您绕不开Nvidia的芯片，那么一定要好好了解这款推理库。

项目链接：https://github.com/NVIDIA/TensorRT-LLM

1、TensorRT-LLM的优势

TensorRT-LLM（TensorRT for Large Language Models）旨在解决大型语言模型在实际应用中面临的性能瓶颈问题。通过提供一系列专为LLM推理设计的优化工具和技术，TensorRT-LLM能够显著提升模型的推理速度，降低延迟，并优化内存使用。

2、TensorRT-LLM的核心功能

1）易于使用的Python API

TensorRT-LLM提供了一个简洁易用的Python API，允许用户定义大型语言模型并构建包含先进优化的TensorRT引擎。
该API设计类似于PyTorch，使得具有PyTorch经验的开发者能够轻松迁移和集成。

2）模型优化

TensorRT-LLM支持多种量化选项（如FP16、INT8等），用户可以根据具体需求选择合适的配置，实现性能与精度的平衡。
通过层级融合、内核选择和精度调整等优化技术，TensorRT-LLM能够显著提升模型的推理速度。

3）内存管理

TensorRT-LLM通过智能内存分配和分页注意力机制，优化了内存使用，降低了内存占用。

4）多线程并行与硬件加速

支持多线程并行处理，提高处理速度。
充分利用NVIDIA GPU的计算能力，加速模型推理。

5）动态批处理

TensorRT-LLM支持动态批处理，通过同时处理多个请求来优化文本生成，减少了等待时间并提高了GPU利用率。

6）多GPU与多节点推理

支持在多个GPU或多个节点上进行分布式推理，提高了吞吐量并减少了总体推理时间。

7）FP8支持

配备TensorRT-LLM的NVIDIA H100 GPU能够轻松地将模型权重转换为新的FP8格式，并自动编译模型以利用优化的FP8内核。这得益于NVIDIA Hopper架构，且无需更改任何模型代码。

8）最新GPU支持

TensorRT-LLM 支持基于 NVIDIA Hopper、NVIDIA Ada Lovelace、NVIDIA Ampere、NVIDIA Turing 和 NVIDIA Volta 架构的GPU。

3、TensorRT-LLM支持部署的模型1）LLM系列

2）多模态大模型

4、量化相关

INT8 SmoothQuant (W8A8)

SmoothQuant技术在：https://arxiv.org/abs/2211.10438中被介绍。它是一种使用INT8对激活和权重进行推理的方法，同时保持网络（在下游任务中）的准确性。如研究论文所述，必须对模型的权重进行预处理。TensorRT-LLM包含用于准备模型以使用SmoothQuant方法运行的脚本。

关于如何为GPT、GPT-J和LLaMA启用SmoothQuant的示例，可以在版本的examples/quantization文件夹中找到。

INT4和INT8仅权重量化 (W4A16和W8A16)

INT4和INT8仅权重量化技术包括对模型的权重进行量化，并在线性层（Matmuls）中动态地对这些权重进行反量化。激活使用浮点数（FP16或BF16）进行编码。要使用INT4/INT8仅权重量化方法，用户必须确定用于量化和反量化模型权重的缩放因子。

GPTQ和AWQ (W4A16)

GPTQ和AWQ技术分别在https://arxiv.org/abs/2210.17323和https://arxiv.org/abs/2306.00978中介绍。TensorRT-LLM支持在线性层中使用每组缩放因子和零偏移来实现GPTQ和AWQ方法。有关详细信息，请参阅WeightOnlyGroupwiseQuantMatmulPlugin插件和相应的weight_only_groupwise_quant_matmulPython函数。

代码中包括将GPTQ应用于GPT-NeoX和LLaMA-v2的示例，以及使用AWQ与GPT-J的示例。这些示例是实验性实现，并可能在未来的版本中有所改进。

FP8 (Hopper)

TensorRT-LLM包含为GPT-NeMo、GPT-J和LLaMA实现的FP8。这些示例可以在examples/quantization中找到。

5、TensorRT-LLM支持的硬件和软件

6、TensorRT-LLM的应用场景

TensorRT-LLM在多个领域展现了其强大的应用能力，包括但不限于：

在线客服系统：通过实时的对话生成，提供无缝的人工智能辅助服务。
搜索引擎：利用模型对查询进行增强，提供更精准的搜索结果。
自动代码补全：在IDE中集成模型，帮助开发者自动完成代码编写。
内容创作平台：自动生成文章摘要或建议，提升创作者的工作效率。

四、FX2TRT

这个是官方出的呢 torch转trt的~~ 之前都是wangxinyu的~

torch-tensorrt仓库移动到Pytorch主仓库下，更名为pytorch/TensorRT
Pytorch仓库将fx2trt分支由主仓库移到了pytorch/TensorRT仓库

官方的FX2TRT的 User Guide（https://github.com/pytorch/TensorRT/blob/master/docsrc/tutorials/getting_started_with_fx_path.rst）

Pytorch仓库的fx2trt代码库转移到pytorch/TensorRT中，变为了其中的一部分：FX Frontend。pytorch/TensorRT也就是之前的Torch-TensorRT库，现在统一了，除了可以将torchscript的模型转化为TensorRT模型，也可以将FX模型转化为TensorRT模型。

Pytorch/TensorRT

这个库区别于NVIDIA官方的TensorRT仓库，是Pytorch自己的 TensorRT仓库(https://github.com/pytorch/TensorRT) ，简单介绍如下：

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT

其实前身是TRtorch也叫作torch-TensorRT，我之前也写过篇关于这个的回答(https://www.zhihu.com/question/436143525/answer/2267845251) 。这个库的主要功能是无缝将torchscript的模型引入TensorRT的加速，使用最接近Pytorch的torchscript的生态去加速模型，充分利用TensorRT和TVM等优秀的工具，不需要把模型拆成好几部分，直接使用torchscript这个运行时去缝合，对于某些模型来说是很合适的：

不过本文的重点不是这个，我们关注的fx2trt这个库挪到了这个仓库中，看来Pytorch是想把这些和TensorRT有关的库都整合在一起，也挺好。这里我只用到了fx2trt，所以只执行以下命令即可：

git clone https://github.com/pytorch/TensorRT/commits/master
cd py
python3 setup.py install --fx-only

看了下其中FX部分的代码结构，基本没什么变动，就是单独拎了出来。

fx2trt这个工具就是为了配合FX，将FX后的模型转化为TensorRT，大概分为四个步骤：

先trace模型
然后split trace后的模型，分为支持trt和不支持trt的部分
将支持trt的部分model转化为trt
然后得到一个新的nn.module，其中subgraph就是一个trt的engine嵌入进去了

看个例子

可以简单看下官方的示例代码，在TensorRT/examples/fx/lower_example.py有一个resnet18的例子。首先获取resnet18的模型，没什么好说的：

model = torchvision.models.resnet18(pretrained=True)

然后通过compile函数来对model进行编译，这个compile函数内部其实就是调用了一个Lowerer类，Lowerer类会根据config配置创建fx2trt的pipeline，之后的torch_tensorrt会统一这个接口，根据fx和ts（torchscript）模型来分别进行compile，不过这里就只说fx了：

# 这里model是nn.module 来自 torchvision.models.resnet18(pretrained=True)
lowered_module = compile(
    module,
    input, # input = [torch.rand(128, 3, 224, 224)
    max_batch_size=conf.batch_size,
    lower_precision=LowerPrecision.FP16 if conf.fp16 else LowerPrecision.FP32,
)

# 其中compile调用了Lowerer，是个help类，搭建fx2trt的pipeline
def compile(
    module: nn.Module,
    input,
    max_batch_size: int = 2048,
    max_workspace_size=1 << 25,
    explicit_batch_dimension=False,
    lower_precision=LowerPrecision.FP16,
    verbose_log=False,
    timing_cache_prefix="",
    save_timing_cache=False,
    cuda_graph_batch_size=-1,
    dynamic_batch=True,
    is_aten=False,
) -> nn.Module:
    lower_setting = LowerSetting(
        max_batch_size=max_batch_size,
        max_workspace_size=max_workspace_size,
        explicit_batch_dimension=explicit_batch_dimension,
        lower_precision=lower_precision,
        verbose_log=verbose_log,
        timing_cache_prefix=timing_cache_prefix,
        save_timing_cache=save_timing_cache,
        cuda_graph_batch_size=cuda_graph_batch_size,
        dynamic_batch=dynamic_batch,
        is_aten=is_aten,
    )
    lowerer = Lowerer.create(lower_setting=lower_setting)
    return lowerer(module, input)

Lowerer.create的时候，根据传递来的lower_setting参数构建pipeline，传递的参数也很容易理解：

比如转换精度，FP16还是FP32
示例输入用于trace以及后续测试
以及一些其他tensorrt常见的参数，比如workspace大小等等

pipeline的话，存在于pass管理器中。上一篇说过FX就是个AI编译器，而编译器中有个概念叫做pass，代表对代码的各种优化，所以FX中的PASS也一样，只不过变化为对模型的各种优化，看了下大概是以下一些：

# 这些pass
def build_trt_lower_pipeline(
        self, input: Input, additional_input: Optional[Input] = None
    ) -> PassManager:
        self._input = input
        self._additional_input = additional_input
        passes = []

        passes.append(self._default_replace_mutable_op_pass())
        passes.append(self._const_fold_pass())
        passes.append(self.graph_optimization_pass())
        passes.append(self._split_pass())
        passes.append(self._trt_lower_pass())

        pm = PassManager.build_from_passlist(passes)
        return pm

上述这些pass操作，其实就是FX中的transform，上一篇也说道过：

Your transform will take in an torch.nn.Module, acquire a Graph from it, do some modifications, and return a new torch.nn.Module. You should think of the torch.nn.Module that your FX transform returns as identical to a regular torch.nn.Module – you can pass it to another FX transform, you can pass it to TorchScript, or you can run it. Ensuring that the inputs and outputs of your FX transform are a torch.nn.Module will allow for composability.

比如replace_mutable_op这个函数，对输入的torch.fx.GraphModule进行修改，修改后recompile()重新构建graphModule，再返回torch.fx.GraphModule：

def replace_mutable_op(module: torch.fx.GraphModule) -> torch.fx.GraphModule:
    if not isinstance(module, torch.fx.GraphModule):
        return module

    # Before any lowering pass, replace mutable ops like torch.fill_
    # Because fx cannot deal with inplace ops
    for n in module.graph.nodes:
        # TODO: add more mutable ops
        if (n.op == "call_method" and n.target == "fill_") or (
            n.op == "call_function" and n.target == torch.fill_
        ):
            # Replace mutable op only if the modified variable
            # is used by the rest of the graph
            # only through this op
            if set(n.args[0].users.keys()) == {n}:
                with module.graph.inserting_after(n):

                    # TODO: move this outside?
                    def fill_with_mul_zero_and_add(*args):
                        return args[0].mul(0.0).add(args[1])

                    new_node = module.graph.create_node(
                        "call_function", fill_with_mul_zero_and_add, args=n.args
                    )
                    n.replace_all_uses_with(new_node)
                    module.graph.erase_node(n)
    module.recompile()
    return module

总之，经过compile的模型内部已经包含trt-engine了，可以直接拿来跑和benchmark：

lowered_module = compile(
    module,
    input,
    max_batch_size=conf.batch_size,
    lower_precision=LowerPrecision.FP16 if conf.fp16 else LowerPrecision.FP32,
)
time = benchmark_torch_function(conf.batch_iter, lambda: lowered_module(*input))

benchmark的结果也很显然，trt模型肯定比原始pytorch快很多，尤其是FP16下，resnet18这种小模型可以提升将近4倍多的QPS：

== Start benchmark iterations
== End benchmark iterations
== Benchmark Result for: Configuration(batch_iter=50, batch_size=128, name='CUDA Eager', trt=False, jit=False, fp16=False, accuracy_rtol=-1)
BS: 128, Time per iter: 31.35ms, QPS: 4082.42, Accuracy: None (rtol=-1)
== Benchmark Result for: Configuration(batch_iter=50, batch_size=128, name='TRT FP32 Eager', trt=True, jit=False, fp16=False, accuracy_rtol=0.001)
BS: 128, Time per iter: 21.53ms, QPS: 5944.90, Accuracy: None (rtol=0.001)
== Benchmark Result for: Configuration(batch_iter=50, batch_size=128, name='TRT FP16 Eager', trt=True, jit=False, fp16=True, accuracy_rtol=0.01)
BS: 128, Time per iter: 7.09ms, QPS: 18056.38, Accuracy: None (rtol=0.01)

运行环境

简单介绍了下Torch-TensorRT，接下来进入正篇。因为写第一篇FX文章比较久了，第二篇也挺久了（好吧我太能拖了），所以写第三篇的时候(2022-10-29)，为了保证文章内容质量...就更新一下测试fx的环境吧。拉的最新环境，torch和torchvision以及torch-tensorrt全部拉成最新，亲手编译的：

torch                   1.14.0a0+gita0c2a7f /root/code/pytorch                                                        
torch-tensorrt          1.3.0a0+5a7ac8f3    
torch-tensorrt-fx2trt   0.1                 /usr/local/lib/python3.8/dist-packages/torch_tensorrt_fx2trt-0.1-py3.8.egg
torchvision             0.14.0a0+d0d7058    /root/code/vision

虽然FX更新挺快，到现在1.14版本为止，FX依然是个beta。但有好的一点，更新了最新的环境后，之前的代码改动稍稍改动（不超2行）就可以运行。可以说明FX的向下兼容做的挺好，大家可以放心使用。

测试模型

因为之前的模型找不到了，所以需要重新找个模型测试FP32（pytorch）和INT8量化后（pytorch-fx以及TensorRT）的精度。

我去年跑fx2trt的时候使用的是resnet50版本的CenterNet，而且修改了Centernet后面的upsample层，将其输入输出通道设为相同：

# 输入in_channels输出通道out_channels必须一致才可以
nn.ConvTranspose2d(
    in_channels=planes,
    out_channels=planes,
    kernel_size=kernel,
    stride=2,
    padding=padding,
    output_padding=output_padding,
    bias=self.deconv_with_bias))

# groups必须为1才行
up = nn.ConvTranspose2d(
    out_dim, out_dim, f * 2, stride=f, padding=f // 2,
    output_padding=0, groups=1, bias=False)

为什么这样搞，因为TensorRT在量化反卷积的时候有bug，必须满足一定条件的反卷积才可以正常解析（当然，不量化的时候没有问题），看了下issue的反馈，大概在8.5版本会解决大部分关于反卷积的问题（反卷积的问题真的多）。相关issue链接：

所以没办法，只能自己训一个模型，我这里采用resnet50为backbone的CenterNet，除了将模型最后部分反卷积改了下通道数，其余和官方的一致。基于自己的数据集训练了个二分类模型，检测人和手的。

FX2TRT

有了模型，开始进入正题！

上文提到过，新版的FX接口略略微微有一些变动，上一篇中prepare_fx参数backend配置名称变为backend_config；以及converter函数封装了一层新的函数convert_to_reference_fx，也就是将is_reference参数挪到里头了，不再使用convert_fx：

def convert_to_reference_fx(
    graph_module: GraphModule,
    convert_custom_config: Union[ConvertCustomConfig, Dict[str, Any], None] = None,
    _remove_qconfig: bool = True,
    qconfig_mapping: Union[QConfigMapping, Dict[str, Any], None] = None,
    backend_config: Union[BackendConfig, Dict[str, Any], None] = None,
) -> torch.nn.Module:
    torch._C._log_api_usage_once("quantization_api.quantize_fx.convert_to_reference_fx")
    return _convert_fx(
        graph_module,
        is_reference=True,
        convert_custom_config=convert_custom_config,
        _remove_qconfig=_remove_qconfig,
        qconfig_mapping=qconfig_mapping,
        backend_config=backend_config,
    )

其他的没啥变化。

我们将模型通过prepare_fx和convert_to_reference_fx之后,得到了最终的reference量化模型。经过convert_to_reference_fx后的模型，其实是simulator quantization，也就是模拟量化版本。并不包含任何INT8的算子，有的只是Q、DQ操作以及FP32的常规算子，以及我们校准得到的scale和offset用于模拟模型的量化误差。实际模型执行的时候是这样：

def forward(self, input):
    input_1 = input
    # 首先得到量化参数scale和zero-point
    backbone_conv1_input_scale_0 = self.backbone_conv1_input_scale_0
    backbone_conv1_input_zero_point_0 = self.backbone_conv1_input_zero_point_0
    # 然后量化输入
    quantize_per_tensor = torch.quantize_per_tensor(input_1, backbone_conv1_input_scale_0, backbone_conv1_input_zero_point_0, torch.qint8);  
    input_1 = backbone_conv1_input_scale_0 = backbone_conv1_input_zero_point_0 = None
    # 然后反量化输入
    dequantize = quantize_per_tensor.dequantize();  quantize_per_tensor = None
    # 实际输入FP32算子的input是反量化后的
    backbone_conv1 = self.backbone.conv1(dequantize);  dequantize = None
    ...
    dequantize_80 = quantize_per_tensor_83.dequantize();  quantize_per_tensor_83 = None
    head_angle_2 = getattr(self.head.angle, "2")(dequantize_80);  dequantize_80 = None
    head_angle_2_output_scale_0 = self.head_angle_2_output_scale_0
    head_angle_2_output_zero_point_0 = self.head_angle_2_output_zero_point_0
    quantize_per_tensor_84 = torch.quantize_per_tensor(head_angle_2, head_angle_2_output_scale_0, head_angle_2_output_zero_point_0, torch.qint8);  head_angle_2 = head_angle_2_output_scale_0 = head_angle_2_output_zero_point_0 = None
    dequantize_81 = quantize_per_tensor_78.dequantize();  quantize_per_tensor_78 = None
    dequantize_82 = quantize_per_tensor_80.dequantize();  quantize_per_tensor_80 = None
    dequantize_83 = quantize_per_tensor_82.dequantize();  quantize_per_tensor_82 = None
    dequantize_84 = quantize_per_tensor_84.dequantize();  quantize_per_tensor_84 = None
    return {'hm': dequantize_81, 'wh': dequantize_82, 'reg': dequantize_83, 'angle': dequantize_84}

这个模型的类型是GraphModule，和nn.Module类似，有对应的forward函数。我们可以直接在Pytorch中执行这个模型测试精度，不过需要注意，这里仅仅是测试模拟的量化模型精度，也是测试校准后得到的scale和offset有没有问题，在转化为TensorRT后精度可能会略有差异，毕竟实际推理框架内部实现的一些算子细节我们是不知道的。简单看一眼上述模型的结构图：

其中，backbone_conv1_input_scale_0和backbone_conv1_input_zero_point_0就是在校准过程中学习到的scale和offset。权重层不需要校准学习，直接就可以算出来（具体细节见上一篇），这里就不赘述了。

这里我对量化后的FX（sim-INT8）和原始的FX模型（FP32）进行了精度的对比，因为Centernet有三个输出：

所以我这里对三个输出都进行了简单的精度计算：

original_fx_model.cuda()
res_fp32 = original_fx_model(data)
res_int8 = quantized_fx(data)
for i in range(len(res_fp32)):
    print(torch.max(torch.abs(res_fp32[i] -  res_int8[i])))

简单粗暴，结果看起来差距有点大，其中wh的最大误差都有26了：

tensor(1.5916, device='cuda:0', grad_fn=<MaxBackward1>)
tensor(26.1865, device='cuda:0', grad_fn=<MaxBackward1>)
tensor(0.1195, device='cuda:0', grad_fn=<MaxBackward1>)

不过如果计算下每个输出的余弦相似度，每个输出的相似度都接近于1：

torch_cosine_similarity:  tensor(1.0000)

大家猜猜看，最终的mAP有没有掉点？

acc_tracer

接下来需要acc_tracer来将reference模型转化为acc版本的模型。

Acc Tracer is inherited from FX symbolic tracer Performs tracing and arg normalization specialized for accelerator lowering.

acc的主要作用是将pytorch中reference版本的op转换为相应的acc-op，一共干了这些事儿：

run的时候，对于TRTInterpreter来说，任务就是遍历graph中的node，然后按照注册好的converter一个一个去转换。这里其实比较巧妙，TRTInterpreter继承了torch.fx.Interpreter，重载了其中的这些方法：

acc_op_map的代码主要在：TensorRT/py/torch_tensorrt/fx/tracer/acc_tracer/acc_ops.py 拿一小段代码看看：

@register_acc_op_properties(AccOpProperty.pointwise, AccOpProperty.unary)
@register_acc_op_mapping(op_and_target=("call_function", nn.functional.relu))
@register_acc_op_mapping(
    op_and_target=("call_function", torch.relu),
    arg_replacement_tuples=[("input", "input")],
)
@register_acc_op_mapping(
    op_and_target=("call_method", "relu"),
    arg_replacement_tuples=[("input", "input")],
)
@register_acc_op
def relu(*, input, inplace=False):
    return nn.functional.relu(input=input, inplace=inplace)

可以看到nn.functional.relu、 torch.relu以及call_method的relu这三种形式，最终都会转化为acc_op.relu。

如果不这样的话，可能需要针对三种情况写三份converter代码，那样就比较麻烦了，代码也会比较冗余。

得到acc版本的model之后，就需要针对acc-op一个一个去转换为trt了。至此，trace的过程就结束了（其实acc_trace的过程细节很多，限于篇幅这里就不说了，之后有机会的话单独介绍下）。

TRTInterpreter

TRTInterpreter继承于torch.fx.Interpreter。

An Interpreter executes an FX graph Node-by-Node. This patterncan be useful for many things, including writing code transformations as well as analysis passes.

关于Interpreter，也在第一篇中介绍过。Interpreter，即解释器，就是以一个比较优雅的方式循环一个Graph的node并且执行它们，并同时顺带完成一些任务。我们可以通过这个实现很多功能，比如替换模型中某个操作，比如模型性能分析等等。而在这里，我们利用TRTInterpreter转换acc_op到trt的op， 首先初始化解释器对象，输入常见的参数，这里我转的是dynamic shape，指定了min、opt和max三个大小，explicit_batch_dimension设为True：

interp = TRTInterpreter(
    quantized_fx,
    [InputTensorSpec(torch.Size([1,3,-1,-1]), torch.float,
                    shape_ranges=[((1, 3, 128, 128), (1, 3, 768, 768), (1, 3, 1024, 1024))], has_batch_dim=True)],
    explicit_batch_dimension=True, explicit_precision=True,
    logger_level=trt.Logger.VERBOSE
    )

然后就可以执行了，run的时候传入具体要转换的精度，以及workspace大小：

res = interp.run(lower_precision=LowerPrecision.INT8, strict_type_constraints=True, max_workspace_size=4096000000)

首先将要trace的模型所有un-tracable的部分转化为traceable
然后干掉所有assertions和exception的wrappers
整理模型，去掉dead code
对graph中的所有node的args/kwargs做标准化，将部分符合要求的arg移动到kwarg，making default values explicit.
trace前的模型graph：

graph():
    %input_1 : [#users=1] = placeholder[target=input]
    %backbone_base_base_layer_0_input_scale_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_input_scale_0]
    %backbone_base_base_layer_0_input_zero_point_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_input_zero_point_0]
    %quantize_per_tensor : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%input_1, %backbone_base_base_layer_0_input_scale_0, %backbone_base_base_layer_0_input_zero_point_0, torch.qint8), kwargs = {})
    %dequantize : [#users=1] = call_method[target=dequantize](args = (%quantize_per_tensor,), kwargs = {})
    %backbone_base_base_layer_0_0_weight : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.weight]
    %backbone_base_base_layer_0_0_weight_scale : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.weight_scale]
    %backbone_base_base_layer_0_0_weight_zero_point : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.weight_zero_point]
    %quantize_per_channel : [#users=1] = call_function[target=torch.quantize_per_channel](args = (%backbone_base_base_layer_0_0_weight, %backbone_base_base_layer_0_0_weight_scale, %backbone_base_base_layer_0_0_weight_zero_point, 0, torch.qint8), kwargs = {})
    %dequantize_1 : [#users=1] = call_method[target=dequantize](args = (%quantize_per_channel,), kwargs = {})
    %backbone_base_base_layer_0_0_bias : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.bias]
    %conv2d : [#users=1] = call_function[target=torch.conv2d](args = (%dequantize, %dequantize_1, %backbone_base_base_layer_0_0_bias, (1, 1), (3, 3), (1, 1), 1), kwargs = {})
    %relu : [#users=1] = call_function[target=torch.nn.functional.relu](args = (%conv2d,), kwargs = {inplace: True})
    %backbone_base_base_layer_0_scale_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_scale_0]
    %backbone_base_base_layer_0_zero_point_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_zero_point_0]
    %quantize_per_tensor_1 : [#users=1] = call_function[target=torch.quantize_per_tensor](args = (%relu, %backbone_base_base_layer_0_scale_0, %backbone_base_base_layer_0_zero_point_0, torch.qint8), kwargs = {})
 ...

trace后的模型graph：

graph():
    %input_1 : [#users=1] = placeholder[target=input]
    %backbone_base_base_layer_0_input_scale_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_input_scale_0]
    %backbone_base_base_layer_0_input_zero_point_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_input_zero_point_0]
    %quantize_per_tensor_92 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.quantize_per_tensor](args = (), kwargs = {input: %input_1, acc_out_ty: (None, torch.qint8, None, None, None, None, {scale: %backbone_base_base_layer_0_input_scale_0, zero_point: %backbone_base_base_layer_0_input_zero_point_0})})
    %dequantize_153 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.dequantize](args = (), kwargs = {input: %quantize_per_tensor_92})
    %backbone_base_base_layer_0_0_weight : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.weight]
    %backbone_base_base_layer_0_0_weight_scale : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.weight_scale]
    %backbone_base_base_layer_0_0_weight_zero_point : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.weight_zero_point]
    %quantize_per_channel_61 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.quantize_per_channel](args = (), kwargs = {input: %backbone_base_base_layer_0_0_weight, acc_out_ty: (None, torch.qint8, None, None, None, None, {scale: %backbone_base_base_layer_0_0_weight_scale, zero_point: %backbone_base_base_layer_0_0_weight_zero_point, axis: 0})})
    %dequantize_154 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.dequantize](args = (), kwargs = {input: %quantize_per_channel_61})
    %backbone_base_base_layer_0_0_bias : [#users=1] = get_attr[target=backbone.base.base_layer.0.0.bias]
    %conv2d_55 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.conv2d](args = (), kwargs = {input: %dequantize_153, weight: %dequantize_154, bias: %backbone_base_base_layer_0_0_bias, stride: (1, 1), padding: (3, 3), dilation: (1, 1), groups: 1})
    %relu_48 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.relu](args = (), kwargs = {input: %conv2d_55, inplace: True})
    %backbone_base_base_layer_0_scale_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_scale_0]
    %backbone_base_base_layer_0_zero_point_0 : [#users=1] = get_attr[target=backbone_base_base_layer_0_zero_point_0]
    %quantize_per_tensor_93 : [#users=1] = call_function[target=torch_tensorrt.fx.tracer.acc_tracer.acc_ops.quantize_per_tensor](args = (), kwargs = {input: %relu_48, acc_out_ty: (None, torch.qint8, None, None, None, None, {scale: %backbone_base_base_layer_0_scale_0, zero_point: %backbone_base_base_layer_0_zero_point_0})})

可以看到原始版本的dequantize转换为了torch_tensorrt.fx.tracer.acc_tracer.acc_ops.dequantize，为什么要这么干呢，有两点原因：

将一些相同功能的op（ PyTorch ops and builtin ops ），比如 . torch.add, builtin.add and torch.Tensor.add 等等，就可以一并都转化为acc.add
Move args and kwargs into kwargs only for converting simplicity

而run函数，遍历node的过程是在父类Interpreter中运行：

# torch/fx/interpreter.py
for node in self.module.graph.nodes:
    if node in self.env:
        # Short circuit if we have this value. This could
        # be used, for example, for partial evaluation
        # where the caller has pre-populated `env` with
        # values for a subset of the program.
        continue

    try:
        self.env[node] = self.run_node(node)
    except Exception as e:
        msg = f"While executing {node.format_node()}"
        msg = '{}\n\n{}'.format(e.args[0], msg) if e.args else str(msg)
        msg += f"\nOriginal traceback:\n{node.stack_trace}"
        e.args = (msg,) + e.args[1:]
        if isinstance(e, KeyError):
            raise RuntimeError(*e.args)
        raise

    if self.garbage_collect_values:
        for to_delete in self.user_to_last_uses.get(node, []):
            del self.env[to_delete]

    if node.op == 'output':
        output_val = self.env[node]
        return self.module.graph.process_outputs(output_val) if enable_io_processing else output_val

但是run_node因为重载了，所以会调用子类TRTInterpreter中的方法（我们之后也可以通过这种方式实现自己的解释器，去做一些功能），最终会根据不同node的类型，调用不同的node方法，比如call_module、call_function、call_method这仨，表示FX中的三种IR，每个函数中都会调用CONVERTERS来获取转换op：

def call_module(self, target, args, kwargs):
    assert isinstance(target, str)
    submod = self.fetch_attr(target)
    submod_type = getattr(submod, "_base_class_origin", type(submod))
    converter = CONVERTERS.get(submod_type)

    if not converter:
        raise RuntimeError(
            f"Conversion of module of type {submod_type} not currently supported!"
        )

    assert self._cur_node_name is not None
    return converter(self.network, submod, args, kwargs, self._cur_node_name)

def call_function(self, target, args, kwargs):
    converter = CONVERTERS.get(target)
    if not converter:
        raise RuntimeError(
            f"Conversion of function {torch.typename(target)} not currently supported!"
        )

    assert self._cur_node_name is not None
    return converter(self.network, target, args, kwargs, self._cur_node_name)

def call_method(self, target, args, kwargs):
    assert isinstance(target, str)
    converter = CONVERTERS.get(target)

    if not converter:
        raise RuntimeError(
            f"Conversion of method {target} not currently supported!"
        )

    assert self._cur_node_name is not None
    return converter(self.network, target, args, kwargs, self._cur_node_name)

转换op的注册代码在TensorRT/py/torch_tensorrt/fx/converters/acc_ops_converters.py中，就拿卷积来说，每一个acc-op对应一个converter，每个converter函数会调用trt的api构建网络：

@tensorrt_converter(acc_ops.conv3d)
@tensorrt_converter(acc_ops.conv2d)
def acc_ops_convnd(
    network: TRTNetwork,
    target: Target,
    args: Tuple[Argument, ...],
    kwargs: Dict[str, Argument],
    name: str,
) -> Union[TRTTensor, Sequence[TRTTensor]]:
    input_val = kwargs["input"]

    if not isinstance(input_val, TRTTensor):
        raise RuntimeError(
            f"Conv received input {input_val} that is not part "
            "of the TensorRT region!"
        )

    if has_dynamic_shape(input_val.shape):
        assert input_val.shape[1] != -1, "Channel dim can't be dynamic for convolution."

    # for now we'll assume bias is constant Tensor or None,
    # and bias being ITensor is not supported in TensorRT api
    # right now
    if kwargs["bias"] is not None and not isinstance(kwargs["bias"], torch.Tensor):
        raise RuntimeError(
            f"linear {name} has bias of type {type(kwargs['bias'])}, Expect Optional[Tenosr]"
        )
    bias = to_numpy(kwargs["bias"])  # type: ignore[arg-type]

    if network.has_explicit_precision:
        weight = get_trt_tensor(network, kwargs["weight"], f"{name}_weight")
        weight_shape = tuple(kwargs["weight"].shape)  # type: ignore[union-attr]
        # will need to use uninitialized weight and set it later to support
        # ITensor weights
        dummy_weight = trt.Weights()
        layer = network.add_convolution_nd(
            input=input_val,
            num_output_maps=weight.shape[0],
            kernel_shape=weight.shape[2:],
            kernel=dummy_weight,
            bias=bias,
        )

        layer.set_input(1, weight)
    else:
        if not isinstance(kwargs["weight"], torch.Tensor):
            raise RuntimeError(
                f"linear {name} has weight of type {type(kwargs['weight'])}, Expect Optional[Tenosr]"
            )
        weight = to_numpy(kwargs["weight"])
        layer = network.add_convolution_nd(
            input=input_val,
            num_output_maps=weight.shape[0],
            kernel_shape=weight.shape[2:],
            kernel=weight,
            bias=bias,
        )

    set_layer_name(layer, target, name)
    layer.stride_nd = kwargs["stride"]
    layer.padding_nd = kwargs["padding"]
    layer.dilation_nd = kwargs["dilation"]
    if kwargs["groups"] is not None:
        layer.num_groups = kwargs["groups"]

    return layer.get_output(0)

构建好网络之后，设置一些build参数，就可以进行build了。

engine = self.builder.build_engine(self.network, builder_config) build完之后，传入TRTModule，就可以直接调用trt_mod来验证精度了。

engine, input_names, output_names = res.engine, res.input_names, res.output_names
trt_mod = TRTModule(engine, input_names, output_names)

这里我验证了这个模型的精度，一共是两个类别，训练图像4w多，校准用了512张图片，评价的分数阈值是0.1，NMS阈值0.2：量化前指标：

|   AP   |  AP50  |  AP60  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 62.745 | 95.430 | 76.175 | 54.004 | 66.575 | 63.692 |

量化后指标：

|   AP   |  AP50  |  AP60  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 60.340 | 95.410 | 70.561 | 50.154 | 64.969 | 62.009 |

量化后转化为TensorRT的指标：

|   AP   |  AP50  |  AP60  |  APs   |  APm   |  APl   |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 60.355 | 95.404 | 70.412 | 50.615 | 64.763 | 61.322 |

嗯，AP降了2个点，但是AP50降得不多，还好还好。再看一下速度，在3080显卡上，一帧需要3.8ms，相比FP16的4.8ms貌似快了一些，但貌似还不够快。

简单跑下trt的隐式量化（implict mode ）模式，大概就是先将Centernet模型转化为ONNX，然后再通过使用trtexec强制指定int8（这里不看精度，不传入校准图片，仅仅是为了测试下int8的速度），然后发现速度竟然只需3.1ms。

速度相差了不少，想都不用想可能FX转化为TRT的时候，肯定有些层没有优化到极致。那就对比下两个engine的网络结构图，首先是implict mode下的engine：

[03/07/2022-11:34:20] [I]                                      Conv_101 + Add_103 + Relu_104       16.09           0.0215      0.7
[03/07/2022-11:34:20] [I]                                                Conv_105 + Relu_106       14.89           0.0199      0.6
[03/07/2022-11:34:20] [I]                                                Conv_107 + Relu_108       20.96           0.0280      0.9
[03/07/2022-11:34:20] [I]                                      Conv_109 + Add_110 + Relu_111       15.18           0.0203      0.6
[03/07/2022-11:34:20] [I]                                                Conv_112 + Relu_113       14.31           0.0191      0.6
[03/07/2022-11:34:20] [I]                                                Conv_114 + Relu_115       20.82           0.0278      0.9
[03/07/2022-11:34:20] [I]                                      Conv_116 + Add_117 + Relu_118       15.16           0.0202      0.6
[03/07/2022-11:34:20] [I]                                                           Conv_119       40.61           0.0542      1.7
[03/07/2022-11:34:20] [I]              ConvTranspose_120 + BatchNormalization_121 + Relu_122       31.20           0.0416      1.3
[03/07/2022-11:34:20] [I]              ConvTranspose_123 + BatchNormalization_124 + Relu_125      110.56           0.1476      4.7
[03/07/2022-11:34:20] [I]              ConvTranspose_126 + BatchNormalization_127 + Relu_128      509.55           0.6803     21.7
[03/07/2022-11:34:20] [I]  Conv_129 + Relu_130 || Conv_132 + Relu_133 || Conv_135 + Relu_136      197.13           0.2632      8.4
[03/07/2022-11:34:20] [I]               Reformatting CopyNode for Input Tensor 0 to Conv_131       13.22           0.0177      0.6
[03/07/2022-11:34:20] [I]                                                           Conv_131       12.35           0.0165      0.5
[03/07/2022-11:34:20] [I]               Reformatting CopyNode for Input Tensor 0 to Conv_134       13.12           0.0175      0.6
[03/07/2022-11:34:20] [I]                                                           Conv_134       12.14           0.0162      0.5
[03/07/2022-11:34:20] [I]               Reformatting CopyNode for Input Tensor 0 to Conv_137       13.07           0.0175      0.6
[03/07/2022-11:34:20] [I]                                                           Conv_137       11.99           0.0160      0.5
[03/07/2022-11:34:20] [I]                                                              Total     2352.92           3.1414    100.0

可以看到该融合的都融合了，尤其是 Conv_116 + Add_117 + Relu_118以及ConvTranspose_120 + BatchNormalization_121 + Relu_122和Conv_129 + Relu_130 || Conv_132 + Relu_133 || Conv_135 + Relu_136，都是提速很大的融合，下图是通过trt-engine生成时候产出的log画的图：

再看下刚才经过FX转换成TRT模型的网络结构：

[03/03/2022-14:46:31] [I]                                                                                                   add_29 + relu_97        8.90           0.0137      0.4
[03/03/2022-14:46:31] [I]   quantize_per_channel_110_input + (Unnamed Layer* 592) [Constant]_output_per_channel_quant + conv2d_107 + relu_98       12.88           0.0199      0.5
[03/03/2022-14:46:31] [I]   quantize_per_channel_111_input + (Unnamed Layer* 603) [Constant]_output_per_channel_quant + conv2d_108 + relu_99       19.11           0.0295      0.8
[03/03/2022-14:46:31] [I]             quantize_per_channel_112_input + (Unnamed Layer* 614) [Constant]_output_per_channel_quant + conv2d_109       12.09           0.0187      0.5
[03/03/2022-14:46:31] [I]                                                                                                  add_30 + relu_100        8.84           0.0136      0.4
[03/03/2022-14:46:31] [I]  quantize_per_channel_113_input + (Unnamed Layer* 630) [Constant]_output_per_channel_quant + conv2d_110 + relu_101       12.61           0.0195      0.5
[03/03/2022-14:46:31] [I]  quantize_per_channel_114_input + (Unnamed Layer* 641) [Constant]_output_per_channel_quant + conv2d_111 + relu_102       18.68           0.0288      0.8
[03/03/2022-14:46:31] [I]             quantize_per_channel_115_input + (Unnamed Layer* 652) [Constant]_output_per_channel_quant + conv2d_112       12.11           0.0187      0.5
[03/03/2022-14:46:31] [I]                                                                                                  add_31 + relu_103        8.84           0.0136      0.4
[03/03/2022-14:46:31] [I]             quantize_per_channel_116_input + (Unnamed Layer* 668) [Constant]_output_per_channel_quant + conv2d_113       37.40           0.0577      1.5
[03/03/2022-14:46:31] [I]     quantize_per_channel_117_input + (Unnamed Layer* 678) [Constant]_output_per_channel_quant + conv_transpose2d_3       30.68           0.0474      1.2
[03/03/2022-14:46:31] [I]                                                                                                      PWN(relu_104)        4.73           0.0073      0.2
[03/03/2022-14:46:31] [I]     quantize_per_channel_118_input + (Unnamed Layer* 693) [Constant]_output_per_channel_quant + conv_transpose2d_4      102.36           0.1580      4.2
[03/03/2022-14:46:31] [I]                                                                                                      PWN(relu_105)       10.18           0.0157      0.4
[03/03/2022-14:46:31] [I]     quantize_per_channel_119_input + (Unnamed Layer* 708) [Constant]_output_per_channel_quant + conv_transpose2d_5      447.84           0.6911     18.2
[03/03/2022-14:46:31] [I]                                                                                                      PWN(relu_106)       34.68           0.0535      1.4
[03/03/2022-14:46:31] [I]  quantize_per_channel_120_input + (Unnamed Layer* 723) [Constant]_output_per_channel_quant + conv2d_114 + relu_107       65.06           0.1004      2.6
[03/03/2022-14:46:31] [I]  quantize_per_channel_122_input + (Unnamed Layer* 742) [Constant]_output_per_channel_quant + conv2d_116 + relu_108       64.46           0.0995      2.6
[03/03/2022-14:46:31] [I]  quantize_per_channel_124_input + (Unnamed Layer* 761) [Constant]_output_per_channel_quant + conv2d_118 + relu_109       64.35           0.0993      2.6
[03/03/2022-14:46:31] [I]             quantize_per_channel_121_input + (Unnamed Layer* 734) [Constant]_output_per_channel_quant + conv2d_115       11.23           0.0173      0.5
[03/03/2022-14:46:31] [I]             quantize_per_channel_123_input + (Unnamed Layer* 753) [Constant]_output_per_channel_quant + conv2d_117       11.16           0.0172      0.5
[03/03/2022-14:46:31] [I]             quantize_per_channel_125_input + (Unnamed Layer* 772) [Constant]_output_per_channel_quant + conv2d_119       11.20           0.0173      0.5
[03/03/2022-14:46:31] [I]                        Reformatting CopyNode for Input Tensor 0 to (Unnamed Layer* 741) [Quantize]_output_.dequant        6.92           0.0107      0.3
[03/03/2022-14:46:31] [I]                                                                    (Unnamed Layer* 741) [Quantize]_output_.dequant        4.45           0.0069      0.2
[03/03/2022-14:46:31] [I]                        Reformatting CopyNode for Input Tensor 0 to (Unnamed Layer* 760) [Quantize]_output_.dequant        6.34           0.0098      0.3
[03/03/2022-14:46:31] [I]                                                                    (Unnamed Layer* 760) [Quantize]_output_.dequant        4.56           0.0070      0.2
[03/03/2022-14:46:31] [I]                        Reformatting CopyNode for Input Tensor 0 to (Unnamed Layer* 779) [Quantize]_output_.dequant        6.00           0.0093      0.2
[03/03/2022-14:46:31] [I]                                                                    (Unnamed Layer* 779) [Quantize]_output_.dequant        4.35           0.0067      0.2
[03/03/2022-14:46:31] [I]                                                                                                              Total     2464.87           3.8038    100.0

可以发现没有Conv_116 + Add_117 + Relu_118以及后续的ConvTranspose_120 + BatchNormalization_121 + Relu_122和Conv_129 + Relu_130 || Conv_132 + Relu_133 || Conv_135 + Relu_136优化，这部分多消耗了0.6ms的时间：

为什么会这样呢，仔细观察了下FX的模型结构，发现这里多了一个Q、DQ的操作，对于TensorRT来说，不恰当位置的QDQ会导致TensorRT在量化的时候优化不彻底。

所以理想的应该是这种的，BN层紧接着Add，中间米有QDQ操作，这样TRT会把conv+bn+add以及后续的relu直接融合成Conv_116 + Add_117 + Relu_118：

另外还有一点，旧版的FX在fuse的时候（第二篇有说），反卷积后续的BN层融合，这个也会对后续的量化造成一些干扰，导致优化不彻底，把这些都解决后TRT就可以正常优化了。

如何批量将多的QDQ操作干掉呢，这个利用刚才介绍的interpreter就OK了，在propagate的时候，将add节点的args直接修改为正确的节点即可，一共17个，批量修改即可：

def propagate(self, *args):
    args_iter = iter(args)
    env : Dict[str, Node] = {}

    def load_arg(a):
        return fx.graph.map_arg(a, lambda n: env[n.name])

    def fetch_attr(target : str):
        target_atoms = target.split('.')
        attr_itr = self.mod
        for i, atom in enumerate(target_atoms):
            if not hasattr(attr_itr, atom):
                raise RuntimeError(f"Node referenced nonexistant target {'.'.join(target_atoms[:i])}")
            attr_itr = getattr(attr_itr, atom)
        return attr_itr

    for node in self.graph.nodes:
        # 这里修改
        if "add" in node.name:
            node.args = (self.change_list[node.name], node.args[1])
 # 修改完之后，需要将置空的节点删除
    self.mod.graph.eliminate_dead_code()
 # 更新graph
    self.mod.recompile()

    return

这样就OK了，修改后的add与上一个conv层（这里BN被conv吸进去了）之间就没有QDQ的操作：

同样，反卷积也和BN层合并了：

将修改后的fx模型，再一次经过TensorRT的转换，再一次benchmark一下：

# 修改网络之前的
=== Performance summary ===
Throughput: 260.926 qps
Latency: min = 4.91473 ms, max = 5.23787 ms, mean = 4.97783 ms, median = 4.97583 ms, percentile(99%) = 5.22012 ms
End-to-End Host Latency: min = 4.98529 ms, max = 8.08485 ms, mean = 7.56827 ms, median = 7.58014 ms, percentile(99%) = 8.06438 ms
Enqueue Time: min = 0.375031 ms, max = 0.717957 ms, mean = 0.394493 ms, median = 0.391724 ms, percentile(99%) = 0.470032 ms
H2D Latency: min = 1.03088 ms, max = 1.09827 ms, mean = 1.03257 ms, median = 1.03235 ms, percentile(99%) = 1.03613 ms
GPU Compute Time: min = 3.75397 ms, max = 4.07245 ms, mean = 3.81574 ms, median = 3.81421 ms, percentile(99%) = 4.05913 ms
D2H Latency: min = 0.125977 ms, max = 0.153076 ms, mean = 0.129512 ms, median = 0.129333 ms, percentile(99%) = 0.131836 ms
Total Host Walltime: 3.01235 s
Total GPU Compute Time: 2.99917 s
Explanations of the performance metrics are printed in the verbose logs.

# 修改网络之后
=== Performance summary ===
Throughput: 305.313 qps
Latency: min = 4.35956 ms, max = 4.64665 ms, mean = 4.41392 ms, median = 4.40918 ms, percentile(99%) = 4.62846 ms
End-to-End Host Latency: min = 4.401 ms, max = 6.90311 ms, mean = 6.43806 ms, median = 6.43774 ms, percentile(99%) = 6.88329 ms
Enqueue Time: min = 0.320801 ms, max = 0.559082 ms, mean = 0.334164 ms, median = 0.330078 ms, percentile(99%) = 0.486328 ms
H2D Latency: min = 1.03186 ms, max = 1.03824 ms, mean = 1.03327 ms, median = 1.0332 ms, percentile(99%) = 1.03638 ms
GPU Compute Time: min = 3.20001 ms, max = 3.48364 ms, mean = 3.25109 ms, median = 3.24609 ms, percentile(99%) = 3.46623 ms
D2H Latency: min = 0.126404 ms, max = 0.13208 ms, mean = 0.129566 ms, median = 0.129395 ms, percentile(99%) = 0.13147 ms
Total Host Walltime: 3.01003 s
Total GPU Compute Time: 2.98775 s
Explanations of the performance metrics are printed in the verbose logs.

发现速度从3.8ms->3.2ms了，提升了0.6ms，QPS也提升了15%，当然精度没有变化，此时TensorRT的log显示该融合的都正确融合了。

不过我好奇的是，现在3.2ms，比上述implict mode下的直接通过trtexec量化的engine的3.1ms，还慢0.1ms。于是我尝试使用trtexec，加入校准数据去量化这个模型，发现速度又变为3.2ms了，目前尚不清楚原因，如果有知道的小伙伴欢迎留言。

到目前为止，我们成功使用FX后训练量化了一个模型，并且转化为了TensorRT，精度和速度也比较符合预期！

需要符合TensorRT搭建network的形式

如果遇到模型出来的节点不对、有腾空的节点（即节点输出不是任一层的输入也不是模型的输出）、有错误引用的结点（结点获取某些属性是不存在的，例如backbone_base_fc_bias = self.backbone.base.fc.bias，其中fc是一个ConvRelu2D的）。这个时候TRT构建的时候会报错：Error Code 4: Internal Error ([DECONVOLUTION]-[acc_ops.conv_transpose2d]-[conv_transpose2d_3]: Missing Dequantization layer \- 2nd input to a weighted-layer must include exactly one DQ layer.)。当然也有可能是TensorRT的bug，修改节点的FX网络，在TensorRT-8.2版本以上就没问题，但是TensorRT-8.0.1.6下，就会构建出匪夷所思的模型（下面显示的模型结构，INT8和FP32的节点错乱）：

Layer(CaskConvolution): quantize_per_channel_106_input + 492output_per_channel_quant + conv2d_103 + relu_95, Tactic: 805889586762897346, 489output[Int8(1,1024,-26,-29)] -> 500output[Int8(1,512,-26,-29)]
Layer(CaskConvolution): quantize_per_channel_109_input + 520output_per_channel_quant + conv2d_106, Tactic: 7738495016763012180, 489output[Int8(1,1024,-26,-29)] -> 527output[Int8(1,2048,-50,-51)]
Layer(CaskConvolution): quantize_per_channel_107_input + 503output_per_channel_quant + conv2d_104 + relu_96, Tactic: 6781129591847482048, 500output[Int8(1,512,-26,-29)] -> 511output[Int8(1,512,-50,-51)]
Layer(CaskConvolution): quantize_per_channel_108_input + 514output_per_channel_quant + conv2d_105 + add_29 + relu_97, Tactic: 8234775147403903473, 511output[Int8(1,512,-50,-51)], 527output[Int8(1,2048,-50,-51)] -> 533output[Int8(1,2048,-50,-51)]
Layer(CudnnConvolution): quantize_per_channel_110_input + 536output_per_channel_quant + 538output_.dequant + conv2d_107 + relu_98, Tactic: 1, 535output[Float(1,2048,-50,-51)] -> 542Activation]_output[Float(1,512,-50,-51)]
Layer(Scale): 542Activation]_output_per_tensor_quant, Tactic: 0, 542Activation]_output[Float(1,512,-50,-51)] -> 544output[Int8(1,512,-50,-51)]
Layer(CaskConvolution): quantize_per_channel_111_input + 547output_per_channel_quant + conv2d_108 + relu_99, Tactic: 7438984192263206338, 544output[Int8(1,512,-50,-51)] -> 555output[Int8(1,512,-50,-51)]
Layer(CaskConvolution): quantize_per_channel_112_input + 558output_per_channel_quant + conv2d_109 + add_30 + relu_100, Tactic: 8234775147403903473, 555output[Int8(1,512,-50,-51)], 533output[Int8(1,2048,-50,-51)] -> 567output[Int8(1,2048,-50,-51)]
Layer(CudnnConvolution): quantize_per_channel_113_input + 570output_per_channel_quant + 572output_.dequant + conv2d_110 + relu_101, Tactic: 1, 569output[Float(1,2048,-50,-51)] -> 576Activation]_output[Float(1,512,-50,-51)]
Layer(Scale): 576Activation]_output_per_tensor_quant, Tactic: 0, 576Activation]_output[Float(1,512,-50,-51)] -> 578output[Int8(1,512,-50,-51)]
Layer(CaskConvolution): quantize_per_channel_114_input + 581output_per_channel_quant + conv2d_111 + relu_102, Tactic: 7438984192263206338, 578output[Int8(1,512,-50,-51)] -> 589output[Int8(1,512,-50,-51)]
Layer(CaskConvolution): quantize_per_channel_115_input + 592output_per_channel_quant + conv2d_112 + add_31 + relu_103, Tactic: 8234775147403903473, 589output[Int8(1,512,-50,-51)], 567output[Int8(1,2048,-50,-51)] -> 601output[Int8(1,2048,-50,-51)]

engine是能构建出来，但是速度很慢，精度全无，对于我们的debug更造成了一些困扰和难度。

FX2TRT的另一种方式

TensorRT有显式量化（explicit mod）和隐式量化（implict mode ）两种方式，我们刚才用的是显式量化，即利用QDQ显式声明需要量化的节点，我们也可以用过隐式量化走FX去转TensorRT，这个时候就不能转reference版本的模型，不是模拟量化，而是实际算子就是INT8的模型，quantized_fx = convert_fx(model.fx_model)。

Pytorch有CPU端的INT8操作，实际中模型调用的是torch.nn.quantized.modules.conv.Conv2d算子，在转trt的时候，会调用以下的转换代码：

@tensorrt_converter(torch.nn.quantized.modules.conv.Conv2d)
def quantized_conv2d(network, submod, args, kwargs, layer_name):
    input_val = args[0]

    if not isinstance(input_val, trt.tensorrt.ITensor):
        raise RuntimeError(
            f"Quantized Conv2d received input {input_val} that is not part "
            "of the TensorRT region!"
        )

    return common_conv(
        network,
        submod,
        dimension=2,
        input_val=input_val,
        layer_name=layer_name,
        is_quantized=True,
    )

过程中我们会传入每一层激活值的scale和zero_point，但是weight还是由tensorrt内部进行校准的：

if is_quantized:
    # Assume the dtype of activation is torch.quint8
    mark_as_int8_layer(
        layer, get_dyn_range(mod.scale, mod.zero_point, torch.quint8)
    )

这里就不演示了，写不动了。

提一嘴TRTModule类

FX2TRT中，最终构造出来的engine是由这个类进行管理，这个类对engine进行了封装，我们在调用该类对象的时候，就和调用普通nn.module一样，非常方便。

可以通过代码看下TRTModule的细节，值得看。

# torch_tensorrt/fx/trt_module.py

class TRTModule(torch.nn.Module):
    def __init__(
        self, engine=None, input_names=None, output_names=None, cuda_graph_batch_size=-1
    ):
        super(TRTModule, self).__init__()
        self._register_state_dict_hook(TRTModule._on_state_dict)
        self.engine = engine
        self.input_names = input_names
        self.output_names = output_names
        self.cuda_graph_batch_size = cuda_graph_batch_size
        self.initialized = False

        if engine:
            self._initialize()

    def _initialize(self):
        self.initialized = True
        self.context = self.engine.create_execution_context()

        # Indices of inputs/outputs in the trt engine bindings, in the order
        # as they are in the original PyTorch model.
        self.input_binding_indices_in_order: Sequence[int] = [
            self.engine.get_binding_index(name) for name in self.input_names
        ]
        self.output_binding_indices_in_order: Sequence[int] = [
            self.engine.get_binding_index(name) for name in self.output_names
        ]
        primary_input_outputs = set()
        primary_input_outputs.update(self.input_binding_indices_in_order)
        primary_input_outputs.update(self.output_binding_indices_in_order)
        self.hidden_output_binding_indices_in_order: Sequence[int] = []
        self.hidden_output_names: Sequence[str] = []
        for i in range(
            self.engine.num_bindings // self.engine.num_optimization_profiles
        ):
            if i not in primary_input_outputs:
                self.hidden_output_binding_indices_in_order.append(i)
                self.hidden_output_names.append(self.engine.get_binding_name(i))

        assert (self.engine.num_bindings // self.engine.num_optimization_profiles) == (
            len(self.input_names)
            + len(self.output_names)
            + len(self.hidden_output_names)
        )

        self.input_dtypes: Sequence[torch.dtype] = [
            torch_dtype_from_trt(self.engine.get_binding_dtype(idx))
            for idx in self.input_binding_indices_in_order
        ]
        self.input_shapes: Sequence[Sequence[int]] = [
            tuple(self.engine.get_binding_shape(idx))
            for idx in self.input_binding_indices_in_order
        ]
        self.output_dtypes: Sequence[torch.dtype] = [
            torch_dtype_from_trt(self.engine.get_binding_dtype(idx))
            for idx in self.output_binding_indices_in_order
        ]
        self.output_shapes = [
            tuple(self.engine.get_binding_shape(idx))
            if self.engine.has_implicit_batch_dimension
            else tuple()
            for idx in self.output_binding_indices_in_order
        ]
        self.hidden_output_dtypes: Sequence[torch.dtype] = [
            torch_dtype_from_trt(self.engine.get_binding_dtype(idx))
            for idx in self.hidden_output_binding_indices_in_order
        ]
        self.hidden_output_shapes = [
            tuple(self.engine.get_binding_shape(idx))
            if self.engine.has_implicit_batch_dimension
            else tuple()
            for idx in self.hidden_output_binding_indices_in_order
        ]

    def _check_initialized(self):
        if not self.initialized:
            raise RuntimeError("TRTModule is not initialized.")

    def _on_state_dict(self, state_dict, prefix, local_metadata):
        self._check_initialized()
        state_dict[prefix + "engine"] = bytearray(self.engine.serialize())
        state_dict[prefix + "input_names"] = self.input_names
        state_dict[prefix + "output_names"] = self.output_names
        state_dict[prefix + "cuda_graph_batch_size"] = self.cuda_graph_batch_size

    def _load_from_state_dict(
        self,
        state_dict,
        prefix,
        local_metadata,
        strict,
        missing_keys,
        unexpected_keys,
        error_msgs,
    ):
        engine_bytes = state_dict[prefix + "engine"]

        logger = trt.Logger()
        runtime = trt.Runtime(logger)
        self.engine = runtime.deserialize_cuda_engine(engine_bytes)

        self.input_names = state_dict[prefix + "input_names"]
        self.output_names = state_dict[prefix + "output_names"]
        self._initialize()

    def __getstate__(self):
        state = self.__dict__.copy()
        state["engine"] = bytearray(self.engine.serialize())
        state.pop("context", None)
        return state

    def __setstate__(self, state):
        logger = trt.Logger()
        runtime = trt.Runtime(logger)
        state["engine"] = runtime.deserialize_cuda_engine(state["engine"])
        self.__dict__.update(state)
        if self.engine:
            self.context = self.engine.create_execution_context()

    def forward(self, *inputs):
        with torch.autograd.profiler.record_function("TRTModule:Forward"):
            self._check_initialized()

            with torch.autograd.profiler.record_function("TRTModule:ProcessInputs"):
                assert len(inputs) == len(
                    self.input_names
                ), f"Wrong number of inputs, expect {len(self.input_names)} get {len(inputs)}."

                # This is only used when the trt engine is using implicit batch dim.
                batch_size = inputs[0].shape[0]
                contiguous_inputs: List[torch.Tensor] = [i.contiguous() for i in inputs]
                bindings: List[Any] = [None] * (
                    len(self.input_names)
                    + len(self.output_names)
                    + len(self.hidden_output_names)
                )

                for i, input_name in enumerate(self.input_names):
                    assert inputs[
                        i
                    ].is_cuda, f"{i}th input({input_name}) is not on cuda device."
                    assert (
                        inputs[i].dtype == self.input_dtypes[i]
                    ), f"Dtype mismatch for {i}th input({input_name}). Expect {self.input_dtypes[i]}, got {inputs[i].dtype}."

                    idx = self.input_binding_indices_in_order[i]
                    bindings[idx] = contiguous_inputs[i].data_ptr()

                    if not self.engine.has_implicit_batch_dimension:
                        self.context.set_binding_shape(
                            idx, tuple(contiguous_inputs[i].shape)
                        )
                    else:
                        assert inputs[i].size()[1:] == self.input_shapes[i], (
                            f"Shape mismatch for {i}th input({input_name}). "
                            f"Expect {self.input_shapes[i]}, got {inputs[i].size()[1:]}."
                        )

            with torch.autograd.profiler.record_function("TRTModule:ProcessOutputs"):
                # create output tensors
                outputs: List[torch.Tensor] = []

                for i, idx in enumerate(self.output_binding_indices_in_order):
                    if self.engine.has_implicit_batch_dimension:
                        shape = (batch_size,) + self.output_shapes[i]
                    else:
                        shape = tuple(self.context.get_binding_shape(idx))

                    output = torch.empty(  # type: ignore[call-overload]
                        size=shape,
                        dtype=self.output_dtypes[i],
                        device=torch.cuda.current_device(),
                    )
                    outputs.append(output)
                    bindings[idx] = output.data_ptr()

                for i, idx in enumerate(self.hidden_output_binding_indices_in_order):
                    if self.engine.has_implicit_batch_dimension:
                        shape = (batch_size,) + self.hidden_output_shapes[i]
                    else:
                        shape = tuple(self.context.get_binding_shape(idx))

                    output = torch.empty(  # type: ignore[call-overload]
                        size=shape,
                        dtype=self.hidden_output_dtypes[i],
                        device=torch.cuda.current_device(),
                    )
                    bindings[idx] = output.data_ptr()

            with torch.autograd.profiler.record_function("TRTModule:TensorRTRuntime"):
                if self.engine.has_implicit_batch_dimension:
                    self.context.execute_async(
                        batch_size, bindings, torch.cuda.current_stream().cuda_stream
                    )
                else:
                    self.context.execute_async_v2(
                        bindings, torch.cuda.current_stream().cuda_stream
                    )

            if len(outputs) == 1:
                return outputs[0]

            return tuple(outputs)

    def enable_profiling(self, profiler: "trt.IProfiler" = None):
        """
        Enable TensorRT profiling. After calling this function, TensorRT will report
        time spent on each layer in stdout for each forward run.
        """
        self._check_initialized()

        if not self.context.profiler:
            self.context.profiler = trt.Profiler() if profiler is None else profiler

    def disable_profiling(self):
        """
        Disable TensorRT profiling.
        """
        self._check_initialized()

        torch.cuda.synchronize()
        del self.context
        self.context = self.engine.create_execution_context()

    def get_layer_info(self) -> str:
        """
        Get layer info of the engine. Only support for TRT > 8.2.
        """
        inspector = self.engine.create_engine_inspector()
        return inspector.get_engine_information(trt.LayerInformationFormat.JSON)

TRTModule我见过最开始出现在torch2trt，也是一个Pytorch的转换TensorRT工具，同样非常好用。

五、TensorRT基础入门

1、为什么trtexec转换engine时，采用FP16推理、INT8量化，推理延时可能变得更久？

答：可能原因是：

a. 量化后可能会引入一些多余的计算操作和内部的一些reshape。对于小模型，多余的计算带来的延时并不明显；而reshape会涉及一些内存操作，这个是延时变长的主要原因。对于reshape引起的延时变长，我们的解决办法是让TensorRT不做一些额外的这些操作，但TensorRT内部产生的reshape我们没有办法解决的。
b. 另外，TensorRT有kernel auto tuning的机制，因此选择的kernel不一定是效率最高的。

2、什么是Myelin？

答：这是TensorRT内部的一个概念，负责graph compilation（图编译）和execution backend（执行后端）的内容。

3、constant cache和constant memory的区别？

答：constant cache和constant memory是两个概念，cache更靠近计算单元，所以速度更快。constant cache是以前GPU版本中的概念，比如早期Fermi架构的SM block（左图）。而现在Ampere架构的SM如右图所示。

4、在cuda, cudnn, tensorrt版本相同的情况下，可以将其他电脑上转换好的trt直接在自己电脑运行吗？

答：不同的GPU架构针对trt的优化方式不一样，所以移植到另外一个平台可能会不兼容。

5、请教一个问题：对于比较大的模型，对于边缘设备trtexec搜寻时间太长有什么好的方法或者技巧么？（转engin搜寻最优的layer的过程时间过长）

答：创建 engine的时候必须要在推理用的设备上跑，边缘设备上可能会稍微慢一点。但是，如果对于同一个模型进行多次创建engine，或者只对模型部分layer做了修改其他大部分layer没有动(比如在调试或者测试的时候)的话，我们可以在第一次创建的时候把各个layer所对应的trt探索得到的最优tactics，也就是核函数和优化方案以某种方式保存下来。第二次以后再创建模型的时候，读取我们所保存的tactics就可以让trt skip掉已经探索所得到的优化方案。这个就是Timing cache。在trtexec命令行和TensorRT API下都可以指定。你试着参考一下这里。https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#opt-builder-perf

TensorRT模型部署优化1、模型部署后，用什么手段分析推理性能？

答：可以利用Nsight工具分析模型推理性能。通过该工具可以捕获模型各个kernel运行的时间。针对运行情况，我们再做优化。

2、神经网络中吞吐和延迟的关系？

答：吞吐是用来描述一个硬件设备单位时间内可以完成的计算量；延迟是用来描述一个模型推理所需的时间。延迟又分为计算产生的延迟和数据传输（包括数据同步）造成的延迟。我们可以用nsys和Nsight Compute工具定量分析不同阶段的延时情况。

3、tensorrt量化方法？

答：trt默认和推荐的量化算法是entropy，但具体需要看情况，有时候选择minmax或者percentile会达到更好的效果。这个需要结合op的特点一起考虑。

4、模型导出fp32的trt engine没有明显精度损失，导出fp16损失很明显，可能的原因有哪些？

比较突出的几个可能性就是：对一些敏感层进行了量化导致掉精度比较严重，或者权重的分布没有集中导致量化的dynamic range的选择让很多有数值的权重都归0了。另外，minmax, entropy, percentile这些计算scale的选择没有根据op进行针对性的选择也会出现掉点。

5、onnx模型推理结果正确，但tensorRT量化后的推理结果不正确，大概原因有哪些？

出现这种问题的时候，需要先确认两种模型推理的前处理（例如，对输入的各种预处理需要和pytoch模型的训练预处理完全一致）和后处理是否一致。确认是量化引起的问题时，可能原因有：

a. calibrator的算法选择不对；
b. calibration过程使用的数据不够；
c. 对网络敏感层进行了量化；
d. 对某些算子选择了不适合OP特性的scale计算。

6、采用tensorRT PTQ量化时，若用不同batchsize校正出来模型精度不一致，这个现象是否正常？

答：这个现象是正常的，因为calibration（校正）是以tensor为单位计算的。对于每次计算，如果histogram的最大值需要更新，那么PTQ会把histogram的range进行翻倍。参考链接：https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#enable_int8_c

不考虑内存不足的问题，推荐使用更大的batch_size，这样每个batch中包含样本更加丰富，校准后的精度会更好。但具体设置多大，需要通过实验确定（从大的batch size开始测试。一点一点往下减）。需要注意的是batch_size越大，校准时间越长。

7、关于对齐内存访问的疑问：如果使用L1cache，访问的颗粒度为128B，对齐的首地址应该为128B偶数倍，不应该是0B，256B，512B.......吗？

答：实际上这里的偶数倍（even multiple）指的是地址是偶数倍的，并非128B的偶数倍。比较官方的解释可以参考如下链接：https://www.nvidia.com/content/PDF/sc_2010/CUDA_Tutorial/SC10_Fundamental_Optimizations.pdf

8、同一个模型，3090 GPU转换成功，但RTX4000转换失败，该如何解决？（具体错误信息见下图）

答：此处提示SM相关错误，所以可以检查makefile或CMakeLists.txt中对nvcc编译器option的设定是否存在问题。

9、如何使用nsight或CUDA runtime api分析模型推理性能？

答：通过nsight可以看到核函数的名字（可通过名字推测它是用cuda core或tensor core, fp16还是int8）还有可以查看memory的流动。

10、如何尽量减少GPU和CPU之间的数据交互或内存分配与回收？

答：由于在推理过程中，CPU与GPU之间的数据拷贝耗时较长或出现频繁分配和回收内存的现象，这大大降低了模型推理性能。我们可以采用在推理模型前分配好所需要的最大内存（做到内存复用）以降低内存分配或回收的次数。针对CPU与GPU之间数据相互拷贝问题，我们需要优化代码流程，尽量减少拷贝的次数或寻找更好的方法去掩盖这个动作需要的时间。

11、如果QAT可以使模型尽可能减少量化带来的误差，那么可以不做敏感层分析，直接将整个网络量化为INT8吗？

答：不建议这么做，从经验来看，敏感层量化到INT8精度会下降很多，所以还是有必要进行敏感层分析。

12、模型量化到INT8后，推理时间反而比FP16慢，这正常吗？

答：正常的，这可能是tensorrt中内核auto tuning机制作怪（会把所有的优化策略都运行一遍，结果发现量化后涉及一堆其他的操作，反而效率不高，索性使用cuda core，而非tensorrt core）。当网络参数和模型架构设计不合理时，trt会添加额外的处理，导致INT8推理时间比FP16长。我们可以通过trt-engine explorer工具可视化engine模型看到。

13、请教一下，engine推理的时候，batchsize=1和batchsize=4，推理时间相差也接近4倍合理吗？有什么办法让多batch的推理时间接近单batch吗？比如加大显存？

答：这个可能出现的原因有很多，有可能单个batchsize的推理就已经把GPU资源全部吃满了，所以batchsize=4的时候看似加大了并行度，实际上也可能是在串行。建议把模型推理放在nsight system上分析一下，看看硬件资源占用率。

14、在device固定的情况下呢？有什么参数设置或者增加streams的方式吗？试过把workspace设到最大，只有轻微的提升

答：workspace的大小跟性能提升关联不大，workspace是使用在创建推理引擎时TensorRT选择tactics来进行优化的，workspace越大可以选择的tactics越丰富。但除非特别的小，一般关联不是那么大。试试fp16, int8这种量化参数来试试量化，cuda-graph来试试kernel launch的隐藏，builderOptimizationLevel的等级设置高一点等等。光靠参数优化还是有点局限。可以看看模型架构是否有冗长。

六、PyTorch Eager Mode 量化 TensorRT 加速

本文介绍了如何使用TensorRT加速通过PyTorch Eager Mode量化接口生成的量化模型，包括在PyTorch中执行eager mode量化、导出ONNX模型、修复ONNX模型图以及构建和验证TensorRT引擎的详细步骤。

from https://leimao.github.io/blog/PyTorch-Eager-Mode-Quantization-TensorRT-Acceleration/

从 PyTorch 2.3.0 开始，PyTorch 提供了三种量化接口：eager mode 量化、FX graph mode 量化以及 PyTorch 2 Export 量化。

由于最新的 PyTorch 2 Export 量化接口阻止了量化后的 PyTorch 模型导出为 ONNX，因此若不开发自定义的 PyTorch FX graph 量化后端（比如fx2trt[1] ），就无法使用 TensorRT 加速模型推理。

而 eager mode 量化和 FX graph mode 量化接口都支持将量化后的 PyTorch 模型导出为 ONNX，可以进一步使用 TensorRT 进行优化和加速。尽管 FX graph mode 量化接口更加灵活和强大，但某些使用场景下，使用 eager mode 量化接口仍是不可避免的。

在这篇文章中，我将展示如何使用 TensorRT 加速 PyTorch eager mode 量化接口生成的量化模型。同样的方法也适用于 FX graph mode 量化接口生成的量化模型，因为这两种量化模型都可以导出为 ONNX。

PyTorch Eager Mode 量化 TensorRT 加速

从 PyTorch eager mode 量化接口生成的量化模型使用 TensorRT 加速的过程涉及以下三个步骤：

在 PyTorch 中对浮点 PyTorch 模型执行 eager mode 量化，并将量化后的 PyTorch 模型导出为 ONNX。
修复量化后的 ONNX 模型图，以便 TensorRT 解析器能够解析它。
将量化后的 ONNX 模型构建为 TensorRT 引擎，进行性能分析并验证准确性。

本篇文章的源代码可以在 GitHub[2] 上找到。

TensorRT INT8 量化要求

TensorRT INT8 显式量化[3]要求量化模型中的权重使用每通道对称量化，激活使用每张量对称量化（per-channel symmetric quantization for weights and per-tensor symmetric quantization for activations）。因此，在 PyTorch 中进行后训练静态量化校准或量化感知训练时，必须确保量化配置满足 TensorRT INT8 量化要求。
PyTorch Eager Mode 量化与我之前发布的 PyTorch 静态量化教程[4] 略有不同，此次的量化方法使用每通道对称量化权重和每张量对称量化激活，并将 PyTorch 的量化后端设置为 qnnpack，而不是 fbgemm。原因在于 fbgemm 不太支持 INT8 对称量化推理，这样会阻止通过跟踪将模型导出为 ONNX。

torch.backends.quantized.engine = 'qnnpack'  

per_tensor_activation_observer_range_neg_128_to_127 = torch.ao.quantization.MinMaxObserver.with_args(  
    dtype=torch.qint8,  
    qscheme=torch.per_tensor_symmetric,  
    quant_min=-128,  
    quant_max=127,  
    eps=2**-12)  
per_channel_weight_observer_range_neg_128_to_127 = torch.ao.quantization.PerChannelMinMaxObserver.with_args(  
    dtype=torch.qint8,  
    qscheme=torch.per_channel_symmetric,  
    quant_min=-128,  
    quant_max=127,  
    eps=2**-12)  
quantization_config = torch.ao.quantization.QConfig(  
    activation=per_tensor_activation_observer_range_neg_128_to_127,  
    weight=per_channel_weight_observer_range_neg_128_to_127)  

quantized_model.qconfig = quantization_config

量化后的 ONNX 模型图修复

从 PyTorch eager mode 量化导出的量化 ONNX 模型中存在一些错误和问题，导致 TensorRT 解析器无法正确解析。因此，我们需要在构建 TensorRT 引擎之前修复量化后的 ONNX 模型。

具体来说，在量化 ONNX 模型图中会插入一个 Cast 节点，位于 QuantizeLinear 和 DequantizeLinear 节点之间，其数据类型为 uint8，而非 int8，尽管我们已显式将激活量化配置设置为 torch.qint8。因此，需要移除这些错误的 Cast 节点。

此外，Conv 节点的浮点偏置项无法被 TensorRT 解析器解析。需要计算并将其作为常量张量添加到量化后的 ONNX 模型图中。

在修复之前，量化后的 ONNX 模型图如下所示：

修复之后，量化后的 ONNX 模型图如下所示：

在某些地方，量化后的 ONNX 模型图仍可以进一步优化以获得最佳 TensorRT 性能，例如通过将跳跃连接的 Add 与之前的卷积层融合，移除 Conv 和 Add 节点之间的 QuantizeLinear 和 DequantizeLinear。

由于使用 PyTorch eager mode 量化实现起来比较棘手，本文未涉及此内容。若需获得最佳的 TensorRT 量化引擎性能，请参考 TensorRT Q/DQ 放置建议[5]，并考虑使用 NVIDIA PyTorch 量化工具包[6]。

构建和验证量化 TensorRT 引擎

浮点型 PyTorch ResNet18 模型和 INT8 量化 PyTorch ResNet18 模型在 CIFAR10 测试数据集上的准确率分别为 0.854 和 0.852。量化 TensorRT 引擎在 CIFAR10 测试数据集上的准确率为 0.851，与量化后的 PyTorch 模型一致。

对于批大小为 1、输入图像尺寸为 32 x 32 的情况下，FP16 和 INT8 ResNet18 TensorRT 引擎的推理延迟分别为 0.208177 ms 和 0.17584 ms。尽管由于小批大小和小图像尺寸导致数学利用率较低，INT8 量化 ResNet18 引擎仍比 FP16 ResNet18 引擎快 1.2 倍。如果批大小和图像尺寸增大，延迟改进将更加显著。

参考资料

• PyTorch Eager Mode Quantization TensorRT Acceleration - GitHub[3]

• PyTorch Quantization[4]

• PyTorch Static Quantization[5]

参考资料

[1]

fx2trt: https://pytorch.org/TensorRT/_modules/torch_tensorrt/fx/fx2trt.html

[2]GitHub: https://github.com/leimao/PyTorch-Eager-Mode-Quantization-TensorRT-Acceleration

[3]PyTorch Eager Mode Quantization TensorRT Acceleration - GitHub: https://github.com/leimao/PyTorch-Eager-Mode-Quantization-TensorRT-Acceleration

[4]PyTorch Quantization: https://pytorch.org/docs/stable/quantization.html

[5]PyTorch Static Quantization: https://leimao.github.io/blog/PyTorch-Static-Quantization/

七、C++中使用TensorRT实现SAM分割一切

此项目名为SPEED-SAM-C++-TENSORRT是 Segment Anything Model (SAM) 的高性能实现，使用 NVIDIA 的 TensorRT 进行高效推理，并使用 CUDA 优化 GPU 利用率。下面详细介绍了该实现的工作原理、如何编译它以及关键组件。

https://github.com/hamdiboukamcha/SPEED-SAM-C-TENSORRT

项目结构和概述

项目的目录结构如下：

该项目专为快速图像分割任务而设计，可以根据选定的点或边界框对图像进行分割。它使用 TensorRT 从 ONNX 模型构建优化的推理引擎，从而可以在 NVIDIA GPU 上进行高效的深度学习推理。

关键组件说明

构建 TensorRT 引擎：

该EngineTRT::build()方法解析 ONNX 模型并构建 TensorRT 引擎。如果需要动态形状支持（针对不同的输入大小），则配置优化配置文件。

该EngineTRT::saveEngine()功能允许将序列化的引擎保存到文件中，以方便在未来的会话中快速加载。

图像分割过程：

该类SpeedSam负责协调分割过程。首先，使用预先训练的编码器模型对图像进行编码。接下来，解码器模型根据输入点或边界框生成掩码。

提供了两种主要的分割方法：

基于点的分割（segmentWithPoint）：用户单击图像来指定要分割的点。

边界框分割（segmentBbox）：用户绘制一个边界框来定义分割区域。

内存优化：

该实现使用 CUDA 流来重叠数据传输和计算，从而减少延迟。

尽可能采用 FP16 精度来加速推理，而不会显著牺牲准确性。

编译和设置

先决条件：

CUDA：NVIDIA 的并行计算平台。

TensorRT：一个用于高性能深度学习推理的库。

OpenCV：用于图像处理。

C++17：编译所需。

编译步骤：

克隆存储库并导航到项目目录：

git clone https://github.com/hamdiboukamcha/SPEED-SAM-C-TENSORRT.git 
cd SPEED-SAM-CPP-TENSORRT

创建构建目录并使用 CMake 进行编译：

mkdir build && cd build
cmake ..
make -j$(nproc)

注意：CMakeLists.txt如有必要，请使用 TensorRT 和 OpenCV 的正确路径进行更新。

运行应用程序

基于点的分割：

程序显示图像并等待用户选择一个点。一旦选择了点，模型就会根据指定的位置分割图像并叠加遮罩

边界框分割：

用户可以在所需区域周围绘制一个边界框。然后模型会在定义的区域内分割图像并显示结果。

输出：

分割后的图像保存到指定的输出路径，并记录推理时间等性能指标。

性能和优化

推理速度：图像编码器和掩码解码器实现了快速的推理时间，整个管道在大约 12 毫秒内完成。
内存管理：有效利用 GPU 内存和 CUDA 流可最大限度地减少延迟并最大限度地提高吞吐量。
精度设置：使用 FP16 精度的选项允许在速度和准确性之间进行权衡。

结论

SPEED -SAM-C++-TENSORRT项目展示了使用 TensorRT 和 CUDA 对 Segment Anything 模型进行优化实现。通过利用 NVIDIA 强大的库，该项目实现了实时分割性能，使其适用于需要快速准确图像分析的应用程序。灵活的代码结构和全面的日志记录也使其成为在 GPU 上进行深度学习推理的开发人员的宝贵资源。

八、使用TensorRT加速YOLO模型推理

TensorRT是 Nvidia 发布的机器学习框架，用于在其硬件上运行机器学习推理。TensorRT 经过高度优化，可在 NVIDIA GPU 上运行。这可能是目前运行模型的最快方法。

安装

我们应该下载与您的 CUDA 版本匹配的 TensorRT 格式。在我的例子中，我的 CUDA 版本是 12.1，因此我将使用相应的版本。

您可以在https://developer.nvidia.com/nvidia-tensorrt-8x-download下载 TensorRT 。

下载完成后，解压。然后从 Extracted TensorRT 文件夹的 lib 文件夹中复制所有 .dll 文件。

将它们粘贴到 CUDA 目录的 bin 文件夹中。

之后，转到 Extracted TensorRT 文件夹并转到 Python，然后查找与您的 Python 版本相对应的文件。

就我而言，我使用的是 Python 3.10 版本。

您可以在您的 Python 环境中输入 python — version (如上所示) 来检查您的 Python 版本。

然后对该文件进行 pip 安装，

然后在解压的 TensorRT 文件夹中找到 graphsurgeon 文件夹，并对其中的文件执行类似的 pip 安装，如下所示

接下来，对uff文件夹内的文件执行类似的 pip 安装，如下所示。

然后通过执行 pip install 重复相同的过程，但这次针对onnx_grapghsurgeon文件夹中的文件。

完成后，我们就可以测试推理速度了。为了进行测试，我们将使用 YOLOv8 预训练模型。该模型可从https://github.com/ultralytics/ultralytics?tab=readme-ov-file下载。

推理

首先执行如下所示的 pip install，并将下载的 .pt 模型文件转换为 TensorRT 格式。

#!pip install ultralytics


from ultralytics import YOLO


model= YOLO('yolov8n.pt')
model.export(format='engine',device=0)

这会将 .pt 格式的 pytorch 模型导出为 .engine 格式。我们将针对同一数据集比较 .pt 模型与 .engine 模型的性能，并测量它们的推理速度。

让我们使用下面的图片进行推理

Pytorch .pt 模型性能

from ultralytics import YOLO


model= YOLO('yolov8n.pt',task='detect')
result= model.predict('japan-2014616_1280.jpg', save=True)

pytorch .pt 模型的推理时间为 99.5 毫秒

TensorRT 模型

from ultralytics import YOLO


model= YOLO('yolov8n.engine',task='detect')
result= model.predict('japan-2014616_1280.jpg', save=True)

TensorRT 模型的推理时间为 2.0 毫秒。推理速度大幅降低。

标签：合集,51c,torch,TensorRT,input,True,self,size
From： https://blog.csdn.net/weixin_49587977/article/details/144183590