整理下tensorrt学习资料,方便后续查找。(文章内容大部分摘取于网络资源)
1. tensorrt介绍
安装: https://docs.nvidia.com/deeplearning/sdk/tensorrt-install-guide/index.html
tensorrt python文档:https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/index.html
tensorrt c++文档 :https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/index.html
tensorrt的工作流程大致如下:
-
- 构建一个
tensorrt.INetworkDefinition
,可以通过parser解析器解析onnx等网络模型,也可以利用 TensorRT Network API(tensorrt.INetworkDefinition
)搭建网络。tensorrt.Builder
能够创建一个空的tensorrt.INetworkDefinition
- 构建一个
-
- 将上一步创建好的[
tensorrt.INetworkDefinition
],利用tensorrt.Builder
构建一个tensorrt.ICudaEngine
,在构建Engine时,builder可以设置一些优化参数,如batch_size, workspace等。
- 将上一步创建好的[
-
- 利用
tensorrt.ICudaEngine
创建一个tensorrt.IExecutionContext
,然后用IExecutionContext
进行优化和推理
- 利用
1.1 tensorrt api
这里简单学习下tensorrt c++ api中几个重要的类:ILogger, IBuilder, INetworkDefinition,ICudaEngine, IParser
ILogger类
tensorRT中的日志类,其主要包含一个枚举类型定义日志等级,一个log虚函数打印日志;
-
枚举类型:数字越大表示日志等级越不严重
enum class Severity : int32_t { kINTERNAL_ERROR = 0 , kERROR = 1 , kWARNING = 2 , kINFO = 3 , kVERBOSE = 4 }
-
log虚函数: 继承类需要实现该函数
virtual void log (Severity severity, AsciiChar const *msg) noexcept=0
实际使用时,一般继承ILogger类,并实现log函数,如下:
// tensorRT include
#include <NvInfer.h>
#include <NvInferRuntime.h>
class TRTLogger : public nvinfer1::ILogger{
public:
virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
if(severity <= Severity::kVERBOSE){
printf("%d: %s\n", severity, msg);
}
}
};
TRTLogger logger;
IBuilder类
根据设置的优化参数,将一个network网络转化为engine。
创建IBuilder对象
该类不能被继承,一般通过createInferBuilder()
函数创建其实例对象,createInferBuilder()
函数定义如下:
inline IBuilder* createInferBuilder(ILogger& logger) noexcept
{
return static_cast<IBuilder*>(createInferBuilder_INTERNAL(&logger, NV_TENSORRT_VERSION));
}
使用示例如下:
#include <NvInfer.h>
#include <NvInferRuntime.h>
class TRTLogger : public nvinfer1::ILogger{
public:
virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
if(severity <= Severity::kVERBOSE){
printf("%d: %s\n", severity, msg);
}
}
};
TRTLogger logger;
nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);
成员方法使用
IBuilder的几个成员方法比较常用,如下:
nvinfer1::IBuilder::createBuilderConfig()
nvinfer1::IBuilder::buildEngineWithConfig()
nvinfer1::IBuilder::createNetworkV2()
nvinfer1::IBuilder::createOptimizationProfile()
createBuilderConfig()
函数定义如下:
nvinfer1::IBuilderConfig * nvinfer1::IBuilder::createBuilderConfig()
函数返回一个IBuilderConfig的实例对象,设置builder创建engine过程中的一些参数,最主要的设置是setMaxWorkspaceSize()
, 其使用如下:
nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
printf("Workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f); // 256Mib
config->setMaxWorkspaceSize(1 << 28);
buildEngineWithConfig()
函数定义如下: ( 从TensorRT 8.0,被IBuilder::buildSerializedNetwork()函数取代 )
TRT_DEPRECATED nvinfer1::ICudaEngine * nvinfer1::IBuilder::buildEngineWithConfig( INetworkDefinition& network, IBuilderConfig& config)
函数根据网络结构INetworkDefinition和配置参数IBuilderConfig, 创建一个engine。因此,同一个网络,采用不同的配置,可以产生不同的engine。使用如下:
nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
config->setMaxWorkspaceSize(1 << 28);
builder->setMaxBatchSize(1); // 推理时 batchSize = 1
nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
createNetworkV2()
函数定义如下:
nvinfer1::INetworkDefinition * nvinfer1::IBuilder::createNetworkV2( NetworkDefinitionCreationFlags flags)
函数创建一个空的Inetwork网络实例对象,参数NetworkDefinitionCreationFlags是一个枚举类别,其定义如下:
enum class NetworkDefinitionCreationFlag : int32_t
{
//! Dynamic shape support requires that the kEXPLICIT_BATCH flag is set.
//! With dynamic shapes, any of the input dimensions can vary at run-time,
//! and there are no implicit dimensions in the network specification. This is specified by using the
//! wildcard dimension value -1.
kEXPLICIT_BATCH = 0, //!< Mark the network to be an explicit batch network
//! Setting the network to be an explicit precision network has the following implications:
//! 1) Precision of all input tensors to the network have to be specified with ITensor::setType() function
//! 2) Precision of all layer output tensors in the network have to be specified using ILayer::setOutputType()
//! function
//! 3) The builder will not quantize the weights of any layer including those running in lower precision(INT8). It
//! will
//! simply cast the weights into the required precision.
//! 4) Dynamic ranges must not be provided to run the network in int8 mode. Dynamic ranges of each tensor in the
//! explicit
//! precision network is [-127,127].
//! 5) Quantizing and dequantizing activation values between higher (FP32) and lower (INT8) precision
//! will be performed using explicit Scale layers with input/output precision set appropriately.
kEXPLICIT_PRECISION TRT_DEPRECATED_ENUM = 1, //! <-- Deprecated, used for backward compatibility
};
使用NetworkDefinitionCreationFlag::kEXPLICIT_BATCH
参数时,表示网络同时支持固定尺寸和动态尺寸的输入。总结下:
- nvinfer1::INetworkDefinition* network = builder->createNetworkV2(0):同时支持动态和静态输入
- nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1):只支持静态输入?
createOptimizationProfile()
函数定义如下:
nvinfer1::IOptimizationProfile * nvinfer1::IBuilder::createOptimizationProfile()
创建一个IOptimizationProfile对象实例,当network网络有动态尺寸的输入时,需要通过IOptimizationProfile对象来指定最小输入尺寸,最大输入尺寸和最优输入尺寸。下面是使用代码示例:
auto profile = builder->createOptimizationProfile();
nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
// 配置最小允许1 x 1 x 3 x 3
profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMIN, nvinfer1::Dims4(1, num_input, 3, 3));
// 配置最优的尺寸
profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kOPT, nvinfer1::Dims4(1, num_input, 3, 3));
// 配置最大允许10 x 1 x 5 x 5
profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMAX, nvinfer1::Dims4(maxBatchSize, num_input, 5, 5));
config->addOptimizationProfile(profile);
INetworkDefinition
定义一个Network网络,如网络输入,网络层,网络输出结构。该类不能被继承,一般通过如下方式创建:
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
有多种方式来实现Network网络的结构,可以采用c++ api,也可以直接加载onnx文件
-
- 采用c++ api定义一个网络结构的代码:
nvinfer1::Weights make_weights(float* ptr, int n){
nvinfer1::Weights w;
w.count = n;
w.type = nvinfer1::DataType::kFLOAT;
w.values = ptr;
return w;
}
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
const int num_input = 3; // in_channel
const int num_output = 2; // out_channel
float layer1_weight_values[] = {1.0, 2.0, 0.5, 0.1, 0.2, 0.5}; // 前3个给w1的rgb,后3个给w2的rgb
float layer1_bias_values[] = {0.3, 0.8};
//输入指定数据的名称、数据类型和完整维度,将输入层添加到网络
nvinfer1::ITensor* input = network->addInput("image", nvinfer1::DataType::kFLOAT, nvinfer1::Dims4(1, num_input, 1, 1));
nvinfer1::Weights layer1_weight = make_weights(layer1_weight_values, 6);
nvinfer1::Weights layer1_bias = make_weights(layer1_bias_values, 2);
//添加全连接层, 注意对input进行了解引用
auto layer1 = network->addFullyConnected(*input, num_output, layer1_weight, layer1_bias);
//添加激活层
auto prob = network->addActivation(*layer1->getOutput(0), nvinfer1::ActivationType::kSIGMOID); // 注意更严谨的写法是*(layer1->getOutput(0)) 即对getOutput返回的指针进行解引用
// 将我们需要的prob标记为输出
network->markOutput(*prob->getOutput(0));
-
- 采用parser解析onnx文件
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
// 通过onnxparser解析的结果会填充到network中,类似addConv的方式添加进去
nvonnxparser::IParser* parser = nvonnxparser::createParser(*network, logger);
if(!parser->parseFromFile("demo.onnx", 1)){
printf("Failed to parser demo.onnx\n");
}
成员函数使用
INetworkDefinition的一些常用函数需要了解下:
int32_t nvinfer1::INetworkDefinition::getNbInputs() const
: 网络有几个输入ITensor* nvinfer1::INetworkDefinition::getInput(int32_t index) const
: 获取网络第index个输入的指针int32_t nvinfer1::INetworkDefinition::getNbLayers() const
: 网络有多少层ILayer* nvinfer1::INetworkDefinition::getLayer(int32_t index)const
: 获取网络第index层的指针int32_t nvinfer1::INetworkDefinition::getNbOutputs() const
: 网络有几个输出ITensor* nvinfer1::INetworkDefinition::getOutput(int32_t index)const
: 获取网络第index个输出的指针bool nvinfer1::INetworkDefinition::hasImplicitBatchDimension() const
: 查询网络建立时,是否采用隐式的尺寸(动态尺寸)void nvinfer1::INetworkDefinition::markOutput(ITensor& tensor)
: 指定一个tensor为网络的输出
ICudaEngine
在头文件InferRuntime.h
中
网络network产生的,用来进行网络推理的engine, 这个类也不能被继承,一般通过builder->buildEngineWithConfig()
或者runtime->deserializeCudaEngine()
创建实例对象, 如下:
# 通过IBuilder创建engine
nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);
nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
# 通过IRuntime反序列化engine (IBuilder创建的engine,通过engine->serialize()进行序列化保存)
auto engine_data = load_file("engine.trtmodel");
nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger);
nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(engine_data.data(), engine_data.size());
常用成员函数
ICudaEngine的一些常用成员函数,需要了解下:
-
TRT_DEPRECATED int32_t nvinfer1::ICudaEngine::getNbBindings() const
: 获取engine输入和输出的总个数,tensorrt8.5之后被getNbIOTensors()
代替。 ( 注意:If the engine has been built for K profiles, the first getNbBindings() / K bindings are used by profile number 0, the following getNbBindings() / K bindings are used by profile number 1 etc. ) -
char const* nvinfer1::ICudaEngine::getBindingName(int32_t bindingIndex)const
: 获取第bindingIndex个binding对应的名称 (binding就是指engine的输入和输出) -
int32_t nvinfer1::ICudaEngine::getBindingIndex(char const* name)const
: 获取name对应的binding的bindingIndex -
Dims nvinfer1::ICudaEngine::getBindingDimensions(int32_t bindingIndex) const
: 获取第bindingIndex个binding对应的尺寸 -
DataType nvinfer1::ICudaEngine::getBindingDataType(int32_t bindingIndex)const
: 获取第bindingIndex个binding对应的数据类型 -
int32_t nvinfer1::ICudaEngine::getMaxBatchSize()const
: 获取engine允许的最大batch_size -
IHostMemory * nvinfer1::ICudaEngine::serialize()const
: 将engine进行序列化,保存为二进制文件,使用代码如下:nvinfer1::IHostMemory* model_data = engine->serialize(); FILE* f = fopen("engine.trtmodel", "wb"); fwrite(model_data->data(), 1, model_data->size(), f); fclose(f);
-
IExecutionContext* nvinfer1::ICudaEngine::createExecutionContext()
: 创建一个执行上下文环境,进行推理,使用代码如下:nvinfer1::IExecutionContext* execution_context = engine->createExecutionContext(); cudaStream_t stream = nullptr; cudaStreamCreate(&stream); float input_data_host[] = {1, 2, 3}; float* input_data_device = nullptr; float output_data_host[2]; float* output_data_device = nullptr; cudaMalloc(&input_data_device, sizeof(input_data_host)); cudaMalloc(&output_data_device, sizeof(output_data_host)); cudaMemcpyAsync(input_data_device, input_data_host, sizeof(input_data_host), cudaMemcpyHostToDevice, stream); // 用一个指针数组指定input和output在gpu中的指针。 float* bindings[] = {input_data_device, output_data_device}; bool success = execution_context->enqueueV2((void**)bindings, stream, nullptr); cudaMemcpyAsync(output_data_host, output_data_device, sizeof(output_data_host), cudaMemcpyDeviceToHost, stream); cudaStreamSynchronize(stream);
IRuntime
主要用来将序列化的engine进行反序列化。 这个类也不能被继承,一般通过如下方式创建
nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger); // 需要传入一个ILogger对象
进行反序列化的函数为deserializeCudaEngine()
,定义如下:
ICudaEngine* nvinfer1::IRuntime::deserializeCudaEngine(void const* blob,std::size_t size)
- blob: 为直线保存engine的memory指针
- size:为保存engine的memory的大小
使用代码如下:
auto engine_data = load_file("engine.trtmodel");
nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger);
nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(engine_data.data(), engine_data.size());
IParser
包含在头文件中NvOnnxParser.h
中,用来解析onnx文件的网络到INetworkDefinition对象中。创建方式如下:
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
nvonnxparser::IParser* parser = nvonnxparser::createParser(*network, logger);
主要通过parseFromFile()
成员函数来解析onnx文件,函数定义如下:
virtual bool nvonnxparser::IParser::parseFromFile(const char* onnxModelFile,
int verbosity)
- onnxModelFile: onnx文件路径
- verbosity: 打印日志等级
使用代码如下:
if(!parser->parseFromFile("demo.onnx", 1)){
printf("Failed to parser demo.onnx\n");
}
1.2 Tensorrt hello-world
下面代码创建一个一层的全连接网络,构建一个engine,并将其序列化,保存为文件:
// tensorRT include
#include <NvInfer.h>
#include <NvInferRuntime.h>
// cuda include
#include <cuda_runtime.h>
// system include
#include <stdio.h>
class TRTLogger : public nvinfer1::ILogger{
public:
virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
if(severity <= Severity::kVERBOSE){
printf("%d: %s\n", severity, msg);
}
}
};
nvinfer1::Weights make_weights(float* ptr, int n){
nvinfer1::Weights w;
w.count = n; // The number of weights in the array.
w.type = nvinfer1::DataType::kFLOAT;
w.values = ptr;
return w;
}
int main(){
// 本代码主要实现一个最简单的神经网络 figure/simple_fully_connected_net.png
TRTLogger logger; // logger是必要的,用来捕捉warning和info等
// ----------------------------- 1. 定义 builder, config 和network -----------------------------
// 这是基本需要的组件
//形象的理解是你需要一个builder去build这个网络,网络自身有结构,这个结构可以有不同的配置
nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);
// 创建一个构建配置,指定TensorRT应该如何优化模型,tensorRT生成的模型只能在特定配置下运行
nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
// 创建网络定义,其中createNetworkV2(1)表示采用显性batch size,新版tensorRT(>=7.0)时,不建议采用0非显性batch size
// 因此贯穿以后,请都采用createNetworkV2(1)而非createNetworkV2(0)或者createNetwork
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
// 构建一个模型
/*
Network definition:
image
|
linear (fully connected) input = 3, output = 2, bias = True w=[[1.0, 2.0, 0.5], [0.1, 0.2, 0.5]], b=[0.3, 0.8]
|
sigmoid
|
prob
*/
// ----------------------------- 2. 输入,模型结构和输出的基本信息 -----------------------------
const int num_input = 3; // in_channel
const int num_output = 2; // out_channel
float layer1_weight_values[] = {1.0, 2.0, 0.5, 0.1, 0.2, 0.5}; // 前3个给w1的rgb,后3个给w2的rgb
float layer1_bias_values[] = {0.3, 0.8};
//输入指定数据的名称、数据类型和完整维度,将输入层添加到网络
nvinfer1::ITensor* input = network->addInput("image", nvinfer1::DataType::kFLOAT, nvinfer1::Dims4(1, num_input, 1, 1));
nvinfer1::Weights layer1_weight = make_weights(layer1_weight_values, 6);
nvinfer1::Weights layer1_bias = make_weights(layer1_bias_values, 2);
//添加全连接层
auto layer1 = network->addFullyConnected(*input, num_output, layer1_weight, layer1_bias); // 注意对input进行了解引用
//添加激活层
auto prob = network->addActivation(*layer1->getOutput(0), nvinfer1::ActivationType::kSIGMOID); // 注意更严谨的写法是*(layer1->getOutput(0)) 即对getOutput返回的指针进行解引用
// 将我们需要的prob标记为输出
network->markOutput(*prob->getOutput(0));
printf("Workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f); // 256Mib
config->setMaxWorkspaceSize(1 << 28);
builder->setMaxBatchSize(1); // 推理时 batchSize = 1
// ----------------------------- 3. 生成engine模型文件 -----------------------------
//TensorRT 7.1.0版本已弃用buildCudaEngine方法,统一使用buildEngineWithConfig方法
nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
if(engine == nullptr){
printf("Build engine failed.\n");
network->destroy();
config->destroy();
builder->destroy();
return -1;
}
// ----------------------------- 4. 序列化模型文件并存储 -----------------------------
// 将模型序列化,并储存为文件
nvinfer1::IHostMemory* model_data = engine->serialize();
FILE* f = fopen("engine.trtmodel", "wb");
fwrite(model_data->data(), 1, model_data->size(), f);
fclose(f);
// 卸载顺序按照构建顺序倒序
model_data->destroy();
engine->destroy();
network->destroy();
config->destroy();
builder->destroy();
printf("Done.\n");
return 0;
}
重点提炼:
-
必须使用createNetworkV2,并指定为1(表示显性batch)。createNetwork已经废弃,非显性batch官方不推荐??? (待确认)。这个方式直接影响推理时enqueue还是enqueueV2
-
builder、config等指针,记得释放,否则会有内存泄漏,使用ptr->destroy()释放
-
markOutput表示是该模型的输出节点,mark几次,就有几个输出,addInput几次就有几个输入。这与推理时相呼应
-
workspaceSize是工作空间大小,某些layer需要使用额外存储时,不会自己分配空间,而是为了内存复用,直接找tensorRT要workspace空间。指的这个意思
-
一定要记住,保存的模型只能适配编译时的trt版本、编译时指定的设备。也只能保证在这种配置下是最优的。如果用trt跨不同设备执行,有时候可以运行,但不是最优的,也不推荐
1.3 Tensorrt inference
下面代码加载一个engine序列化文件,创建engine和context,并进行推理,代码如下:
// tensorRT include
#include <NvInfer.h>
#include <NvInferRuntime.h>
// cuda include
#include <cuda_runtime.h>
// system include
#include <stdio.h>
#include <math.h>
#include <iostream>
#include <fstream>
#include <vector>
using namespace std;
class TRTLogger : public nvinfer1::ILogger{
public:
virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
if(severity <= Severity::kINFO){
printf("%d: %s\n", severity, msg);
}
}
} logger;
nvinfer1::Weights make_weights(float* ptr, int n){
nvinfer1::Weights w;
w.count = n;
w.type = nvinfer1::DataType::kFLOAT;
w.values = ptr;
return w;
}
bool build_model(){
TRTLogger logger;
// 这是基本需要的组件
nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);
nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
// 构建一个模型
/*
Network definition:
image
|
linear (fully connected) input = 3, output = 2, bias = True w=[[1.0, 2.0, 0.5], [0.1, 0.2, 0.5]], b=[0.3, 0.8]
|
sigmoid
|
prob
*/
const int num_input = 3;
const int num_output = 2;
float layer1_weight_values[] = {1.0, 2.0, 0.5, 0.1, 0.2, 0.5};
float layer1_bias_values[] = {0.3, 0.8};
nvinfer1::ITensor* input = network->addInput("image", nvinfer1::DataType::kFLOAT, nvinfer1::Dims4(1, num_input, 1, 1));
nvinfer1::Weights layer1_weight = make_weights(layer1_weight_values, 6);
nvinfer1::Weights layer1_bias = make_weights(layer1_bias_values, 2);
auto layer1 = network->addFullyConnected(*input, num_output, layer1_weight, layer1_bias);
auto prob = network->addActivation(*layer1->getOutput(0), nvinfer1::ActivationType::kSIGMOID);
// 将我们需要的prob标记为输出
network->markOutput(*prob->getOutput(0));
printf("Workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f);
config->setMaxWorkspaceSize(1 << 28);
builder->setMaxBatchSize(1);
nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
if(engine == nullptr){
printf("Build engine failed.\n");
return false;
}
// 将模型序列化,并储存为文件
nvinfer1::IHostMemory* model_data = engine->serialize();
FILE* f = fopen("engine.trtmodel", "wb");
fwrite(model_data->data(), 1, model_data->size(), f);
fclose(f);
// 卸载顺序按照构建顺序倒序
model_data->destroy();
engine->destroy();
network->destroy();
config->destroy();
builder->destroy();
printf("Done.\n");
return true;
}
vector<unsigned char> load_file(const string& file){
ifstream in(file, ios::in | ios::binary);
if (!in.is_open())
return {};
in.seekg(0, ios::end);
size_t length = in.tellg();
std::vector<uint8_t> data;
if (length > 0){
in.seekg(0, ios::beg);
data.resize(length);
in.read((char*)&data[0], length);
//in.read((char*)data.data(), length);
}
in.close();
return data;
}
void inference(){
// ------------------------------ 1. 准备模型并加载 ----------------------------
TRTLogger logger;
auto engine_data = load_file("engine.trtmodel");
// 执行推理前,需要创建一个推理的runtime接口实例。与builer一样,runtime需要logger:
nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger);
// 将模型从读取到engine_data中,则可以对其进行反序列化以获得engine
nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(engine_data.data(), engine_data.size());
if(engine == nullptr){
printf("Deserialize cuda engine failed.\n");
runtime->destroy();
return;
}
nvinfer1::IExecutionContext* execution_context = engine->createExecutionContext();
cudaStream_t stream = nullptr;
// 创建CUDA流,以确定这个batch的推理是独立的
cudaStreamCreate(&stream);
/*
Network definition:
image
|
linear (fully connected) input = 3, output = 2, bias = True w=[[1.0, 2.0, 0.5], [0.1, 0.2, 0.5]], b=[0.3, 0.8]
|
sigmoid
|
prob
*/
// ------------------------------ 2. 准备好要推理的数据并搬运到GPU ----------------------------
float input_data_host[] = {1, 2, 3};
float* input_data_device = nullptr;
float output_data_host[2];
float* output_data_device = nullptr;
cudaMalloc(&input_data_device, sizeof(input_data_host));
cudaMalloc(&output_data_device, sizeof(output_data_host));
cudaMemcpyAsync(input_data_device, input_data_host, sizeof(input_data_host), cudaMemcpyHostToDevice, stream);
// 用一个指针数组指定input和output在gpu中的指针。
float* bindings[] = {input_data_device, output_data_device};
// ------------------------------ 3. 推理并将结果搬运回CPU ----------------------------
bool success = execution_context->enqueueV2((void**)bindings, stream, nullptr);
cudaMemcpyAsync(output_data_host, output_data_device, sizeof(output_data_host), cudaMemcpyDeviceToHost, stream);
cudaStreamSynchronize(stream);
printf("output_data_host = %f, %f\n", output_data_host[0], output_data_host[1]);
// ------------------------------ 4. 释放内存 ----------------------------
printf("Clean memory\n");
cudaStreamDestroy(stream);
cudaFree(input_data_device);
cudaFree(output_data_device);
execution_context->destroy();
engine->destroy();
runtime->destroy();
// ------------------------------ 5. 手动推理进行验证 ----------------------------
const int num_input = 3;
const int num_output = 2;
float layer1_weight_values[] = {1.0, 2.0, 0.5, 0.1, 0.2, 0.5};
float layer1_bias_values[] = {0.3, 0.8};
printf("手动验证计算结果:\n");
for(int io = 0; io < num_output; ++io){
float output_host = layer1_bias_values[io];
for(int ii = 0; ii < num_input; ++ii){
output_host += layer1_weight_values[io * num_input + ii] * input_data_host[ii];
}
// sigmoid
float prob = 1 / (1 + exp(-output_host));
printf("output_prob[%d] = %f\n", io, prob);
}
}
int main(){
if(!build_model()){
return -1;
}
inference();
return 0;
}
重点提炼:
-
bindings是tensorRT对输入输出张量的描述,bindings = input-tensor + output-tensor。比如input有a,output有b, c, d,那么bindings = [a, b, c, d],bindings[0] = a,bindings[2] = c。则engine->getBindingDimensions(0)获取的即为input的尺寸
-
enqueueV2是异步推理,加入到stream队列等待执行。输入的bindings则是tensors的指针(注意是device pointer)。其shape对应于编译时指定的输入输出的shape(这里只演示全部shape静态)
-
createExecutionContext可以执行多次,允许一个引擎具有多个执行上下文,不过看看就好,别当真
1.3 Dynamic-shape
如果需要输入设置为动态的shape,主要有两步操作:
-
- build engine时,通过profile设置最小,最优和最大的输入尺寸,如下代码:
// 如果模型有多个输入,则必须多个profile auto profile = builder->createOptimizationProfile(); // 配置最小允许1 x 1 x 3 x 3 profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMIN, nvinfer1::Dims4(1, num_input, 3, 3)); profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kOPT, nvinfer1::Dims4(1, num_input, 3, 3)); // 配置最大允许10 x 1 x 5 x 5 // if networkDims.d[i] != -1, then minDims.d[i] == optDims.d[i] == maxDims.d[i] == networkDims.d[i] profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMAX, nvinfer1::Dims4(maxBatchSize, num_input, 5, 5)); config->addOptimizationProfile(profile);
-
- 进行推理时,通过context设置此时输入的shape, 如下代码:
execution_context->setBindingDimensions(0, nvinfer1::Dims4(ib, 1, ih, iw)); float* bindings[] = {input_data_device, output_data_device}; bool success = execution_context->enqueueV2((void**)bindings, stream, nullptr);
下面代码,搭建了一层卷积网络,在build engine时设置网络的输入为动态的,可接受(1, 1, 3, 3)到(10,1, 5, 5)之间的输入尺寸:
// tensorRT include
#include <NvInfer.h>
#include <NvInferRuntime.h>
// cuda include
#include <cuda_runtime.h>
// system include
#include <stdio.h>
#include <math.h>
#include <iostream>
#include <fstream> // 后面要用到ios这个库
#include <vector>
using namespace std;
class TRTLogger : public nvinfer1::ILogger{
public:
virtual void log(Severity severity, nvinfer1::AsciiChar const* msg) noexcept override{
if(severity <= Severity::kINFO){
printf("%d: %s\n", severity, msg);
}
}
} logger;
nvinfer1::Weights make_weights(float* ptr, int n){
nvinfer1::Weights w;
w.count = n;
w.type = nvinfer1::DataType::kFLOAT;
w.values = ptr;
return w;
}
bool build_model(){
TRTLogger logger;
// ----------------------------- 1. 定义 builder, config 和network -----------------------------
nvinfer1::IBuilder* builder = nvinfer1::createInferBuilder(logger);
nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();
nvinfer1::INetworkDefinition* network = builder->createNetworkV2(1);
// 构建一个模型
/*
Network definition:
image
|
conv(3x3, pad=1) input = 1, output = 1, bias = True w=[[1.0, 2.0, 0.5], [0.1, 0.2, 0.5], [0.2, 0.2, 0.1]], b=0.0
|
relu
|
prob
*/
// ----------------------------- 2. 输入,模型结构和输出的基本信息 -----------------------------
const int num_input = 1;
const int num_output = 1;
float layer1_weight_values[] = {
1.0, 2.0, 3.1,
0.1, 0.1, 0.1,
0.2, 0.2, 0.2
}; // 行优先
float layer1_bias_values[] = {0.0};
// 如果要使用动态shape,必须让NetworkDefinition的维度定义为-1,in_channel是固定的
nvinfer1::ITensor* input = network->addInput("image", nvinfer1::DataType::kFLOAT, nvinfer1::Dims4(-1, num_input, -1, -1));
nvinfer1::Weights layer1_weight = make_weights(layer1_weight_values, 9);
nvinfer1::Weights layer1_bias = make_weights(layer1_bias_values, 1);
auto layer1 = network->addConvolution(*input, num_output, nvinfer1::DimsHW(3, 3), layer1_weight, layer1_bias);
layer1->setPadding(nvinfer1::DimsHW(1, 1));
auto prob = network->addActivation(*layer1->getOutput(0), nvinfer1::ActivationType::kRELU); // *(layer1->getOutput(0))
// 将我们需要的prob标记为输出
network->markOutput(*prob->getOutput(0));
int maxBatchSize = 10;
printf("Workspace Size = %.2f MB\n", (1 << 28) / 1024.0f / 1024.0f);
// 配置暂存存储器,用于layer实现的临时存储,也用于保存中间激活值
config->setMaxWorkspaceSize(1 << 28);
// --------------------------------- 2.1 关于profile ----------------------------------
// 如果模型有多个输入,则必须多个profile
auto profile = builder->createOptimizationProfile();
// 配置最小允许1 x 1 x 3 x 3
profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMIN, nvinfer1::Dims4(1, num_input, 3, 3));
profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kOPT, nvinfer1::Dims4(1, num_input, 3, 3));
// 配置最大允许10 x 1 x 5 x 5
// if networkDims.d[i] != -1, then minDims.d[i] == optDims.d[i] == maxDims.d[i] == networkDims.d[i]
profile->setDimensions(input->getName(), nvinfer1::OptProfileSelector::kMAX, nvinfer1::Dims4(maxBatchSize, num_input, 5, 5));
config->addOptimizationProfile(profile);
nvinfer1::ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
if(engine == nullptr){
printf("Build engine failed.\n");
return false;
}
// -------------------------- 3. 序列化 ----------------------------------
// 将模型序列化,并储存为文件
nvinfer1::IHostMemory* model_data = engine->serialize();
FILE* f = fopen("engine.trtmodel", "wb");
fwrite(model_data->data(), 1, model_data->size(), f);
fclose(f);
// 卸载顺序按照构建顺序倒序
model_data->destroy();
engine->destroy();
network->destroy();
config->destroy();
builder->destroy();
printf("Done.\n");
return true;
}
vector<unsigned char> load_file(const string& file){
ifstream in(file, ios::in | ios::binary);
if (!in.is_open())
return {};
in.seekg(0, ios::end);
size_t length = in.tellg();
std::vector<uint8_t> data;
if (length > 0){
in.seekg(0, ios::beg);
data.resize(length);
in.read((char*)&data[0], length);
}
in.close();
return data;
}
void inference(){
// ------------------------------- 1. 加载model并反序列化 -------------------------------
TRTLogger logger;
auto engine_data = load_file("engine.trtmodel");
nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(logger);
nvinfer1::ICudaEngine* engine = runtime->deserializeCudaEngine(engine_data.data(), engine_data.size());
if(engine == nullptr){
printf("Deserialize cuda engine failed.\n");
runtime->destroy();
return;
}
nvinfer1::IExecutionContext* execution_context = engine->createExecutionContext();
cudaStream_t stream = nullptr;
cudaStreamCreate(&stream);
/*
Network definition:
image
|
conv(3x3, pad=1) input = 1, output = 1, bias = True w=[[1.0, 2.0, 0.5], [0.1, 0.2, 0.5], [0.2, 0.2, 0.1]], b=0.0
|
relu
|
prob
*/
// ------------------------------- 2. 输入与输出 -------------------------------
float input_data_host[] = {
// batch 0
1, 1, 1,
1, 1, 1,
1, 1, 1,
// batch 1
-1, 1, 1,
1, 0, 1,
1, 1, -1
};
float* input_data_device = nullptr;
// 3x3输入,对应3x3输出
int ib = 2;
int iw = 3;
int ih = 3;
float output_data_host[ib * iw * ih];
float* output_data_device = nullptr;
cudaMalloc(&input_data_device, sizeof(input_data_host));
cudaMalloc(&output_data_device, sizeof(output_data_host));
cudaMemcpyAsync(input_data_device, input_data_host, sizeof(input_data_host), cudaMemcpyHostToDevice, stream);
// ------------------------------- 3. 推理 -------------------------------
// 明确当前推理时,使用的数据输入大小
execution_context->setBindingDimensions(0, nvinfer1::Dims4(ib, 1, ih, iw));
float* bindings[] = {input_data_device, output_data_device};
bool success = execution_context->enqueueV2((void**)bindings, stream, nullptr);
cudaMemcpyAsync(output_data_host, output_data_device, sizeof(output_data_host), cudaMemcpyDeviceToHost, stream);
cudaStreamSynchronize(stream);
// ------------------------------- 4. 输出结果 -------------------------------
for(int b = 0; b < ib; ++b){
printf("batch %d. output_data_host = \n", b);
for(int i = 0; i < iw * ih; ++i){
printf("%f, ", output_data_host[b * iw * ih + i]);
if((i + 1) % iw == 0)
printf("\n");
}
}
printf("Clean memory\n");
cudaStreamDestroy(stream);
cudaFree(input_data_device);
cudaFree(output_data_device);
execution_context->destroy();
engine->destroy();
runtime->destroy();
}
int main(){
if(!build_model()){
return -1;
}
inference();
return 0;
}
重点提炼:
-
动态shape,即编译时指定可动态的范围[L-H],推理时可以允许 L <= shape <= H
-
OptimizationProfile是一个优化配置文件,用来指定输入的shape可以变换的范围的,不要被优化两个字蒙蔽了双眼
-
如果onnx的输入某个维度是-1,表示该维度动态,否则表示该维度是明确的,明确维度的minDims, optDims, maxDims一定是一样的