TVM -TVM/VTA代码生成流程

标签：VTA 代码生成 target relay ir module TVM build pass

参考文献链接

https://chhzh123.github.io/blogs/2020-03-26-tvm-flow/

https://krantz-xrf.github.io/2019/10/24/tvm-workflow.html

主要介绍TVM的代码生成流程，即调用relay.build或tvm.build之后发生了什么，将深入到TVM的源代码进行剖析。（这里采用的依然是TVM v0.6）

首先区分两个build的区别：tvm.build主要针对单一算子（参照Tensor Expression一文），而relay.build是针对整个模型进行编译（参照GCN优化一文），而Relay最后也会调用到tvm::build做代码生成。

relay.build

通常的模型编译由以下两条语句完成。

# Build with Relay

with relay.build_config(opt_level=0):

    graph, lib, params = relay.build(func, target, params=params)

跟踪细节

这里稍微提一下如何进行代码跟踪，一方面可以直接通过VS Code在函数上方Alt+单击跳转，另一方面如果想有更直观的印象，则可以利用pycallgraph进行可视化（需先用pip安装），代码如下，还是用GCN的代码编译模块。

from pycallgraph import PyCallGraph

from pycallgraph.output import GraphvizOutput

from pycallgraph import Config

graphviz = GraphvizOutput()

graphviz.output_file = 'relay_callgraph.png'

config = Config(max_depth=5)

with PyCallGraph(output=graphviz,config=config):

# Build with Relay

with relay.build_config(opt_level=0):

graph, lib, params = relay.build(func, target, params=params)

生成的Callgraph如下图所示。

这里为放置递归过深，设置了最大深度为5，但生成的图依然很大。不过从中还是可以看出（需放大）

各函数之间的调用关系，如tvm.relay.build_module.build->tvm.relay.build_module.BuildModule.build
FFI的打包调用关系，C++和Python在哪些函数上实现互调
深色标注的结点（执行时间长）实际上也是核心的执行步骤，即关键路径
结点的调用次数，如tvm.build_module.lower调用了14次，对应的正是14个Relay算子，可见Relay IR计算图可视化。

那么对relay.build进行跟踪，跳转进来是python/tvm/relay/build_module.py（这里是因为在relay/__init__.py中将build函数直接import到relay的命名空间，因此跳过了build_module这一层），其中的build函数是build_module内的全局函数(helper)。

def build(mod, target=None, target_host=None, params=None):

    # do somthing

    if isinstance(autotvm.DispatchContext.current, autotvm.FallbackContext):

        tophub_context = autotvm.tophub.context(list(target.values()))

    else:

        tophub_context = autotvm.util.EmptyContext()

    with tophub_context:

        bld_mod = BuildModule()

        graph_json, mod, params = bld_mod.build(func, target, target_host, params)

    return graph_json, mod, params

首先是寻找AutoTVM是否有预先tune好的参数记录，然后构造tophub_context，在其内部构建了BuildModule之后，才跳转到BuildModule.build，然后返回BuildModule.__init__中的内容。

class BuildModule(object):

    """Build a Relay function to run on TVM graph runtime. This class is used

    to expose the `RelayBuildModule` APIs implemented in C++.

"""

    def __init__(self):

        self.mod = _build_module._BuildModule()

        self._get_graph_json = self.mod["get_graph_json"]

        self._get_module = self.mod["get_module"]

        self._build = self.mod["build"]

        self._optimize = self.mod["optimize"]

        self._set_params_func = self.mod["set_params"]

        self._get_params_func = self.mod["get_params"]

    def build(self, func, target=None, target_host=None, params=None):

        target = _update_target(target)

        # Setup the params.

        if params:

            self._set_params(params)

        # Build the function

        self._build(func, target, target_host)

        # Get artifacts

        graph_json = self.get_json()

        mod = self.get_module()

        params = self.get_params()

        return graph_json, mod, params

而_build_module._BuildModule()又通过FFI在python/tvm/relay/_build_module.py中与C++函数建立联系（tvm._ffi._cytpes.function.Function.__call__）。

from tvm._ffi.function import _init_api

_init_api("relay.build_module", __name__)

对应的C++函数在src/relay/backend/build_module.cc

runtime::Module RelayBuildCreate() {

  auto exec = make_object<RelayBuildModule>();

  return runtime::Module(exec);

TVM_REGISTER_GLOBAL("relay.build_module._BuildModule")

.set_body([](TVMArgs args, TVMRetValue* rv) {

  *rv = RelayBuildCreate();

});

也就是注册了一个RelayBuildModule供调用，由于我们主要用的是build函数，因此到RelayBuildModule中找对应的函数。这里TVM又用PackedFunc做了一层封装，见下。

PackedFunc GetFunction(const std::string& name,

                         const ObjectPtr<Object>& sptr_to_self) final {

      // ...

      if (name == "build") {

      return PackedFunc([sptr_to_self, this](TVMArgs args, TVMRetValue* rv) {

        CHECK_EQ(args.num_args, 3);

        this->Build(args[0], args[1], args[2]);

});

      // ...

也就是调用的是this->Build，再跳转过去会指向BuildRelay。

  void BuildRelay(

      Function func,

      const std::unordered_map<std::string, tvm::runtime::NDArray>& params) {

    // Optimize input Relay Function and returns Relay Module

    relay::Module relay_module = Optimize(func, targets_, params);

    // Get the updated function.

    func = relay_module->Lookup("main");

    // Generate code for the updated function.

    graph_codegen_ = std::unique_ptr<GraphCodegen>(new GraphCodegen());

    graph_codegen_->Init(nullptr, targets_);

    graph_codegen_->Codegen(func);

    ret_.graph_json = graph_codegen_->GetJSON();

    ret_.params = graph_codegen_->GetParams();

    auto lowered_funcs = graph_codegen_->GetLoweredFunc();

    if (lowered_funcs.size() == 0) {

      LOG(WARNING) << "no lowered funcs exist in the compiled module";

    } else {

      ret_.mod = tvm::build(

        lowered_funcs,

        target_host_,

        BuildConfig::Current());

经过多番跳转，终于到达build的核心模块，再来看TVM逐步做的工作。

优化
计算图生成
后端代码生成

优化

先是优化Optimize，可以看到这里的优化主要是设备无关的优化，是graph-level的针对tensor运算的优化。（这里的优化pass都已经在C++中实现，先前版本的NNVM似乎还是在Python中调用）

  relay::Module Optimize(

      Function func,

      const TargetsMap& targets,

      const std::unordered_map<std::string, runtime::NDArray>& params) {

    // BindParamsByName(func, params)

    // Perform Module->Module optimizations.

    relay::Module relay_module = relay::ModuleNode::FromExpr(func);

    Array<Pass> pass_seqs;

    // Run all dialect legalization passes.

    // ...

    pass_seqs.push_back(transform::SimplifyInference());

//

    // ...fskip

//

    pass_seqs.push_back(transform::EliminateCommonSubexpr(fskip));

    pass_seqs.push_back(transform::CombineParallelConv2D(3));

    pass_seqs.push_back(transform::CombineParallelDense(3));

    pass_seqs.push_back(transform::FoldConstant());

    pass_seqs.push_back(transform::FoldScaleAxis());

    pass_seqs.push_back(transform::CanonicalizeCast());

    pass_seqs.push_back(transform::CanonicalizeOps());

    // ...AlterOpLayout

    pass_seqs.push_back(transform::FoldConstant());

    // Create a sequential pass and perform optimizations.

    transform::Pass seq = transform::Sequential(pass_seqs);

    // ... judge & do

    relay_module = seq(relay_module);

    // Handle heterogeneous compilation.

    transform::PassContext pass_ctx = PassContext::Current();

    if (targets_.size() > 1) {

      relay_module =

          RunDeviceAnnotationPass(relay_module, pass_ctx->fallback_device);

    // Fuse the operations if it is needed.

    relay_module = transform::FuseOps()(relay_module);

    relay_module = transform::InferType()(relay_module);

    CHECK(relay_module.defined());

    return relay_module;

计算图生成

对应GraphCodegen类，以同样的方式调用src/relay/backend/build_module.cc中的relay.build_module._GraphRuntimeCodegen（一样是FFI），然后跳转至src/relay/backend/graph_runtime_codegen.cc，其中已经用TVM_REGISTER_GLOBAL注册了对应函数，即用GraphRuntimeCodegenModule生成对应Object。

因此实际graph_codegen_->Codegen的函数是一个PackedFunc，定义在GraphRuntimeCodegen.Codegen，用来将relay::Function func进行遍历，然后生成计算图。

后端代码生成

Relay得到lower后的函数，最后一步则是交给tvm::build做代码生成，跳转到src/codegen/build_module.cc中的build函数（注意这里重载了几个版本），然后跳转到核心build，注意这里的build函数支持异构编译，只要再inputs划分好不同硬件设施即可。

// Build for heterogeneous execution.

runtime::Module build(const Map<Target, Array<LoweredFunc>>& inputs,

                      const Target& target_host,

                      const BuildConfig& config) {

  Array<LoweredFunc> fhost_all;

  std::vector<runtime::Module> device_modules;

  Target target_host_val = target_host;

  if (!target_host.defined()) {

    for (const auto& it : inputs) {

      if (it.first->device_type == kDLCPU) {

        target_host_val = it.first;

        break;

  if (!target_host_val.defined()) {

    target_host_val = DefaultTargetHost(target_host_val);

  for (const auto& it : inputs) {

    auto host_dev_funcs =

        split_dev_host_funcs(it.second, it.first, target_host_val, config);

    auto& fhost = host_dev_funcs[0];

    auto& fdevice = host_dev_funcs[1];

    // Get the module for a certain target.

    runtime::Module mdev = DeviceBuild(fdevice, it.first);

    for (const auto& it : fhost) {

      fhost_all.push_back(it);

    device_modules.push_back(mdev);

  runtime::Module mhost = codegen::Build(fhost_all, target_host_val->str());

  // Import all modules

  for (const auto& it : device_modules) {

    if (it.operator->()) {

      mhost.Import(it);

  return mhost;

当中最最核心的则是mhost = codegen::Build，最后跳转过去就开始调用代码生成模块了（src/codegen/codegen.cc）。

runtime::Module Build(const Array<LoweredFunc>& funcs,

                      const std::string& target) {

  // do something

  std::string build_f_name = "codegen.build_" + mode;

  // the build function.

  const PackedFunc* bf = runtime::Registry::Get(build_f_name);

  runtime::Module m = transformed_funcs.empty() ?

                      (*bf)(funcs, target) :

                      (*bf)(transformed_funcs, target);

  return m;

以生成LLVM IR为例，codegen.build_llvm会在src/codegen/llvm/llvm_module.cc注册，然后调用同个文件中的LLVMModuleNode->Init。这时会跳转到src/codegen/llvm/codegen_llvm.cc中的CodeGenLLVM类进行代码生成。

tvm.build

用tvm.build对算子进行编译则是按照以下方式进行调用，例子来自Tensor Expression。

s = tvm.create_schedule(C.op)

tgt = "llvm" # "cuda"

fadd = tvm.build(s,[A,B,C],target=tgt,name="myadd")

调用tvm.build后首先跳转到python/tvm/build_module.py，其中的build函数主要做两个步骤：

lower高层次代码
后端代码生成

代码变换

lower高层次代码对应的是

flist = lower(inputs,args,name=name,binds=binds)

而lower函数同样在python/tvm/build_module.py中，类似于relay.build中的Optimize，但这里执行的是operator-level的优化，主要针对循环变换。

def lower(sch,

          args,

          name="default_function",

          binds=None,

          simple_mode=False):

    # initialization

    # Phase 0

    if isinstance(sch, schedule.Schedule):

        stmt = form_body(sch)

    for f in lower_phase0:

        stmt = f(stmt)

    compact = ir_pass.VerifyCompactBuffer(stmt)

    binds, arg_list = get_binds(args, compact, binds)

    # Phase 1

    stmt = ir_pass.RewriteForTensorCore(stmt, sch, binds)

    stmt = ir_pass.StorageFlatten(stmt, binds, 64, cfg.instrument_bound_checkers)

    stmt = ir_pass.CanonicalSimplify(stmt)

    for f in lower_phase1:

        stmt = f(stmt)

    # Phase 2

    if not simple_mode:

        stmt = ir_pass.LoopPartition(stmt, cfg.partition_const_loop)

    if cfg.disable_vectorize:

        stmt = ir_pass.SkipVectorize(stmt)

    else:

        stmt = ir_pass.VectorizeLoop(stmt)

    stmt = ir_pass.InjectVirtualThread(stmt)

    stmt = ir_pass.InjectDoubleBuffer(stmt, cfg.double_buffer_split_loop)

    stmt = ir_pass.StorageRewrite(stmt)

    stmt = ir_pass.UnrollLoop(

        stmt,

        cfg.auto_unroll_max_step,

        cfg.auto_unroll_max_depth,

        cfg.auto_unroll_max_extent,

        cfg.unroll_explicit)

    for f in lower_phase2:

        stmt = f(stmt)

    # Phase 3

    stmt = ir_pass.Simplify(stmt)

    stmt = ir_pass.RemoveNoOp(stmt)

    if not cfg.disable_select_rewriting:

        stmt = ir_pass.RewriteUnsafeSelect(stmt)

    for f in lower_phase3:

        stmt = f(stmt)

    # Instrument BoundCheckers

    if cfg.instrument_bound_checkers:

        stmt = ir_pass.InstrumentBoundCheckers(stmt)

    if simple_mode:

        return stmt

    return ir_pass.MakeAPI(stmt, name, arg_list, 0, cfg.restricted_func)

优化Pass的主体实施都在src/api/api_pass.cc中，以tvm.ir_pass进行注册（注意由于C++函数中已经在tvm的命名空间里，故搜索时直接搜ir_pass才会出来对应的API）。

代码生成

lower完之后就进入到后端代码生成，对应build函数中的

mhost = codegen.build_module(fhost_all, str(target_host))

同样的原理，跳转至tvm/codegen.py，初始化tvm.codegen的API codegen._Build，调用FFI，跳转至src/api/api_codegen.cc，最后跳转至src/codegen/codegen.cc中的tvm::Build，之后的后端代码生成则与relay.build相同。

References

TVM Codebase Walkthrough by Example, https://docs.tvm.ai/dev/codebase_walkthrough.html
TVM图编译器Relay简单探究 - 郑思泽的文章 - 知乎, https://zhuanlan.zhihu.com/p/91283238
谢睿峰, TVM/VTA代码生成流程, https://krantz-xrf.github.io/2019/10/24/tvm-workflow.html
https://discuss.tvm.ai/t/relationship-between-tvm-build-and-relay-build/4166

TVM/VTA代码生成流程

最近看了很多TVM/VTA后端代码生成的代码，现在就把近日所得总结一下，以备有需求的朋友参考。

关于TVM/VTA

TVM是一个深度学习描述框架，通过Python代码描述算子（输入、输出、运算方法等）形成抽象语法树（Abstract Syntax Tree，AST），然后在TVM内部转换为中间表示（Intermediate Representation，IR），最终转换成目标平台的机器代码，以作为算子用于构成更复杂的神经网络。

VTA（Versatile Tensor Accelerator，多功能张量加速器）是TVM框架的一个扩展，可以简单理解成一个深度神经网络的底层硬件实现。

这篇文章就是TVM从IR生成后端机器代码过程的一个概览。

由于手头并没有VTA的硬件，因此使用了TVM提供的模拟器tsim；以下所述的过程都是针对tsim的，如果针对其他硬件后端，大体思想应当是一致的，但细节肯定颇有不同。

分析的代码样例

分析代码生成流程时，使用了官方教程提供的测试代码，详见Get Started with VTA。

以下描述的流程从vta.build的调用开始。

代码生成

vta.build首先判断出我们host端（宿主平台，主程序所运行的平台）使用llvm做代码生成（target_host='llvm'）；随后，它直接将调用转发给tvm.build。

整理输入

由于Python的特殊性，函数参数的类型是不定的；TVM允许传入tvm.build函数的参数有如下四种：

Schedule
LoweredFunc
[LoweredFunc]
{target: [LoweredFunc]}

转换经历下图的过程：

(1)Schedule→lowerLoweredFunc→[∗][LoweredFunc]→{target:∗}{target : [LoweredFunc]}" role="presentation" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent; display:table-cell !important;overflow-wrap: normal;max-width: none;max-height: none; min-width: 36.328em;min-height: 0px;float:none;word-spacing:normal" id="MathJax-Element-1-Frame">

Schedulel→LoweredFunc→[LoweredFunc]→{target : [LoweredFunc]}(1)(1)Schedule→lowerLoweredFunc→[∗][LoweredFunc]→{target:∗}{target : [LoweredFunc]}

target_flist = {'ext_dev': [LoweredFunc]}

降级代码表示

这一部分对应于函数lower，总的流程参看下图：

lower

form_body

phase 0

phase 1

phase 2

phase 3

sch

sch

sch

bounds

dis_vec?

otherwise

otherwise

simple_mode?

custom

ir_pass.Simplify

ir_pass.LowerStorageAccessInfo

ir_pass.RemoveNoOps

NOT disable_select_rewriting?:

ir_pass.RewriteUnsafeSelect

custom

instrument_bounds_checkers?:

ir_pass.InstrumentBoundsCheckers

ir_pass.MakeAPI

custom

NOT simple_mode?:

ir_pass.LoopPartition

ir_pass.SkipVectorize

is_pass.VectorizeLoop

ir_pass.InjectVirtualThread

ir_pass.InjectDoubleBuffer

ir_pass.StorageRewrite

ir_pass.UnrollLoop

custom

ir_pass.StorageFlatten

ir_pass.CanonicalSimplify

ir_pass.InjectPrefetch

normalize

schedule.ScheduleOps

schedule.InferBounds

输入

预处理结果

以我们的测试代码为例，每一趟后代码发生的变化如下表：

阶段	处理阶段	是否变化	发生的变化
0			初始状态
1.1	`StorageFlatten`		`realize` -> `allocate`，指标的表示形式（多维转化为一维）
1.2	`CanonicalSimplify`		双层`for`循环 -> `TAStoreBuffer2D`
1.3	（外部过程）
1.4	（外部过程）		增加了一些新属性
1.5	（外部过程）		移除了一些属性
1.6	（外部过程）		缓冲区内存分配从`produce`块中移出
1.7	（外部过程）		增加同步属性
1.8	（外部过程）		`A`、`B`、`C`的分配合并成`A`的分配
1.9	（外部过程）
2.1	`LoopPartition`
2.2	`VectorizeLoop`
2.3	`InjectVirtualThread`
2.4	`InjectDoubleBuffer`
2.5	`StorageRewrite`
2.6	`UnrollLoop`		循环转化为`VTAUopLoopBegin`、`VTAUopPush`和`VTAUopLoopEnd`
2.7	（外部过程）
3.1	`Simplify`		缓冲区内存分配完全移除
3.2	`LowerStorageAccessInfo`
3.3	`RemoveNoOp`
3.4	`RewriteUnsafeSelect`
3.5	（外部过程）
3.6	（外部过程）
4			最终状态

标明“（外部过程）”是从C++注册的处理过程，在Python的跟踪过程中无法看到。

这部分的中间结果文件可以在这里下载。

遍历`target_list`检查

对所有的目标⟨target,flist⟩∈target_flist" role="presentation" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent; display:inline-block;overflow-wrap: normal;max-width: none;max-height: none; min-width: 0px;min-height: 0px;float:none;word-spacing:normal" id="MathJax-Element-2-Frame">⟨target,flist⟩∈target_flist⟨target,flist⟩∈target_flist：

函数名查重：存在相同的函数名字就报错
验证目标target" role="presentation" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent; display:inline-block;overflow-wrap: normal;max-width: none;max-height: none; min-width: 0px;min-height: 0px;float:none;word-spacing:normal" id="MathJax-Element-3-Frame">targettarget是str或者_target.Target

下面，flist被传入函数_build_for_device处理。

为特定设备生成目标代码

总的思想是这样的：将flist分离为宿主代码（fhost）和设备代码（mdev），然后分别生成机器代码；其中设备端的模块会导入到宿主模块中，最终的结果是宿主代码模块mhost。

(2)flist→_build_for_device{fhost→codegen.build_modulemdev→import_module}→mhost" role="presentation" style="box-sizing: border-box;-webkit-tap-highlight-color: transparent; display:table-cell !important;overflow-wrap: normal;max-width: none;max-height: none; min-width: 24.654em;min-height: 0px;float:none;word-spacing:normal" id="MathJax-Element-4-Frame">flist_build_for_device−−−−−−−−−−→⎧⎪⎨⎪⎩fhostcodegen.build_module−−−−−−−−−−−−−→mdevimport_module−−−−−−−−−→⎫⎪⎬⎪⎭→mhost(2)(2)flist→_build_for_device{fhost→codegen.build_modulemdev→import_module}→mhost

下图是IR的多趟（pass）处理流程：

_build_for_device

flist

host

device

fhost

None

mhost

mdev

ir_pass.VerifyMemory

ir_pass.ThreadSync: 'global', 'shared', 'warp'

ir_pass.LowerThreadAllreduce

ir_pass.SplitHostDevice

ir_pass.BindDeviceType: 12

ir_pass.LowerTVMBuiltin

ir_pass.LowerIntrin: 'llvm'

ir_pass.CombineContextCall

ir_pass.LowerWarpMemory

ir_pass.LowerIntrin: 'ext_dev'

empty?

codegen.build_module: 'ext_dev'

输入

codegen.build_module: 'llvm'

import_module

生成的结果模块

加载生成好的目标代码

这部分对应的Python代码如下：

remote.upload("vadd.o")

f = remote.load_module("vadd.o")

流程概览

_LoadFromFile

RPCSession

LocalSession

TVM Python

Module::LoadFromFile

module.loadfile_so

DSOModuleNode::Load

module.loadfile_vta-tsim

DPIModuleNode::Load

DPIModule::Init

remote.load_module

rpc._LoadRemoteModule

RPCSession::HandlePackedCall

RPCModuleLoad

tvm.rpc.server.load_module

_load_module

module.load

_cc.create_shared

_linux_compile

_LoadFromFile

remote.upload

load binary

下面代码分tsim和真实硬件两种情况。

相关分析只列出被执行的关键路径，依照代码中的注释应当很容易理解。代码块的缩进表示嵌套的函数调用。

贴出的代码有Python也有C++，由于每一段都有注释，应该很好分辨（Python是#，C++是//）。

使用模拟器

此时代码中的远端设备remote是一个LocalSession。

这一部分的关键就是拼接命令，调用系统编译器g++来把对象文件（.o文件）链接成动态库。

# ..., then, in LocalSession.load_module

#   with path = "vadd.o"

_load_module(self._temp.relpath(path))

# _load_module is module.load

# in module.load, with path = (full path for "vadd.o"), fmt = ""

if path.endswith(".o"): # true

    _cc.create_shared(path + ".so", path)

·       # in create_shared

·       #   with output  = "vadd.o.so"

·       #        objects = "vadd.o"

·       #        options = None

·       #        cc      = "g++"

·       if sys.platform == "darwin" or sys.platform.startswith("linux"):

·           _linux_compile(output, objects, options, cc)

o   # in _linux_compile

o   #   with output      = "vadd.o.so"

o   #        objects     = "vadd.o"

o   #        options     = None

o   #        compile_cmd = "g++"

o   cmd = [compile_cmd] # cmd: g++

o   if output.endswith(".so"): # true

o       cmd += ["-shared", "-fPIC"]

o       if sys.platform == "darwin": # true

o           cmd += ["-undefined", "dynamic_lookup"]

o   else: # false, ...

o   # cmd: g++ -shared -fPIC -undefined dynamic_lookup

o   cmd += ["-o", output]

o   # cmd: g++ -shared -fPIC -undefined dynamic_lookup -o vadd.o.so

o   if isinstance(objects, str): # true

o       cmd += [objects]

o   else: # false, ...

o   # cmd: g++ -shared -fPIC -undefined dynamic_lookup -o vadd.o.so vadd.o

o   if options: # false, ...

o   # run cmd

o   proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

o   (out, _) = proc.communicate()

o   if proc.returncode != 0:

o       msg = "Compilation error:\n"

o       msg += py_str(out)

o       raise RuntimeError(msg)

·       # back in create_shared

·       else: # false, ...

# back in module.load

    path += ".so"

else: # false

       # ...

return _LoadFromFile(path, fmt)

_LoadFromFile是Python端封装的C++函数，在C++端对应Module::LoadFromFile，见最终加载动态库。

使用真实的VTA设备

这里，由于没有真实设备，执行流程是~~静态分析~~瞪眼猜测的结果。

此时远端设备remote是一个RPCSession。

在LoadRemoteModule函数执行前也应该有一些额外的操作，把.o对象文件生成为动态链接库。

// in "rpc._LoadRemoteModule"

sess->CallRemote(RPCCode::kModuleLoad, args[1]);

// in RPCSession::HandlePackedCall

switch (code_)

    // ...

    case RPCCode::kModuleLoad: CallHandler(RPCModuleLoad); break;

    // ...

// in RPCModuleLoad

fsys_load_ = runtime::Registry::Get("tvm.rpc.server.load_module");

/* ... */ (*fsys_load_)(file_name);

// in "tvm.rpc.server.load_module"

// Below is Objective-C++:

//   - not quite familiar

//   - might misinterpret

// in "tvm.rpc.server.load_module", with name = "vadd.o.so"s

std::string fmt = GetFileFormat(name, "");

·       // in tvm::runtime::GetFileFormat

·       //   with file_name = "vadd.o.so"s

·       //        format    = ""s

·       if (format.length() == 0) { // true

·           // ...

·           size_t pos = file_name.find_last_of("."); // 6

·           if (pos != std::string::npos) { // true

·                 return file_name.substr(pos + 1, file_name.length() - pos - 1);

·                 // "vadd.o.so"s.substr[from: 6, length: 2] = "so"s

·           }

·           // ...

·       } // ...

// fmt = "so"s

// ... converting `name` to `path`

//     not quite sure because of use of Obj-C++

NSString* path = [base stringByAppendingPathComponent:

                         [NSString stringWithUTF8String:name.c_str()]];

// ... and again back to `name`

name = [path UTF8String];

// finally! loading from file?

// - no! yet another propagation

/* ... */ Module::LoadFromFile(name, fmt);

最终加载动态库

这一部分核心就是转化为系统调用dlopen（POSIX系统）或LoadLibraryW（Windows系统）。

// in Module::LoadFromFile

//   with file_name = ... ("vadd.o.so" with full path)

//        format    = "so"s

std::string fmt = GetFileFormat(file_name, format);

// "so"s.length() != 0, should just return "so"s

// fmt = "so"s

if (fmt == "dll" || fmt == "dylib" || fmt == "dso") { /* ... */ } // false

std::string load_f_name = "module.loadfile_" + fmt;

// load_f_name = "module.loadfile_so"s

f = Registry::Get(load_f_name);

/* ... */ (*f)(file_name, format);

// in "module.loadfile_so"

n = std::make_shared<DSOModuleNode>();

n->Init(args[0]);

// in DSOModuleNode::Init

DSOModuleNode::Load(name); // propagate to LoadLibraryW/dlopen

// ...

InitContextFunctions([this](const char* fname) { return GetSymbol(fname); });

// Load the imported modules

const char* dev_mblob = GetSymbol(runtime::symbol::tvm_dev_mblob);

if (dev_mblob != nullptr) { /* ... */ }

参考文献链接

https://chhzh123.github.io/blogs/2020-03-26-tvm-flow/

https://krantz-xrf.github.io/2019/10/24/tvm-workflow.html

最终所有的输入都被整理成如下形式：

标签：VTA,代码生成,target,relay,ir,module,TVM,build,pass
From： https://www.cnblogs.com/wujianming-110117/p/16887871.html

TVM -TVM/VTA代码生成流程

relay.build

优化

计算图生成

后端代码生成

tvm.build

代码变换

代码生成

References

关于TVM/VTA

分析的代码样例

代码生成

整理输入

降级代码表示

遍历`target_list`检查

为特定设备生成目标代码

加载生成好的目标代码

流程概览

使用模拟器

使用真实的VTA设备

最终加载动态库

相关文章

赞助商

阅读排行

TVM -TVM/VTA代码生成流程

relay.build

优化

计算图生成

后端代码生成

tvm.build

代码变换

代码生成

References

关于TVM/VTA

分析的代码样例

代码生成

整理输入

降级代码表示

遍历target_list检查

为特定设备生成目标代码

加载生成好的目标代码

流程概览

使用模拟器

使用真实的VTA设备

最终加载动态库

相关文章

赞助商

阅读排行

遍历`target_list`检查