首页 > 其他分享 >iree 编译流程(2)——buildGlobalOptimizationPassPipeline

iree 编译流程(2)——buildGlobalOptimizationPassPipeline

时间:2024-08-03 16:25:34浏览次数:13  
标签:hal tensor linalg buffer 编译 f32 buildGlobalOptimizationPassPipeline iree view

buildGlobalOptimizationPassPipeline

  • IREE::Util::createSimplifyGlobalAccessesPass
    这个pass主要做这几件事:

    • 将不可变global tensor的 load 提前到了 block 的开头,将global tensor的 store 安全地挪到 block 的结尾。
    • 进行以下化简:
      • 如果load after store,则把 load 直接替换成 store 的 source。比如,
      store %0, @p
      %1 = load @p
      return %1
      
      转换成,
      store %0, @p
      return %0
      
      • 如果store after store,则直接消除前一个 store
      store %0, @p
      store %1, @p
      
      转换成,
      store %1, @p
      
      • 如果load after load,则消除后一个 load
      %0 = load @p
      %1 = load @p
      return %1
      
      转换成,
      %0 = load @p
      return %0
      
  • IREE::Util::createApplyPatternsPass
    执行IREE::Util dialect ODS中定义的Canonicalization Patterns,并执行 block 和跳转命令参数化简操作。

    • block 参数化简
    br ^bb1(%0, %0 : index, index)
    ^bb1(%arg0: index, %arg1: index):
      ...
    

    折叠相同的参数,化简为

    br ^bb1(%0 : index)
    ^bb1(%arg0: index):  // %arg1 remapped to %arg0
      ...
    
    • 跳转命令参数消除
    func.func @foo(%arg0: index) {
      br ^bb1(%arg0 : index)
      ^bb1(%0: index):
        ...
    }
    

    消除参数后,

    func.func @foo(%arg0: index) {
      br ^bb1
      ^bb1:  // %0 remapped to %arg0
        ...
    }
    
  • IREE::Util::createFoldGlobalsPass
    这个 pass 继续对global tensor的 load 和 store 操作进行优化,主要包括:

    • 内联常量 store,比如
    util.global mutable @a : i32
    func.func @fool {
      %c5 = arith.constant 5 : i32
      util.global.store %c5, @a : i32
      return
    }
    

    转换成,

    util.global @a = 5 : i32
    
    • 內联常量 load,比如
    util.global @a = 5 : i32
    func.func @fool {
      %1 = util.global.load @a : i32
      ...
    }
    

    转换成,

    func.func @fool {
      %1 = arith.constant 5 : i32
      ...
    }
    
    • 重命名互为链式的global tensor
    • 如果一个mutable global tensor只在 init 函数中被 store 过,则将它修改为 immutable。
    • 删除没有 load 过的global tensor
    • 合并相同初始值的immutable global tensor
  • IREE::Flow::createTensorPadToTensorInsertSlicePass
    tensor.pad转换为linalg.fill + tensor.insert_slice

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x1xf32>
      %padded = tensor.pad %0 low[1, 2] high[3, 4] {
      ^bb0(%arg1: index, %arg2: index):
        tensor.yield %cst : f32
      } : tensor<1x1xf32> to tensor<5x7xf32>
      %1 = hal.tensor.export %padded : tensor<5x7xf32> -> !hal.buffer_view
      return %1 : !hal.buffer_view
    }
    

    转换为,

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x1xf32>
      %1 = tensor.empty() : tensor<5x7xf32>
      %2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<5x7xf32>) -> tensor<5x7xf32>
      %inserted_slice = tensor.insert_slice %0 into %2[1, 2] [1, 1] [1, 1] : tensor<1x1xf32> into tensor<5x7xf32>
      %3 = hal.tensor.export %inserted_slice : tensor<5x7xf32> -> !hal.buffer_view
      return %3 : !hal.buffer_view
    }
    
  • mlir::createConvertElementwiseToLinalgPass
    把 elementwise 算子(带有Elementwise traits的 op)转换成linalg generic op,方便后续对elementwise op做算子融合。arith dialectmath dialect的 op 都是 Elementwise 的,所以实际上这个 pass 会把arith dialectmath dialect lowerlinalg dialect

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
      %1 = arith.addf %0, %0 : tensor<2x3xf32>
      %2 = hal.tensor.export %1 : tensor<2x3xf32> -> !hal.buffer_view
      return %2 : !hal.buffer_view
    }
    

    转换成,

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
      %1 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%0, %0 : tensor<2x3xf32>, tensor<2x3xf32>) outs(%0 : tensor<2x3xf32>) {
      ^bb0(%in: f32, %in_0: f32, %out: f32):
        %3 = arith.addf %in, %in_0 : f32
        linalg.yield %3 : f32
      } -> tensor<2x3xf32>
      %2 = hal.tensor.export %1 : tensor<2x3xf32> -> !hal.buffer_view
      return %2 : !hal.buffer_view
    }
    
  • mlir::createLinalgFoldUnitExtentDimsPass
    消除长度为 的维度或者循环。

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x3xf32>
      %1 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%0 : tensor<1x3xf32>) outs(%0 : tensor<1x3xf32>) {
      ^bb0(%in: f32, %out: f32):
        %3 = arith.addf %in, %in : f32
        linalg.yield %3 : f32
      } -> tensor<1x3xf32>
      %2 = hal.tensor.export %1 : tensor<1x3xf32> -> !hal.buffer_view
      return %2 : !hal.buffer_view
    }
    

    转换成,

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<1x3xf32>
      %collapsed = tensor.collapse_shape %0 [[0, 1]] : tensor<1x3xf32> into tensor<3xf32>
      %collapsed_0 = tensor.collapse_shape %0 [[0, 1]] : tensor<1x3xf32> into tensor<3xf32>
      %1 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%collapsed : tensor<3xf32>) outs(%collapsed_0 : tensor<3xf32>) {
      ^bb0(%in: f32, %out: f32):
        %3 = arith.addf %in, %in : f32
        linalg.yield %3 : f32
      } -> tensor<3xf32>
      %expanded = tensor.expand_shape %1 [[0, 1]] : tensor<3xf32> into tensor<1x3xf32>
      %2 = hal.tensor.export %expanded : tensor<1x3xf32> -> !hal.buffer_view
      return %2 : !hal.buffer_view
    }
    

    linalg.generic由 2 层循环缩减成了单层循环

  • createInterchangeGenericOpsPass
    循环维度变换。将 reduction 循环维度交换到最内层,相应的 parallel 循环维度被交换到外层。

    // sum(%arg0: tensor<2x3xf32>, 0) -> tensor<3xf32>
    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
      %1 = tensor.empty() : tensor<3xf32>
      %2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<3xf32>) -> tensor<3xf32>
      %3 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>], iterator_types = ["reduction", "parallel"]} ins(%0 : tensor<2x3xf32>) outs(%2 : tensor<3xf32>) {
      ^bb0(%in: f32, %out: f32):
        %5 = arith.addf %in, %out : f32
        linalg.yield %5 : f32
      } -> tensor<3xf32>
      %4 = hal.tensor.export %3 : tensor<3xf32> -> !hal.buffer_view
      return %4 : !hal.buffer_view
    }
    

    交换循环之后转换成,

    func.func @foo(%arg0: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x3xf32>
      %1 = tensor.empty() : tensor<3xf32>
      %2 = linalg.fill ins(%cst : f32) outs(%1 : tensor<3xf32>) -> tensor<3xf32>
      %3 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d1, d0)>, affine_map<(d0, d1) -> (d0)>], iterator_types = ["parallel", "reduction"]} ins(%0 : tensor<2x3xf32>) outs(%2 : tensor<3xf32>) {
      ^bb0(%in: f32, %out: f32):
        %5 = arith.addf %in, %out : f32
        linalg.yield %5 : f32
      } -> tensor<3xf32>
      %4 = hal.tensor.export %3 : tensor<3xf32> -> !hal.buffer_view
      return %4 : !hal.buffer_view
    }
    
  • memref::createResolveShapedTypeResultDimsPass

  • mlir::createCanonicalizerPass

  • mlir::createCSEPass

  • createFusionOfTensorOpsPass
    主要做 elementwise 的算子融合,其次也会将tensor.expand_shape转换成linalg generic op,方便进行算子融合。

    elementwise 算子融合的条件:

    • producer 和 comsumer 都是linalg generic op,且都为 tensor 语义。
    • producer 只有一个 user。
    • producer 所有维度的迭代类型都是 parallel,consumer 的 index map 必须和 producer 具有相同的循环嵌套层数。
    • producer 结果的 index map 必须是 Permutation,即结果的每个元素有且仅 store 一次(输出是 pointwise 的)。
    • consumer 可以包含 reduction 迭代类型,但需要保证融合后输入的 index map 可以覆盖每一个迭代维度,理由是如果缺失就无法确定该维度的循环边界。
    // reduce(mul(arg0, arg1), 0)
    // for (int d0 = 0; d0 < n; ++d0) {
    //   temp[d0] = arg0[d0] * arg1[d0];
    // }
    // result = 0;
    // for (int d0 = 0; d0 < n; ++d0) {
    //   result += temp[d0];
    // }
    func.func @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2xf32>
      %2 = tensor.empty() : tensor<2xf32>
      %3 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>], iterator_types = ["parallel"]} ins(%0, %1 : tensor<2xf32>, tensor<2xf32>) outs(%2 : tensor<2xf32>) {
      ^bb0(%in: f32, %in_0: f32, %out: f32):
        %8 = arith.mulf %in, %in_0 : f32
        linalg.yield %8 : f32
      } -> tensor<2xf32>
      %4 = tensor.empty() : tensor<f32>
      %5 = linalg.fill ins(%cst : f32) outs(%4 : tensor<f32>) -> tensor<f32>
      %6 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> ()>], iterator_types = ["reduction"]} ins(%3 : tensor<2xf32>) outs(%5 : tensor<f32>) {
      ^bb0(%in: f32, %out: f32):
        %8 = arith.addf %in, %out : f32
        linalg.yield %8 : f32
      } -> tensor<f32>
      %7 = hal.tensor.export %6 : tensor<f32> -> !hal.buffer_view
      return %7 : !hal.buffer_view
    }
    

    融合mul和reduce之后转换成,

    // result = 0;
    // for (int d0 = 0; d0 < n; ++d0) {
    //   result += arg0[d0] * arg1[d0];
    // }
    func.func @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<2xf32>
      %2 = tensor.empty() : tensor<f32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<f32>) -> tensor<f32>
      %4 = linalg.generic {indexing_maps = [affine_map<(d0) -> (d0)>, affine_map<(d0) -> (d0)>, affine_map<(d0) -> ()>], iterator_types = ["reduction"]} ins(%0, %1 : tensor<2xf32>, tensor<2xf32>) outs(%3 : tensor<f32>) {
      ^bb0(%in: f32, %in_0: f32, %out: f32):
        %6 = arith.mulf %in, %in_0 : f32
        %7 = arith.addf %6, %out : f32
        linalg.yield %7 : f32
      } -> tensor<f32>
      %5 = hal.tensor.export %4 : tensor<f32> -> !hal.buffer_view
      return %5 : !hal.buffer_view
    }
    
  • mlir::createLinalgDetensorizePass
    将 0-D Tensor 转换为它的基础元素类型。

  • mlir::createCanonicalizerPass

  • mlir::createCSEPass

  • createSplitReductionPass
    将 matmul 和 topk 的单次 reduce 分成两次 reduce 操作(一次 batch matmul 和一次 add)。默认不开启,设置--iree-flow-split-matmul-reduction>=2可开启。

    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
      %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
      %4 = linalg.matmul ins(%0, %1 : tensor<128x256xf32>, tensor<256x256xf32>) outs(%3 : tensor<128x256xf32>) -> tensor<128x256xf32>
      %5 = hal.tensor.export %4 : tensor<128x256xf32> -> !hal.buffer_view
      return %5 : !hal.buffer_view
    }
    

    --iree-flow-split-matmul-reduction=2转换成,

    func.func @test(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<128x256xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<256x256xf32>
      %2 = linalg.init_tensor [128, 256] : tensor<128x256xf32>
      %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x256xf32>) -> tensor<128x256xf32>
      %4 = tensor.expand_shape %0 [[0], [1, 2]] : tensor<128x256xf32> into tensor<128x2x128xf32>
      %5 = tensor.expand_shape %1 [[0, 1], [2]] : tensor<256x256xf32> into tensor<2x128x256xf32>
      %6 = linalg.init_tensor [2, 128, 256] : tensor<2x128x256xf32>
      %7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<2x128x256xf32>) -> tensor<2x128x256xf32>
      %8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d1, d0, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel", "reduction"]} ins(%4, %5 : tensor<128x2x128xf32>, tensor<2x128x256xf32>) outs(%7 : tensor<2x128x256xf32>) attrs =  {__internal_linalg_transform__ = "SPLIT", linalg.memoized_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]} {
      ^bb0(%arg2: f32, %arg3: f32, %arg4: f32):
        %11 = arith.mulf %arg2, %arg3 : f32
        %12 = arith.addf %arg4, %11 : f32
        linalg.yield %12 : f32
      } -> tensor<2x128x256xf32>
      %9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d1, d2)>], iterator_types = ["reduction", "parallel", "parallel"]} ins(%8 : tensor<2x128x256xf32>) outs(%3 : tensor<128x256xf32>) attrs =  {__internal_linalg_transform__ = "SPLIT"} {
      ^bb0(%arg2: f32, %arg3: f32):
        %11 = arith.addf %arg2, %arg3 : f32
        linalg.yield %11 : f32
      } -> tensor<128x256xf32>
      %10 = hal.tensor.export %9 : tensor<128x256xf32> -> !hal.buffer_view
      return %10 : !hal.buffer_view
    }
    
  • createInterchangeGenericOpsPass
    循环维度变换。将 reduction 循环维度交换到最内层,相应的 parallel 循环维度被交换到外层。

  • createInterchangeTransposeGenericOpsPass
    当输入 indexing map 是 permutation 时,交换循环维度使得输入的 indexing map 是 identity 的,其作用是使得输入尽可能变成连续访存。

  • createDispatchWithTransformDialect
    根据transform dialect对算子进行调度和派遣,需要另外加载一个transform dialect的 module 文件,默认不做该变换。transform dialect定义了一套调度规则,用于引导目标 IR 进行变换,比如循环展开、tiling 等。

  • createFormDispatchRegionsPass
    以包含reduction looplinalg opnamed linalg op为中心(root),按一定规则合并 producers 和 comsumers,划分出dispatch region子图。dispatch region是 IREE 中的原子执行单元,dispatch region内部可以直接复用输入和输出的内存,从而避免了内部的内存分配操作,内存分配只发生在dispatch region的边界,同时dispatch region之间会自动插入同步操作。

    func.func @predict(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x10xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<10x5xf32>
      %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<5xf32>
      %3 = tensor.empty() : tensor<2x5xf32>
      %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<2x5xf32>) -> tensor<2x5xf32>
      %5 = linalg.matmul ins(%0, %1 : tensor<2x10xf32>, tensor<10x5xf32>) outs(%4 : tensor<2x5xf32>) -> tensor<2x5xf32>
      %6 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%5, %2 : tensor<2x5xf32>, tensor<5xf32>) outs(%3 : tensor<2x5xf32>) {
      ^bb0(%in: f32, %in_0: f32, %out: f32):
        %8 = arith.addf %in, %in_0 : f32
        linalg.yield %8 : f32
      } -> tensor<2x5xf32>
      %7 = hal.tensor.export %6 : tensor<2x5xf32> -> !hal.buffer_view
      return %7 : !hal.buffer_view
    }
    

    转换成,

    func.func @predict(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
      %cst = arith.constant 0.000000e+00 : f32
      %0 = hal.tensor.import %arg0 : !hal.buffer_view -> tensor<2x10xf32>
      %1 = hal.tensor.import %arg1 : !hal.buffer_view -> tensor<10x5xf32>
      %2 = hal.tensor.import %arg2 : !hal.buffer_view -> tensor<5xf32>
      %3 = tensor.empty() : tensor<2x5xf32>
      %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<2x5xf32>) -> tensor<2x5xf32>
      %c1 = arith.constant 1 : index
      %c0 = arith.constant 0 : index
      %c2 = arith.constant 2 : index
      %c1_0 = arith.constant 1 : index
      %5 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0, %c2, %c1_0]
      %c0_1 = arith.constant 0 : index
      %c5 = arith.constant 5 : index
      %c1_2 = arith.constant 1 : index
      %6 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1, %c5, %c1_2]
      %7 = flow.dispatch.region[%5, %6] -> (tensor<2x5xf32>) {
        %9 = linalg.matmul ins(%0, %1 : tensor<2x10xf32>, tensor<10x5xf32>) outs(%4 : tensor<2x5xf32>) -> tensor<2x5xf32>
        %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%9, %2 : tensor<2x5xf32>, tensor<5xf32>) outs(%3 : tensor<2x5xf32>) {
        ^bb0(%in: f32, %in_3: f32, %out: f32):
          %11 = arith.addf %in, %in_3 : f32
          linalg.yield %11 : f32
        } -> tensor<2x5xf32>
        flow.return %10 : tensor<2x5xf32>
      } count(%arg3: index, %arg4: index) -> (index, index, index) {
        %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
        flow.return %x, %y, %z : index, index, index
      }
      %8 = hal.tensor.export %7 : tensor<2x5xf32> -> !hal.buffer_view
      return %8 : !hal.buffer_view
    }
    
  • createFormDispatchWorkgroupsPass
    dispatch region转换成dispatch work group的形式,并将 cloneable 的 op(比如tensor.filltensor.empty等)拷贝到 work group 中。如果在linalg层做了tiling,该 pass 也会把tiling引入的tensor.extract_slicetensor.insert_slice尽可能转换成flow.tensor.slice和flow.tensor.update,转换不了的后续再转换成flow.dispatch.tensor.loadflow.dispatch.tensor.store

  • createCaptureDispatchDynamicDimsPass
    由于flow.dispatch.workgroups的参数中动态形状 tensor 被替换成了!flow.dispatch.tensor和相应的动态维度 index,该 pass 捕获 workgroups 参数中的动态维度 index,插入flow.dispatch.tie_shape将参数中的动态维度 index 和!flow.dispatch.tensor进行绑定。

  • mlir::createCanonicalizerPass

  • createCSEPass

  • createInitializeEmptyTensorsPass
    如果tensor.empty op的 user 中存在非 linalg 或 IREE LinalgExt op,则把该tensor.empty op转换成flow.tensor.emptyflow.tensor.splat op

  • IREE::Flow::createOutlineDispatchRegionsPass
    把每个dispatch region转换成flow.executable + flow.dispatch op

  • IREE::Util::createStripDebugOpsPass
    消除DebugOnly op。

  • mlir::createCanonicalizerPass

  • IREE::Flow::createDeduplicateExecutablesPass
    消除重复的flow.executable

  • IREE::Flow::createInjectDispatchTracingPass
    注入跟踪运行时 dispatch 函数输入和输出信息的 op。默认不开启。

  • IREE::Flow::createCleanupTensorShapesPass
    删除flow.tensor.tie_shape op,并确认 module 中不再包含tensor.dimtensor.rank这两类形状查询 op。

  • mlir::createCanonicalizerPass

  • mlir::createCSEPass

  • mlir::createCanonicalizerPass

  • mlir::createCSEPass

  • mlir::createSymbolDCEPass

未完待续…

标签:hal,tensor,linalg,buffer,编译,f32,buildGlobalOptimizationPassPipeline,iree,view
From: https://blog.csdn.net/qq_38342510/article/details/140759512

相关文章

  • 函数名冲突导致的C语言“conflicting types”编译错误
    快速解答:啊,看来你也遇到了“conflictingtypes”——类型冲突编译错误。如果你不是遇到:循环引用而没有用宏定义来解决。声明或定义在调用后面。声明和定义冲突。.h.gch未更新。那么我想告诉你,你可跟我一样忘了C语言不支持“函数重载”,即你的函数名不能重复。所......
  • caffe编译和基本使用(Windows + CPU)
    xqspace0.摘要本文主要完成以下几点:caffe默认支持是vs2013/vs2015+python2.7/python3.5,使用其他版本会比较麻烦,这里使用的是vs2015和python3.5;完成caffe在windows上的编译(cpu版);提供caffe的python接口;caffe的基本使用方法(这里是推荐几篇讲的比较详细的入门博文);......
  • 1、 window平台opencv下载编译, 基于cmake和QT工具链
    1.环境准备,源码下载1.1前置环境qt下载安装cmake安装,可参考:https://blog.csdn.net/qq_51355375/article/details/1391866811.2opencv源码下载官网地址:https://opencv.org/releases/下载源码:2.opencv编译这里使用cmakegui图形化配置,操作简答些。2.1源......
  • lua---编译与反编译
    lua---编译与反编译脚本举例(test.lua)--单行注释--[[多行注释]]----变量声明the_str='hello'--也可以是双引号"hello"print(the_str)the_len=string.len(the_str)print(the_len)the_num=1+2print(the_num)--没有数组,只有tablemytable={}my......
  • android12编译三方提供的bin文件,通过selinux配置并实现rc开机启动
    为三方bin建立工程在vendor/自己公司目录下建立工程文件夹,我这里以CarpalyMonter工程,新建如下文件CarplayMonitor为三方bin文件Android.mk模块编译配置如下:LOCAL_PATH:=$(callmy-dir)include$(CLEAR_VARS)LOCAL_MODULE:=carplaymonitor #模块名字LOCAL_SRC......
  • 【C语言】程序环境,预处理,编译,汇编,链接详细介绍,其中预处理阶段重点讲解
    目录程序环境翻译环境1.翻译环境的两个过程2.编译过程的三个阶段 执行环境 预处理(预编译) 1.预定义符号2.#define 2.1用#define定义标识符(符号)2.2用#define定义宏 2.3#define的替换规则 2.4#和##的用法2.5宏和函数2.6#undef3.命令......
  • SQLite库笔记:下载编译
    SQLite是一个C语言库,它实现了一个小型、快速、自包含、高可靠性、全功能的SQL数据库引擎。它广泛应用于计算机、手机和嵌入式设备。SQLite源代码在公有领域(publicdomain),据SQLite官网介绍说可以免费使用,不需要license。1.源码包下载https://www.sqlite.org/download.html2.......
  • 在windows上用docker编译ceph
    Why为什么要在windows上跑docker去编译ceph的代码?是松鼠哥吃太饱了吗?当然不是~在实际生产问题处理中,很多时候会遇到棘手的情况,需要快速修改并编译得到可用的二进制程序,例如上篇中,松鼠哥处理多个osd连续的down时,就需要导出其中的一些pg,但是pg的数据导出会因为其中的一些对......
  • 022.(附加)chromedriver编译-绕过selenium机器人检测
    有小伙伴说使用selenium没能绕过机器人检测,盘他。一、selenium简介Selenium是一个强大的工具,用于Web浏览器自动化,更常被用于爬虫但selenium需要通过webdriver来驱动chrome,每次运行selenium时,都要先找到对应版本的chromedriver.exe。chromedriver自动化会对浏览器的部分属......
  • protobuf编译和安装
    编译环境介绍:ubuntu24.04LTSgcc(Ubuntu13.2.0-23ubuntu4)13.2.0g++(Ubuntu13.2.0-23ubuntu4)13.2.0cmakeversion3.28.3 #下载源码gitclonehttps://github.com/protocolbuffers/protobuf.gitcdprotobufgitsubmoduleupdate--init--recursi......