Hive执行计划详解

标签：Statistics stats Hive 详解 col0 Operator 执行 type Stage

什么是Hive SQL执行计划

Hive SQL执行计划描绘了SQL实际执行的整体轮廓，即SQL转化为对应计算引擎的执行逻辑；毫无疑问，这一块对于Hive SQL的优化是非常重要的。

Hive SQL早期是基于规则的方式生成执行计划，在Hive 0.14及之后，集成了Apache Calcite，使得Hive能够基于成本代价来生成执行计划。

Hive目前提供的是预估的执行计划，而非真实的执行计划（SQL实际执行完后才能获得的计划）。

查看Hive SQl执行计划

explain：查看基本信息
explain extended：查看扩展信息
explain dependency：查看SQL数据输入依赖的信息
explain authorization：查看SQL操作相关权限的信息
explain vectorization：查看向量化的描述信息

# 看这个前请先理解sql语句
STAGE DEPENDENCIES: # 作业依赖关系
  Stage-1 is a root stage # 根stage
  Stage-0 depends on stages: Stage-1 # stage-0依赖stage-1

STAGE PLANS:
  Stage: Stage-1  # Stage-1详细任务
    Map Reduce  	# 表示当前引擎使用的是 MapReduce
      Map Operator Tree: # Map阶段操作信息
          TableScan  		# 对下面alias声明的结果集进行 表扫描操作
            alias: student  	# alias声明student表
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE # 对当前阶段的统计信息，如当前处理的行和数据量（都是预估值）
            Filter Operator	# 表示在Tablescan的结果集上进行过滤
              predicate: ((s_age < 30) and (s_name like '%red%')) (type: boolean) # 过滤条件
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              Select Operator # 表示在过滤后的结果集上进行投影
                expressions: s_age (type: int) # 需要投影的列
                outputColumnNames: s_age	# 输出的列名
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                Group By Operator # 在投影后结果集上进行分组聚合
                  aggregations: count() # 分组聚合使用的算法
                  keys: s_age (type: int) # 分组的列
                  mode: hash # 采用hash
                  outputColumnNames: _col0, _col1 # 输出的列名
                  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                  Reduce Output Operator # Map端聚合
                    key expressions: _col0 (type: int) # Map端输出的key 
                    sort order: +  # +表示正序,-表示逆序
                    Map-reduce partition columns: _col0 (type: int) # 分区字段
                    Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                    value expressions: _col1 (type: bigint) # Map端输出的value
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0) # 对Map value的第一个值进行聚合算法
          keys: KEY._col0 (type: int) # key是Map Key的第一个值
          mode: mergepartial
          outputColumnNames: _col0, _col1 # 输出列名
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          File Output Operator # 对上面的结果集进行文件输出
            compressed: false # 不压缩
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            table: 
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat # 输入文件类型
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat # 输出文件类型
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe # 序列化、反序列化方式

  Stage: Stage-0
    Fetch Operator 
      limit: -1
      Processor Tree:
        ListSink

可以看到有两部分内容：STAGE DEPENDENCIES（作业的依赖关系图）、STAGE PLANS（每个作业的详细信息）。以及标注的详细解释。

执行计划归类

只有Map阶段的类型

select-from-where型：简单的SQL执行计划，不包含列操作、条件过滤、UDF、聚合、连接等操作的SQL。

由于不需要经过聚合，所以只有Map阶段操作，如果文件大小控制合适的话，可以完全发挥任务本地化执行的优点，也就是不需要跨节点，非常高效！

hive> explain
    > select s_age,s_name
    > from student
    > where s_age=10;
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: student
          Statistics: ...
          Filter Operator
            predicate: (s_age = 10) (type: boolean)
            Statistics: ...
            Select Operator
              expressions: 10 (type: int), s_name (type: int)
              outputColumnNames: _col0, _col1
              Statistics: ...
              ListSink

select-func(col)-from-where-func(col) / select-operation-from-where-operation型：只带普通函数（除UDTF、UDAF、窗口函数）

同样只有Map阶段，非常高效！

hive> explain
    > select case when s_age>20 then 'over 20' when s_age=20 then 'equal 20' else 'other' end age
    > from student
    > where s_name is not null;
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: student
          Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
          Filter Operator
            predicate: s_name is not null (type: boolean)
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: CASE WHEN ((s_age > 20)) THEN ('over 20') WHEN ((s_age = 20)) THEN ('equal 20') ELSE ('other') END (type: string)
              outputColumnNames: _col0
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              ListSink

Map+Reduce类型

select-aggr_func-from-where-groupby类型：带聚合函数的SQL。这类SQL可以分为如下几类：在Reduce阶段聚合的SQL执行计划、在Map和Reduce都有聚合的SQL计划、高级分组聚合的执行计划。

hive中可以通过配置hive.map.aggr来设定是否开启Combine。

高级分组聚合指的是聚合时使用：grouping sets、cube、rollup（使用高级分组聚合需要确保Map端reduce开启）。使用高级分组聚合可以将union需要多次的作业塞到一个作业中，可以减少多作业在磁盘和网络IO中的消耗，是一种优化。但是需要注意，这种聚合会造成数据极速膨胀；如果基表的数据量很大，容易导致Map或者Reduce任务因为硬件资源不足而崩溃，Hive中可以使用hive.new.job.grouping.set.cardinality配置这个问题。

带窗口函数的SQL执行计划

hive> explain
    > select s_name,row_number() over(partition by s_age order by s_score) rk
    > from student;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: student
            Statistics: ...
            Reduce Output Operator
              key expressions: s_age (type: int), s_score (type: int)
              sort order: ++
              Map-reduce partition columns: s_age (type: int)
              Statistics: ...
              value expressions: s_name (type: string)
      Execution mode: vectorized
      Reduce Operator Tree:
        Select Operator
          expressions: VALUE._col0 (type: string), KEY.reducesinkkey0 (type: int), KEY.reducesinkkey1 (type: int)
          outputColumnNames: _col0, _col1, _col2
          Statistics: ...
          PTF Operator # 窗口函数分析操作
            Function definitions: 
                Input definition
                  input alias: ptf_0
                  output shape: _col0: string, _col1: int, _col2: int
                  type: WINDOWING
                Windowing table definition
                  input alias: ptf_1
                  name: windowingtablefunction
                  order by: _col2 ASC NULLS FIRST # 窗口函数排序列
                  partition by: _col1	# 窗口函数分区列
                  raw input shape: 
                  window functions:
                      window function definition
                        alias: row_number_window_0
                        name: row_number # 窗口函数的方法
                        window function: GenericUDAFRowNumberEvaluator
                        window frame: ROWS PRECEDING(MAX)~FOLLOWING(MAX) # 表示当前窗口函数上下边界
                        isPivotResult: true
            Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: _col0 (type: string), row_number_window_0 (type: int)
              outputColumnNames: _col0, _col1
              Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

表连接的SQL计划

Hive表连接类型可以分为以下六种：

inner join
full outer join
left outer join
right outer join
left semi join：返回左表中与右表的匹配记录
cross join：返回两表连接字段的笛卡尔积

以下举例inner join

hive> explain
    > select a.s_no
    > from a a inner join b b on a.s_no=b.s_no;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan # 扫描表a
            alias: a
            Statistics: ...
            Filter Operator
              predicate: s_no is not null (type: boolean) # inner join 内置过滤
              Statistics: ...
              Select Operator
                expressions: s_no (type: int)
                outputColumnNames: _col0
                Statistics: ...
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: ...
          TableScan # 扫描表b
            alias: b
            Statistics: ...
            Filter Operator
              predicate: s_no is not null (type: boolean)
              Statistics: ...
              Select Operator
                expressions: s_no (type: int)
                outputColumnNames: _col0
                Statistics: ...
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: ...
      Reduce Operator Tree:
        Join Operator # 表连接操作
          condition map:
               Inner Join 0 to 1  # inner join 0 1 ;0 1指代的信息在下面
          keys:
            0 _col0 (type: int) # 0表示map阶段一个表的数据集 _col0表示连接字段
            1 _col0 (type: int) # 1表示map阶段另一个表的数据集 _col0表示连接字段
          outputColumnNames: _col0
          Statistics: ...
          File Output Operator
            compressed: false
            Statistics: ...
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

标签：Statistics,stats,Hive,详解,col0,Operator,执行,type,Stage
From： https://www.cnblogs.com/nangk/p/17649685.html