适用场景
Hive SQL在执行之前会将SQL转换为MapReduce任务,因此需要了解具体的转换过程。可以通过explain关键字来查看具体的执行计划。通过执行计划能看到 SQL 程序转换成相应计算引擎的执行逻辑,从而能够实现更有针对性的优化(比如有些看似等价的SQL写法实则不等价)。
使用语法就是在查询语句之前加上explain关键字:
explain select * from ads.ads_grow_ad_material_statistic_df where date = 20220609;
前置知识
一个 Hive 查询被转换为一个有向无环图DAG,包含一个或多个stage(阶段),不同stage之间会存在依赖关系。越复杂的查询通常会引入更多的stage,通常stage越多则需要越多时间来完成任务。
默认hive一次执行一个stage,一个stage可以是一个mapreduce任务,也可以是一个抽样阶段,或者是一个合并阶段,还可以是一个limit阶段。
我们拿一个查询简单举个例子:
-- 根据app_id和plan_id来查看收入
explain select app_id, plan_id, sum(pay_mnt)
from ads.ads_grow_ad_material_statistic_df
where date = 20220609
and material_id = 'ALL'
group by app_id, plan_id;
explain得出的结果如下:
STAGE DEPENDENCIES: |
| Stage-1 is a root stage |
| Stage-0 depends on stages: Stage-1 |
| |
| STAGE PLANS: |
| Stage: Stage-1 |
| Map Reduce |
| Map Operator Tree: |
| TableScan |
| alias: ads_grow_ad_material_statistic_df |
| Statistics: Num rows: 262845 Data size: 7096850 Basic stats: COMPLETE Column stats: NONE |
| Filter Operator |
| predicate: (material_id = 'ALL') (type: boolean) |
| Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE |
| Select Operator |
| expressions: app_id (type: bigint), plan_id (type: bigint), pay_mnt (type: bigint) |
| outputColumnNames: app_id, plan_id, pay_mnt |
| Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE |
| Group By Operator |
| aggregations: sum(pay_mnt) |
| keys: app_id (type: bigint), plan_id (type: bigint) |
| mode: hash |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE |
| Reduce Output Operator |
| key expressions: _col0 (type: bigint), _col1 (type: bigint) |
| sort order: ++ |
| Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
| Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE |
| value expressions: _col2 (type: bigint) |
| Reduce Operator Tree: |
| Group By Operator |
| aggregations: sum(VALUE._col0) |
| keys: KEY._col0 (type: bigint), KEY._col1 (type: bigint) |
| mode: mergepartial |
| outputColumnNames: _col0, _col1, _col2 |
| Statistics: Num rows: 65711 Data size: 1774205 Basic stats: COMPLETE Column stats: NONE |
| File Output Operator |
| compressed: false |
| Statistics: Num rows: 65711 Data size: 1774205 Basic stats: COMPLETE Column stats: NONE |
| table: |
| input format: org.apache.hadoop.mapred.TextInputFormat |
| output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat |
| serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe |
| |
| Stage: Stage-0 |
| Fetch Operator |
| limit: -1 |
| Processor Tree: |
| ListSink
我们将上述结果拆分看,先从最外层开始,包含两个大的部分:
先看第一部分 stage dependencies ,包含两个 stage,stage-1 是根 stage,是开始的 stage,包含了,而stage-0 依赖 stage-1,stage-1 执行完成后执行 stage-0。
- stage dependencies:各个 stage 之间的依赖性
- stage plan:各个 stage 的执行计划
再看第二部分 stage plan,其中stage1包含了这个job的大部分处理过程,会触发一个mapreduce。里面有一个 Map Reduce,一个 MR 的执行计划分为两个部分:
- Map Operator Tree:MAP 端的执行计划树
- Reduce Operator Tree:Reduce 端的执行计划树
这两个执行计划树里面包含这条 sql 语句的 operator:
Key | Desciption | Detail |
TableScan | map 端第一个操作肯定是加载表,所以就是表扫描操作,其实就是from | tablescan以这个表作为输入,然后会产生一个只有字段number的输出 |
alias | 表名称 | 从哪个表查 |
Statistics |
| |
Select Operator | 选取操作,常见的属性 | |
expressions | 需要的字段名称及字段类型 | |
outputColumnNames |
| |
Group By Operator | 分组聚合操作 | 有group by 时会触发该operator,会产生一个输出字段_col0,这是为临时字段按规则起的临时字段名 |
aggregations | 显示聚合函数信息 | |
mode | 聚合模式:
| |
keys |
| |
Reduce Output Operator | 输出到 reduce 操作,常见属性; | |
sort order |
| |
Filter Operator | 过滤操作 | 对应where或having来触发operator |
predicate | 过滤条件,如 sql 语句中的 where id>=1,则此处显示 (id>= 1) | |
Map Join Operator | join 操作,常见的属性 | |
condition map | join 方式 ,如 Inner Join 0 to 1 Left Outer Join0 to 2 | |
File Output Operator | 文件输出操作 | |
compressed | 是否压缩 | |
table | 表的信息,包含输入输出文件格式化方式,序列化方式等 | |
Fetch Operator | 客户端获取数据操作 | 一般是Stage-0 中 |
limit | 值为 -1 表示不限制条数,其他值为限制的条数 | 一般是Stage-0 中 |
实例
一. 做join时,是否需要额外过滤掉 null 的值
我们在hive client中输入以下查询计划语句:
SELECT game_dict.package_name
,game_dict.game_name
,temp.business
FROM
(
SELECT package_name
,business
FROM ods.ods_grow_ad_plan_info
WHERE date = 20220609
AND deli_busi = 3
) temp
JOIN
(
SELECT package_name
,game_name
FROM gamesdk.game_info_all
WHERE date = 20220609
) game_dict
ON temp.package_name = game_dict.package_name
LIMIT 10;
问题是:上面这条join会过滤掉package_name 为NULL的值吗?
通过explain来做下验证:(仅截取了部分输出信息)
TableScan |
alias: ods_grow_ad_plan_info |
Statistics: Num rows: 2593 Data size: 300879 Basic stats: COMPLETE Column stats: NONE || Filter Operator |
predicate: ((deli_busi = 3) and package_name is not null) (type: boolean) |
Statistics: Num rows: 648 Data size: 75190 Basic stats: COMPLETE Column stats: NONE |
Select Operator |
expressions: package_name (type: string), business (type: bigint) |
outputColumnNames: _col0, _col1 |
Statistics: Num rows: 648 Data size: 75190 Basic stats: COMPLETE Column stats: NONE |
HashTable Sink Operator |
keys: |
0 _col0 (type: string) |
1 _col0 (type: string)
...
从上述结果可以看到 predicate: package_name is not null,说明 join 时会自动过滤掉关联字段为 null 值的情况,但 left/right join 或 full join 是不会自动过滤 null 值的。
二. 做了group by后,是否还需要对语句做排序?
SELECT app_id
,plan_id
,SUM(pay_mnt)
FROM ads.ads_grow_ad_material_statistic_df
WHERE date = 20220609
AND material_id = 'ALL'
GROUP BY app_id
,plan_id;
问:对app_id,plan_id做group by后,是否还需要对语句进行排序?
直接来看 explain 之后结果 (仅截取了部分输出信息):
TableScan |
alias: ads_grow_ad_material_statistic_df |
Statistics: Num rows: 262845 Data size: 7096850 Basic stats: COMPLETE Column stats: NONE |
Filter Operator |
predicate: (material_id = 'ALL') (type: boolean) |
Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE |
Select Operator |
expressions: app_id (type: bigint), plan_id (type: bigint), pay_mnt (type: bigint) |
outputColumnNames: app_id, plan_id, pay_mnt |
Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE |
Group By Operator |
aggregations: sum(pay_mnt) |
keys: app_id (type: bigint), plan_id (type: bigint) |
mode: hash |
outputColumnNames: _col0, _col1, _col2 |
Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE |
Reduce Output Operator || key expressions: _col0 (type: bigint), _col1 (type: bigint) |
sort order: ++ |
Map-reduce partition columns: _col0 (type: bigint), _col1 (type: bigint) |
Statistics: Num rows: 131422 Data size: 3548411 Basic stats: COMPLETE Column stats: NONE |
value expressions: _col2 (type: bigint)
...
分组主要看Group By Operator,里面的keys: app_id (type: bigint), plan_id (type: bigint) 表示是按照appid和planid来进行分组,再往下走有sort order:++,说明是按照两个字段进行正序排序的。
三. 子查询和join哪个执行效率高?
有这样两个SQL:
-- SQL1
select a.id, b.id
from (
select id
from ods.ods_grow_ad_plan_info
where date = 20220609
) a
join (
select id, caid
from ods.ods_grow_ad_create_info
where date = 20220609
) b
on a.id = b.caid;
-- SQL2
SELECT a.id
FROM
(
SELECT id
FROM ods.ods_grow_ad_plan_info
WHERE date = 20220609
) a
WHERE a.id IN ( SELECT id FROM ods.ods_grow_ad_create_info WHERE date = 20220609)
这两条 sql 语句输出的结果是一样的,但是哪条 sql 执行效率高?高的话高在哪里?
第一条的explain:
Map Reduce Local Work |
Alias -> Map Local Tables: |
a:ods_grow_ad_plan_info |
Fetch Operator |
limit: -1 |
Alias -> Map Local Operator Tree: |
a:ods_grow_ad_plan_info |
TableScan |
alias: ods_grow_ad_plan_info |
Statistics: Num rows: 37609 Data size: 300879 Basic stats: COMPLETE Column stats: NONE |
Filter Operator |
predicate: id is not null (type: boolean) |
Statistics: Num rows: 18805 Data size: 150443 Basic stats: COMPLETE Column stats: NONE |
Select Operator |
expressions: id (type: bigint) |
outputColumnNames: _col0 |
Statistics: Num rows: 18805 Data size: 150443 Basic stats: COMPLETE Column stats: NONE |
HashTable Sink Operator |
keys: |
0 _col0 (type: bigint) |
1 _col1 (type: bigint)
...
第二条的explain:
Map Reduce Local Work |
Alias -> Map Local Tables: |
sq_1:ods_grow_ad_create_info |
Fetch Operator |
limit: -1 |
Alias -> Map Local Operator Tree: |
sq_1:ods_grow_ad_create_info |
TableScan |
alias: ods_grow_ad_create_info |
Statistics: Num rows: 242672 Data size: 1941378 Basic stats: COMPLETE Column stats: NONE |
Filter Operator |
predicate: id is not null (type: boolean) |
Statistics: Num rows: 121336 Data size: 970689 Basic stats: COMPLETE Column stats: NONE |
Select Operator |
expressions: id (type: bigint) |
outputColumnNames: _col0 || Statistics: Num rows: 121336 Data size: 970689 Basic stats: COMPLETE Column stats: NONE |
Group By Operator |
keys: _col0 (type: bigint) |
mode: hash |
outputColumnNames: _col0 |
Statistics: Num rows: 121336 Data size: 970689 Basic stats: COMPLETE Column stats: NONE |
HashTable Sink Operator |
keys: |
0 _col0 (type: bigint) |
1 _col0 (type: bigint)
...
可以看到第二条的执行相比第一条,多了一个Group By的Operator,经过了一个hash partition的随机聚合,因此可以用join来代替子查询来提高性能。
四. 面试题:比较两个sql的效率
-- SQL1
SELECT
a.id,
b.user_name
FROM
test1 a
JOIN test2 b ON a.id = b.id
WHERE
a.id > 2;
-- SQL2
SELECT
a.id,
b.user_name
FROM
(SELECT * FROM test1 WHERE id > 2) a
JOIN test2 b ON a.id = b.id;
这两条 sql 语句输出的结果是一样的,但哪条 sql 执行效率高呢?
- 可能是第一条 sql 执行效率高,因为第二条 sql 有子查询,子查询会影响性能;
- 也可能是第二条 sql 执行效率高,因为先过滤之后,在进行 join 时的条数减少了,所以执行效率就高了。
explain两个sql后来发现(这里不贴全部了):除了表别名不一样,其他的执行计划完全一样,都是先进行 where 条件过滤,在进行 join 条件关联。说明 hive 底层会自动帮我们进行优化,所以这两条 sql 语句执行效率是一样的。
参考资料
1. 《Hive编程指南》
2. 《Hive实战》
3. 《Hive性能调优实战》
读懂 explain 的执行计划有利于我们优化 Hive SQL,同时也能提升我们对 SQL 的掌控力。如查看 stage 的依赖情况、排查数据倾斜、hive 调优等,小伙伴们阔以自行尝试。
标签:Statistics,stats,为例,explain,hive,bigint,Operator,type,id From: https://blog.51cto.com/u_15955938/6039492