首页 > 其他分享 >Hive调优 【行列过滤 及谓词下推】

Hive调优 【行列过滤 及谓词下推】

时间:2023-01-02 23:12:55浏览次数:37  
标签:stats url 下推 Hive 调优 join where id select

1. 列处理 :
说明 : 避免使用 select *,有分区时,要指定分区

2. 行处理 :
关联原则 : 在关联操作时,能先where的尽量先where,减少数据集
3. hive底层优化策略-谓词下推(predicate pushdown) :
什么是谓词下推 ?
1. 将过滤表达式尽可能移动至靠近数据源的位置,以使真正执行时能直接跳过无关的数据
2. 在map完成对数据的过滤

参数设置 :
-- 是否开启 谓词下推(默认开启)
set hive.optimize.ppd=true;

什么情况下会触发 谓词下推?
-- 触发条件
1. 对于 inner join , full outer join :
1. 条件写在on后面,还是where后面,都会触发 谓词下推
2. 对于 A表 left outer join B表 :
1. A表条件写在on后面 且 B表条件 写在where后面
2. A表条件,B表条件 都写在where后面
3. 对于 A表 right outer join B表 :
1. B表条件写在on后面 且 A表条件 写在where后面
2. A表条件,B表条件 都写在where后面
4. 当 条件中 有 不确定函数时,不会触发
不确定函数 : unix_timestamp、rand

4. (谓词下推)触发条件测试
-- 表结构
 
-- 表结构
A表 : log_orc B表 : log_parquet

字段名称                 字段类型
track_time              string
url                     string
session_id              string
referer                 string
ip                      string
end_user_id             string
city_id                 string
 
4.1. inner join (自动添加 key != nul)
 
1. inner join (自动添加 key != nul)
-- 测试1 : AB 条件在where 后面 (触发)
explain
select A.*,B.*
from log_orc as A
inner join log_parquet as B
on A.url = B.url
where  A.ip = '100'
and B.city_id = '10';

TableScan
            alias: a
            Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((ip = '100') and url is not null) (type: boolean)
TableScan
            alias: b
            Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((city_id = '10') and url is not null) (type: boolean)


-- 测试2 : AB 条件在on 后面 (触发)
explain
select A.*,B.*
from log_orc as A
inner join log_parquet as B
on A.url = B.url
and  A.ip = '100'
and B.city_id = '10';

TableScan
            alias: a
            Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((ip = '100') and url is not null) (type: boolean)
TableScan
            alias: b
            Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((city_id = '10') and url is not null) (type: boolean)

-- 测试3 : inner join 自动空key过滤
with t1 as (
select 'huawei' as channel ,1 as score
union all
select 'oppo' as channel ,2 as score
union all
select null as channel ,3 as score
),
t2 as (
select 'huawei' as channel ,'2021-08-01' as credit_date
union all
select 'oppo' as channel ,'2021-08-01' as credit_date
union all
select null as channel ,'2021-08-03' as credit_date
union all
select 'iphone' as channel ,'2021-08-03' as credit_date
)
select t1.*,t2.*
from t1
inner join t2
on t1.channel = t2.channel;
-- 结果
channel score   channel credit_date
huawei  1       huawei  2021-08-01
oppo    2       oppo    2021-08-01
 
2. left outer join
 
2. left outer join
-- 测试1 : AB 条件在where 后面 (触发)
explain
select A.*,B.*
from log_orc as A
left outer join log_parquet as B
on A.url = B.url
where  A.ip = '100'
and B.city_id = '10';

          TableScan
            alias: b
            Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((city_id = '10') and url is not null) (type: boolean)

          TableScan
            alias: a
            Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((ip = '100') and url is not null) (type: boolean)

-- 测试2 : AB 条件在on 后面 (B条件触发,A条件不触发)

explain
select A.*,B.*
from log_orc as A
left outer join log_parquet as B
on A.url = B.url
and A.ip = '10'
and B.city_id = '100';

          TableScan
            alias: a
            Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
            Select Operator

          TableScan
            alias: b
            Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (city_id = '10') (type: boolean)


-- 测试3 : A 条件在on 后面,B 条件在 where 后面 (都会触发)

explain
select A.*,B.*
from log_orc as A
left outer join log_parquet as B
on A.url = B.url
and A.ip = '10'
where B.city_id = '100';


 TableScan
            alias: a
            Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((ip = '10') and url is not null) (type: boolean)

 TableScan
            alias: b
            Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((city_id = '100') and url is not null) (type: boolean)
 
3. 不触发条件测试(存在不确定函数)
 
3. 不触发条件测试(存在不确定函数)
explain
select A.*,B.*
from log_orc as A
inner join log_parquet as B
on A.url = B.url
where  A.ip = '100'
and A.city_id = rand()
-- and A.city_id = '1'
;

          TableScan
            alias: a
            Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((ip = '100') and url is not null) (type: boolean)
 

 

参考 :
https://blog.csdn.net/strongyoung88/article/details/81156271
https://blog.csdn.net/baichoufei90/article/details/85264100



标签:stats,url,下推,Hive,调优,join,where,id,select
From: https://www.cnblogs.com/jira/p/17020805.html

相关文章

  • [Hive排序]--4种排序方式介绍
    一、官方文档​​Home-ApacheHive-ApacheSoftwareFoundation​​​​LanguageManual-ApacheHive-ApacheSoftwareFoundation​​​​LanguageManualSortBy-......
  • Elasticsearch高级调优方法论之——根治慢查询!
    Elasticsearch最少必要知识实战教程直播回放1、引言Elasticsearch是非常灵活且功能丰富的搜索引擎,它提供了许多不同查询数据的方法。在实战业务场景中,经常会出现远远低......
  • 性能测试-微服务性能压测监控和调优【重点】【杭州多测师_王sir】【杭州多测师】
     本文主要内容一、何为压力测试1.1、大白话解释性能压测是什么:就是考察当前软件和硬件环境下,系统所能承受的最大负荷,并帮助找出系统的瓶颈所在。性能压测的......
  • 关于CLR GC调优的一些问题
    前言在很多的场合我都遇到过一些群友提这样的一些问题:为什么Java有GC调优而CLR没有听说过有GC调优呢?到底是Java的JVMGC比较强还是C#使用的.NETCLR的GC比较强呢?其实业内已......
  • 一次SQL调优 聊一聊 SQLSERVER 数据页
    一:背景1.讲故事最近给一位朋友做​​SQL慢语句​​优化,花了些时间调优,遗憾的是SQLSERVER非源码公开,玩起来不是那么顺利,不过从这次经历中我觉得明年的一个重大任务就是......
  • 面试官问我JVM调优,我忍不住了,哈哈哈哈
    面试官:今天要不来聊聊JVM调优相关的吧?面试官:你曾经在生产环境下有过调优JVM的经历吗?候选者:没有面试官:…候选者:嗯…是这样的,我们一般优化系统的思路是这样的候选者:1.一般来......
  • HiveSQL——打折日期交叉问题
    需求描述如下为平台商品促销数据:字段为品牌,打折开始日期,打折结束日期createtabletemp_date_link(brand_idvarchar(10),start_datestring,end_dat......
  • mysql调优
    一、insert调优1、插入多条数据时最好批量插入(但一般不超过一千条)2、手动提交事务,多条语句一起提交starttransaction;insertintotablevalues(),(),();......
  • 一次SQL调优 聊一聊 SQLSERVER 数据页
    一:背景1.讲故事最近给一位朋友做SQL慢语句优化,花了些时间调优,遗憾的是SQLSERVER非源码公开,玩起来不是那么顺利,不过从这次经历中我觉得明年的一个重大任务就是好好研......
  • Hive篇---Hive与Hbase整合
    =========================================================声明:由于不同平台阅读格式不一致(尤其源码部分),所以获取更多阅读体验!!个人网站地址:​​http://www.lhworldblog.......