1. 列处理 :
说明 : 避免使用 select *,有分区时,要指定分区
2. 行处理 :
关联原则 : 在关联操作时,能先where的尽量先where,减少数据集
3. hive底层优化策略-谓词下推(predicate pushdown) :
什么是谓词下推 ?
1. 将过滤表达式尽可能移动至靠近数据源的位置,以使真正执行时能直接跳过无关的数据
2. 在map完成对数据的过滤
参数设置 :
-- 是否开启 谓词下推(默认开启)
set hive.optimize.ppd=true;
什么情况下会触发 谓词下推?
-- 触发条件
1. 对于 inner join , full outer join :
1. 条件写在on后面,还是where后面,都会触发 谓词下推
2. 对于 A表 left outer join B表 :
1. A表条件写在on后面 且 B表条件 写在where后面
2. A表条件,B表条件 都写在where后面
3. 对于 A表 right outer join B表 :
1. B表条件写在on后面 且 A表条件 写在where后面
2. A表条件,B表条件 都写在where后面
4. 当 条件中 有 不确定函数时,不会触发
不确定函数 : unix_timestamp、rand
4. (谓词下推)触发条件测试
-- 表结构
-- 表结构 A表 : log_orc B表 : log_parquet 字段名称 字段类型 track_time string url string session_id string referer string ip string end_user_id string city_id string
4.1. inner join (自动添加 key != nul)
1. inner join (自动添加 key != nul) -- 测试1 : AB 条件在where 后面 (触发) explain select A.*,B.* from log_orc as A inner join log_parquet as B on A.url = B.url where A.ip = '100' and B.city_id = '10'; TableScan alias: a Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((ip = '100') and url is not null) (type: boolean) TableScan alias: b Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((city_id = '10') and url is not null) (type: boolean) -- 测试2 : AB 条件在on 后面 (触发) explain select A.*,B.* from log_orc as A inner join log_parquet as B on A.url = B.url and A.ip = '100' and B.city_id = '10'; TableScan alias: a Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((ip = '100') and url is not null) (type: boolean) TableScan alias: b Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((city_id = '10') and url is not null) (type: boolean) -- 测试3 : inner join 自动空key过滤 with t1 as ( select 'huawei' as channel ,1 as score union all select 'oppo' as channel ,2 as score union all select null as channel ,3 as score ), t2 as ( select 'huawei' as channel ,'2021-08-01' as credit_date union all select 'oppo' as channel ,'2021-08-01' as credit_date union all select null as channel ,'2021-08-03' as credit_date union all select 'iphone' as channel ,'2021-08-03' as credit_date ) select t1.*,t2.* from t1 inner join t2 on t1.channel = t2.channel; -- 结果 channel score channel credit_date huawei 1 huawei 2021-08-01 oppo 2 oppo 2021-08-01
2. left outer join
2. left outer join -- 测试1 : AB 条件在where 后面 (触发) explain select A.*,B.* from log_orc as A left outer join log_parquet as B on A.url = B.url where A.ip = '100' and B.city_id = '10'; TableScan alias: b Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((city_id = '10') and url is not null) (type: boolean) TableScan alias: a Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((ip = '100') and url is not null) (type: boolean) -- 测试2 : AB 条件在on 后面 (B条件触发,A条件不触发) explain select A.*,B.* from log_orc as A left outer join log_parquet as B on A.url = B.url and A.ip = '10' and B.city_id = '100'; TableScan alias: a Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE Select Operator TableScan alias: b Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (city_id = '10') (type: boolean) -- 测试3 : A 条件在on 后面,B 条件在 where 后面 (都会触发) explain select A.*,B.* from log_orc as A left outer join log_parquet as B on A.url = B.url and A.ip = '10' where B.city_id = '100'; TableScan alias: a Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((ip = '10') and url is not null) (type: boolean) TableScan alias: b Statistics: Num rows: 100000 Data size: 700000 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((city_id = '100') and url is not null) (type: boolean)
3. 不触发条件测试(存在不确定函数)
3. 不触发条件测试(存在不确定函数) explain select A.*,B.* from log_orc as A inner join log_parquet as B on A.url = B.url where A.ip = '100' and A.city_id = rand() -- and A.city_id = '1' ; TableScan alias: a Statistics: Num rows: 100000 Data size: 76900000 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((ip = '100') and url is not null) (type: boolean)
参考 :
https://blog.csdn.net/strongyoung88/article/details/81156271
https://blog.csdn.net/baichoufei90/article/details/85264100
标签:stats,url,下推,Hive,调优,join,where,id,select From: https://www.cnblogs.com/jira/p/17020805.html