窗口函数
窗口函数可以更加灵活地对一定范围内的数据进行操作和分析,它能够为每行数据划分一个窗口,然后对窗口范围内的数据进行计算,最后将计算结果返回给该行数据;
举个例子,区别于 GroupBy,GroupBy 对分组范围内的数据进行聚合统计,得到当前分组的一条结果;窗口函数对每一条数据处理,展开一个窗口范围,然后分析(聚合、筛选)得到一条结果;GroupBy 结果等于分组数、窗口函数结果等于数据总数。
基础语法
窗口函数语法主要分为窗口、函数两部分。其中窗口用于定义计算范围,函数用于定义计算逻辑。
函数:一般为聚合函数,比如 max()、min()、sum()、count()、avg()
,因为窗口往往涉及到多条数据,需要对这些数据聚合再执行相关操作。
窗口:窗口范围的定义分为两种类型,一种是基于行的,一种是基于值的。
窗口函数基本语法:
select order_id, order_date, amount, 函数(amount) over (窗口范围) total_amount from order_info;
基于行的窗口函数
select xxx, xxx, ..., func(xxx) over (order by [column] rows between xxx and xxx) from table_name;
窗口范围是指实际计算(分派到不同节点 MapReduce 任务相关数据)时的数据,所以为了能明确数据行的范围,需要对数据按字段排序。
其中窗口的起始和结束范围有以下几种情况:
- 起始为
unbounded preceding
、[num] preceding
;- 结束为
[num] preceding
、current now
、[num] following
、unbounded following
;
- 结束为
- 起始为
current now
;- 结束为
current now
、[num] following
、unbounded following
;
- 结束为
- 起始为
[num] following
;- 结束为
[num] following
、unbounded following
;
- 结束为
基于值的窗口函数
select xxx, xxx, ...., func(xxx) over (order by [column] range between xxx and xxx) from table_name;
区别于基于行的窗口函数,基于值的 order by [column]
并没有实际意义,只是为了声明基于哪个字段进行窗口函数的划分。
当窗口范围划分时包含了数字,order by [column]
字段必须为数字类型,否则窗口函数无效。
静态窗口 Partition By
select xxx, xxx, ..., func(xxx) over (partition by [column] order by [column] rows between xxx and xxx) from table_name;
在划分窗口范围时,可以使用 PartitionBy 实现与 GroupBy 相同的分组划分,且使用聚合函数。这种窗口的大小,在每条数据进行处理时,已经等于数据的分组结果了,所以窗口为静态的。
注意 over(xxx)
中的内容也可以不写,其中:
partitionBy
省略不写,表示不分区;orderBy
省略不写,表示不排序;rows|range between xxx and xxx
省略不写,表示使用其默认值;默认值为range between unbounded preceding and current row
、rows between unbounded preceding and unbounded following
;
Hive 中的窗口函数 API
SUM、AVG、MIN、MAX
注意使用时 SQL 语句加和不加 order by
的区别,加了 order by
表示分组内从起点到当前行的 PV 累积;不加表示分组内所有值的累加求和。
select cookieid, createtime, pv,
sum(pv) over(partition by cookieid order by createTime) as pv1
from user_pv;
select cookieid,createtime,pv,
sum(pv) over(partition by cookieid ) as pv1
from user_pv;
ROW_NUMBER、RANK、DENSE_RANK、NTILE
ROW_NUMBER
:表示从1开始,按照顺序生成分组内记录的序列;
select cookieid, createtime, pv,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn
FROM user_pv;
RANK、DENSE_RANK
:RANK
:生成数据项在分组中的排名,排名相等会在名次中留下空位;DENSE_RANK
:生成数据项在分组中的排名,排名相等会在名次中不会留下空位;NTILE
:将有序集合平均分配到指定数量的桶中,将桶号分配给每一行,若不能平均分配,则优先分配编号较小的桶,并且各个桶中能放的行数最多相差1;使用时可以根据桶号,选取前或者后 n分之几的数据。
SELECT cookieid, createtime, pv,
NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,
NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,
NTILE(4) OVER(ORDER BY createtime) AS rn3,
FROM user_pv
ORDER BY cookieid, createtime;
LAG、LEAD、FIRST_VALUE、LAST_VALUE
LAG(col, n, DEFAULT)
:统计窗口内往上第 n 行值,其中第一个参数为列名,第二个为 n 行,第三个为默认值;
SELECT cookieid, createtime, url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAG(createtime, 1, '1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,
LAG(createtime, 2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time
FROM user_url;
LEAD(col, n, DEFAULT)
:统计窗口内往下第 n 行值。
SELECT cookieid, createtime, url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY create_time) AS rn,
LEAD(createtime, 1, '1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,
LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time
FROM user_url;
FIRST_VALUE
:去分组内排序后,截止到当前行的第一个值;
SELECT cookieid, createtime, url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
FIRST_VALE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1
FROM user_url;
LAST_VALUE
:分组排序后,截止当前行的最后一个值;
SELECT cookieid, createtime, url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1
FROM user_url;
如果想要取分组内最后一个值,应当通过 FIRST_VALUE
倒序排序:
SELECT cookieid, createtime, url,
ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,
LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,
FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2
FROM user_url ORDER BY cookieid,createtime;
CUME_DIST
- 小于等于当前值的行数/分组内总行数 的比例值,Order 默认顺序。
SELECT dept, userid, sal,
CUME_DIST() OVER(ORDER BY sal) AS rn1,
CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2
FROM staff_salary;
GROUPING SETS、GORUPING__ID、CUBE、ROLLUP
GROUPING SETS
:将不同维度的 GROUP BY 结果集进行 UNION ALL;
窗口函数实际用例
获取第几高的数据值
+----+--------+
| Id | Salary |
+----+--------+
| 1 | 100 |
| 2 | 200 |
| 3 | 300 |
+----+--------+
获取第二高的薪水信息:
SELECT * FROM (
SELECT Salary, row_number() over(order by Salary desc) rk FROM Employee
) t WHERE t.rk = 2;
连续出现的数字
+----+-----+
| Id | Num |
+----+-----+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 2 |
| 5 | 1 |
| 6 | 2 |
| 7 | 2 |
+----+-----+
获取连续出现至少三次的数字:
SELECT DISTINCT `Num` as ConsecutiveNums FROM (
SELECT Num,
lead(Num, 1, null) over(order by id) n2,
lead(Num, 2, null) over(order by id) n3
FROM logs
) t1 WHERE t1.n2 = Num and t1.n3 = Num;
统计当月累计下单总和
按照下单时间分组,
sum()
函数累计计算总金额。
create table order_info
(
order_id string, --订单id
user_id string, -- 用户id
user_name string, -- 用户姓名
order_date string, -- 下单日期
order_amount int -- 订单金额
);
SELECT
order_id,
user_id,
user_name,
order_date,
order_amount,
sum(order_amount) over(partition by substring(order_date, 1, 7) order by order_date rows between unbounded preceding and current row) sum_so_far
FROM order_info;
统计用户每次下单距离上次下单的天数
按照用户id 分组,用户下单时间排序,通过
lag
窗口查询上一行的信息。
SELECT
order_id,
user_id,
user_name,
order_date,
order_amount,
IFNULL (datediff(order_date, last_order_date), 0) diff
FROM
(
SELECT
order_id,
user_id,
user_name,
order_date,
order_amount,
lag(order_date, 1, null) over(partition by user_id order by order_date) last_order_date
FROM order_info
) t1
统计所有下单记录及每个用户的每个下单记录所在月份的首/末次下单日期
SELECT
order_id,
user_id,
user_name,
order_date,
order_amount,
first_value(order_date) over(partition by user_id, substring(order_date, 1, 7) order by order_date) first_date,
last_value(order_date) over(partition by user_id, substring(order_date, 1, 7) order by order_date) last_date
FROM
order_info;
连续 N 天登录
-- Accounts 表
+----+-----------+
| id | name |
+----+-----------+
| 1 | Winston |
| 7 | Jonathan |
+----+-----------+
-- 表 logins
+----+-------------+
| id | login_date |
+----+-------------+
| 7 | 2020-05-30 |
| 1 | 2020-05-30 |
| 7 | 2020-05-31 |
| 7 | 2020-06-01 |
| 7 | 2020-06-02 |
| 7 | 2020-06-02 |
| 7 | 2020-06-03 |
| 1 | 2020-06-07 |
| 7 | 2020-06-10 |
+----+-------------+
获取至少连续5天登录的用户信息,结果按 id 排序,思路如下:
- 去重:可能单人一天登录多次;
- 排序:对登录日期排序,然后建立窗口函数
row_number()
标记序号; - 差值:计算登录日志与排序间的差值,找到连续登录的记录,可以通过
date_sub(xxx, n)
获取前 n 天的日期值; - 过滤:最终结果按照 id 分组后,每个组内应当有多行值具有相同的日期值,通过
groupby
按照日期分组后having
过滤掉总次数小于5的记录;
SELECT DISTINCT b.id, sum(ac.name)
FROM
(SELECT id, login_date,
DATE_SUB(login_date, ROW_NUMBER() OVER(PARTITION BY id ORDER BY login_date)) AS ddate
FROM(SELECT DISTINCT id, login_date FROM Logins) a) b
INNER JOIN Accounts ac
ON b.id = ac.id
GROUP BY b.id, b.ddate
HAVING COUNT(b.id) >= 5;
中级SQL
环境准备
-
用户信息表
-
商品信息表
-
订单信息表
-
订单明细表
-
登录明细表
-
商品价格变更明细表
-
配送信息表
-
好友关系表
-
收藏信息表
DROP TABLE IF EXISTS user_login_detail;
CREATE TABLE user_login_detail
(
`user_id` string comment '用户id',
`ip_address` string comment 'ip地址',
`login_ts` string comment '登录时间',
`logout_ts` string comment '登出时间'
) COMMENT '用户登录明细表'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
需求
- 查询累计销量排名第二的商品
SELECT sku_id
FROM (
SELECT sku_id //过滤排名第二的
FROM (
SELECT sku_id, //最内层查询各商品销量的排名——dense_rank
order_num,
dense_rank() over(order by order_num desc) rk
FROM (
SELECT sku_id,
sum(sku_num) order_num
FROM order_detail
GROUP BY sku_id
) t1
) t2
WHERE rk = 2
) t3
RIGHT JOIN //没有第二名返回 null
(
SELECT 1
) t4
ON 1= 1;
- 查询至少连续三天下单的用户
思路:(1)首先根据订单表按照用户id、下单日期分组去重;(2)统计每一行行数,根据用户id分组、下单日期排序,并用当前日期减去当前行号得到flag字段,如果有一批连续的日期,flag的结果应该相同;(3)按照用户id分组,统计flag的总数是否大于等于3;
SELECT distinct user_id
FROM (
SELECT user_id, // 按照flag字段分组并统计总行数大于3的个数
FROM (
SELECT user_id, // 统计每一行日期减去行号的结果值
create_date,
date_sub(create_date, row_number() over(partition by user_id order by create_date)) flag
FROM (
SELECT user_id, // 分组去重
create_date
FROM order_info
GROUP BY user_id, create_date
) t1
) t2
GROUP BY user_id, flag
HAVING count(flag) >= 3
)t3;
- 查询各品类销售商品的种数及销量最高的商品
思路:(1)先根据订单明细表查询各商品的售卖总数;(2)统计各商品所属品类的商品种数、各品类中按照商品售卖总数排序获取售卖第一的商品并展示。
SELECT category_id,
category_name,
sku_id,
name,
order_num,
sku_cnt
FROM (
SELECT od.sku_id,
sku.name
sku.category_id,
cate.category_name,
order_num
rank() over(partition by sku.category_id order by order_num desc) rk, // 按品类分组,给每个品类下商品按售卖数排序编号
count(distinct od.sku_id) over(partition by sku.category_id) sku_cnt // 统计每个品类下商品的种类个数
FROM (
SELECT sku_id, // 先查询订单明细表获取各种商品的售卖总数
sum(sku_num) order_num
FROM order_detail
GROUP BY sku_id
) od
LEFT JOIN sku_info sku ON od.sku_id = sku.id
LEFT JOIN category_info cate ON sku.category_id = cate.id
) t1
WHERE rk = 1;
- 查询用户累计消费金额及VIP等级
详细需求:
从订单信息表(order_info)中统计每个用户截止其每个下单日期的累积消费金额,以及每个用户在其每个下单日期的VIP等级。
用户vip等级根据累积消费金额计算,计算规则如下:
设累积消费总额为X,
若0=<X<10000,则vip等级为普通会员
若10000<=X<30000,则vip等级为青铜会员
若30000<=X<50000,则vip等级为白银会员
若50000<=X<80000,则vip为黄金会员
若80000<=X<100000,则vip等级为白金会员
若X>=100000,则vip等级为钻石会员
SELECT user_id,
create_date,
sum_so_far
case
when sum_so_far >= 10000 then '钻石会员'
when sum_so_far >= 80000 then '白金会员'
when sum_so_far >= 50000 then '黄金会员'
when sum_so_far >= 30000 then '白银会员'
when sum_so_far >= 10000 then '青铜会员'
when sum_so_far >= 0 then '普通会员'
end vip_level
FROM (
SELECT user_id,
create_date, // 截止当前日期累计消费总金额
sum(total_amount_per_day) over(partition by user_id order by create_date) sum_so_far
FROM (
SELECT user_id, // 按用户id、创建时间分组,统计每个订单金额总数
create_date,
sum(total_amount) total_amount_per_day
FROM order_info
GROUP BY user_id, create_date
) t1
) t2
- 查询首次下单后第二天连续下单用户比率
需求:从 order_info 表中查询首次下单后第二天仍然连续下单用户的比例。
思路:(1)类似于TOPN问题,分组后排序然后获取前 N 名数据即可;(2)然后通过count()
函数获取满足条件的数据行数与总行数的比例求出最终比例值。
SELECT
count(if(datediff(second_date, first_date) = 1, 1, 0)) / count(*)
FROM (
SELECT user_id, // 划分两次日期
min(create_date) first_date,
max(create_date) second_date
FROM (
SELECT user_id, // 创建日期排序,求首次和第二次下单日期
create_date,
rank() over(partition by user_id order by create_date) rk
FROM (
SELECT user_id, // 按用户id、创建时间分组
create_date
FROM order_info
GROUP BY user_id, create_date
) t1
) t2
WHERE rk <= 2
GROUP BY user_id
)
- 求商品销售年份、销售数量、销售金额
需求:从订单明细表(order_detail)统计每个商品销售首年的年份,销售数量和销售总额。
SELECT
year(create_date),
sum(sku_num),
sum(price * sku_num)
FROM (
SELECT sku_id,
create_date,
sku_num,
price,
rank() over(partition by sku_id order by year(create_date)) rk // 根据年份排序给每个订单标识序号
FROM order_detail
GROUP BY sku_id, create_date
) t1
WHERE rk = 1
GROUP BY sku_id, year(create_date);
- 筛选去年总销量小于 100 的商品
需求:从 order_detail 中筛选去年(2021年)总销量小于 100 的商品及销量,同时不考虑上架小于一个月的商品。
SELECT
t1.sku_id,
name,
order_num
FROM (
SELECT sku_id, // 过滤总销量大于 100 的商品
sum(sku_num) order_num
FROM order_detail
WHERE year(create_date) = '2021' AND sku_id in ( // 子查询筛选上架大于一个月的商品
SELECT sku_id
FROM sku_info
WHERE datediff('2022-01-10', from_date) > 30
)
GROUP BY sku_id
having sum(sku_num) < 100
) t1 left join sku_info t2 ON t1.sku_id = t2.sku_id;
- 查询每日新用户数
需求:从用户登录明细表查询每天新增的用户数。
SELECT
login_ts,
count(distinct t1.user_id) new_user_count
FROM (
SELECT user_id,
login_ts,
lag(login_ts, 1) over(partition by user_id order by login_ts) last_login_ts
FROM user_login_detail
) t1
WHERE datediff(login_ts, IFNLL(last_login_ts, 0)) > 1
ORDER BY login_ts;
- 统计每个商品的销量最高的日期
需求:从 order_detail 中统计每种商品销售件数最多的日期及当日销量,如果有同一商品多日销量并列情况,取其中最小日期。
SELECT sku_id,
create_date,
sum_num
FROM (
SELECT sku_id,
create_date,
sum_num
FROM (
SELECT sku_id, // 每种商品按销售额倒序排序、按照创建日期升序排序
create_date,
sum_num,
row_number() over(partition by sku_id order by sum_num desc, create_date asc) rk
FROM (
SELECT sku_id, // 查询每种商品每天的销售总额
create_date,
sum(sku_num) sum_num
FROM order_detail
GROUP BY sku_id, create_date
) t1
) t2
)
WHERE rk = 1;
- 查询销售件数高于品类平均数的商品
从 order_detail 中查询累计销售件数高于其所属品类平均数的商品。
SELECT
sku_id,
name,
sum_num,
cate_avg_num
FROM (
SELECT
od.sku_id, // 统计同一品类中商品销售数平均值
category_id,
name,
sum_num,
avg(sum_num) over(partition by category_id) cate_avg_num
FROM (
SELECT sku_id, // 统计每种商品的售卖总数
sum(sku_num) sum_num
FROM order_detail
GROUP BY sku_id
) od LEFT JOIN ( // 联合商品品类
SELECT sku_id,
category_id,
FROM sku_info
) sku ON od.sku_id = sku.sku_id
) t1
WHERE sum_num > cate_avg_num; // 最外层过滤
- 用户注册、登录、下单综合统计
需求:从用户 user_login_detail、order_info 中查询每个用户的注册日期、总登录次数以及在 2021 年的登录次数、订单数和订单总额。
SELECT login.user_id,
register_date,
login_count_2021,
order_count_2021,
total_amount_2021
FROM (
SELECT user_id, // 最早登录日期、总登录次数、2021年总登录次数
min(date_format(login_ts, 'yyy-MM-dd')),
count(1) total_login_count,
count(if(year(login_ts) = '2021', 1, null)) login_count_2021
FROM user_login_detail
GROUP BY user_id
) login JOIN (
SELECT user_id, // 2021 年订单数、订单总额
count(distinct(order_id)) order_count_2021,
sum(total_amount) total_amount_2021
FROM order_info
WHERE year(create_date) = '2021'
GROUP BY user_id
) ON login.user_id = order.user_id;
- 即时订单比例
需求:订单配送中,如果期望配送日期和下单日期相同,称为即时订单,如果期望配送日期和下单日期不同,称为计划订单。请从配送信息表(delivery_info)中求出每个用户的首单(用户的第一个订单)中即时订单的比例,保留两位小数,以小数形式显示。
SELECT round(sum(if(order_date = custom_date, 1, 0)) / count(*), 2) percentage
FROM (
SELECT delivery_id,
user_id,
order_date,
custom_date,
row_number() over(partition by user_id order by order_date) rk
FROM delivery_info
) t1
WHERE rk = 1;
- 向用户推荐朋友收藏产品
需求:现需要请向所有用户推荐其朋友收藏但是用户自己未收藏的商品,请从好友关系表(friendship_info)和收藏表(favor_info)中查询出应向哪位用户推荐哪些商品。
SELECT
distinct t1.user_id,
friend_favor.sku_id
FROM (
SELECT user1_id user_id,
user2_id friend_id
FROM friendship_info
) t1
LEFT JOIN favor_info friend_favor ON t1.friend_id = friend_favor.user_id // 查找用户和好友的收藏商品
LEFT JOIN favor_info user_favor ON t1.user_id = user_favor.user_id
AND friend_favor.sku_id = user_favor.sku_id // 联合起来结果为空,说明好友收藏但是用户未收藏
WHERE user_favor.sku_id is null;
- 查询所有用户的连续登录两天及以上的日期区间
需求:从登录明细表(user_login_detail)中查询出,所有用户的连续登录两天及以上的日期区间,以登录时间(login_ts)为准。
SELECT user_id,
min(login_date) start_date,
max(login_date) end_date
FROM (
SELECT user_id,
login_date,
date_sub(login_date, rk) flag // 每行日期值减去行号,如果连续登录会出现多行值相同
FROM (
SELECT
user_id, // 记录按照登录时间升序排序,并且记录每行行号
login_date,
row_number() over(partition by user_id order by login_date) rk
FROM (
SELECT user_id, // 查询用户id、登录时间,并按照用户id分组
date_format(login_ts, 'yyyy-MM-dd') login_date,
FROM user_login_detail
GROUP BY user_id, date_format(login_ts, 'yyyy-MM-dd')
) t1
) t2
) t3
GROUP BY user_id, flag
having count(*) >= 2;
- 男性和女性每日的购物金额统计
需求:从订单信息表(order_info)和用户信息表(user_info)中,分别统计每天男性和女性用户的订单总金额,如果当天男性或者女性没有购物,则统计结果为0。
SELECT create_date,
sum(if(gender = '男', total_amount, 0)) total_amount_male,
sum(if(gender = '女', total_amount, 0)) total_amount_female
FROM order_info oi
LEFT JOIN user_info ui ON oi.user_id = ui.user_id
GROUP BY create_date;
- 订单金额趋势分析
需求:查询截止每天的最近3天内的订单金额总和以及订单金额日平均值,保留两位小数,四舍五入。
标签:sku,窗口,函数,id,user,date,hive05,order,SELECT
From: https://www.cnblogs.com/istitches/p/18348542