整理自数据分析常考面试题100题
1. 连续日期
1、为日期排序
row_number() over (partition by use_id order by date) as rank
2、求日期和排序的差值
3、求 diff 出现最多的次数 max(count(diff))
例1:
- 每个用户一周活跃天数 -
SELECT user_id,
COUNT(1) AS active_day
FROM (SELECT user_id, date
FROM table
WHERE TIMESTAMPDIFF(DAY,date,CURRENT_DATE) <= 7
GROUP BY user_id, date) a -- 这里还要加GROUP BY date是为了去重 --
GROUP BY user_id;
- 每个用户一周内最大连续活跃天数 -
WITH recent_day AS(
SELECT user_id, date
FROM table
WHERE TIMESTAMPDIFF(DAY, date, CURRENT_DATE) <= 7
GROUP BY user_id, date
),
diff_table AS(
SELECT user_id,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY date) AS rank,
DATE_SUB(date, ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY transdt)) AS gap
-- 连续活跃的话,gap会一样 --
FROM recent_day
),
gap_cnt AS(
SELECT user_id, count(gap) AS count
FROM diff_table
GROUP BY user_id, gap
)
SELECT user_id, MAX(count)
FROM gap_cnt
GROUP BY user_id
* TIMESTAMPDIFF(unit, start, end) ; DATEDIFF(end, start)
例2:
- 截至当前,每个用户已经连续签到的天数 -
SELECT user_id,
DATEDIFF(CURRENT_DATE, no_sign_in) AS consecutive_days
FROM (SELECT user_id, MAX(date) AS no_sign_in
FROM table
WHERE is_sign_in = 0
GROUP BY user_id
)
思路:找到最近的“中断”点:找到最近一次没有签到的日期,那么在这之后的所有天数都可以被视为连续签到的天数。
2.时间间隔问题
1. 为日期排序
row number() over(partition by id order by date) as rank
2、错位相减,来实现“相邻” 要求, 求日期和排序的差值( diff)
3、根据题目要求,求出相应指标· max(时间差)· count(*) · min(时间差)
- 每个用户相邻两次浏览时间之差小于 3min 的次数 -
WITH rank_table AS(
SELECT user_id, user_time,
ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY user_time) AS rk
FROM table
),
misaligned_sub AS(
SELECT user_id, a.user_time, b.user_time
FROM rank_table a
JOIN rank_table b ON a.user_id = b.user_id AND a.rk = b.rk - 1
)
SELECT user_id, COUNT(1) AS times
FROM misaligned_sub
WHERE TIMESTAMPDIFF(MINUTE, a.user_time, b.user_time) <= 3
GROUP BY user_id
延申:同一个用户购买同一个商品相邻两次的时间间隔
-同一个用户购买同一个商品相邻两次的时间间隔-
WITH ranked_purchases AS (
SELECT user_id,
product_id,
purchase_time,
ROW_NUMBER() OVER (PARTITION BY user_id, product_id ORDER BY purchase_time) AS rk
FROM purchases_table
)
SELECT a.user_id,
a.product_id,
a.purchase_time AS current_purchase_time,
b.purchase_time AS next_purchase_time,
TIMESTAMPDIFF(MINUTE, b.purchase_time, a.purchase_time) AS time_interval_minutes
FROM ranked_purchases a
JOIN ranked_purchases b
ON a.user_id = b.user_id
AND a.product_id = b.product_id
AND a.rk = b.rk - 1;
3. 求累加
SELECT user_id,
date,
amount,
SUM(amount) OVER (PARTITION BY user_id ORDER BY date) AS cumulative_amount
FROM table
ROWS BETWEEN
: 基于行的位置来定义窗口帧,精确匹配行。RANGE BETWEEN
: 基于值范围来定义窗口帧,可能包含具有相同排序值的多行。range表示的是 具体的值,比这个值小n的行,比这个值大n的行- sum(close) range between 100 preceding and 200 following: 如当前行的 close 字段值是 200,那么这个窗口大小的定义就会选择分区中 close 字段值落在 100 至 400 区间的记录(行)。
unbounded preceding 前面所有行
unbounded following 后面所有行
current row 当前行
n following 后面n行
n preceding 前面n行
4. LEAD and LAG
lag :用于统计窗口内往上第 n 行值
lead :用于统计窗口内往下第 n 行值
lag 和 lead 有三个参数,第一个参数是列名,第二个参数是偏移量,第三个参数是超出记录窗口时的默认值
例1:有三列user,time和url,获取用户在某个页面停留时长
SELECT userid, url,
UNIX_TIMESTAMP(LEAD(time,1) OVER(PARTITION BY userid ORDER BY time)) - UNIX_TIMESTAMP(time) AS period
FROM user_log
例2:寻找至少连续出现 3 次的数字
SELECT DISTINCT num
FROM (SELECT
num,
LAG(num,1) OVER() AS lag1,
LAG(num,2) OVER() AS lag2
FROM table
) a
WHERE num = lag1 AND num = lag2
5.行/列转换
多行转一行:
SELECT Product,
CONCAT_WS(',',COLLECT_SET(Supplier)) AS Supplier
FROM mytable
GROUP BY Product
*collect_set 函数,有两个作用,第一个是去重,去除 group by 后的重复元素,第二个是形成一个集合,将 group by 后属于同一组的集合起来成为一个集合。
一行转多行:
SELECT Product, Supplier
FROM mytable
LATERAL VIEW EXPLODE(SPLIT(Supplier,',')) tb1 AS Supplier
6.去除最高最低的平均
计算除去部门最高工资,和最低工资的平均工资(字节跳动面试)
SELECT a.Department, AVG(a.Salary) AS avg_salary
FROM (
SELECT
Department,
Salary,
RANK() OVER(PARTITION BY Department ORDER BY Salary ASC) AS asc_rk,
RANK() OVER(PARTITION BY Department ORDER BY Salary DESC) AS desc_rk
FROM Salary_Info
) a
WHERE a.asc_rk > 1 -- 去除最低的薪资
AND a.desc_rk > 1 -- 去除最高的薪资
GROUP BY a.Department;
核心是使用窗口函数 RANK 分别对工资 salary 进行升序和降序排列,就获得了该分组内最低和最高的工资,过滤掉这 2 条记录
rank()排序相同时会重复,总数不变,即会出现1、1、3这样的排序结果;
dense_rank()排序相同时会重复,总数会减少,即会出现1、1、2这样的排序结果;
标签:SQL,HIVE,user,time,table,高频,id,SELECT,rk From: https://blog.csdn.net/m0_63190465/article/details/141529972