需求描述
给定多个时间段,每个时间段分为开始时间、结束时间,将相互重叠的多个时间段合并为一个区间。
--数据:id、开始时间、结束时间 1001 16 18 1002 17 19 1003 29 32 1004 30 31 1005 57 58 1006 44 44 1007 56 57 1008 12 15
合并后结果如下;
flag start_time end_time 1 12 15 2 16 19 3 29 32 4 44 44 5 56 58
需求分析
解题要点:如何判断哪些区间是要合并的?其实换个角度就是哪些区间是交叉的,哪些是重复的?判断思路:如果将起始时间,和结束时间进行排序,当前行的起始时间小于等于上一行的结束时间,那么日期就存在交叉,存在重复的数据。根据该条件我们可以设置断点,然后用经典的思路sum() over()来获取分组id,问题便得到解决。第一步:按照起始时间和结束时间进行降序排序,获取上一行的结束时间,目的是为了比较
with t as ( select 1001 as id, 16 as start_time, 18 as end_time union all select 1002 as id, 17 as start_time, 19 as end_time union all select 1003 as id, 29 as start_time, 32 as end_time union all select 1004 as id, 30 as start_time, 31 as end_time union all select 1005 as id, 57 as start_time, 58 as end_time union all select 1006 as id, 44 as start_time, 44 as end_time union all select 1007 as id, 56 as start_time, 57 as end_time union all select 1008 as id, 12 as start_time, 15 as end_time) select *, lag(end_time, 1, end_time) over (order by start_time ,end_time) as last_end_time from t; 1008 12 15 15 1001 16 18 15 1002 17 19 18 1003 29 32 19 1004 30 31 32 1006 44 44 31 1007 56 57 44 1005 57 58 57
第二步:根据lag_end_time进行判断,当当前行的start_time <= lag_end_time时候设置标记值0,否则为1(经典的按条件变化后的分组思路,这里一定是满足条件的时候置为0,不满足条件的时候置为1)
select t1.id, t1.start_time, t1.end_time, t1.last_end_time, if(start_time <= last_end_time, 0, 1) flag --条件成立的时候为 0,不成立的时候为 1 from (select *, lag(end_time, 1, end_time) over (order by start_time ,end_time) as last_end_time from t) t1 1008 12 15 15 0 1001 16 18 15 1 1002 17 19 18 0 1003 29 32 19 1 1004 30 31 32 0 1006 44 44 31 1 1007 56 57 44 1 1005 57 58 57 0
第三步:按照sum() over()的方法获取分组id
with t as ( select 1001 as id, 16 as start_time, 18 as end_time union all select 1002 as id, 17 as start_time, 19 as end_time union all select 1003 as id, 29 as start_time, 32 as end_time union all select 1004 as id, 30 as start_time, 31 as end_time union all select 1005 as id, 57 as start_time, 58 as end_time union all select 1006 as id, 44 as start_time, 44 as end_time union all select 1007 as id, 56 as start_time, 57 as end_time union all select 1008 as id, 12 as start_time, 15 as end_time) select id, start_time, end_time, last_end_time, flag, sum(flag) over (order by start_time,end_time) grep_flag from (select t1.id, t1.start_time, t1.end_time, t1.last_end_time, if(start_time <= last_end_time, 0, 1) flag --条件成立的时候为 0,不成立的时候为 1 from (select *, lag(end_time, 1, end_time) over (order by start_time ,end_time) as last_end_time from t) t1) t 1008 12 15 15 0 0 1001 16 18 15 1 1 1002 17 19 18 0 1 1003 29 32 19 1 2 1004 30 31 32 0 2 1006 44 44 31 1 3 1007 56 57 44 1 4 1005 57 58 57 0 4
第四步:在分组里获取,最小值及最大值,最小值即为起始点,最大值即为结束点,分组id即为id
select grep_flag + 1 as id, min(start_time) start_time, max(end_time) end_time from (select id, start_time, end_time, last_end_time, flag, sum(flag) over (order by start_time,end_time) grep_flag from (select t1.id, t1.start_time, t1.end_time, t1.last_end_time, if(start_time <= last_end_time, 0, 1) flag --条件成立的时候为 0,不成立的时候为 1 from (select *, lag(end_time, 1, end_time) over (order by start_time ,end_time) as last_end_time from t) t1) t) t group by grep_flag + 1 order by 1; 1 12 15 2 16 19 3 29 32 4 44 44 5 56 58标签:end,union,合并,Hive,start,time,id,select,刷题 From: https://www.cnblogs.com/wdh01/p/17156142.html