首页 > 其他分享 >Hive 刷题——区间合并

Hive 刷题——区间合并

时间:2023-02-26 09:46:45浏览次数:31  
标签:end union 合并 Hive start time id select 刷题

需求描述

给定多个时间段,每个时间段分为开始时间、结束时间,将相互重叠的多个时间段合并为一个区间。

--数据:id、开始时间、结束时间
1001 16 18
1002 17 19
1003 29 32
1004 30 31
1005 57 58
1006 44 44
1007 56 57
1008 12 15

合并后结果如下;

flag    start_time    end_time
1    12    15
2    16    19
3    29    32
4    44    44
5    56    58

需求分析

解题要点:如何判断哪些区间是要合并的?其实换个角度就是哪些区间是交叉的,哪些是重复的?判断思路:如果将起始时间,和结束时间进行排序,当前行的起始时间小于等于上一行的结束时间,那么日期就存在交叉,存在重复的数据。根据该条件我们可以设置断点,然后用经典的思路sum() over()来获取分组id,问题便得到解决。第一步:按照起始时间和结束时间进行降序排序,获取上一行的结束时间,目的是为了比较 

with t as (
    select 1001 as id, 16 as start_time, 18 as end_time
    union all
    select 1002 as id, 17 as start_time, 19 as end_time
    union all
    select 1003 as id, 29 as start_time, 32 as end_time
    union all
    select 1004 as id, 30 as start_time, 31 as end_time
    union all
    select 1005 as id, 57 as start_time, 58 as end_time
    union all
    select 1006 as id, 44 as start_time, 44 as end_time
    union all
    select 1007 as id, 56 as start_time, 57 as end_time
    union all
    select 1008 as id, 12 as start_time, 15 as end_time)
select *, lag(end_time, 1, end_time) over (order by start_time ,end_time) as last_end_time  from t;
1008    12    15    15
1001    16    18    15
1002    17    19    18
1003    29    32    19
1004    30    31    32
1006    44    44    31
1007    56    57    44
1005    57    58    57

第二步:根据lag_end_time进行判断,当当前行的start_time <=  lag_end_time时候设置标记值0,否则为1(经典的按条件变化后的分组思路,这里一定是满足条件的时候置为0,不满足条件的时候置为1)

select t1.id,
       t1.start_time,
       t1.end_time,
       t1.last_end_time,
       if(start_time <= last_end_time, 0, 1) flag --条件成立的时候为 0,不成立的时候为 1
from (select *, lag(end_time, 1, end_time) over (order by start_time ,end_time) as last_end_time
      from t) t1
1008    12    15    15    0
1001    16    18    15    1
1002    17    19    18    0
1003    29    32    19    1
1004    30    31    32    0
1006    44    44    31    1
1007    56    57    44    1
1005    57    58    57    0

 第三步:按照sum() over()的方法获取分组id

with t as (
    select 1001 as id, 16 as start_time, 18 as end_time
    union all
    select 1002 as id, 17 as start_time, 19 as end_time
    union all
    select 1003 as id, 29 as start_time, 32 as end_time
    union all
    select 1004 as id, 30 as start_time, 31 as end_time
    union all
    select 1005 as id, 57 as start_time, 58 as end_time
    union all
    select 1006 as id, 44 as start_time, 44 as end_time
    union all
    select 1007 as id, 56 as start_time, 57 as end_time
    union all
    select 1008 as id, 12 as start_time, 15 as end_time)
select id, start_time, end_time, last_end_time, flag, sum(flag) over (order by start_time,end_time) grep_flag
      from (select t1.id,
                   t1.start_time,
                   t1.end_time,
                   t1.last_end_time,
                   if(start_time <= last_end_time, 0, 1) flag --条件成立的时候为 0,不成立的时候为 1
            from (select *, lag(end_time, 1, end_time) over (order by start_time ,end_time) as last_end_time
                  from t) t1) t
1008    12    15    15    0    0
1001    16    18    15    1    1
1002    17    19    18    0    1
1003    29    32    19    1    2
1004    30    31    32    0    2
1006    44    44    31    1    3
1007    56    57    44    1    4
1005    57    58    57    0    4

第四步:在分组里获取,最小值及最大值,最小值即为起始点,最大值即为结束点,分组id即为id

select grep_flag + 1 as id, min(start_time) start_time, max(end_time) end_time
from (select id, start_time, end_time, last_end_time, flag, sum(flag) over (order by start_time,end_time) grep_flag
      from (select t1.id,
                   t1.start_time,
                   t1.end_time,
                   t1.last_end_time,
                   if(start_time <= last_end_time, 0, 1) flag --条件成立的时候为 0,不成立的时候为 1
            from (select *, lag(end_time, 1, end_time) over (order by start_time ,end_time) as last_end_time
                  from t) t1) t) t
group by grep_flag + 1
order by 1;
1    12    15
2    16    19
3    29    32
4    44    44
5    56    58

标签:end,union,合并,Hive,start,time,id,select,刷题
From: https://www.cnblogs.com/wdh01/p/17156142.html

相关文章

  • 刷题疑问
    1.K个链表合并,新建的节点怎么样能不使得内存泄漏;以及在使用priority_queue的时候,compare二元谓词、仿函数怎么使用来?template<classT,classContainer=vector<T>,......
  • 【Hive 元数据和真实数据-TAB_COL_STATS记录错误问题】
    MySql的元数据以及HDFS上数据的关系 元数据在DBS和TBLS上(SD_ID)1首先通过hive创建一个表table_test在hdfs的存储路径会生成相应的表TBLS也会更新内容进行记录2通......
  • 刷刷刷 Day 36 | 56. 合并区间
    56.合并区间LeetCode题目要求以数组intervals表示若干个区间的集合,其中单个区间为intervals[i]=[starti,endi]。请你合并所有重叠的区间,并返回 一个不重叠的区......
  • 2023/02/24刷题
    B.MaximumProduct链接B.MaximumProduct这个题因为只取5个数字,所以我们直接枚举5个数字的全部情况取出最大值就可以了#include<iostream>#include<algorithm>#......
  • hive:regexp_replace函数
    语法regexp_replace(subject,pattern,str)subject,pattern,str都为字符串subject为被替换的字符串pattern为正则表达式str需要替换的字符串实例SELECTaa......
  • Hive:COALESCE函数
    语法COALESCE(Tv1,Tv2,…)返回参数中的第一个非空值;如果所有值都为NULL,那么返回NULL。查询非空值selectcoalesce(a,b,c);如果a为null,则选择b;如果b为null,则选......
  • PHP使用ZipArchive批量打包压缩文件并下载
    PHP使用ZipArchive批量打包压缩文件,并下载。使用php自带的ZipArchive类,可以压缩或解压文件。首先需要确定已经安装了zip扩展,如果没有安装,请先安装,下载:http://pecl.php.net......
  • 如何实现把多个git仓库合并为一个,并保留提交记录?
    近期,我们中台git项目在进行monorepo改造。由多个各自独立的单一仓库改造成一个monorepo仓库。这其中就涉及到git仓库的合并。好办是好办,不过呢,要保留各仓库的提交记录的话,......
  • 算法刷题-数组排序(图算法、算法高阶)、螺旋矩阵(数组、矩阵)、分发糖果(贪心、数组)
    数组排序(图算法、算法高阶)编写一个JavaApplication程序,将随机生成的无序数组使用冒泡排序,将这个混乱的数组变成一个从小到大排列的有序的数组并输出。classdemo_sort......
  • LeetCode56. 合并区间(/)
    原题解题目约束题解classSolution{public:vector<vector<int>>merge(vector<vector<int>>&intervals){if(intervals.size()==0){......