生命周期管理工具
当数仓中表过于多的时候,生命周期管理就成了比较大的问题。
在表里面自己去手动删除分区,表过多的时候,修改就成了很大的问题。
这时候就可以做一个脚本:每天定时去删除过期的数据,不仅仅能处理这个问题,还可以减少成本
方案
hive 提供了可以自定义表属性的空间TBLPROPERTIES,我们可以在其中插入自己需要的生命周期
可以在建表语句里面直接写,也可以后来使用ALTER TABLE 进行更改
CREATE TABLE table_name(col1 string,col2 int)PARTITIONED BY (dt string)
TBLPROPERTIES ('lifecycle'='129','transient_lastDdlTime'='1682577175')
ALTER TABLE table set tblproperties ('life_cycle'='30');
hive元数据介绍
SDS表样例
分区存储表,里面记录了存储ID和loacation,其他列为了方便查看结果已经删掉
select * from SDS limit 10 ;
+-------+----------------------------------------------+
| SD_ID | LOCATION |
+-------+----------------------------------------------+
| 81 | file:/opt/hive/hive/data/table_1 |
| 82 | file:/opt/hive/hive/data/table_1/dt=20220101 |
| 83 | file:/opt/hive/hive/data/table_1/dt=20220102 |
| 84 | file:/opt/hive/hive/data/table_1/dt=20220103 |
| 85 | file:/opt/hive/hive/data/table_1/dt=20220104 |
| 86 | file:/opt/hive/hive/data/table_1/dt=20220105 |
| 87 | file:/opt/hive/hive/data/table_1/dt=20220106 |
| 88 | file:/opt/hive/hive/data/table_1/dt=20220107 |
| 89 | file:/opt/hive/hive/data/table_1/dt=20220108 |
| 90 | file:/opt/hive/hive/data/table_1/dt=20220109 |
+-------+----------------------------------------------+
PARTITIONS表样例
分区存储表,里面记录了分区ID和分区名称和表ID和存储ID,其他列为了方便查看结果已经删掉
select * from PARTITIONS where TBL_ID=56 ;
+---------+-------------+-------+--------+
| PART_ID | PART_NAME | SD_ID | TBL_ID |
+---------+-------------+-------+--------+
| 1 | dt=20230401 | 6 | 1 |
| 2 | dt=20230401 | 7 | 2 |
| 3 | dt=20230401 | 8 | 3 |
| 4 | dt=20230402 | 9 | 1 |
| 5 | dt=20230402 | 10 | 2 |
| 6 | dt=20230402 | 11 | 3 |
| 7 | dt=20230403 | 12 | 1 |
| 8 | dt=20230403 | 13 | 2 |
| 9 | dt=20230403 | 14 | 3 |
| 10 | dt=20230404 | 15 | 1 |
| 11 | dt=20230404 | 16 | 2 |
| 12 | dt=20230404 | 17 | 3 |
+---------+-------------+-------+--------+
TABLE_PARAMS表样例
HIVE表参数表,表ID和自定义参数名称和自定义参数值,其他列为了方便查看结果已经删掉
select * from TABLE_PARAMS limit 10 ;
+--------+-----------------------+-------------+
| TBL_ID | PARAM_KEY | PARAM_VALUE |
+--------+-----------------------+-------------+
| 1 | lifecycle | 30 |
| 2 | lifecycle | 30 |
| 3 | lifecycle | 30 |
| 4 | lifecycle | 30 |
+--------+-----------------------+-------------+
TBLS表样例
HIVE表描述表,表ID和数据库ID和表名称,其他列为了方便查看结果已经删掉
select * from TBLS limit 10;
+--------+------+----------+
| TBL_ID |DB_ID | TBL_NAME |
+--------+------+----------+
| 6 | 1 | table_1 |
| 11 | 1 | table_2 |
| 16 | 1 | table_3 |
| 21 | 1 | table_4 |
| 26 | 1 | table_5 |
| 31 | 1 | table_6 |
| 36 | 1 | table_7 |
| 41 | 1 | table_8 |
| 46 | 1 | table_9 |
| 51 | 1 | table_10 |
+--------+------+----------+
最终脚本
#!/bin/bash
IFS=$'\n'
mysql -u'root' -p'123456' -D'hive' -N -e"select DB_ID from TBLS group by DB_ID" > /root/db.csv
for dbindex in `cat /root/db.csv `
do
echo "db '$dbindex' start"
mysql -u'root' -p'123456' -D'hive' -N -e"select CONCAT('alter table ',TBL_NAME,' drop partition (',b.PART_NAME,');') from (select TBL_ID,TBL_NAME from TBLS where DB_ID in ($dbindex))a inner join (select TBL_ID,CONCAT('dt=',REPLACE(DATE_SUB(CURDATE(), INTERVAL CAST(PARAM_VALUE AS SIGNED) DAY),'-','')) as PART_NAME from TABLE_PARAMS where PARAM_KEY='lifecycle')b on a.TBL_ID=b.TBL_ID inner join (select PART_NAME,SD_ID,TBL_ID from PARTITIONS )c on b.TBL_ID=c.TBL_ID and b.PART_NAME>=c.PART_NAME inner join (select SD_ID,LOCATION from SDS)d on c.SD_ID=d.SD_ID group by b.PART_NAME,TBL_NAME" > /root/retult.csv
sed -i 's/=/</g' /root/retult.csv
awk '{printf "%s%s", (NR%100==1 && NR!=1?RS:""), $0} END {print ""}' retult.csv > data.csv
for tableindex in `head /root/data.csv `
do
hive -e $tableindex
done
echo "db '$dbindex' end"
done
标签:生命周期,管理工具,hive,TBL,table,dt,data,ID From: https://www.cnblogs.com/wuxiaolong4/p/17360108.html上面的解决方案缺陷如下:
1.hive使用的是mysql。一般情况下不是这样的比如AWS的云hive使用的glue元数据,阿里云使用的是RDS,解析方式可能不太一样
2.最终drop分区使用的是hive。其实可以升级成为spark。
需要注意的点:
1.如果hive中有特别大的表(分区特别多),那就每次处理的是单独处理一下
2.服务器是需要开启回收站的,以便于误删的数据能恢复过来
3.真正上线脚本的时候需要三步走:表->库->整个脚本上线