hive和mysql的笔记-反向overwrite,避免重复插入数据,量表关联更新
1反向overwrite 操作->解决错误的插入数据的问题。
描述:
例如
学生表:student<id,name,createtime>
在学生表中插入数据,首先我们在23号插入一条数据,然后又插入一条24号的数据。发现差错了。此时可以查询24号之前的数据然后在 overwrite 到原表中
2. where not exists 避免重复插入SQL语句
避免重复插入SQL语句:
insert into TABLE2
select *
from TABLE1
where not exists (select 1 from TABLE2 where TABLE2.id=TABLE1.id and TABLE2.NAME = TABLE1.NAME)
- SQL数据写入经常遇到重复插入数据导致数据样本变大,对测试和磁盘造成压力,浪费资源,故需要修改写入语句
-
原语句:
0. 建表 create table IF NOT EXISTS dbgen_version ( dv_version varchar(10), dv_create_date date, dv_create_time varchar(10), dv_cmdline_args varchar(10) ) WITH (format = 'ORC') 1. 清空表(如果有数据) trino:hivetest> delete from hive.hivetest.dbgen_version; DELETE Query 20210225_072035_00059_zvank, FINISHED, 1 node Splits: 1 total, 1 done (100.00%) 0.24 [0 rows, 0B] [0 rows/s, 0B/s] 查看语句: trino:hivetest> select * from hive.hivetest.dbgen_version; dv_version | dv_create_date | dv_create_time | dv_cmdline_args ------------+----------------+----------------+----------------- (0 rows) Query 20210225_072059_00060_zvank, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0.21 [0 rows, 0B] [0 rows/s, 0B/s]
-
插入并查询语句
trino:hivetest> INSERT INTO hive.hivetest.dbgen_version select * from hive.hive.dbgen_version; INSERT: 1 row Query 20210225_072113_00061_zvank, FINISHED, 3 nodes Splits: 69 total, 69 done (100.00%) 0.53 [1 rows, 67B] [1 rows/s, 127B/s] 查看语句 trino:hivetest> select * from hive.hivetest.dbgen_version; dv_version | dv_create_date | dv_create_time | dv_cmdline_args ------------+----------------+----------------+----------------- 2.0.0 | 2021-02-02 | 18:32:49 | --scale 10 (1 row) Query 20210225_072152_00062_zvank, FINISHED, 2 nodes Splits: 17 total, 17 done (100.00%) 0.21 [1 rows, 0B] [4 rows/s, 0B/s]
-
插入where not exists限定语句
trino:hivetest> insert into hive.hivetest.dbgen_version -> select * -> from hive.hive.dbgen_version TABLE1 -> where not exists (select 1 from hive.hivetest.dbgen_version TABLE2 where TABLE2.dv_version=TABLE1.dv_version and TABLE2.dv_create_date = TABLE1.dv_create_date); INSERT: 0 rows 查看语句 Query 20210225_072210_00063_zvank, FINISHED, 3 nodes Splits: 198 total, 198 done (100.00%) 0.44 [3 rows, 176B] [6 rows/s, 400B/s] trino:hivetest> select * from hive.hivetest.dbgen_version; dv_version | dv_create_date | dv_create_time | dv_cmdline_args ------------+----------------+----------------+----------------- 2.0.0 | 2021-02-02 | 18:32:49 | --scale 10 (1 row) 查看语句 Query 20210225_072216_00064_zvank, FINISHED, 1 node Splits: 17 total, 17 done (100.00%) 0.21 [1 rows, 0B] [4 rows/s, 0B/s]
-
插入一般的语句
trino:hivetest> INSERT INTO hive.hivetest.dbgen_version select * from hive.hive.dbgen_version; INSERT: 1 row 查看语句 Query 20210225_072237_00065_zvank, FINISHED, 3 nodes Splits: 69 total, 69 done (100.00%) 0.60 [1 rows, 67B] [1 rows/s, 111B/s] 重复插入!!! trino:hivetest> select * from hive.hivetest.dbgen_version; dv_version | dv_create_date | dv_create_time | dv_cmdline_args ------------+----------------+----------------+----------------- 2.0.0 | 2021-02-02 | 18:32:49 | --scale 10 2.0.0 | 2021-02-02 | 18:32:49 | --scale 10 (2 rows) Query 20210225_072245_00066_zvank, FINISHED, 3 nodes Splits: 18 total, 18 done (100.00%) 0.21 [2 rows, 0B] [9 rows/s, 0B/s]
原因
where not exists (select 1 from TABLE2 where TABLE2.id=TABLE1.id and TABLE2.NAME = TABLE1.NAME)限定了两个条件相等情况下不予插入。如果仅限定一个条件,对于有重复数据的情况可能会写入数据不成功,需要检查确认。