分桶表+事务表+视图

Hive Bucketed Tables 分桶表

分桶表也是桶表（ bucket）是一种用于优化查询而设计的表类型

分桶表把数据文件在底层分解若干个部分（被拆分某干个小文件）
分桶要指定字段分到哪个分桶

分桶规则：桶编号相同的数据回分到同一个桶里面
hash_function 取决于分桶字段buckteing_column的类型
- 如果是int类型，hash_function(int) == int;
- 如果是其他比如bigint,string或者复杂数据类型，hash_function比较棘手，将是从该类型派生的某个数字，比如hashcode值。

语法

-- 分桶语法
CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS

CLUSTERED BY (col_name)表示根据哪个字段进行分；
INTO N BUCKETS表示分为几桶（也就是几个部分，n表示几桶）。
PS：分桶的字段必须是表中已经存在的字段

例子

l根据state州把数据分为5桶，建表语句如下：

CREATE TABLE itheima.t_usa_covid19(
count_date string,
county string,
state string,
fips int,
cases int,
deaths int)
CLUSTERED BY(state) INTO 5 BUCKETS;

在创建分桶表时，还可以指定分桶内的数据排序规则：

CREATE TABLE itheima.t_usa_covid19(
count_date string,
county string,
state string,
fips int,
cases int,
deaths int)
CLUSTERED BY (state)
sorted by(cases desc) INTO 5 BUCKETS

分桶表的数据加载

--step1:开启分桶的功能 从Hive2.0开始不再需要设置
set hive.enforce.bucketing=true;
--step2:把源数据加载到普通hive表中
drop table if exists t_usa_covid19;

CREATE TABLE itheima.t_usa_covid19(
count_date string,county string,
state string,
fips int,
cases int,
deaths int)
row format delimited fields terminated by ','

--将源数据上传到HDFS，t_usa_covid19表对应的路径下
hadoop fs -put us-covid19-counties.dat /user/hive/warehouse/itheima.db/t_usa_covid19
--step3:使用insert+select语法将数据加载到分桶表中(插入数据的结果来自于后面的查询语句)
insert into t_usa_covid19_bucket select * from t_usa_covid19;

到HDFS上查看t_usa_covid19_bucket底层数据结构可以发现，数据被分为了5个部分。
并且从结果可以发现，分桶字段一样的数据就一定被分到同一个桶中。

使用好处

基于分桶字段查询时，减少全表扫描

JOIN时可以提高MR程序效率，减少笛卡尔积数量根据join的字段对表进行分桶操作（比如下图中id是join的字段）

分桶表数据进行高效抽样

当数据量特别大时，对全体数据进行处理存在困难时，抽样就显得尤其重要了。抽样可以从被抽取的数据中估计和推断出整体的特性，是科学实验、质量检验、社会调查普遍采用的一种经济有效的工作和研究方法

Hive Transactional Tables 事务表

本来hive是不支持数据的更改

因为：Hive的核心目标是将已经存在的结构化数据文件映射成为表，然后提供基于表的SQL分析处理，是一款面向分析的工具

后续增加了更改，但是有局限性

尚不支持BEGIN，COMMIT和ROLLBACK。所有语言操作都是自动提交的。
仅支持ORC文件格式（STORED AS ORC）
默认情况下事务配置为关闭。需要配置参数开启使用
表必须是分桶表（Bucketed）才可以使用事务功能
表参数transactional必须为true
外部表不能成为ACID表，不允许从非ACID会话读取/写入ACID表

如果不进行任何配置的修改，去直接UPDATA,DELTE,INSERT 操作，仅仅INSERT可以执行（INSERT底层是把数据直接写入一个新的文件中）

--Step1：创建普通的表
create table student(
    num int,
    name string,
    sex string,
    age int,
    ept string)
row format delimited 
fields terminated by ','

--Step2：加载数据到普通表中
hadoop fs -put students.txt /user/hive/warehouse/itheima.db/student
--Step3：执行更新操作
update student
set age = 66
where num = 95001;

配置开启事务、创建事务表

--1、开启事务配置（可以使用set设置当前session生效 也可以配置在hive-site.xml中）
set hive.support.concurrency = true; --Hive是否支持并发
set hive.enforce.bucketing = true; --从Hive2.0开始不再需要  是否开启分桶功能
set hive.exec.dynamic.partition.mode = nonstrict; --动态分区模式  非严格
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; --

set hive.compactor.initiator.on = true; --是否在Metastore实例上运行启动线程和清理线程

set hive.compactor.worker.threads = 1; --在此metastore实例上运行多少个压缩程序工作线程
--2、创建Hive事务表
create table trans_student(
    id int,
    name String,
    age int)
    clustered by (id) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');

insert update delete

--3、针对事务表进行insert update delete操作
insert into trans_student values(1,"allen",18);

update trans_student
set age = 20
where id = 1;

delete from trans_student where id =1;

select * from trans_student;

Hive Views 视图

Hive的视图是一种虚拟表,只保存定义，不实际存储数据
视图时用来简化操作的，不缓冲记录，也灭用提高查询性能

-- 语法
--hive中有一张真实的基础表t_usa_covid19
select *from itheima.t_usa_covid19;
--1、创建视图
create view v_usa_covid19 as select count_date, county,state,deaths from t_usa_covid19 limit 5;
--能否从已有的视图中创建视图呢  可以的
create view v_usa_covid19_from_view as select * from v_usa_covid19 limit 2;
--2、显示当前已有的视图
show tables;
show views;
--hive v2.2.0之后支持
--3、视图的查询使用
select *from v_usa_covid19;

--能否插入数据到视图中呢
--不行 报错  SemanticException:A view cannot be used as target table for LOAD or INSERT
insert into v_usa_covid19 select count_date,county,state,deaths from t_usa_covid19;
--4、查看视图定义
show create table v_usa_covid19;
--5、删除视图
drop view v_usa_covid19_from_view;
--6、更改视图属性
alter view v_usa_covid19 set TBLPROPERTIES ('comment' = 'This is a view');
--7、更改视图定义
alter view v_usa_covid19 as  select county,deaths from t_usa_covid19 limit 2;

视图的好处

真实表中特定的列数据提供给用户，保护数据隐式

--通过视图来限制数据访问可以用来保护信息不被随意查询:
create table userinfo(firstname string, lastname string, ssn string, password string);
create view safer_user_info as select firstname, lastname from userinfo;
--可以通过where子句限制数据访问，比如，提供一个员工表视图，只暴露来自特定部门的员工信息:
create table employee(firstname string, lastname string, ssn string, password string, department string);
create view techops_employee as select firstname, lastname, ssn from userinfo where department = 'java';

降低查询的复杂度，优化查询语句

--使用视图优化嵌套查询
from (
    select * from people join cart                                         on(cart.pepople_id = people.id) where firstname = 'join'     
      )a select a.lastname where a.id = 3;
      
--把嵌套子查询变成一个视图
create view shorter_join as 
select * from people join cart                        
on (cart.pepople_id = people.id) where firstname = 'join';

--基于视图查询
select lastname from shorter_join where id = 3;

标签：事务,分桶,usa,--,covid19,视图,string
From： https://www.cnblogs.com/catch-autumn/p/16814669.html

Hive Bucketed Tables 分桶表

例子

使用好处

Hive Transactional Tables 事务表

Hive Views 视图

视图的好处

相关文章

赞助商

阅读排行