首页 > 其他分享 >Hudi表创建时HDFS上的变化

Hudi表创建时HDFS上的变化

时间:2023-05-31 17:35:01浏览次数:45  
标签:HDFS Hudi name 创建 t71 hoodie type null zhangsan

SparkSQL 建 Hudi 表语句:

CREATE TABLE t71 (
    ds BIGINT,
    ut STRING,
    pk BIGINT,
    f0 BIGINT,
    f1 BIGINT,
    f2 BIGINT,
    f3 BIGINT,
    f4 BIGINT
) USING hudi
PARTITIONED BY (ds)
TBLPROPERTIES ( -- 这里也可使用 options (https://hudi.apache.org/docs/table_management)
  type = 'mor',
  primaryKey = 'pk',
  preCombineField = 'ut',
  hoodie.index.type = 'BUCKET',
  hoodie.bucket.index.num.buckets = '2',
  hoodie.compaction.payload.class = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
  hoodie.datasource.write.payload.class = 'org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload',
  hoodie.archive.merge.enable = 'true',
  hoodie.datasource.write.operation = 'upsert'
);

执行 create table 后,会创建一子目录和文件:

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71
Found 1 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71
Found 1 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie
Found 5 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.schema
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.temp
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/archived
-rw-r--r--   3 zhangsan dfsusers       1501 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/hoodie.properties

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/archived
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.schema
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux
Found 1 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.temp
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap
Found 2 items
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.fileids
drwxr-xr-x   - zhangsan dfsusers          0 2023-05-31 11:09 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.partitions

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.partitions
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/.aux/.bootstrap/.fileids

[/home/zhangsan]$ sh hadoop.sh fs -cat hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/.hoodie/hoodie.properties
#Properties saved on 2023-05-31T03:09:25.601Z
#Wed May 31 11:09:25 CST 2023
hoodie.table.precombine.field=ut
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.partition.fields=ds
hoodie.bucket.index.num.buckets=2
hoodie.table.type=MERGE_ON_READ
hoodie.archivelog.folder=archived
hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload
hoodie.table.version=5
hoodie.timeline.layout.version=1
hoodie.table.recordkey.fields=pk
hoodie.database.name=test
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.table.name=t71
hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
hoodie.datasource.write.hive_style_partitioning=true
hoodie.table.create.schema={"type"\:"record","name"\:"t71_record","namespace"\:"hoodie.t71","fields"\:[{"name"\:"_hoodie_commit_time","type"\:["string","null"]},{"name"\:"_hoodie_commit_seqno","type"\:["string","null"]},{"name"\:"_hoodie_record_key","type"\:["string","null"]},{"name"\:"_hoodie_partition_path","type"\:["string","null"]},{"name"\:"_hoodie_file_name","type"\:["string","null"]},{"name"\:"ut","type"\:["string","null"]},{"name"\:"pk","type"\:["long","null"]},{"name"\:"f0","type"\:["long","null"]},{"name"\:"f1","type"\:["long","null"]},{"name"\:"f2","type"\:["long","null"]},{"name"\:"f3","type"\:["long","null"]},{"name"\:"f4","type"\:["long","null"]},{"name"\:"ds","type"\:["long","null"]}]}
hoodie.index.type=BUCKET
hoodie.table.checksum=3938074607

执行 drop table 后,会将表目录如 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71 删除掉。

如果是分区表,则在有数据插入是,会在表目录下以分区值建立子目录,比如:

insert into t71 (ds,ut,pk,f0) values (20230101,CURRENT_TIMESTAMP,1102,1);

上述语句会在 HDFS 上建立以“ds=20230101”为名的子目录:

[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101
Found 2 items
-rw-r--r--   3 zhangsan dfsusers         96 2023-05-31 11:29 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.hoodie_partition_metadata
-rw-r--r--   3 zhangsan dfsusers     435756 2023-05-31 11:29 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/00000001-6776-4b80-915b-ad6bdff96948-0_1-21-19_20230531112913107.parquet

执行连续三条 insert:

insert into t71 (ds,ut,pk,f0) values (20230101,CURRENT_TIMESTAMP,1102,1);
select * from t71 where pk=1102;
insert into t71 (ds,ut,pk,f1) values (20230101,CURRENT_TIMESTAMP,1102,2);
select * from t71 where pk=1102;
insert into t71 (ds,ut,pk,f2) values (20230101,CURRENT_TIMESTAMP,1102,3);
select * from t71 where pk=1102;
[/home/zhangsan]$ sh hadoop.sh fs -ls hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101
Found 3 items
-rw-r--r--   3 zhangsan dfsusers       1048 2023-05-31 14:26 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_20230531141236926.log.1_1-8-6
-rw-r--r--   3 zhangsan dfsusers       2096 2023-05-31 14:31 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_20230531141236926.log.1_1-8-6
-rw-r--r--   3 zhangsan dfsusers         96 2023-05-31 14:13 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/.hoodie_partition_metadata
-rw-r--r--   3 zhangsan dfsusers     435757 2023-05-31 14:13 hdfs://hadoop-cluster-01/user/zhangsan/warehouse/test.db/t71/ds=20230101/00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_1-21-17_20230531141236926.parquet

上面列了两个“.log.”文件,分别为第一次 insert 后和第二次 insert 后的结果,为方便对比观察放在了一起。

分区的第 1 笔插入总是会生成“.parquet”文件,而不是“.log.”文件。上述“.parquet”为列存储格式的基础文件,COW 和 MOR 表都有的文件,但 COW 每笔 insert 都会整个重写,而 MOR 表则不会。“.log.”为行存储格式的增量日志文件,为 MOR 表独有文件。文件 .hoodie_partition_metadata 为分区元数据文件:

[/home/zhangsan]$ sh hadoop.sh fs -cat hdfs://hadoop-cluster-01/user/zhangsan/warehouse/testtest.db/t71/ds=20230101/.hoodie_partition_metadata
#partition metadata
#Wed May 31 11:29:49 CST 2023
commitTime=20230531112913107
partitionDepth=1

“.parquet”文件

使用在线的工具 https://parquet-viewer-online.com/result 打开“.parquet”文件,可发现内容同 select 完全一样。

_hoodie_commit_time _hoodie_commit_seqno  _hoodie_record_key _hoodie_partition_path _hoodie_file_name                                                        ut	                     pk	  f0 f1	  f2   f3   f4   ds
20230531141236926   20230531141236926_1_0 1102               ds=20230101            00000001-b6d7-4eaa-8004-ac7d0626bf8d-0_1-21-17_20230531141236926.parquet 2023-05-31 14:12:37.126 1102 1  null null null null 20230

“.log.”文件

第二次 insert 后生成了“.log.”文件:

#HUDI#      
              4{"type":"record","name":"t71_record","namespace":"hoodie.t71","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"ut","type":"string"},{"name":"pk","type":"long"},{"name":"f0","type":["null","long"],"default":null},{"name":"f1","type":["null","long"],"default":null},{"name":"f2","type":["null","long"],"default":null},{"name":"f3","type":["null","long"],"default":null},{"name":"f4","type":["null","long"],"default":null},{"name":"ds","type":"long"}]}       20230531142614512       •         ‰"20230531142614512*20230531142614512_1_11102ds=20230101L00000001-b6d7-4eaa-8004-ac7d0626bf8d-0.2023-05-31 14:26:14.761œ    ª¿¥          

第三次 insert 后更新了“.log.”文件:

#HUDI#      
              4{"type":"record","name":"t71_record","namespace":"hoodie.t71","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"ut","type":"string"},{"name":"pk","type":"long"},{"name":"f0","type":["null","long"],"default":null},{"name":"f1","type":["null","long"],"default":null},{"name":"f2","type":["null","long"],"default":null},{"name":"f3","type":["null","long"],"default":null},{"name":"f4","type":["null","long"],"default":null},{"name":"ds","type":"long"}]}       20230531142614512       •         ‰"20230531142614512*20230531142614512_1_11102ds=20230101L00000001-b6d7-4eaa-8004-ac7d0626bf8d-0.2023-05-31 14:26:14.761œ    ª¿¥          #HUDI#      
              4{"type":"record","name":"t71_record","namespace":"hoodie.t71","fields":[{"name":"_hoodie_commit_time","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_commit_seqno","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_record_key","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_partition_path","type":["null","string"],"doc":"","default":null},{"name":"_hoodie_file_name","type":["null","string"],"doc":"","default":null},{"name":"ut","type":"string"},{"name":"pk","type":"long"},{"name":"f0","type":["null","long"],"default":null},{"name":"f1","type":["null","long"],"default":null},{"name":"f2","type":["null","long"],"default":null},{"name":"f3","type":["null","long"],"default":null},{"name":"f4","type":["null","long"],"default":null},{"name":"ds","type":"long"}]}       20230531143136695       •         ‰"20230531143136695*20230531143136695_1_11102ds=20230101L00000001-b6d7-4eaa-8004-ac7d0626bf8d-0.2023-05-31 14:31:36.801œ    ª¿¥          

这里可以看到 pk=1102 有两笔数据,ut 值分别为 2023-05-31 14:26:14.761 和 2023-05-31 14:31:36.801 。使用 OverwriteNonDefaultsWithLatestAvroPayload 读取时,只能读取到 2023-05-31 14:31:36.801 这笔,这是依据 preCombineField 更大原则的结果,在 HoodieRecordPayload::preCombine 时完成的逻辑。

相关源码

// OverwriteNonDefaultsWithLatestAvroPayload 没有重写 OverwriteWithLatestAvroPayload 的 preCombine 方法
public class OverwriteNonDefaultsWithLatestAvroPayload extends OverwriteWithLatestAvroPayload {
}

public class OverwriteWithLatestAvroPayload extends BaseAvroPayload
    implements HoodieRecordPayload<OverwriteWithLatestAvroPayload> {

  @Override
  public OverwriteWithLatestAvroPayload preCombine(OverwriteWithLatestAvroPayload oldValue) {
    if (oldValue.recordBytes.length == 0) {
      // use natural order for delete record
      return this;
    }
    if (oldValue.orderingVal.compareTo(orderingVal) > 0) {
      // pick the payload with greatest ordering value
      return oldValue;
    } else {
      return this;
    }
  }
}

如果换用 PartialUpdateAvroPayload,则

_hoodie_commit_time _hoodie_commit_seqno  _hoodie_record_key _hoodie_partition_path _hoodie_file_name                       ut         pk           f0      f1 f2   f3  f4
20230531164701237   20230531164701237_0_1 1006	             ds=20230101	        00000000-ad06-474e-a7ac-0580f60307e1-0	2023-05-31 16:47:02.337	1006	1	2	3	NULL	NULL

标签:HDFS,Hudi,name,创建,t71,hoodie,type,null,zhangsan
From: https://www.cnblogs.com/aquester/p/17446828.html

相关文章

  • 【2023 · CANN训练营第一季】——搭建环境:创建ECS,下载sample仓
    前言:        本文是环境搭建的第一篇笔记。主要包括下面两方面内容:    1、在华为云上创建ECS服务器,并修改Ubuntu源和pip源为国内镜像地址。        2、为了更好的使用ECS,需要在本地安装远程连接和查看代码的工具软件,以Windows为例介绍几个常用的工具软件。......
  • nginx创建基本认证(Basic Authorization)
     步骤一:创建用户名密码#创建用户名密码文件htpasswd-dbchtpasswd.usersuserpassword密码也可以通过opensslpasswdpassword来创建格式为user:encrypr_password可以多个 步骤二:Nginx配置server{listen80;server_namexxx.com;locat......
  • 使用python操作hdfs,并grep想要的数据
    代码如下:importsubprocessfordayinrange(24,30):forhinrange(0,24):filename="tls-metadata-2018-10-%02d-%02d.txt"%(day,h)cmd="hdfsdfs-text/data/2018/10/%02d/%02d/*.snappy"%(day,h)print(c......
  • python cassandra 创建space table并写入和查询数据
     fromcassandra.clusterimportClustercluster=Cluster(["10.178.209.161"])session=cluster.connect()keyspacename="demo_space"session.execute("createkeyspace%swithreplication={'class':'SimpleStrategy&......
  • 【京东云】通过SDK创建多个弹性IP
    需求:某客户短时间内需要申请大量的IP,并且在同一个C段(256个)目的:通过SDK方式,将一个C段的IP都开出来步骤:一、配置环境:安装CLI:(1)安装python3.6 yuminstallpython36(2)安装pipcurlhttps://bootstrap.pypa.io/get-pip.py-oget-pip.pypython3get-pip.py(3)修改bashrc[root@junper~]#ec......
  • 用POSIX线程库创建带优先级的线程
    #include<iostream>#include<pthread.h>void*threadFunction(void*arg){//线程函数逻辑//...returnnullptr;}intmain(){pthread_tthread;pthread_attr_tattr;//初始化线程属性pthread_attr_init(&attr);//......
  • HDFS 文件格式——SequenceFile RCFile
    HDFS块内行存储的例子HDFS块内列存储的例子HDFS块内RCFile方式存储的例子......
  • SqlServer2014管理工具创建用户并设置对应访问权限
     需求:创建一个具有访问权限登录的用户,用以访问指定数据库, 对其放开指定数据库的指定表  一、用户名创建以及设置1、首先使用管理员账号登陆到数据库,【安全性】-【登录名】-右键【新建登录名】 【服务器角色】页签中:用于向用户授予服务器范围内的安全特权 【用户映......
  • 关于mysql 创建索引报错 1071 specified key was too long ;max key length is 3027
    另一种张表也是相同的字段创建索引却能成功,在网上查了一些资料。后来发现是两张表字段都用的varchar类型,不过能成功建索引的表设置的长度是50,而不能成功的表里设置的255,修改字符长度就能成功建索引了。关于varchar(50)和varchar(255)的区别:https://dba.stackexchange.com/questio......
  • vst实例(9)创建编辑器
    先上编辑器单元的代码:uniteditlink;interfaceusesWindows,Messages,SysUtils,Classes,Graphics,Controls,Forms,Dialogs,StdCtrls,VirtualTrees;typetcomboeditlink=class(TInterfacedObject,IVTEditLink)privateFedit:TComboBox;itemstrs:......