1. presto:
1. array_position(applist_install, 'com.funtomic.matchmasters') > 0
2. split('joyit_daily_mas_cleaner','_')[3]—>presto 从1开始计位(spark从0开始计位)
3. 日期diff:
1. DATE_DIFF('day', DATE_PARSE('20220301', '%Y%m%d'), DATE_PARSE('20220313', '%Y%m%d'))
2. date_diff('day',to_date(datepart,'yyyy-mm-dd'),current_date)=1
4. 日期转化:
+8小时 :format_datetime((cast(create_time as timestamp) + interval '8' hour),'yyyyMMdd') ='20230927'
每月最后一天 :last_day_of_month(to_date(dt,'yyyymmdd')
split_part(try_cast(from_unixtime((try_cast(event_time AS double)+8*60*60*1000)/1000) AS varchar),':',1)
1. 删除分区 :
1. ALTER TABLE table_name DROP IF EXISTS PARTITION(year = 2015, month = 10, day = 1);
2. 修改表名 alter table name rename to new name
3. 修改字段名:Alter table table_name change column 原字段名称 现字段名称 数据类型
4. 新增字段名 alter table detail_flow_test add columns(original_union_id string);
5. 查询是否字符是否包含 select strpos(addr,'北京')
6. json解析 select json_extract(feild,'$.name') as feild_name
7. 去重列转行:array_sort(array_distinct(array_agg(字段名)))
8. 展示存储地址 :show create table 表名
****2. spark
1. 动态分区要放在静态分区之后写入
2. MSCK REPAIR TABLE 修复分区
3. collect_set :将多个值汇总到一行
4. 建iceberg表: using iceberg partitioned by (dt)
5. array_contains(applist_install, 'com.funtomic.matchmasters')
6. 每月最后一天 last_day(to_date('20230727','yyyyMMdd'))
1. 建表(csv格式的)
CREATE EXTERNAL TABLE analyst.huanglu_test
(
`nation` string COMMENT 'from deserializer',
`beyla_id` string COMMENT 'from deserializer',
`push_launch_pv` bigint COMMENT 'from deserializer',
`other_launch_pv` bigint COMMENT 'from deserializer',
`extra1` string COMMENT 'from deserializer',
`extra2` string COMMENT 'from deserializer',
`extra3` string COMMENT 'from deserializer'
)
PARTITIONED BY (`dt` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',','quoteChar' = '\"','escapeChar' = '\\')
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://da.results.prod.us-east-1/huanglu/temp/test_20200529改一改这里' ---3 这行要修改
TBLPROPERTIES ('has_encrypted_data'='false','skip.header.line.count'='1')
1. 建表(mongodb json格式的)
CREATE EXTERNAL TABLE game_ue1.ods_gameet_pp_feedback(
`_id` STRING,
u STRING,
app STRING,
uc STRING,
i STRING,
obj STRING,
cont STRING,
imgs STRING,
star double,
biz STRING,
ct BIGINT
) PARTITIONED BY (datepart STRING)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('serialization.format' = '1')
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://game.data.us-east-1/data_analyst/server_ods/ods_gameet_pp_feedback'
1. 时间戳转化:
select from_unixtime(unix_timestamp('20230515' ,'yyyyMMdd'), 'yyyy-MM-dd')
cast(create_time as timestamp) + interval '8' hour 增加8小时
select from_unixtime(cast ('1668009600000' AS bigint)/1000+ 86060,'yyyy-MM-dd')
from_unixtime(unix_timestamp(cast(create_time as timestamp) + interval '8' hour ,'yyyyMMdd'),'yyyyMMdd')
1. 动态分区
1. 注意迁移数据的时候需要把分区字段放在最后select出来
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=nonstrict
3. 累计求和
sum(order_pv)OVER (partition by pp_user_id,item_id,item_type,stage
ORDER BY datepart ASC)
**3 tableau**
1. 最大年月:{MAX(DATETRUNC('day',date))}
2. 星期几:DATENAME('weekday', [YourDateField])
3. 根据不同类型日期范围截断:
1. DATETRUNC('date_part', date_expression, [start_of_week])
1. 'date_part' 是指定截断到的时间单位,例如 'year'(年)、'quarter'(季)、'month'(月)、'week'(周)、'day'(日)等。
4. 日期格式:DATEPARSE('yyyy-MM-dd', '2023-01-15')
5. 指定动态分区刷新:--conf spark.sql.sources.partitionOverwriteMode=dynamic
6. spark的参数不需要加set:--conf hive.exec.dynamic.partition.mode=nonstrict
(动态分区严格模式下至少需要一个静态列,所以这时候如果使用动态分区需要增加这个参数)
标签:COMMENT,STRING,sql,date,deserializer,day,string,函数
From: https://www.cnblogs.com/iris0928/p/18387916