标签：toDate 00 01 expr time 补齐 Clickhouse FILL

1 背景引入

背景需求: 缺失数据的补齐

要聚合查询一些数据，按照时间 15 分钟聚合数据量或根据用户提供的入参startTime/endTime动态筛选指定的若干月份聚合数据。但是由于某些时间段（如：某些月份）数据缺失，一些时段值没有，前端显示就会有问题，需要数据接口开发者自己插值解决
有一种实现方式: 在查询出库后，在服务端使用代码补全这些数据。

但是觉得这样有点 LOW，应该有更好的方法

经过一番调研，发现 Clickhouse 的 WITH FILL 正符合需求

2 WITH FILL 简介

WITH FILL 修饰符配合 ORDER BY 使用，就是为了按照指定顺序填值————有值跳过，无值插入
可以在 ORDER BY expr 之后用可选的 FROM expr、TO expr和 STEP expr 参数来设置 WITH FILL 修饰符

from、to确定补值的范围，step是步进。若没有设置，则使用默认值

针对DateTime类型，默认步进单位是 : 1秒

针对Date类型，默认步进单位是 : 1天

针对数字类型，默认步进单位是 : 1.0

所有expr列的缺失值将被顺序填充，而其他列将被填充为默认值

使用以下语法填充多列，在ORDER BY部分的每个字段名称后添加带有可选参数的WITH FILL修饰符。

ORDER BY expr 
[WITH FILL] 
[FROM const_expr] [TO const_expr] 
[STEP const_numeric_expr], ... exprN 

[WITH FILL] 
[FROM expr] [TO expr] 
[STEP numeric_expr]

WITH FILL可以针对单字段使用，也可以多字段使用，但是仅适用于具有数字（所有类型的浮点，小数，整数）或日期/日期时间类型的字段
当未定义 FROM const_expr 填充顺序时，则使用 ORDER BY 中的最小expr字段值
当定义了 STEP const_numeric_expr 时，不同类型（就3种）表示不一样：

对于数字类型，const_numeric_expr 为具体数值；

当作为日期类型(Date)，const_numeric_expr 为多少天；

当作为日期时间(DateTime)类型时，const_numeric_expr 为多少秒。

如果省略了 STEP const_numeric_expr，则：填充顺序使用1.0表示数字类型，1 day表示日期类型，1 second 表示日期时间类型。

3 应用场景

CASE: 单字段、基于数字、按月份逐一补0

with tmp_device_active_days as ( -- 设备每月活跃日数据集 (模拟从数据库查出的初始数据)
	-- 动态计算指定月份的总天数 : toDayOfMonth(subtractDays(addMonths(toStartOfMonth( monthFirstDay ), 1), 1) )  as totalMonthDays
	select 'XXXX001' as deviceId, toDate('2023-09-01') as monthFirstDay , 30 as totalMonthDays , 17 as realActiveMonthDays
	union all
	select 'XXXX001' as deviceId, toDate('2023-10-01') as monthFirstDay , 31 as totalMonthDays , 14 as realActiveMonthDays
	union all
	select 'XXXX002' as deviceId, toDate('2024-09-01') as monthFirstDay , 30 as totalMonthDays , 1 as realActiveMonthDays
	union all
	select 'XXXX003' as deviceId, toDate('2023-06-01') as monthFirstDay , 30 as totalMonthDays , 2 as realActiveMonthDays
	union all
	select 'XXXX004' as deviceId, toDate('2023-08-01') as monthFirstDay , 31 as totalMonthDays , 5 as realActiveMonthDays
	union all
	select 'XXXX005' as deviceId, toDate('2023-05-01') as monthFirstDay , 31 as totalMonthDays , 8 as realActiveMonthDays
	union all
	select 'XXXX006' as deviceId, toDate('2023-12-01') as monthFirstDay , 31 as totalMonthDays , 17 as realActiveMonthDays
	union all
	select 'XXXX007' as deviceId, toDate('2024-08-01') as monthFirstDay , 31 as totalMonthDays , 2 as realActiveMonthDays
	union all
	select 'XXXX008' as deviceId, toDate('2024-07-01') as monthFirstDay , 30 as totalMonthDays , 10 as realActiveMonthDays
	union all
	select 'XXXX009' as deviceId, toDate('2024-02-01') as monthFirstDay , 30 as totalMonthDays , 10 as realActiveMonthDays
)
select
    formatDateTime( toStartOfMonth( addMonths( toDate('1970-01-01') , relativeMonth ) ) , '%Y%m' , 'Asia/Shanghai' ) as `date`
    , toUInt64(0) as newDevices
    , activeDevices
FROM (
    SELECT
        relativeMonth
        , count(deviceId) as activeDevices
    FROM (
        SELECT
            *
            , dateDiff('month', toDate('1970-01-01'), monthFirstDay ) as relativeMonth
        FROM tmp_device_active_days
        where 1 = 1 and realActiveMonthDays > 5 -- 每月活跃天数 > 5 天的设备 -- realActiveMonthDays = totalMonthDays 
    ) y
    group by relativeMonth 
    
    order by relativeMonth ASC 
    WITH FILL
    -- dateDiff 计算结果是 数字（距 1970-01-01 的月份数，以此实现对缺数据月份的自动补齐）
    FROM dateDiff('month', toDate('1970-01-01') , toDate( toDateTime64('{{startTime}}', 3 ,'{{timeZone}}') ) )  TO dateDiff('month', toDate('1970-01-01'), toDate( toDateTime64('{{endTime}}', 3 ,'{{timeZone}}') ) )
    STEP 1
)

用户传入的动态参数

startTime , 时间戳、单位: 毫秒(13位)。如: 1684857600000

endTime , 时间戳、单位: 毫秒(13位)。如 : 1795462399999

timeZone , 时区，如："Asia/Shanghai"

output : 不加 WITH FILL 时

--    order by relativeMonth ASC 
--    WITH FILL
--    
--    FROM dateDiff('month', toDate('1970-01-01') , toDate( toDateTime64('1684857600000', 3 ,'Asia/Shanghai') ) )  TO dateDiff('month', toDate('1970-01-01'), toDate( toDateTime64('1795462399999', 3 ,'Asia/Shanghai') ) )
--    STEP 1

date  |newDevices|activeDevices|
------+----------+-------------+
202407|         0|            1|
202402|         0|            1|
202305|         0|            1|
202312|         0|            1|
202310|         0|            1|
202309|         0|            1|

output : 加 WITH FILL 时

date  |newDevices|activeDevices|
------+----------+-------------+
202305|         0|            1|
202306|         0|            0|
202307|         0|            0|
202308|         0|            0|
202309|         0|            1|
202310|         0|            1|
202311|         0|            0|
202312|         0|            1|
202401|         0|            0|
202402|         0|            1|
202403|         0|            0|
202404|         0|            0|
202405|         0|            0|
202406|         0|            0|
202407|         0|            1|
202408|         0|            0|
202409|         0|            0|
202410|         0|            0|
202411|         0|            0|
202412|         0|            0|
202501|         0|            0|
202502|         0|            0|
202503|         0|            0|
202504|         0|            0|
202505|         0|            0|
202506|         0|            0|
202507|         0|            0|
202508|         0|            0|
202509|         0|            0|
202510|         0|            0|
202511|         0|            0|
202512|         0|            0|
202601|         0|            0|
202602|         0|            0|
202603|         0|            0|
202604|         0|            0|
202605|         0|            0|
202606|         0|            0|
202607|         0|            0|
202608|         0|            0|
202609|         0|            0|
202610|         0|            0|

CASE : 单字段、基于 DateTime 、按N分钟间隔时段补0

此处是对表passing_vehicle 进行数据统计，把一天按照15分钟间隔，获取96条过车数据
toStartOfInterval(time_stamp, INTERVAL 15 minute) as time_stamp2是对时间戳进行聚合，按照15分钟间隔
GROUP by time_stamp2,approach是按照15分钟和进口聚合
一开始的SQL是 GROUP by time_stamp2,approach order by time_stamp2,approach，查询条件也没有and approach = 'SB'，结果插值补全有问题，多字段聚合排序，只补值一个字段，得到的不是自己的想要的结果

SELECT 
	toStartOfInterval(time_stamp, INTERVAL 15 minute) as time_stamp2 ,
	approach,lane_nbr, 
	count() as totalVolume
from passing_vehicle pv 
WHERE 
	intersection_number = 1687001
	and time_stamp > '2023-05-08 00:00:00'
	and time_stamp < '2023-05-09 00:00:00'
	and approach = 'SB'
	and status = 1
GROUP by time_stamp2
order by time_stamp2
WITH FILL 
FROM toDateTime('2023-05-08 00:00:00')  TO toDateTime('2023-05-09 00:00:00') 
STEP 15*60

现在这条SQL能正常运行和补值，它查询的是，某路口南进口（SB）的15分钟过车流量统计
FROM toDateTime('2023-05-08 00:00:00') TO toDateTime('2023-05-09 00:00:00') 是时间范围，这个和我的查询时间是对应的，这个范围内补全。注意这个类型，一定要是时间日期，即使用toDateTime函数，将字符串转成日期时间类型
STEP 15*60，是将插值补全，步进为15分钟，日期时间类型，默认是1s，15分钟乘以对应秒数即可

注：如果是要按照一天（或n天）去做时间聚合统计，就需要使用 toDate 函数，相应步进为n天

CASE : 多字段补全

进阶，按照进口查询多次，多个字段补全

SELECT 
	toStartOfInterval(time_stamp, INTERVAL 15 minute) as time_stamp2,
	lane_nbr,
	sum(status) as totalVolume
from passing_vehicle pv 
WHERE 1 = 1 
	and intersection_number = 1687001
	and time_stamp >= '2023-05-08 00:00:00'
	and time_stamp <= '2023-05-08 01:59:59'
	and approach = 'NB'
GROUP by time_stamp2, lane_nbr 
order by time_stamp2 
WITH FILL 
	FROM toDateTime('2023-05-08 00:00:00') TO toDateTime('2023-05-08 01:59:59') 
	STEP 15 * 60,

-- 注意：这里的6，是传参进来的，因为是5车道，lane_nbr需要补全1-5（最大是5，范围要是6才行）
lane_nbr 
WITH FILL 
	FROM 1 TO 6 
	STEP 1

这条SQL查询的是，某路口北进口（NB）各个车道15分钟过车数量统计
GROUP by time_stamp2,lane_nbr，聚合条件是15分钟，和车道编号
time_stamp2 WITH FILL与上面一样，按照15分钟补全
lane_nbr WITH FILL FROM 1 TO 6 STEP 1是按照车道编号补全，注意to的值要比最大值大1
这2个 WITH FILL 一起使用，就会把时间和车道编号都补全，得到我们想要的结果
对于非数值/日期/日期时间类型的字段，如果是枚举类型字符串类型字段，也可以使用补全，将该字段的枚举值，存成连续新增的数值即可

X 参考文献

clickhouse

https://clickhouse.tech/docs/en/sql-reference/statements/select/

标签：toDate,00,01,expr,time,补齐,Clickhouse,FILL
From： https://www.cnblogs.com/johnnyzen/p/18441359

[Clickhouse] Clickhouse 特性 : WITH FILL(缺失段数据补齐)

1 背景引入

背景需求: 缺失数据的补齐

2 WITH FILL 简介

3 应用场景

CASE: 单字段、基于数字、按月份逐一补0

CASE : 单字段、基于 DateTime 、按N分钟间隔时段补0

CASE : 多字段补全

X 参考文献

相关文章

赞助商

阅读排行