Filling gaps in time-series data
When working with time-series data, there can be gaps in the data due to missing data or inactivity.
Typically, we don't want those gaps to exist when we query the data. In this case, the WITH FILL
clause can come in handy.
This guide discusses how to use WITH FILL
to fill gaps in your time-series data.
Setup
Imagine we've got the following table that stores metadata on images generated by a GenAI image service:
CREATE TABLE images
(
`id` String,
`timestamp` DateTime64(3),
`height` Int64,
`width` Int64,
`size` Int64
)
ENGINE = MergeTree
ORDER BY (size, height, width);
Let's import some records:
INSERT INTO images VALUES (1088619203512250448, '2023-03-24 00:24:03.684', 1536, 1536, 2207289);
INSERT INTO images VALUES (1088619204040736859, '2023-03-24 00:24:03.810', 1024, 1024, 1928974);
INSERT INTO images VALUES (1088619204749561989, '2023-03-24 00:24:03.979', 1024, 1024, 1275619);
INSERT INTO images VALUES (1088619206431477862, '2023-03-24 00:24:04.380', 2048, 2048, 5985703);
INSERT INTO images VALUES (1088619206905434213, '2023-03-24 00:24:04.493', 1024, 1024, 1558455);
INSERT INTO images VALUES (1088619208524431510, '2023-03-24 00:24:04.879', 1024, 1024, 1494869);
INSERT INTO images VALUES (1088619208425437515, '2023-03-24 00:24:05.160', 1024, 1024, 1538451);
Querying by bucket
We're going to explore the images created between 00:24:03
and 00:24:04
on the 24th March 2023, so let's create some parameters for those points in time:
SET param_start = '2023-03-24 00:24:03',
param_end = '2023-03-24 00:24:04';
Next, we'll write a query that groups the data into 100ms buckets and returns the count of images created in that bucket:
SELECT
toStartOfInterval(timestamp, toIntervalMillisecond(100)) AS bucket,
count() AS count
FROM MidJourney.images
WHERE (timestamp >= {start:String}) AND (timestamp <= {end:String})
GROUP BY ALL
ORDER BY bucket ASC