Using TTL to Manage Data Lifecycles in ClickHouse

If the data that you are analyzing in ClickHouse grows over time you may want to plan to move, remove, or summarize older data on a schedule. Typically, your options here will depend on your data retention requirements and whether query SLAs vary depending on the data age. For example, while storing all dimensions at the highest available granularity on the latest data is usually required, it might be optional to store historical data with a lower level of detail.

Managing the data lifecycle can help optimize storage as well as improve query performance. ClickHouse has a simple but powerful data lifecycle management tool configured with the TTL clause of DDL statements.

In this blog post, we explore the TTL clause and how this can be used to solve a number of data management tasks.

Sometimes, it no longer makes sense to store older data and it can be removed from ClickHouse. This is usually called a retention policy. Data removal is done automatically in the background based on conditions in a TTL statement.

Suppose we have the events table, and we would like to delete all records older than one month:

CREATE TABLE events
(
    `event` String,
    `time` DateTime,
    `value` UInt64
)
ENGINE = MergeTree
ORDER BY (event, time)
TTL time + INTERVAL 1 MONTH DELETE

We add the TTL statement and use the DELETE operation with time + INTERVAL 1 MONTH. This will remove a record when its time column value is more than one month in the past.

Note that records removal is an asynchronous background process, and outdated records can still be available for some time.

Let's try to insert a couple of records, including an outdated one, into the table:

INSERT INTO events VALUES('error', now() - interval 2 month, 123), ('error', now(), 123);

We'll find both records available in the table right after insert:

SELECT * FROM events

┌─event─┬────────────────time─┬─value─┐
│ error │ 2022-11-24 09:34:44 │   123 │
│ error │ 2023-01-24 09:34:44 │   123 │
└───────┴─────────────────────┴───────┘

The outdated record will be removed in the background by an "off-schedule" merge which is scheduled periodically. Sometime later:

SELECT * FROM events
┌─event─┬────────────────time─┬─value─┐
│ error │ 2023-01-24 09:34:44 │   123 │
└───────┴─────────────────────┴───────┘

Background removal happens every 4 hours by default and can be controlled using the merge_with_ttl_timeout table settings option:

CREATE TABLE events
...
TTL time + INTERVAL 1 MONTH DELETE
SETTINGS merge_with_ttl_timeout = 1200

Do not use values smaller than 300 seconds for this setting, as this can create I/O overhead and impact cluster performance. After some time, we can find that the outdated record no longer exists in our table:

SELECT * FROM events

┌─event─┬────────────────time─┬─value─┐
│ error │ 2023-01-24 09:34:44 │   123 │
└───────┴─────────────────────┴───────┘

Let's assume we would like to delete only a specific type of record (e.g., where an event column value is error). We can additionally specify this in the WHERE clause of the TTL statement:

CREATE TABLE events
(
    `event` String,
    `time` DateTime,
    `value` UInt64
)
ENGINE = MergeTree
ORDER BY (event, time)
TTL time + INTERVAL 1 MONTH DELETE WHERE event = 'error'

Now, only outdated rows with the event='error' value will be removed:

INSERT INTO events VALUES('not_error', now() - interval 2 month, 123), ('error', now(), 123)

We can be sure the not_error record will not be removed (though it's older than one month):

SELECT * FROM events

┌─event─────┬────────────────time─┬─value─┐
│ error     │ 2023-01-24 09:48:05 │   123 │
│ not_error │ 2022-11-24 09:48:05 │   123 │
└───────────┴─────────────────────┴───────┘

ClickHouse allows specifying multiple TTL statements. This allows us to be more flexible and specific about what to remove and when. Let's say we want to remove all non error events in 1 month and all errors after 6 months:

CREATE TABLE events
(
    `event` String,
    `time` DateTime,
    `value` UInt64
)
ENGINE = MergeTree
ORDER BY (event, time)
TTL time + INTERVAL 1 MONTH DELETE WHERE event != 'error',
    time + INTERVAL 6 MONTH DELETE WHERE event = 'error'

We can have any number of rules configured for the TTL statement.

We can use Materialized Views in combination with a TTL statement to address cases where we might want to move outdated data to another table before being removed from the main table.

Suppose we want to move error events to the table errors_history before removing them from the events table. First, we create a target table for a materialized view, which is of the same structure as the events table:

CREATE TABLE errors_history (
    `event` String,
    `time` DateTime,
    `value` UInt64
)
ENGINE = MergeTree
ORDER BY (event, time)

Note that we can't use CREATE TABLE errors_history AS events as this will also copy the TTL expression, which we don't want to happen. Then we create the materialized view trigger to ingest data automatically into the errors_history table:

CREATE MATERIALIZED VIEW errors_history_mv TO errors_history AS
SELECT * FROM events WHERE event = 'error'

Now, when we insert data into the events table, error events are automatically inserted into the errors_history table as well. On the other hand, when the TTL procedure removes records from the events table, they remain in the errors_history table.

In many cases, we don't want to delete the data, but we can afford to reduce its level of detail in order to save resources. For example, let's consider a situation when we don't want to remove error events from our table. At the same time, we might not need the per-second details after one month, so we can leave daily aggregated numbers.

This can be implemented using a GROUP BY ... SET clause of the TTL statement:

CREATE TABLE events
(
    `event` String,
    `time` DateTime,
    `value` UInt64
)
ENGINE = MergeTree
ORDER BY (toDate(time), event)
TTL time + INTERVAL 1 MONTH GROUP BY toDate(time), event SET value = SUM(value)

Several important points here:

The GROUP BY toDate(time), event expression should be a prefix of the primary key, so we also changed ORDER BY (toDate(time), event).
The SET value = SUM(value) clause will set the value column to the sum of all values in each group (usual behavior for the GROUP BY clauses in queries).
The time column value will be chosen randomly during grouping (as in the case of using any() aggregate function).

Let's ensure the following data is inserted:

INSERT INTO events VALUES('error', now() - interval 2 month, 123),
                         ('error', now() - interval 2 month, 321);

After some time, when background merge happens, we can see our data is aggregated:

SELECT * FROM events

┌─event─┬────────────────time─┬─value─┐
│ error │ 2022-11-24 12:36:23 │   444 │
└───────┴─────────────────────┴───────┘

While it may not be feasible to remove or aggregate data, you may have more relaxed query SLAs for your historical data. To address this, we can consider using higher compression levels for older data to save more space. For example, we can ask ClickHouse to use LZ4HC compression of the higher level for the data older than one month. We need to use the RECOMPRESS clause for this:

CREATE TABLE events
(
    `event` String,
    `time` DateTime,
    `value` UInt64
)
ENGINE = MergeTree
ORDER BY (toDate(time), event)
TTL time + INTERVAL 1 MONTH RECOMPRESS CODEC(LZ4HC(10))

Note that recompressed data will take less space but will also take more time to compress, thus impacting insertion times.

We can also manage the lifecycle of a single column using TTL. Suppose we have a debug column in our table that stores additional debug information (e.g., errors backtrace). This column is only useful for one week, during which it consumes a lot of space. We can ask ClickHouse to reset it to the default value after a week:

CREATE TABLE events
(
    `event` String,
    `time` DateTime,
    `value` UInt64,
    `debug` String TTL time + INTERVAL 1 WEEK
)
ENGINE = MergeTree
ORDER BY (event, time)

Now let's insert an outdated record with a debug column value:

INSERT INTO events VALUES('error', now() - interval 1 month, 45, 'a lot of details');

ClickHouse will reset this debug column value to an empty string once the TTL processing is completed:

SELECT * FROM events

┌─event─┬────────────────time─┬─value─┬─debug─┐
│ error │ 2022-12-24 15:13:54 │    45 │       │
└───────┴─────────────────────┴───────┴───────┘

Note that ClickHouse will use the default value for the column TTL. So in case the DEFAULT expression is present, outdated columns will have this value assigned.

For local setups, consider using storage management for your data. Typically users need faster but smaller disks (called hot, e.g., SSD) and larger but slower ones (called cold, e.g., HDD or S3). ClickHouse allows setting data policy so that data is moved to a slower device once the faster device usage reaches a specific threshold. This storage policy is configured in two steps - declaring a storage list and a policy.

Alternatively, consider using ClickHouse Cloud, which avoids this complexity by separating storage and compute using object storage. When coupled with intelligent caching for data queried more often, typically your newer "hot" data, tiered architectures are no longer required.

ClickHouse provides powerful data lifecycles management tools to enable automatic removal, compaction, or movement between different storage types. Compacting and retention can be configured using the TTL statement at a table level. Use disk policies to manage storage or consider using ClickHouse Cloud as a reliable solution to scaling.

For more details on TTL, see our recently published guide.