ClickHouse Release 25.12

neutral avatar 400804ae96
Jan 14, 2026 Β· 23 minutes read

Another month goes by, which means it’s time for another release!

ClickHouse version 25.12 contains 26 new features πŸŽ„ 31 performance optimizations `β›Έ` 129 bug fixes β˜ƒ

This release we have faster top-n queries with data skipping indexes, faster lazy reading with join-style execution model, and more!

New contributors

A special welcome to all the new contributors in 25.12! The growth of ClickHouse's community is humbling, and we are always grateful for the contributions that have made ClickHouse so popular.

Below are the names of the new contributors:

Alex Soffronow Pagonidis, Alexey Bakharew, Govind R Nair, Jeremy Aguilon, Kirill Kopnev, LeeChaeRok, MakarDev, Paresh Joshi, Revertionist, Sam Kaessner, Seva Potapov, Shaurya Mohan, SΓΌmer Cip, Yonatan-Dolan, alsugiliazova, htuall, ita004, jetsetbrand, mostafa, punithns97, rainac1

Hint: if you’re curious how we generate this list… here.

You can also view the slides from the presentation.

Faster Top-N queries with data skipping indexes

Contributed by Shankar Iyer

Why this matters
ClickHouse is the fastest production-grade analytical DBMS on ClickBench, a benchmark based on real-world web analytics traffic with results published for more than 70 database systems. It is also the fastest production-grade analytical system on JSONBench among all major engines with first-class JSON support.

ClickHouse’s performance leadership comes from many engine-level optimizations for common analytical query patterns. Top-N queries are one of these patterns. They appear frequently in real workloads. More than half of the queries in ClickBench and a significant fraction of those in JSONBench follow a Top-N pattern. This section focuses on the specific optimizations ClickHouse applies to make Top-N queries faster.


Top-N queries (e.g. ORDER BY … LIMIT N) are one of the most common analytical query patterns, appearing frequently in dashboards, monitoring, ranking reports, and exploratory analysis.

In ClickHouse 25.12, we introduce a new optimization that makes these queries significantly faster by using data skipping indexes to avoid reading data that cannot possibly affect the Top-N result.

Instead of scanning all relevant granules and then sorting or filtering, ClickHouse can now compare the current Top-N threshold against granule-level min/max metadata and skip entire granules before any data is read. This works both:

  • statically, for simple Top-N queries without predicates, and

  • dynamically, for queries with filters, where the threshold tightens as execution progresses.

In our tests, this reduced the amount of data read by one to two orders of magnitude and sped up Top-N queries by 5–10Γ—, with even larger benefits on large tables and cold cache.

For a detailed explanation of how this works internally, including examples and benchmarks, see the standalone deep dive: How ClickHouse makes Top-N queries faster with granule-level data skipping


Community feedback
We have received initial feedback from a community member, Michael Jarrett (@EmeraldShift), who has contributed across several recent issues and pull requests and tested these optimizations on very large production tables.

He observed that the skip index optimization is remarkably fast, under 0.2 seconds on a table with 50 billion rows, validating both the performance characteristics and the overall design in real-world settings. Further improvements are expected to make this even faster.

Faster lazy reading with join-style execution model

Contributed by Nikolai Kochetov

In the previous section, we introduced lazy reading as one of the low-level optimizations that helps accelerate Top-N queries by deferring reads of non-ordering columns until the final result set is known.

In ClickHouse 25.12, lazy reading itself became significantly faster. The underlying execution model was redesigned, allowing lazy reading to scale efficiently to much larger LIMIT values and unlock additional performance gains.

Lazy reading before 25.12

When lazy reading was first introduced, it intentionally shipped with a conservative default.

The execution model worked as follows:

  1. ClickHouse first read only the columns needed for sorting.

  2. After the Top-N rows were identified, it fetched the remaining selected columns row by row.

This meant that if a query returned N rows and selected M non-ordering columns, ClickHouse performed roughly N Γ— M individual column reads.

While this avoided reading unnecessary data, it also resulted in many small, scattered reads. For larger LIMIT values, the overhead of these per-row lookups could outweigh the benefits of deferred I/O.

For that reason, the original lazy materialization blog post noted:

Note: Lazy materialization is applied automatically for LIMIT N queries, but only up to a N threshold. This is controlled by the query_plan_max_limit_for_lazy_materialization setting (default: 10).

What changed in 25.12

In ClickHouse 25.12, lazy reading switches to a join-style execution model, benefiting from the same vectorized, parallel execution model as ”normal” joins, making it efficient even for large LIMIT values.

Conceptually, the pipeline now looks like this:

  1. Read the ORDER BY columns and determine the Top-N rows (unchanged).

  2. Materialize a compact set of row identifiers.

  3. Perform a single batched lookup that joins those row identifiers back to the base table.

The default value of query_plan_max_limit_for_lazy_materialization is now 10,000. This limit exists because lazy reading still requires a sort of the lazily materialized columns, which can become noticeable at very large LIMIT values.

Lazy reading continues to work well with read-in-order execution, which typically uses less memory than full sorting.

Demonstration

We illustrate the effect using the same anonymized web analytics dataset as in the previous section. The data was loaded on an AWS m6i.8xlarge instance (32 cores, 128 GB RAM) backed by a gp3 EBS volume.

To stress lazy reading, we use a Top-N query with a large LIMIT of 100,000 rows.

For this experiment, we explicitly disable the lazy-materialization limit by setting query_plan_max_limit_for_lazy_materialization to 0, allowing lazy reading to apply without restriction.

SELECT * 
FROM hits 
ORDER BY EventTime 
LIMIT 100000
FORMAT Hash
SETTINGS query_plan_max_limit_for_lazy_materialization = 0;

Lazy reading before 25.12

In ClickHouse 25.11, lazy reading still materialized remaining columns row by row.

For this query, that meant fetching 104 non-order columns for each of the 100,000 Top-N rows –> around 10 million individual column lookups.

The fastest of three runs completed in 38.7 seconds:

100000 rows in set. Elapsed: 38.702 sec. Processed 100.00 million rows, 1.20 GB (2.58 million rows/s., 31.01 MB/s.)
Peak memory usage: 981.92 MiB.

100000 rows in set. Elapsed: 34.193 sec. Processed 100.00 million rows, 1.20 GB (2.92 million rows/s., 35.09 MB/s.)
Peak memory usage: 977.59 MiB.

100000 rows in set. Elapsed: 34.152 sec. Processed 100.00 million rows, 1.20 GB (2.93 million rows/s., 35.14 MB/s.)
Peak memory usage: 953.25 MiB.

As discussed earlier, for large Top-N limits (such as the LIMIT 100,000 in this example), the overhead of per-row lazy materialization can outweigh the benefits of deferred I/O.

We can illustrate this by disabling lazy reading entirely. By setting query_plan_optimize_lazy_materialization = false, ClickHouse falls back to eagerly reading all selected columns up front, avoiding the row-by-row lookup overhead.

With lazy reading disabled, the fastest of three runs completed in 7.014 seconds:

100000 rows in set. Elapsed: 7.910 sec. Processed 100.00 million rows, 56.83 GB (12.64 million rows/s., 7.18 GB/s.)
Peak memory usage: 73.75 GiB.

100000 rows in set. Elapsed: 7.847 sec. Processed 100.00 million rows, 56.83 GB (12.74 million rows/s., 7.24 GB/s.)
Peak memory usage: 76.67 GiB.

100000 rows in set. Elapsed: 7.014 sec. Processed 100.00 million rows, 56.83 GB (14.26 million rows/s., 8.10 GB/s.)
Peak memory usage: 77.58 GiB.

This confirms why lazy reading originally shipped with a conservative default. For large LIMIT values, eager sequential reads can outperform row-by-row lazy materialization, even though they require reading significantly more data and consume substantially more memory.

Lazy reading in 25.12

Rather than fetching 104 columns for each row individually, ClickHouse now performs a single join-like pass to materialize all remaining columns for the Top-100,000 rows at once.

The fastest of three runs completed in 0.513 seconds:

100000 rows in set. Elapsed: 0.513 sec. Processed 104.87 million rows, 3.56 GB (204.62 million rows/s., 6.94 GB/s.)
Peak memory usage: 967.19 MiB.

100000 rows in set. Elapsed: 0.524 sec. Processed 104.87 million rows, 3.56 GB (200.13 million rows/s., 6.79 GB/s.)
Peak memory usage: 951.09 MiB.

100000 rows in set. Elapsed: 0.520 sec. Processed 104.87 million rows, 3.56 GB (201.77 million rows/s., 6.85 GB/s.)
Peak memory usage: 953.08 MiB.

This is roughly 75 times faster than with the previous lazy reading mechanics (and 14 times faster than without lazy reading).

Looking under the hood

We can confirm the change in lazy reading mechanics by inspecting the logical query plans for the same query using EXPLAIN PLAN:

EXPLAIN PLAN
SELECT * 
FROM hits 
ORDER BY EventTime 
Limit 100000
SETTINGS query_plan_max_limit_for_lazy_materialization = 0;

On 25.11, with the old mechanics for lazy reading the plan shows (read it from bottom to top) how ClickHouse plans to

  1. Read data from the table ( only the ORDER BY column).
  2. Sort the data
  3. Apply the LIMIT
  4. Then lazily fetch the remaining non-order columns row by row.
LazilyRead
...
Limit
...
Sorting
...
ReadFromMergeTree (default.hits)

On 25.12, the plan changes in a fundamental way.

The engine still:

  1. Reads only the ORDER BY column.
  2. Sorts the data
  3. Applies the LIMIT

But instead of row-by-row materialization, it now introduces a dedicated join step to fetch the remaining columns in bulk from the base table:

JoinLazyColumnsStep
    ...
    Limit
    ...
    Sorting
    ...
    ReadFromMergeTree (default.hits) 
LazilyReadFromMergeTree

Faster Joins with a more powerful join reordering algorithm

Contributed by Alexander Gololobov

Top-N queries got faster. Lazy reading got faster. And, unsurprisingly for a ClickHouse release, joins got faster as well. ClickHouse 25.12 ships with a simple (for now) but more powerful join reordering algorithm for INNER JOINs - DPsize - that explores more join orders than the existing greedy approach, often producing more efficient execution plans with less intermediate data.

Join reordering primer

As a quick reminder, when multiple tables are joined, the join order does not affect correctness, but it can dramatically affect performance. Different join orders can produce vastly different amounts of intermediate data. Since ClickHouse’s default hash-based join algorithms build in-memory structures from one side of each join, choosing a join order that keeps build inputs small is critical for fast and efficient execution.

DPsize join reordering algorithm

DPsize is one of the classic join reordering algorithms and is used, often with variations, in many database systems, including PostgreSQL and IBM DB2.

It can be considered the classical dynamic programming approach (hence the name DPsize) that constructs the optimal join order based on a given cost model in a bottom-up fashion:

  • It starts with the simplest plans: single tables.
  • It then builds optimal plans for pairs of tables.
  • Then for three tables, four tables, and so on.
  • At each step, it constructs larger join plans by combining two already-optimal smaller subplans.

You can think of it as gradually assembling the final join tree, increasing its size step by step while always reusing the cheapest plans found so far.

The trade-off is optimize-cost:

  • DPsize explores many more possible join orders than greedy algorithms.
  • In practice, this means DPsize is more powerful, but also more optimizer-time expensive, than greedy reordering.

This is why many systems historically default to greedy join ordering and only use DP-based approaches selectively.

DPsize in ClickHouse 25.12

In ClickHouse 25.12, DPsize is introduced as an additional option for INNER JOIN reordering: a simple but more expressive algorithm that explores a richer set of join orders and can produce better execution plans for more complex join graphs.

A new experimental setting controls what algorithms are used in which order, e.g. query_plan_optimize_join_order_algorithm='dpsize,greedy' means that DPsize is tried first with fallback to greedy.

Demonstration

Let’s see DPsize in action using the classic TPC-H join benchmark.

We reuse the same eight TPC-H tables (extended with column statistics) from an earlier release post that introduced global join reordering. The data is loaded with a scale factor of 100, and the benchmark is run on an AWS EC2 m6i.8xlarge instance (32 vCPUs, 128 GiB RAM).

Note that we run all join queries with these settings, to ensure that join reordering and statistics-based optimization are fully enabled:

SET allow_experimental_analyzer = 1;
SET query_plan_optimize_join_order_limit = 10;
SET allow_statistic_optimize = 1;
SET query_plan_join_swap_table='auto';
SET enable_parallel_replicas = 0;

We use the same eight-table TPC-H join query that we previously used to introduce global join reordering. The query itself is unchanged; the only difference is the join reordering strategy used by the optimizer. We run it once with the existing greedy join reordering algorithm, and once with DPsize, by setting:

  • query_plan_optimize_join_order_algorithm = 'greedy'
  • query_plan_optimize_join_order_algorithm = 'dpsize'

This allows us to directly compare the execution plans produced by the two algorithms on an identical workload.

SELECT
  n_name,
  sum(l_extendedprice * (1 - l_discount)) AS revenue
FROM
  customer,
  orders,
  lineitem,
  supplier,
  nation,
  region
WHERE
  c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND l_suppkey = s_suppkey
AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'ASIA'
AND o_orderdate >= DATE '1994-01-01'
AND o_orderdate < DATE '1994-01-01' + INTERVAL '1' year
GROUP BY
  n_name
ORDER BY
  revenue DESC
SETTINGS
    query_plan_optimize_join_order_algorithm = 'greedy';
    -- query_plan_optimize_join_order_algorithm = 'dpsize';

When we first run the query with both algorithms using EXPLAIN, we can see the join order chosen by each optimizer.

Greedy join order:

(((lineitem β‹ˆ (orders β‹ˆ customer)) β‹ˆ (supplier β‹ˆ (nation β‹ˆ region))))

The greedy algorithm proceeds as follows:

  • orders is joined with customer
  • nation is joined with region, and the result is joined with supplier
  • The result of orders β‹ˆ customer is joined with lineitem
  • Finally, the two larger intermediate results are joined together

This plan is built incrementally by repeatedly extending the current plan with what looks like the cheapest next join at each step.

DPsize join order:

((lineitem β‹ˆ orders) β‹ˆ (supplier β‹ˆ (nation β‹ˆ region))) β‹ˆ customer

With DPsize, the join order is different:

  • lineitem is joined with orders.
  • nation is joined with region, and the result is joined with supplier
  • These two intermediate results are then joined together
  • customer is joined last

By considering larger combinations of tables, DPsize is able to delay joining customer until the end, resulting in a different, and, as we will see below, in this case more efficient execution plan.

When we execute the query with both join reordering algorithms and measure three consecutive runs for each, the difference becomes visible in actual runtime. With identical data, the plan produced by DPsize consistently runs faster than the greedy plan. Across the three runs, the DPsize plan is about 4.7% faster than the greedy one.

Three consecutive Greedy runs:

5 rows in set. Elapsed: 2.975 sec. Processed 638.85 million rows, 14.76 GB (214.76 million rows/s., 4.96 GB/s.)
Peak memory usage: 3.65 GiB.

5 rows in set. Elapsed: 2.718 sec. Processed 638.85 million rows, 14.76 GB (235.06 million rows/s., 5.43 GB/s.)
Peak memory usage: 3.64 GiB.

5 rows in set. Elapsed: 2.702 sec. Processed 638.85 million rows, 14.76 GB (236.44 million rows/s., 5.46 GB/s.)
Peak memory usage: 3.65 GiB.

Three consecutive DPsize runs:

5 rows in set. Elapsed: 2.667 sec. Processed 638.85 million rows, 14.76 GB (239.53 million rows/s., 5.53 GB/s.)
Peak memory usage: 3.83 GiB.

5 rows in set. Elapsed: 2.658 sec. Processed 638.85 million rows, 14.76 GB (240.37 million rows/s., 5.55 GB/s.)
Peak memory usage: 3.84 GiB.

5 rows in set. Elapsed: 2.672 sec. Processed 638.85 million rows, 14.76 GB (239.08 million rows/s., 5.52 GB/s.)
Peak memory usage: 3.83 GiB.

While this is a modest improvement for this particular eight-table query, the impact of join reordering grows with query complexity. For queries that join more tables, involve larger size differences between relations, or have less obvious join orders, exploring a richer set of join orders can lead to significantly larger performance gains.

Text index is beta

Contributed by Anton Popov, Elmi Ahmadov, Jimmy Aguilar Mena

Three months after version 3 of the text index was introduced in ClickHouse 25.9, in ClickHouse 25.12, it has moved to beta status.

Let’s remind ourselves how it works with help from the Hacker News example dataset. We’ll first create a table, and one change since 25.9 is that we need to specify a tokenizer - we can’t use the value default anymore.

SET enable_full_text_index=1;

CREATE TABLE hackernews
(
    `id` Int64,
    `deleted` Int64,
    `type` String,
    `by` String,
    `time` DateTime64(9),
    `text` String,
    `dead` Int64,
    `parent` Int64,
    `poll` Int64,
    `kids` Array(Int64),
    `url` String,
    `score` Int64,
    `title` String,
    `parts` Array(Int64),
    `descendants` Int64,
    INDEX inv_idx(text)
    TYPE text(tokenizer = 'splitByNonAlpha')
    GRANULARITY 128
)
ORDER BY time;

Valid values for the tokenizer, which you can also find in the docs, are as follows:

tokenizer = splitByNonAlpha
                                           | splitByString[(S)]
                                           | ngrams[(N)]
                                           | sparseGrams[(min_length[, max_length[, min_cutoff_length]])]
                                           | array

We can write queries against the text field to find specific terms using the hasToken, hasAllTokens, and hasAnyTokens functions.

The following query finds the users posting about OpenAI:

SELECT by, count()
FROM hackernews
WHERE hasToken(text, 'OpenAI')
GROUP BY ALL
ORDER BY count() DESC
LIMIT 10;
β”Œβ”€by──────────────┬─count()─┐
β”‚ minimaxir       β”‚      48 β”‚
β”‚ sillysaurusx    β”‚      43 β”‚
β”‚ gdb             β”‚      40 β”‚
β”‚ thejash         β”‚      24 β”‚
β”‚ YeGoblynQueenne β”‚      23 β”‚
β”‚ nl              β”‚      20 β”‚
β”‚ Voloskaya       β”‚      19 β”‚
β”‚ p1esk           β”‚      18 β”‚
β”‚ rvz             β”‚      17 β”‚
β”‚ visarga         β”‚      16 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Our next query finds users posting about OpenAI and Google in the same post:

SELECT by, count()
FROM hackernews
WHERE hasAllTokens(text, ['OpenAI', 'Google'])
GROUP BY ALL
ORDER BY count() DESC
LIMIT 10;
β”Œβ”€by──────────────┬─count()─┐
β”‚ thejash         β”‚      17 β”‚
β”‚ boulos          β”‚       8 β”‚
β”‚ p1esk           β”‚       6 β”‚
β”‚ nl              β”‚       5 β”‚
β”‚ colah3          β”‚       5 β”‚
β”‚ sillysaurusx    β”‚       5 β”‚
β”‚ Voloskaya       β”‚       4 β”‚
β”‚ YeGoblynQueenne β”‚       4 β”‚
β”‚ visarga         β”‚       4 β”‚
β”‚ rvz             β”‚       4 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

And our final query finds the users who posted about OpenAI or Google:

SELECT by, count()
FROM hackernews
WHERE hasAnyTokens(text, ['OpenAI', 'Google'])
GROUP BY ALL
ORDER BY count() DESC
LIMIT 10;
β”Œβ”€by───────────┬─count()─┐
β”‚ ocdtrekkie   β”‚    2506 β”‚
β”‚ nostrademons β”‚    2317 β”‚
β”‚ pjmlp        β”‚    1948 β”‚
β”‚ tptacek      β”‚    1626 β”‚
β”‚ ChuckMcM     β”‚    1523 β”‚
β”‚ dragonwriter β”‚    1417 β”‚
β”‚ mtgx         β”‚    1189 β”‚
β”‚ dredmorbius  β”‚    1142 β”‚
β”‚ jrockway     β”‚    1121 β”‚
β”‚ Animats      β”‚    1103 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

We can also use other clauses and functions to search text, but the text index will only be used if complete tokens can be extracted from the search term.

For example, the following query won’t use the full-text search index:

SELECT by, count()
FROM hackernews
WHERE text LIKE '%OpenAI%'
GROUP BY ALL
ORDER BY count() DESC
LIMIT 10;
β”Œβ”€by──────────────┬─count()─┐
β”‚ minimaxir       β”‚      49 β”‚
β”‚ sillysaurusx    β”‚      45 β”‚
β”‚ gdb             β”‚      40 β”‚
β”‚ thejash         β”‚      24 β”‚
β”‚ YeGoblynQueenne β”‚      23 β”‚
β”‚ nl              β”‚      20 β”‚
β”‚ Voloskaya       β”‚      19 β”‚
β”‚ p1esk           β”‚      18 β”‚
β”‚ rvz             β”‚      17 β”‚
β”‚ visarga         β”‚      16 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

10 rows in set. Elapsed: 2.161 sec. Processed 28.03 million rows, 9.42 GB (12.97 million rows/s., 4.36 GB/s.)
Peak memory usage: 171.12 MiB.

If we want this query to use the full-text index, we need to add spaces around OpenAI so that the query engine can extract complete tokens from the search term.

SELECT by, count()
FROM hackernews
WHERE text LIKE '% OpenAI %'
GROUP BY ALL
ORDER BY count() DESC
LIMIT 10;
β”Œβ”€by──────────────┬─count()─┐
β”‚ minimaxir       β”‚      33 β”‚
β”‚ sillysaurusx    β”‚      31 β”‚
β”‚ gdb             β”‚      19 β”‚
β”‚ thejash         β”‚      17 β”‚
β”‚ YeGoblynQueenne β”‚      16 β”‚
β”‚ rvz             β”‚      13 β”‚
β”‚ visarga         β”‚      13 β”‚
β”‚ Voloskaya       β”‚      13 β”‚
β”‚ ryanmercer      β”‚      11 β”‚
β”‚ backpropaganda  β”‚      11 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

10 rows in set. Elapsed: 0.529 sec. Processed 7.39 million rows, 2.61 GB (13.95 million rows/s., 4.92 GB/s.)
Peak memory usage: 171.45 MiB.

dictGetKeys

Contributed by Nihal Z. Miaji

ClickHouse supports various dictionaries, a special type of in-memory table that utilizes specialized data structures for fast key-value lookups.

The following dictionary allows us to look up the borough or zone for a given Location ID. It is populated from a CSV file of taxi zones in New York.

CREATE DICTIONARY taxi_zone_dictionary
(
  LocationID UInt16 DEFAULT 0, -- key
  Borough String,              -- attributes
  Zone String,
  service_zone String
)
PRIMARY KEY LocationID
SOURCE(HTTP(URL 'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/taxi_zone_lookup.csv' FORMAT 'CSVWithNames'))
LIFETIME(MIN 0 MAX 0) 
LAYOUT(HASHED_ARRAY());

We can use the dictGet function to find the borough and zone where LocationId=240:

SELECT dictGet('taxi_zone_dictionary', ('Borough', 'Zone'), 240);
β”Œβ”€dictGet('taxi_zβ‹―Zone'), '240')─┐
β”‚ {                             ↴│
│↳  "Borough": "Bronx",         ↴│
│↳  "Zone": "Van Cortlandt Park"↴│
│↳}                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

As of ClickHouse 25.12, we can retrieve the keys for a given attribute using the dictGetKeys function. The following query returns the LocationIDs for the Bronx:

SELECT dictGetKeys('taxi_zone_dictionary', 'Borough', 'Bronx')
Row 1:
──────
dictGetKeys(β‹―', 'Bronx'): [81,46,18,254,183,185,147,58,31,167,119,3,51,59,259,212,47,174,199,60,213,200,169,248,184,235,242,126,241,168,136,208,78,20,94,250,32,182,220,247,159,69,240]

This function automatically creates a per-query cache, allowing bulk lookups to be fast. The size of the cache is controlled by the max_reverse_dictionary_lookup_cache_size_bytes setting.

Non-constant IN

Contributed by Yarik Briukhovetskyi

As of ClickHouse 25.12, we can now use non-constant lists as part of the IN clause of a query. Let’s have a look at how this works with help from the New York taxis dataset.

The following query returns the drop-offs to LaGuardia when the payment type was cash and to JFK for other payment types.

SELECT dropoff_nyct2010_gid, payment_type, count()
FROM trips
WHERE dropoff_nyct2010_gid IN (payment_type = 'CSH' ? [138] : [132])
GROUP BY ALL;

If we run this query before 25.12, we’ll get the following error:

Received exception:
Code: 1. DB::Exception: Function 'in' is supported only if second argument is constant or table expression. (UNSUPPORTED_METHOD)

If we run it with 25.12, we’ll see the following output:

β”Œβ”€dropoff_nyct2010_gid─┬─payment_type─┬─count()─┐
β”‚                  138 β”‚ CSH          β”‚   10356 β”‚
β”‚                  132 β”‚ CRE          β”‚   10824 β”‚
β”‚                  132 β”‚ NOC          β”‚      80 β”‚
β”‚                  132 β”‚ DIS          β”‚      39 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

HMAC

Contributed by Mikhail F. Shiryaev.

The 25.12 also sees the introduction of an HMAC function for message authentication using a (shared) secret key. This makes it possible to use the ClickHouse server as a webhook and validate message authenticity on INSERT.

To use ClickHouse like this, we need to allow reading of HTTP headers from incoming requests by configuring the following setting:

profiles:
  default:
    allow_get_client_http_header: 1

We’ll then create two tables:

  • webhook_staging, which will store all incoming payloads
  • webhook_prod, which will only store payloads where the signature is valid.

The staging table looks like this:

CREATE TABLE webhook_staging (
    received_at DateTime DEFAULT now(),
    raw_payload String,
    signature String DEFAULT getClientHTTPHeader('X-Hub-Signature-256')
) ENGINE = MergeTree()
ORDER BY received_at;

The production table is like this:

CREATE TABLE webhook_prod (
    received_at DateTime,
    event_type String,
    payload String
) ENGINE = MergeTree()
ORDER BY received_at;

And then we’ll have a materialized view that reads incoming rows to the staging table and forwards them to the production table if the signature is valid:

CREATE MATERIALIZED VIEW webhook_validator TO webhook_prod AS
SELECT 
    received_at,
    raw_payload::JSON.event as event_type,
    raw_payload as payload
FROM webhook_staging
WHERE signature = 'sha256=' || lower(hex(HMAC('SHA256', raw_payload, 'my_secret_key')));

We can simulate a webhook request to ClickHouse using the wrong key:

PAYLOAD='{"event":"user_signup","user_id":789}'
SIGNATURE=$(echo -n "$PAYLOAD" | openssl dgst -sha256 -hmac "my_secret_key2" | cut -d' ' -f2)

curl -X POST "http://localhost:8123/?query=INSERT%20INTO%20webhook_staging%20(raw_payload)%20FORMAT%20RawBLOB" 
  -H "X-Hub-Signature-256: sha256=$SIGNATURE" 
  -d "$PAYLOAD"

If we query the webhook_staging table, we’ll see the following entry:

Row 1:
──────
received_at: 2026-01-06 13:51:53
raw_payload: {"event":"user_signup","user_id":789}
signature:   sha256=8184a43ddb115fba57877a6e3f85b48ae60d678dbcf44407130e467b4106cd3b

But webhook_logs is empty! We can run a variation of the materialized view’s query to show that the signature was invalid:

SELECT
    received_at,
    raw_payload::JSON.event as event_type,
    raw_payload as payload,
    signature,
    'sha256=' || lower(hex(HMAC('SHA256', raw_payload, 'my_secret_key')))
FROM webhook_staging
Row 1:
──────
received_at:              2026-01-06 13:51:53
event_type:               user_signup
payload:                  {"event":"user_signup","user_id":789}
signature:                sha256=8184a43ddb115fba57877a6e3f85b48ae60d678dbcf44407130e467b4106cd3b
concat('sha2β‹―et_key')))): sha256=1f0480fde689cf4080e3b621f6c127df41506efabbf71767539f68b809a90203

We can see that the two signatures don’t match. If we submit a request using the correct key, the record will be inserted into the webhook_logs table.


Share this post

  • Y Combinator icon
  • X icon
  • Bluesky icon
  • Facebook icon
  • LinkedIn icon

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!

Recent posts

Tsvetan Stoychev Β· Jun 26, 2026
Aditya Chidurala Β· Jun 26, 2026
James Cunningham and Vadim Skipin Β· Jun 25, 2026