Lazy materialization
This article describes how lazy materialization works and how it fits into ClickHouse’s broader I/O optimization stack. It presents a real-world example demonstrating how lazy materialization increases query performance.
Lazy materialization was introduced in version 25.4 of ClickHouse, and is turned on by default.
Overview
Over the years, ClickHouse has introduced a series of layered optimizations to aggressively reduce I/O. These techniques form the foundation of its speed and efficiency:
| Optimization | Description |
|---|---|
| Columnar storage | Allows skipping entire columns that aren't needed for a query and also enables high compression by grouping similar values together, minimizing I/O during data loading. |
| Sparse primary indexes | secondary data-skipping indexes | projections | Prune irrelevant data by identifying which granules (row blocks) might match filters on indexed columns. These techniques operate at the granule level and can be used individually or in combination. |
| PREWHERE | Checks matches also for filters on non-indexed columns to skip data early that would otherwise be loaded and discarded. It can work independently or refine the granules selected by indexes, complementing granule pruning by skipping rows that don't match all column filters. |
| Query condition cache | Speeds up repeated queries by remembering which granules matched all filters last time. ClickHouse can then skip reading and filtering granules that didn't match, even if the query shape changes. |
While the aforementioned I/O optimizations can significantly reduce data read, they still assume that all columns for rows passing the WHERE clause must be loaded before running operations like sorting, aggregation, or LIMIT. But what if some columns aren’t needed until later, or some data, despite passing the WHERE clause, is never needed at all?
That’s where lazy materialization comes in. It is an orthogonal enhancement that completes the I/O optimization stack:
- Indexing, together with
PREWHERE, ensures that only rows matching column filters in theWHEREclause are processed. - Lazy materialization builds on this by deferring column reads until they’re actually required by the query execution plan.
Even after filtering, only the columns needed for the next operation - such as sorting - are loaded immediately.
Others are postponed and, due to
LIMIT, are often read only partially, just enough to produce the final result. This makes lazy materialization especially powerful for Top N queries, where the final result may only require a handful of rows from certain, often large, columns.
A worked example
We highly recommend the blogpost "ClickHouse gets lazier (and faster): Introducing lazy materialization" for a deep dive on lazy materialization. The example below is taken from the aforementioned blogpost and reproduced here to demonstrate how a ClickHouse query can go from 219 seconds to just 139 milliseconds (a 1576× speedup) with lazy materialization.
To benefit from indexing and PREWHERE, a query needs filters, on primary key columns for indexing, and on any columns for PREWHERE.
Lazy materialization layers cleanly on top, but unlike the other optimizations mentioned previously, it can also speed up queries with no column filters at all.
Consider the following example query which finds Amazon reviews with the highest number of helpful votes, regardless of date, product, rating, or verification status, and returns the top 3 along with their title, headline, and full text.
First running the query (with cold filesystem caches) with lazy materialization disabled (using query_plan_optimize_lazy_materialization):
Next the query is rerun (again with a cold filesystem cache), but this time with lazy materialization enabled:
Ordinarily you don't need to explicitly set query_plan_optimize_lazy_materialization = true to gain the benefit of lazy materialization.
It is enabled by default.
Consider the difference in performance with lazy materialization turned off and turned on:
| Metric | Lazy materialization off | Lazy materialization on | Improvement |
|---|---|---|---|
| Elapsed time | 219.071 sec | 0.139 sec | ~1576× faster |
| Data read | 71.38 GB | 1.81 GB | ~40× less |
| Peak memory | 1.11 GiB | 3.80 MiB | ~300× less |
How to confirm lazy materialization in the query execution plan
You can observe usage of lazy materialization for the previous query by inspecting the query’s logical execution plan using the EXPLAIN clause:
You can read the operator plan from bottom to top and observe that ClickHouse defers reading the three large String columns until after sorting and limiting.