In August 2024, ClickHouse v24.8 introduced a powerful JSON data type to ClickHouse. Since then, we’ve continuously improved it with new features and optimizations.
In this post, we’ll show how we’ve just made it even better in ClickHouse v25.8.
ClickHouse is already the industry leader in analytical performance, and the latest changes in v25.8 make ClickHouse the leader in analytics over JSON data, too.
How the JSON type worked in v24.8 #
Let’s start with a quick recap of how JSON data is stored in MergeTree parts as designed in ClickHouse v24.8.
In the illustration below, we have JSON data with K unique paths with string values. The first N paths are stored as subcolumns in dynamic paths. All remaining K – N paths are stored in a shared data structure, which is represented as a Map(String, String)
column containing both paths and values.

For example, if we query the data of the path key1
, ClickHouse first checks metadata to see whether key1
is in dynamic paths or shared data. In this example path key1
is in dynamic paths, so its values are stored in dedicated data files, which can be read directly and efficiently.

Now, if we query key_n+1
, the metadata shows that it’s not part of dynamic paths. Instead, it resides in shared data. To extract its values, ClickHouse must read the entire Map(String, String)
column and filter it in memory. This is much less efficient.

By default, the limit on dynamic paths is 1024. Raising this limit to handle more paths is usually a bad idea, especially when using remote storage like S3, because creating thousands of files per part increases memory usage during merges and complicates reads.
As a result, workloads with thousands or tens of thousands of unique JSON paths suffer from poor performance.
This is the challenge we set out to solve.
New serializations for shared data in v25.8 #
ClickHouse v25.8 introduces two new serialization formats for shared data that dramatically improve the efficiency of reading specific paths.
-
Bucketed shared data #
The first new serialization splits shared data into N buckets. Each bucket contains its own Map(String, String)
column, with a deterministic assignment of paths to buckets.

When querying a path like key_m1
, ClickHouse can immediately determine which bucket contains the requested path. Only that bucket is read, while the rest are skipped.

This reduces the amount of data scanned and improves performance compared to reading the entire shared data. However, increasing the number of buckets too far creates too many files and reintroduces file-system overhead.
-
Advanced shared data #
The second and more powerful approach is the advanced serialization format.
Here, each bucket contains 3 files:
.structure
– metadata for each granule: number of rows, list of paths in the granule, and offset in the.paths_marks
file..data
– the actual path data stored in columnar format per granule..paths_marks
– offsets pointing to the start of each path’s data in the.data
file.

When querying the path key_m1
,ClickHouse first checks the .structure
file to see if the granule contains key_m1
. If not, the granule is skipped entirely. If it does, ClickHouse looks up the offset in the .paths_marks
file and jumps directly to the key_m1
path’s data in the .data
file and reads it.

This avoids reading the data of unrelated paths into memory and significantly improves query performance.
Supporting nested paths #
Many JSON documents contain nested structures like arrays of objects. Extracting a nested path from such structures using the above approach still requires reading the entire array, which is inefficient for large payloads.
To address this, the advanced format was extended with additional files for subcolumn handling.
Here each bucket has 6 files:
.structure
– metadata for each granule: number of rows, list of paths in the granule and offsets in files.paths_marks
and.substreams_metadata
..data
– the actual path data stored in columnar format per granule, but split into substreams. A single path may have multiple substreams. This structure allows ClickHouse to read only the substreams needed to reconstruct the requested subcolumn, rather than scanning the entire path value..paths_marks
– offsets pointing to the start of each path’s data in the.data
file..substreams_marks
– offsets pointing to the start of each path’s substream in the.data
file..substreams
– list of substreams present for each path in a granule. This may differ across granules (e.g., some arrays may contain objects with different nested fields)..substreams_metadata
– for each path, stores offsets in the.substreams
and.substreams_marks
files, effectively linking a path to its subcolumns and their data locations.

When querying the subcolumn of path key_m1
, ClickHouse first reads data from the .structure
file and checks if this granule contains the requested path key_m1
. If not, the granule is skipped entirely. If it does, ClickHouse uses the offset stored in the .structure
file to read the corresponding entry in .substreams_metadata
and obtain the offsets for key_m1
.
Using the first offset, ClickHouse then reads the list of substreams for this path in the granule. If this list does not include the substreams required for the requested subcolumn, the granule is skipped. If the required substreams are present, ClickHouse uses the second offset to read their positions from .substreams_marks
and then reads only the data of those substreams from the .data
file.
After reconstructing the requested subcolumn, ClickHouse proceeds to the next granule.

This avoids reading the data of unrelated paths and unrelated substreams of the requested path in the granule in memory. This significantly improves the performance of nested subcolumn reading.
Balancing efficiency and compatibility #
The advanced format is excellent for selective reads, but it doesn’t come for free: reading the entire JSON column or performing merges becomes slower, because the in-memory representation of the shared data (Map(String, String)
) is very different from its storage layout, resulting in expensive conversions during entire JSON column reading.
To address this, ClickHouse v25.8 makes a trade-off by storing a copy of the data in the original format alongside the advanced one. This effectively doubles the storage requirement, but you get all the benefits of the new format without the sacrifices noted above.
In the below illustration, you can see the copy is stored in three extra files:
.copy.offsets
– offsets of the originalMap(String, String)
column..copy.indexes
– indexes into the combined list of paths across buckets (avoids storing all paths again)..copy.values
- values of the originalMap(String, String)
column.

This makes reading full JSON columns and performing merges as fast as before, while still keeping the benefits of the advanced format for selective reads.
Performance #
The new advanced shared data serialization delivers performance close to storing all paths as dynamic subcolumns both in query speed and memory usage, while scaling to tens of thousands of unique paths.
Below we highlight some of the results for the 10,000 path tests, which demonstrate the improvements from the new serialization formats in v25.8.
Performance test 1 (selective reads) #
In this test, we compare reading a single JSON key from a table of 200k rows, each row containing a JSON document with 10k paths, using wide parts.
Using the advanced serialization format dramatically improves both time (~58x) and memory (~3,300x) performance compared to the original JSON serialization.
Data type | Time (seconds) | Memory usage (MiB) |
---|---|---|
JSON no shared data | 0.027 | 18.64 MiB |
JSON shared data "advanced" | 0.063 | 3.89 MiB |
JSON shared data "map_with_buckets" | 0.087 | 403.55 MiB |
String | 3.216 | 582.70 MiB |
Map | 3.594 | 538.37 MiB |
JSON shared data "map" | 3.63 | 12.53 GiB |
Performance test 2 (reading the whole JSON object) #
In this test, we compare reading the full JSON document from a table of 200k rows, each row containing a JSON document with 10k paths, using wide parts.
Using the advanced serialization is practically equivalent to the original JSON serialization. This shows that the advanced serialization delivers huge gains to selective reads without sacrificing performance when reading the whole document.
table_type | time_sec | memory_usage |
---|---|---|
String | 2.734 | 581.15 MiB |
Map | 3.869 | 550.14 MiB |
JSON shared data "map" | 4.182 | 618.78 MiB |
JSON shared data "advanced" | 4.774 | 683.27 MiB |
JSON shared data "map_with_buckets" | 15.271 | 1.45 GiB |
JSON no shared data | - | OOM |
Conclusion #
The new shared data serialization takes JSON support in ClickHouse to the next level. You can now efficiently query JSON documents with tens of thousands of unique paths while maintaining excellent performance for selective reads.
This enhancement makes ClickHouse an even stronger choice for workloads dealing with semi-structured JSON data at scale.