Skip to main content
Skip to main content

Use JSON where appropriate

ClickHouse now offers a native JSON column type designed for semi-structured and dynamic data. It's important to clarify that this is a column type, not a data format—you can insert JSON into ClickHouse as a string or via supported formats like JSONEachRow, but that does not imply using the JSON column type. Users should only use the JSON type when the structure of their data is dynamic, not when they simply happen to store JSON.

When to use the JSON type

Use the JSON type when your data:

  • Has unpredictable keys that can change over time.
  • Contains values with varying types (e.g., a path might sometimes contain a string, sometimes a number).
  • Requires schema flexibility where strict typing isn't viable.

If your data structure is known and consistent, there is rarely a need for the JSON type, even if your data is in JSON format. Specifically, if your data has:

  • A flat structure with known keys: use standard column types e.g. String.
  • Predictable nesting: use Tuple, Array, or Nested types for these structures.
  • Predictable structure with varying types: consider Dynamic or Variant types instead.

You can also mix approaches - for example, use static columns for predictable top-level fields and a single JSON column for a dynamic section of the payload.

Considerations and tips for using JSON

The JSON type enables efficient columnar storage by flattening paths into subcolumns. But with flexibility comes responsibility. To use it effectively:

  • Specify path types using hints in the column definition to specify types for known sub columns, avoiding unnecessary type inference.
  • Skip paths if you don't need the values, with SKIP and SKIP REGEXP to reduce storage and improve performance.
  • Avoid setting max_dynamic_paths too high - large values increase resource consumption and reduce efficiency. As a rule of thumb, keep it below 10,000.
Type hints

Type hits offer more than just a way to avoid unnecessary type inference - they eliminate storage and processing indirection entirely. JSON paths with type hints are always stored just like traditional columns, bypassing the need for discriminator columns or dynamic resolution during query time. This means that with well-defined type hints, nested JSON fields achieve the same performance and efficiency as if they were modeled as top-level fields from the outset. As a result, for datasets that are mostly consistent but still benefit from the flexibility of JSON, type hints provide a convenient way to preserve performance without needing to restructure your schema or ingest pipeline.

Advanced Features

  • JSON columns can be used in primary keys like any other columns. Codecs cannot be specified for a sub-column.
  • They support introspection via functions like JSONAllPathsWithTypes() and JSONDynamicPaths().
  • You can read nested sub-objects using the .^ syntax.
  • Query syntax may differ from standard SQL and may require special casting or operators for nested fields.

For additional guidance, see ClickHouse JSON documentation or explore our blog post A New Powerful JSON Data Type for ClickHouse.

Examples

Consider the following JSON sample, representing a row from the Python PyPI dataset:

Lets assume this schema is static and the types can be well defined. Even if the data is in NDJSON format (JSON row per line), there is no need to use the JSON type for such a schema. Simply define the schema with classic types.

and insert JSON rows:

Consider the arXiv dataset containing 2.5m scholarly papers. Each row in this dataset, distributed as NDJSON, represents a published academic paper. An example row is shown below:

While the JSON here is complex, with nested structures, it is predictable. The number and type of the fields will not change. While we could use the JSON type for this example, we can also just define the structure explicitly using Tuples and Nested types:

Again we can insert the data as JSON:

Suppose another column called tags is added. If this was simply a list of strings we could model as an Array(String), but let's assume users can add arbitrary tag structures with mixed types (notice score is a string or integer). Our modified JSON document:

In this case, we could model the arXiv documents as either all JSON or simply add a JSON tags column. We provide both examples below:

note

We specify the update_date in the JSON definition as we use it in the ordering/primary key. This helps ClickHouse know this column won't be null. If not specified, the user must explicitly allow nullable primary keys (not recommended for performance reasons) via the setting allow_nullable_key=1

We can insert into this table and view the subsequently inferred schema using the JSONAllPathsWithTypes function and PrettyJSONEachRow output format:

Alternatively, we could model this using our earlier schema and a JSON tags column. This is generally preferred, minimizing the inference required by ClickHouse:

We can now infer the types of the sub column tags.