Loading JSON
The following examples provide a very simple example of loading structured and semi-structured JSON data. For more complex JSON, including nested structures, see the guide Designing JSON schema.
Loading Structured JSON
In this section, we assume the JSON data is in NDJSON
(Newline delimited JSON) format, known as JSONEachRow
in ClickHouse, and well structured i.e. the column names and types are fixed. NDJSON
is the preferred format for loading JSON due to its brevity and efficient use of space, but others are supported for both input and output.
Consider the following JSON sample, representing a row from the Python PyPI dataset:
In order to load this JSON object into ClickHouse, a table schema must be defined.
In this simple case, our structure is static, our column names are known, and their types are well-defined.
Whereas ClickHouse supports semi-structured data through a JSON type, where key names and their types can be dynamic, this is unnecessary here.
In cases where your columns have fixed names and types, and new columns are not expected, always prefer a statically defined schema in production.
The JSON type is preferred for highly dynamic data, where the names and types of columns are subject to change. This type is also useful in prototyping and data exploration.
A simple schema for this is shown below, where JSON keys are mapped to column names:
We have selected an ordering key here via the ORDER BY
clause. For further details on ordering keys and how to choose them, see here.
ClickHouse can load data JSON in several formats, automatically inferring the type from the extension and contents. We can read JSON files for the above table using the S3 function:
Note how we are not required to specify the file format. Instead, we use a glob pattern to read all *.json.gz
files in the bucket. ClickHouse automatically infers the format is JSONEachRow
(ndjson) from the file extension and contents. A format can be manually specified through parameter functions in case ClickHouse is unable to detect it.
The above files are also compressed. This is automatically detected and handled by ClickHouse.
To load the rows in these files, we can use an INSERT INTO SELECT
:
Rows can also be loaded inline using the FORMAT
clause e.g.
These examples assume the use of the JSONEachRow
format. Other common JSON formats are supported, with examples of loading these provided here.
Loading Semi-structured JSON
Our previous example loaded JSON which was static with well known key names and types. This is often not the case - keys can be added or their types can change. This is common in use cases such as Observability data.
ClickHouse handles this through a dedicated JSON
type.
Consider the following example from an extended version of the above Python PyPI dataset dataset. Here we have added an arbitrary tags
column with random key value pairs.
The tags column here is unpredictable and thus impossible for us to model. To load this data, we can use our previous schema but provide an additional tags
column of type JSON
:
We populate the table using the same approach as for the original dataset:
Notice the performance difference here on loading data. The JSON column requires type inference at insert time as well as additional storage if columns exist that have more than one type. Although the JSON type can be configured (see Designing JSON schema) for equivalent performance to explicitly declaring columns, it is intentionally flexible out-of-the-box. This flexibility, however, comes at some cost.
When to use the JSON type
Use the JSON type when your data:
- Has unpredictable keys that can change over time.
- Contains values with varying types (e.g., a path might sometimes contain a string, sometimes a number).
- Requires schema flexibility where strict typing isn't viable.
If your data structure is known and consistent, there is rarely a need for the JSON type, even if your data is in JSON format. Specifically, if your data has:
- A flat structure with known keys: use standard column types e.g. String.
- Predictable nesting: use Tuple, Array, or Nested types for these structures.
- Predictable structure with varying types: consider Dynamic or Variant types instead.
You can also mix approaches as we have done in the above example, using static columns for predictable top-level keys and a single JSON column for a dynamic section of the payload.