Designing your schema
While schema inference can be used to establish an initial schema for JSON data and query JSON data files in place, e.g., in S3, users should aim to establish an optimized versioned schema for their data. We discuss the options for modeling JSON structures below.
Extract where possible
Where possible, users are encouraged to extract the JSON keys they query frequently to the columns on the root of the schema. As well as simplifying query syntax, this allows users to use these columns in their ORDER BY
clause if required or specify a secondary index.
Consider the arXiv dataset explored in the guide JSON schema inference:
Suppose we wish to make the first value of versions.created
the main ordering key - ideally under a name published_date
. This should be either extracted prior to insertion or at insert time using ClickHouse materialized views or materialized columns.
Materialized columns represent the simplest means of extracting data at query time and are preferred if the extraction logic can be captured as a simple SQL expression. As an example, the published_date
can be added to the arXiv schema as a materialized column and defined as an ordering key as follows:
The above requires us to access the tuple using the notation versions[1].1
, referring to the created
column by position, rather than the preferred syntax of versions.created_at[1]
.
On loading the data, the column will be extracted:
Values of materialized columns are always calculated at insert time and cannot be specified in INSERT
queries. Materialized columns will, by default, not be returned in a SELECT *
. This is to preserve the invariant that the result of a SELECT *
can always be inserted back into the table using INSERT. This behavior can be disabled by setting asterisk_include_materialized_columns=1
.
For more complex filtering and transformation tasks, we recommend using materialized views.
Static vs dynamic JSON
The principal task on defining a schema for JSON is to determine the appropriate type for each key's value. We recommended users apply the following rules recursively on each key in the JSON hierarchy to determine the appropriate type for each key.
- Primitive types - If the key's value is a primitive type, irrespective of whether it is part of a sub object or on the root, ensure you select its type according to general schema design best practices and type optimization rules. Arrays of primitives, such as
phone_numbers
below, can be modeled asArray(<type>)
e.g.,Array(String)
. - Static vs dynamic - If the key's value is a complex object i.e. either an object or an array of objects, establish whether it is subject to change. Objects that rarely have new keys, where the addition of a new key can be predicted and handled with a schema change via
ALTER TABLE ADD COLUMN
, can be considered static. This includes objects where only a subset of the keys may be provided on some JSON documents. Objects where new keys are added frequently and/or not predictable should be considered dynamic. To establish whether a value is static or dynamic, see the relevant sections Handling static objects and Handling dynamic objects below.
Important: The above rules should be applied recursively. If a key's value is determined to be dynamic, no further evaluation is required and the guidelines in Handling dynamic objects can be followed. If the object is static, continue to assess the subkeys until either key values are primitive or dynamic keys are encountered.
To illustrate these rules, we use the following JSON example representing a person:
Applying these rules:
- The root keys
name
,username
,email
,website
can be represented as typeString
. The columnphone_numbers
is an Array primitive of typeArray(String)
, withdob
andid
typeDate
andUInt32
respectively. - New keys will not be added to the
address
object (only new address objects), and it can thus be considered static. If we recurse, all of the sub-columns can be considered primitives (and typeString
) exceptgeo
. This is also a static structure with twoFloat32
columns,lat
andlon
. - The
tags
column is dynamic. We assume new arbitrary tags can be added to this object of any type and structure. - The
company
object is static and will always contain at most the 3 keys specified. The subkeysname
andcatchPhrase
are of typeString
. The keylabels
is dynamic. We assume new arbitrary tags can be added to this object. Values will always be key-value pairs of type string.
Handling static objects
We recommend static objects are handled using named tuples i.e. Tuple
. Arrays of objects can be held using arrays of tuples i.e. Array(Tuple)
. Within tuples themselves, columns and their respective types should be defined using the same rules. This can result in nested Tuples to represent nested objects as shown below.
To illustrate this, we use the earlier JSON person example, omitting the dynamic objects:
The schema for this table is shown below:
Note how the company
column is defined as a Tuple(catchPhrase String, name String)
. The address
field uses an Array(Tuple)
, with a nested Tuple
to represent the geo
column.
JSON can be inserted into this table in its current structure:
In our example above, we have minimal data, but as shown below, we can query the tuple fields by their period-delimited names.
Note how the address.street
column is returned as an Array
. To query a specific object inside an array by position, the array offset should be specified after the column name. For example, to access the street from the first address:
The principal disadvantage of tuples is that the sub columns cannot be used in ordering keys. The following will thus fail:
While tuple columns cannot be used in ordering keys, the entire tuple can be used. While possible, this rarely makes sense.
Handling default values
Even if JSON objects are structured, they are often sparse with only a subset of the known keys provided. Fortunately, the Tuple
type does not require all columns in the JSON payload. If not provided, default values will be used.
Consider our earlier people
table and the following sparse JSON, missing the keys suite
, geo
, phone_numbers
and catchPhrase
.
We can see below this row can be successfully inserted:
Querying this single row, we can see that default values are used for the columns (including sub-objects) that were omitted:
If users need to differentiate between a value being empty and not provided, the Nullable type can be used. This should be avoided unless absolutely required, as it will negatively impact storage and query performance on these columns.
Handling new columns
While a structured approach is simplest when the JSON keys are static, this approach can still be used if the changes to the schema can be planned, i.e., new keys are known in advance, and the schema can be modified accordingly.
Note that ClickHouse will by default ignore JSON keys which are provided in the payload and are not present in the schema. Consider the following modified JSON payload with the addition of a nickname
key:
This JSON can be successfully inserted with the nickname
key ignored:
Columns can be added to a schema using the ALTER TABLE ADD COLUMN
command. A default can be specified via the DEFAULT
clause, which will be used if it is not specified during the subsequent inserts. Rows for which this value is not present (as they were inserted prior to its creation) will also return this default value. If no DEFAULT
value is specified, the default value for the type will be used.
For example:
Handling dynamic objects
There are two recommended approaches to handling dynamic objects:
- Map(String,V) type
- String with JSON functions
The following rules can be applied to determine the most appropriate.
- If the objects are highly dynamic, with no predictable structure and contain arbitrary nested objects, users should use the
String
type. Values can be extracted at query time using JSON functions as we show below. - If the object is used to store arbitrary keys, mostly of one type, consider using the
Map
type. Ideally, the number of unique keys should not exceed several hundred. TheMap
type can also be considered for objects with sub-objects, provided the latter have uniformity in their types. Generally, we recommend theMap
type be used for labels and tags, e.g. Kubernetes pod labels in log data.
Different techniques may be applied to different objects in the same schema. Some objects can be best solved with a String
and others Map
. Note that once a String
type is used, no further schema decisions need to be made. Conversely, it is possible to nest sub-objects within a Map
key as we show below - including a String
representing JSON.
Using String
Handling data using the structured approach described above is often not viable for those users with dynamic JSON, which is either subject to change or for which the schema is not well understood. For absolute flexibility, users can simply store JSON as String
s before using functions to extract fields as required. This represents the extreme opposite of handling JSON as a structured object. This flexibility incurs costs with significant disadvantages - primarily an increase in query syntax complexity as well as degraded performance.
As noted earlier, for the original person object, we cannot ensure the structure of the tags
column. We insert the original row (we also include company.labels
, which we ignore for now), declaring the Tags
column as a String
:
We can select the tags
column and see that the JSON has been inserted as a string:
The JSONExtract functions can be used to retrieve values from this JSON. Consider the simple example below:
Notice how the functions require both a reference to the String
column tags
and a path in the JSON to extract. Nested paths require functions to be nested e.g. JSONExtractUInt(JSONExtractString(tags, 'car'), 'year')
which extracts the column tags.car.year
. The extraction of nested paths can be simplified through the functions JSON_QUERY AND JSON_VALUE.
Consider the extreme case with the arxiv
dataset where we consider the entire body to be a String
.
To insert into this schema, we need to use the JSONAsString
format:
Suppose we wish to count the number of papers released by year. Contrast the query against the structured version of the schema vs using only a string:
Notice the use of an XPath expression here to filter the JSON by method i.e. JSON_VALUE(body, '$.versions[0].created')
.
String functions are appreciably slower (> 10x) than explicit type conversions with indices. The above queries always require a full table scan and processing of every row. While these queries will still be fast on a small dataset such as this, performance will degrade on larger datasets.
This approach's flexibility comes at a clear performance and syntax cost, and it should be used only for highly dynamic objects in the schema.
Simple JSON Functions
The above examples use the JSON* family of functions. These utilize a full JSON parser based on simdjson, that is rigorous in its parsing and will distinguish between the same field nested at different levels. These functions are able to deal with JSON that is syntactically correct but not well-formatted, e.g. double spaces between keys.
A faster and more strict set of functions are available. These simpleJSON*
functions offer potentially superior performance, primarily by making strict assumptions as to the structure and format of the JSON. Specifically:
-
Field names must be constants
-
Consistent encoding of field names e.g.
simpleJSONHas('{"abc":"def"}', 'abc') = 1
, butvisitParamHas('{"\\u0061\\u0062\\u0063":"def"}', 'abc') = 0
-
The field names are unique across all nested structures. No differentiation is made between nesting levels and matching is indiscriminate. In the event of multiple matching fields, the first occurrence is used.
-
No special characters outside of string literals. This includes spaces. The following is invalid and will not parse.
Whereas, the following will parse correctly:
In some circumstances, where performance is critical and your JSON meets the above requirements, these may be appropriate. An example of the earlier query, re-written to use simpleJSON*
functions, is shown below:
The above uses the simpleJSONExtractString
to extract the created
key, exploiting the fact we want the first value only for the published date. In this case, the limitations of the simpleJSON*
functions are acceptable for the gain in performance.
Using Map
If an object is used to store arbitrary keys of mostly one type, consider using the Map
type. Ideally, the number of unique keys should not exceed several hundred. We recommend the Map
type be used for labels and tags e.g. Kubernetes pod labels in log data. While a simple way to represent nested structures, Map
s have some notable limitations:
- The fields must be of all the same type.
- Accessing sub-columns requires a special map syntax since the fields don’t exist as columns; the entire object is a column.
- Accessing a subcolumn loads the entire
Map
value i.e. all siblings and their respective values. For larger maps, this can result in a significant performance penalty.
When modelling objects as Map
s, a String
key is used to store the JSON key name. The map will therefore always be Map(String, T)
, where T
depends on the data.
Primitive values
The simplest application of a Map
is when the object contains the same primitive type as values. In most cases, this involves using the String
type for the value T
.
Consider our earlier person JSON where the company.labels
object was determined to be dynamic. Importantly, we only expect key-value pairs of type String to be added to this object. We can thus declare this as Map(String, String)
:
We can insert our original complete JSON object:
Querying these fields within the request object requires a map syntax e.g.:
A full set of Map
functions is available to query this time, described here. If your data is not of a consistent type, functions exist to perform the necessary type coercion.
Object values
The Map
type can also be considered for objects which have sub-objects, provided the latter have consistency in their types.
Suppose the tags
key for our persons
object requires a consistent structure, where the sub-object for each tag
has a name
and time
column. A simplified example of such a JSON document might look like the following:
This can be modelled with a Map(String, Tuple(name String, time DateTime))
as shown below:
The application of maps in this case is typically rare, and suggests that the data should be remodelled such that dynamic key names do not have sub-objects. For example, the above could be remodelled as follows allowing the use of Array(Tuple(key String, name String, time DateTime))
.