Other approaches to modeling JSON
The following are alternatives to modeling JSON in ClickHouse. These are documented for completeness and were applicable before the development of the JSON type and are thus generally not recommended or applicable in most use cases.
Different techniques may be applied to different objects in the same schema. For example, some objects can be best solved with a String
type and others with a Map
type. Note that once a String
type is used, no further schema decisions need to be made. Conversely, it is possible to nest sub-objects within a Map
key - including a String
representing JSON - as we show below:
Using String
If the objects are highly dynamic, with no predictable structure and contain arbitrary nested objects, users should use the String
type. Values can be extracted at query time using JSON functions as we show below.
Handling data using the structured approach described above is often not viable for those users with dynamic JSON, which is either subject to change or for which the schema is not well understood. For absolute flexibility, users can simply store JSON as String
s before using functions to extract fields as required. This represents the extreme opposite of handling JSON as a structured object. This flexibility incurs costs with significant disadvantages - primarily an increase in query syntax complexity as well as degraded performance.
As noted earlier, for the original person object, we cannot ensure the structure of the tags
column. We insert the original row (including company.labels
, which we ignore for now), declaring the Tags
column as a String
:
We can select the tags
column and see that the JSON has been inserted as a string:
The JSONExtract
functions can be used to retrieve values from this JSON. Consider the simple example below:
Notice how the functions require both a reference to the String
column tags
and a path in the JSON to extract. Nested paths require functions to be nested e.g. JSONExtractUInt(JSONExtractString(tags, 'car'), 'year')
which extracts the column tags.car.year
. The extraction of nested paths can be simplified through the functions JSON_QUERY
and JSON_VALUE
.
Consider the extreme case with the arxiv
dataset where we consider the entire body to be a String
.
To insert into this schema, we need to use the JSONAsString
format:
Suppose we wish to count the number of papers released by year. Contrast the following query using only a string versus the structured version of the schema:
Notice the use of an XPath expression here to filter the JSON by method i.e. JSON_VALUE(body, '$.versions[0].created')
.
String functions are appreciably slower (> 10x) than explicit type conversions with indices. The above queries always require a full table scan and processing of every row. While these queries will still be fast on a small dataset such as this, performance will degrade on larger datasets.
This approach's flexibility comes at a clear performance and syntax cost, and it should be used only for highly dynamic objects in the schema.
Simple JSON Functions
The above examples use the JSON* family of functions. These utilize a full JSON parser based on simdjson, that is rigorous in its parsing and will distinguish between the same field nested at different levels. These functions are able to deal with JSON that is syntactically correct but not well-formatted, e.g. double spaces between keys.
A faster and more strict set of functions are available. These simpleJSON*
functions offer potentially superior performance, primarily by making strict assumptions as to the structure and format of the JSON. Specifically:
-
Field names must be constants
-
Consistent encoding of field names e.g.
simpleJSONHas('{"abc":"def"}', 'abc') = 1
, butvisitParamHas('{"\\u0061\\u0062\\u0063":"def"}', 'abc') = 0
-
The field names are unique across all nested structures. No differentiation is made between nesting levels and matching is indiscriminate. In the event of multiple matching fields, the first occurrence is used.
-
No special characters outside of string literals. This includes spaces. The following is invalid and will not parse.
Whereas, the following will parse correctly:
The above query uses the simpleJSONExtractString
to extract the created
key, exploiting the fact we want the first value only for the published date. In this case, the limitations of the simpleJSON*
functions are acceptable for the gain in performance.
Using Map
If the object is used to store arbitrary keys, mostly of one type, consider using the Map
type. Ideally, the number of unique keys should not exceed several hundred. The Map
type can also be considered for objects with sub-objects, provided the latter have uniformity in their types. Generally, we recommend the Map
type be used for labels and tags, e.g. Kubernetes pod labels in log data.
Although Map
s give a simple way to represent nested structures, they have some notable limitations:
- The fields must all be of the same type.
- Accessing sub-columns requires a special map syntax since the fields don't exist as columns. The entire object is a column.
- Accessing a subcolumn loads the entire
Map
value i.e. all siblings and their respective values. For larger maps, this can result in a significant performance penalty.
When modelling objects as Map
s, a String
key is used to store the JSON key name. The map will therefore always be Map(String, T)
, where T
depends on the data.
Primitive values
The simplest application of a Map
is when the object contains the same primitive type as values. In most cases, this involves using the String
type for the value T
.
Consider our earlier person JSON where the company.labels
object was determined to be dynamic. Importantly, we only expect key-value pairs of type String to be added to this object. We can thus declare this as Map(String, String)
:
We can insert our original complete JSON object:
Querying these fields within the request object requires a map syntax e.g.:
A full set of Map
functions is available to query this time, described here. If your data is not of a consistent type, functions exist to perform the necessary type coercion.
Object values
The Map
type can also be considered for objects which have sub-objects, provided the latter have consistency in their types.
Suppose the tags
key for our persons
object requires a consistent structure, where the sub-object for each tag
has a name
and time
column. A simplified example of such a JSON document might look like the following:
This can be modelled with a Map(String, Tuple(name String, time DateTime))
as shown below:
The application of maps in this case is typically rare, and suggests that the data should be remodelled such that dynamic key names do not have sub-objects. For example, the above could be remodelled as follows allowing the use of Array(Tuple(key String, name String, time DateTime))
.
Using Nested
The Nested type can be used to model static objects which are rarely subject to change, offering an alternative to Tuple
and Array(Tuple)
. We generally recommend avoiding using this type for JSON as its behavior is often confusing. The primary benefit of Nested
is that sub-columns can be used in ordering keys.
Below, we provide an example of using the Nested type to model a static object. Consider the following simple log entry in JSON:
We can declare the request
key as Nested
. Similar to Tuple
, we are required to specify the sub columns.
flatten_nested
The setting flatten_nested
controls the behavior of nested.
flatten_nested=1
A value of 1
(the default) does not support an arbitrary level of nesting. With this value, it is easiest to think of a nested data structure as multiple Array columns of the same length. The fields method
, path
, and version
are all separate Array(Type)
columns in effect with one critical constraint: the length of the method
, path
, and version
fields must be the same. If we use SHOW CREATE TABLE
, this is illustrated:
Below, we insert into this table:
A few important points to note here:
-
We need to use the setting
input_format_import_nested_json
to insert the JSON as a nested structure. Without this, we are required to flatten the JSON i.e. -
The nested fields
method
,path
, andversion
need to be passed as JSON arrays i.e.
Columns can be queried using a dot notation:
Note the use of Array
for the sub-columns means the full breath Array functions can potentially be exploited, including the ARRAY JOIN
clause - useful if your columns have multiple values.
flatten_nested=0
This allows an arbitrary level of nesting and means nested columns stay as a single array of Tuple
s - effectively they become the same as Array(Tuple)
.
This represents the preferred way, and often the simplest way, to use JSON with Nested
. As we show below, it only requires all objects to be a list.
Below, we re-create our table and re-insert a row:
A few important points to note here:
-
input_format_import_nested_json
is not required to insert. -
The
Nested
type is preserved inSHOW CREATE TABLE
. Underneath this column is effectively aArray(Tuple(Nested(method LowCardinality(String), path String, version LowCardinality(String))))
-
As a result, we are required to insert
request
as an array i.e.
Columns can again be queried using a dot notation:
Example
A larger example of the above data is available in a public bucket in s3 at: s3://datasets-documentation/http/
.
Given the constraints and input format for the JSON, we insert this sample dataset using the following query. Here, we set flatten_nested=0
.
The following statement inserts 10 million rows, so this may take a few minutes to execute. Apply a LIMIT
if required:
Querying this data requires us to access the request fields as arrays. Below, we summarize the errors and http methods over a fixed time period.
Using Pairwise Arrays
Pairwise arrays provide a balance between the flexibility of representing JSON as Strings and the performance of a more structured approach. The schema is flexible in that any new fields can be potentially added to the root. This, however, requires a significantly more complex query syntax and isn't compatible with nested structures.
As an example, consider the following table:
To insert into this table, we need to structure the JSON as a list of keys and values. The following query illustrates the use of the JSONExtractKeysAndValues
to achieve this:
Note how the request column remains a nested structure represented as a string. We can insert any new keys to the root. We can also have arbitrary differences in the JSON itself. To insert into our local table, execute the following:
Querying this structure requires using the indexOf
function to identify the index of the required key (which should be consistent with the order of the values). This can be used to access the values array column i.e. values[indexOf(keys, 'status')]
. We still require a JSON parsing method for the request column - in this case, simpleJSONExtractString
.