ClickHouse 现在提供了适用于半结构化和动态数据的原生 JSON 列类型。需要特别说明的是,这是一种列类型,而不是一种数据格式——可以以字符串形式将 JSON 插入 ClickHouse,或者通过 JSONEachRow 等受支持的格式进行插入,但这并不意味着就在使用 JSON 列类型。只有在数据结构本身是动态的情况下,才应该选择 JSON 类型,而不是因为“刚好”以 JSON 形式存储数据就使用它。
何时使用 JSON 类型
在以下情况下使用 JSON 类型:
- 存在不可预测的键,并且这些键会随时间变化。
- 包含类型各异的值(例如,同一路径有时是字符串,有时是数字)。
- 需要更灵活的 schema,而严格类型不可行。
如果你的数据结构是已知且稳定的,即使数据本身是 JSON 格式,通常也不需要使用 JSON 类型。特别是在以下情况下,你的数据具有:
- 扁平结构且键是已知的:使用标准列类型,例如 String。
- 可预测的嵌套结构:对这些结构使用 Tuple、Array 或 Nested 类型。
- 结构可预测但值类型变化:可以考虑使用 Dynamic 或 Variant 类型。
你也可以混合使用多种方式——例如,对可预测的顶层字段使用固定列,对有效负载中动态部分使用单独的 JSON 列。
使用 JSON 的注意事项和技巧
JSON 类型通过将路径扁平化为子列,实现高效的列式存储。但灵活性也意味着更高的使用要求。要高效使用它,请:
类型提示
类型提示不仅仅是避免不必要类型推断的一种方式——它们还能完全消除存储和处理中的间接层。带有类型提示的 JSON 路径始终与传统列以相同方式存储,从而无需在查询时依赖判别列(discriminator column)或进行动态解析。这意味着,在合理定义类型提示的情况下,嵌套 JSON 字段可以获得与从一开始就被建模为顶层字段相同的性能和效率。因此,对于大部分结构一致、但仍希望利用 JSON 灵活性的数据集,类型提示提供了一种便捷方式,可以在无需重构 schema 或摄取管道的前提下保持性能。
高级特性
如需更多指导,请参阅 ClickHouse JSON 文档 或查看我们的博文 A New Powerful JSON Data Type for ClickHouse。
考虑如下 JSON 示例,它表示 Python PyPI 数据集 中的一行数据:
{
"date": "2022-11-15",
"country_code": "ES",
"project": "clickhouse-connect",
"type": "bdist_wheel",
"installer": "pip",
"python_minor": "3.9",
"system": "Linux",
"version": "0.3.0"
}
假设该 schema 是静态的,并且各个类型都可以被很好地定义。即使数据是 NDJSON 格式(每行一条 JSON 数据),对于这样一个 schema 也没有必要使用 JSON 类型。只需使用常规数据类型来定义该 schema 即可。
CREATE TABLE pypi (
`date` Date,
`country_code` String,
`project` String,
`type` String,
`installer` String,
`python_minor` String,
`system` String,
`version` String
)
ENGINE = MergeTree
ORDER BY (project, date)
并插入 JSON 行数据:
INSERT INTO pypi FORMAT JSONEachRow
{"date":"2022-11-15","country_code":"ES","project":"clickhouse-connect","type":"bdist_wheel","installer":"pip","python_minor":"3.9","system":"Linux","version":"0.3.0"}
考虑包含 250 万篇学术论文的 arXiv 数据集。该数据集以 NDJSON 格式分发,其中的每一行都代表一篇已发表的学术论文。示例行如下所示:
{
"id": "2101.11408",
"submitter": "Daniel Lemire",
"authors": "Daniel Lemire",
"title": "Number Parsing at a Gigabyte per Second",
"comments": "Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/",
"journal-ref": "Software: Practice and Experience 51 (8), 2021",
"doi": "10.1002/spe.2984",
"report-no": null,
"categories": "cs.DS cs.MS",
"license": "http://creativecommons.org/licenses/by/4.0/",
"abstract": "With disks and networks providing gigabytes per second ....\n",
"versions": [
{
"created": "Mon, 11 Jan 2021 20:31:27 GMT",
"version": "v1"
},
{
"created": "Sat, 30 Jan 2021 23:57:29 GMT",
"version": "v2"
}
],
"update_date": "2022-11-07",
"authors_parsed": [
[
"Lemire",
"Daniel",
""
]
]
}
尽管此处的 JSON 结构较为复杂并包含嵌套,但它是可预测的,字段的数量和类型不会变化。在本示例中,我们既可以使用 JSON 类型,也可以直接使用 Tuples 和 Nested 类型显式地定义该结构:
CREATE TABLE arxiv
(
`id` String,
`submitter` String,
`authors` String,
`title` String,
`comments` String,
`journal-ref` String,
`doi` String,
`report-no` String,
`categories` String,
`license` String,
`abstract` String,
`versions` Array(Tuple(created String, version String)),
`update_date` Date,
`authors_parsed` Array(Array(String))
)
ENGINE = MergeTree
ORDER BY update_date
同样可以将这些数据作为 JSON 插入:
INSERT INTO arxiv FORMAT JSONEachRow
{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]]}
假设又新增了一列名为 tags。如果它只是一个字符串列表,我们可以将其建模为 Array(String),但这里假设你可以添加具有混合类型的任意标签结构(注意 score 既可以是字符串也可以是整数)。我们修改后的 JSON 文档如下:
{
"id": "2101.11408",
"submitter": "Daniel Lemire",
"authors": "Daniel Lemire",
"title": "Number Parsing at a Gigabyte per Second",
"comments": "Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/",
"journal-ref": "Software: Practice and Experience 51 (8), 2021",
"doi": "10.1002/spe.2984",
"report-no": null,
"categories": "cs.DS cs.MS",
"license": "http://creativecommons.org/licenses/by/4.0/",
"abstract": "With disks and networks providing gigabytes per second ....\n",
"versions": [
{
"created": "Mon, 11 Jan 2021 20:31:27 GMT",
"version": "v1"
},
{
"created": "Sat, 30 Jan 2021 23:57:29 GMT",
"version": "v2"
}
],
"update_date": "2022-11-07",
"authors_parsed": [
[
"Lemire",
"Daniel",
""
]
],
"tags": {
"tag_1": {
"name": "ClickHouse user",
"score": "A+",
"comment": "A good read, applicable to ClickHouse"
},
"28_03_2025": {
"name": "professor X",
"score": 10,
"comment": "Didn't learn much",
"updates": [
{
"name": "professor X",
"comment": "Wolverine found more interesting"
}
]
}
}
}
在本例中,我们可以将 arXiv 文档全部建模为 JSON,或者只添加一个 JSON 类型的 tags 列。下面我们提供这两种示例:
CREATE TABLE arxiv
(
`doc` JSON(update_date Date)
)
ENGINE = MergeTree
ORDER BY doc.update_date
注意
我们在 JSON 定义中为 update_date 列提供了类型提示,因为我们在排序键/主键中使用了它。这有助于让 ClickHouse 确定该列不会为 null,并确保它能够知道应使用哪个 update_date 子列(每种类型可能都有多个子列,否则会产生歧义)。
我们可以向此表插入数据,并使用 JSONAllPathsWithTypes 函数和 PrettyJSONEachRow 输出格式查看之后推断出的表结构:
INSERT INTO arxiv FORMAT JSONAsObject
{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]],"tags":{"tag_1":{"name":"ClickHouse user","score":"A+","comment":"A good read, applicable to ClickHouse"},"28_03_2025":{"name":"professor X","score":10,"comment":"Didn't learn much","updates":[{"name":"professor X","comment":"Wolverine found more interesting"}]}}}
SELECT JSONAllPathsWithTypes(doc)
FROM arxiv
FORMAT PrettyJSONEachRow
{
"JSONAllPathsWithTypes(doc)": {
"abstract": "String",
"authors": "String",
"authors_parsed": "Array(Array(Nullable(String)))",
"categories": "String",
"comments": "String",
"doi": "String",
"id": "String",
"journal-ref": "String",
"license": "String",
"submitter": "String",
"tags.28_03_2025.comment": "String",
"tags.28_03_2025.name": "String",
"tags.28_03_2025.score": "Int64",
"tags.28_03_2025.updates": "Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))",
"tags.tag_1.comment": "String",
"tags.tag_1.name": "String",
"tags.tag_1.score": "String",
"title": "String",
"update_date": "Date",
"versions": "Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))"
}
}
1 row in set. Elapsed: 0.003 sec.
或者,我们也可以使用之前的 schema,并加上一列 JSON 类型的 tags 来进行建模。通常更推荐这种方式,因为这样可以最大限度减少 ClickHouse 所需的推断:
CREATE TABLE arxiv
(
`id` String,
`submitter` String,
`authors` String,
`title` String,
`comments` String,
`journal-ref` String,
`doi` String,
`report-no` String,
`categories` String,
`license` String,
`abstract` String,
`versions` Array(Tuple(created String, version String)),
`update_date` Date,
`authors_parsed` Array(Array(String)),
`tags` JSON()
)
ENGINE = MergeTree
ORDER BY update_date
INSERT INTO arxiv FORMAT JSONEachRow
{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]],"tags":{"tag_1":{"name":"ClickHouse user","score":"A+","comment":"A good read, applicable to ClickHouse"},"28_03_2025":{"name":"professor X","score":10,"comment":"Didn't learn much","updates":[{"name":"professor X","comment":"Wolverine found more interesting"}]}}}
我们现在可以推断出子列 tags 的数据类型。
SELECT JSONAllPathsWithTypes(tags)
FROM arxiv
FORMAT PrettyJSONEachRow
{
"JSONAllPathsWithTypes(tags)": {
"28_03_2025.comment": "String",
"28_03_2025.name": "String",
"28_03_2025.score": "Int64",
"28_03_2025.updates": "Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))",
"tag_1.comment": "String",
"tag_1.name": "String",
"tag_1.score": "String"
}
}
1 row in set. Elapsed: 0.002 sec.