在合适的情况下使用 JSON

ClickHouse 现在提供了原生的 JSON 列类型，用于处理半结构化和动态数据。需要明确的是，这是一种列类型，而不是数据格式——可以将 JSON 作为字符串插入 ClickHouse，或通过 JSONEachRow 等受支持的格式插入，但这并不意味着在使用 JSON 列类型。只有在数据结构本身是动态的情况下，才应使用 JSON 类型，而不是仅仅因为数据碰巧以 JSON 形式存储就选择该类型。

何时使用 `JSON` 类型

JSON 类型专为在结构动态或不可预测的 JSON 对象中，对特定字段进行查询、过滤和聚合而设计。它通过将 JSON 对象拆分为多个子列来实现这一点，与使用 Map 或解析字符串等方案相比，可以显著减少选定字段的数据读取量，从而加速查询。

不过，这也带来了一些重要的权衡：

INSERT 操作较慢 —— 将 JSON 拆分为子列、执行类型推断以及管理灵活的存储结构，会使插入操作比将 JSON 作为简单的 String 列存储更慢。
读取整个对象时较慢 —— 如果你需要获取完整的 JSON 文档（而不是特定字段），从 JSON 类型读取会比从 String 列读取更慢。当你不进行字段级查询时，从多个子列重建对象的开销并不会带来任何收益。
存储开销更大 —— 相比将 JSON 存为单个字符串值，维护独立的子列会增加结构性存储开销。

在以下情况使用 `JSON` 类型：

当数据结构具有动态性或不可预测性，不同文档之间的键各不相同
字段类型或模式（schema）会随时间变化，或在不同记录之间有所差异
你需要在无法预先预测结构的 JSON 对象中，对特定路径进行查询、过滤或聚合
你的使用场景涉及半结构化数据，如日志、事件，或模式（schema）不一致的用户生成内容

在以下情况下使用 `String` 列（或结构化类型）：

你的数据结构是已知且保持一致的——在这种情况下，应使用常规列、Tuple、Array、Dynamic 或 Variant 类型来代替
JSON 文档被当作不透明的二进制大对象（blob），只以整体形式存储和检索，而不进行字段级分析
你不需要在数据库中对单个 JSON 字段进行查询或过滤
JSON 只是传输/存储格式，不在 ClickHouse 内进行分析

提示

如果 JSON 是在数据库内部不被分析、仅用于存储和读取的不透明文档，则应将其存储为 String 字段。只有当你需要在动态 JSON 结构中的特定字段上进行高效的查询、过滤或聚合时，JSON 类型的优势才会体现出来。

你也可以组合使用多种方式——对可预测的顶层字段使用标准列，对负载中动态部分使用 JSON 列。

使用 JSON 的注意事项和提示

JSON 类型通过将路径展开为子列，实现高效的列式存储。但灵活性也意味着需要更谨慎的使用方式。要高效使用它：

在列定义中使用提示指定路径类型，通过在列定义中使用类型提示为已知子列显式指定类型，避免不必要的类型推断。
跳过不需要的路径，如果你不需要这些值，可以使用 SKIP 和 SKIP REGEXP 来减少存储并提升性能。
避免将 max_dynamic_paths 设置得过高——过大的值会增加资源消耗并降低效率。经验法则是将其保持在 10,000 以下。

Type hints

类型提示不仅仅是避免不必要类型推断的一种方式——它们还能彻底消除存储和处理过程中的间接层。带有类型提示的 JSON 路径始终与传统列以相同方式存储，无需使用判别（discriminator）列或在查询时进行动态解析。也就是说，在类型提示定义良好的情况下，嵌套 JSON 字段可以获得与一开始就建模为顶层字段相同的性能和效率。因此，对于大部分结构一致、但仍希望利用 JSON 灵活性的数据集，类型提示提供了一种便捷方式，在无需重构模式（schema）或摄取管道的前提下，保持高性能。

高级特性

JSON 列可以像其他列一样用于主键。无法为子列单独指定编解码器（codec）。
它们支持通过诸如 JSONAllPathsWithTypes() 和 JSONDynamicPaths() 这样的函数进行自省。
可以使用 .^ 语法读取嵌套子对象。
查询语法可能与标准 SQL 不同，并且对于嵌套字段可能需要特殊的类型转换或运算符。

如需更多指导，请参阅 ClickHouse JSON 文档，或阅读我们的博客文章 A New Powerful JSON Data Type for ClickHouse。

示例

考虑下面的 JSON 示例，表示来自 Python PyPI 数据集的一行数据：

{
  "date": "2022-11-15",
  "country_code": "ES",
  "project": "clickhouse-connect",
  "type": "bdist_wheel",
  "installer": "pip",
  "python_minor": "3.9",
  "system": "Linux",
  "version": "0.3.0"
}

假设这个 schema 是静态的，并且各字段的数据类型都可以被清晰定义。即使数据是 NDJSON 格式（每行一个 JSON 行），对于这样一个 schema 也没有必要使用 JSON 类型。只需使用传统的数据类型来定义该 schema 即可。

CREATE TABLE pypi (
  `date` Date,
  `country_code` String,
  `project` String,
  `type` String,
  `installer` String,
  `python_minor` String,
  `system` String,
  `version` String
)
ENGINE = MergeTree
ORDER BY (project, date)

然后插入 JSON 行：

INSERT INTO pypi FORMAT JSONEachRow
{"date":"2022-11-15","country_code":"ES","project":"clickhouse-connect","type":"bdist_wheel","installer":"pip","python_minor":"3.9","system":"Linux","version":"0.3.0"}

考虑包含 250 万篇学术论文的 arXiv 数据集。该数据集以 NDJSON 格式提供，其中每一行都代表一篇已发表的学术论文。下面给出一行示例数据：

{
  "id": "2101.11408",
  "submitter": "Daniel Lemire",
  "authors": "Daniel Lemire",
  "title": "Number Parsing at a Gigabyte per Second",
  "comments": "Software at https://github.com/fastfloat/fast_float and\n  https://github.com/lemire/simple_fastfloat_benchmark/",
  "journal-ref": "Software: Practice and Experience 51 (8), 2021",
  "doi": "10.1002/spe.2984",
  "report-no": null,
  "categories": "cs.DS cs.MS",
  "license": "http://creativecommons.org/licenses/by/4.0/",
  "abstract": "With disks and networks providing gigabytes per second ....\n",
  "versions": [
    {
      "created": "Mon, 11 Jan 2021 20:31:27 GMT",
      "version": "v1"
    },
    {
      "created": "Sat, 30 Jan 2021 23:57:29 GMT",
      "version": "v2"
    }
  ],
  "update_date": "2022-11-07",
  "authors_parsed": [
    [
      "Lemire",
      "Daniel",
      ""
    ]
  ]
}

虽然此处的 JSON 很复杂，并包含嵌套结构，但它是可预测的，字段的数量和类型不会改变。对于这个示例，我们本可以使用 JSON 类型，但也可以直接使用 Tuples 和 Nested 类型显式地定义结构：

CREATE TABLE arxiv
(
  `id` String,
  `submitter` String,
  `authors` String,
  `title` String,
  `comments` String,
  `journal-ref` String,
  `doi` String,
  `report-no` String,
  `categories` String,
  `license` String,
  `abstract` String,
  `versions` Array(Tuple(created String, version String)),
  `update_date` Date,
  `authors_parsed` Array(Array(String))
)
ENGINE = MergeTree
ORDER BY update_date

同样地，我们可以将数据以 JSON 格式插入：

INSERT INTO arxiv FORMAT JSONEachRow 
{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n  https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]]}

假设又添加了一个名为 tags 的列。如果这只是一个字符串列表，我们可以将其建模为 Array(String)，但我们再假设你可以添加由不同类型组成的任意标签结构（注意 score 可以是字符串或整数）。我们修改后的 JSON 文档：

{
 "id": "2101.11408",
 "submitter": "Daniel Lemire",
 "authors": "Daniel Lemire",
 "title": "Number Parsing at a Gigabyte per Second",
 "comments": "Software at https://github.com/fastfloat/fast_float and\n  https://github.com/lemire/simple_fastfloat_benchmark/",
 "journal-ref": "Software: Practice and Experience 51 (8), 2021",
 "doi": "10.1002/spe.2984",
 "report-no": null,
 "categories": "cs.DS cs.MS",
 "license": "http://creativecommons.org/licenses/by/4.0/",
 "abstract": "With disks and networks providing gigabytes per second ....\n",
 "versions": [
 {
   "created": "Mon, 11 Jan 2021 20:31:27 GMT",
   "version": "v1"
 },
 {
   "created": "Sat, 30 Jan 2021 23:57:29 GMT",
   "version": "v2"
 }
 ],
 "update_date": "2022-11-07",
 "authors_parsed": [
 [
   "Lemire",
   "Daniel",
   ""
 ]
 ],
 "tags": {
   "tag_1": {
     "name": "ClickHouse user",
     "score": "A+",
     "comment": "A good read, applicable to ClickHouse"
   },
   "28_03_2025": {
     "name": "professor X",
     "score": 10,
     "comment": "Didn't learn much",
     "updates": [
       {
         "name": "professor X",
         "comment": "Wolverine found more interesting"
       }
     ]
   }
 }
}

在这种情况下，我们可以将 arXiv 文档建模为全部使用 JSON，或者仅添加一个 JSON 格式的 tags 列。下面提供这两种示例：

CREATE TABLE arxiv
(
  `doc` JSON(update_date Date)
)
ENGINE = MergeTree
ORDER BY doc.update_date

注意

我们在 JSON 定义中为 update_date 列提供了类型提示，因为会在排序键/主键中使用它。这样可以让 ClickHouse 确认该列不会为 null，并确保它知道应使用哪个 update_date 子列（每种类型可能都有多个子列，否则会产生歧义）。

我们可以向此表插入数据，并使用 JSONAllPathsWithTypes 函数和 PrettyJSONEachRow 输出格式来查看随后推断出的表结构：

INSERT INTO arxiv FORMAT JSONAsObject 
{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n  https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]],"tags":{"tag_1":{"name":"ClickHouse user","score":"A+","comment":"A good read, applicable to ClickHouse"},"28_03_2025":{"name":"professor X","score":10,"comment":"Didn't learn much","updates":[{"name":"professor X","comment":"Wolverine found more interesting"}]}}}

SELECT JSONAllPathsWithTypes(doc)
FROM arxiv
FORMAT PrettyJSONEachRow

{
  "JSONAllPathsWithTypes(doc)": {
    "abstract": "String",
    "authors": "String",
    "authors_parsed": "Array(Array(Nullable(String)))",
    "categories": "String",
    "comments": "String",
    "doi": "String",
    "id": "String",
    "journal-ref": "String",
    "license": "String",
    "submitter": "String",
    "tags.28_03_2025.comment": "String",
    "tags.28_03_2025.name": "String",
    "tags.28_03_2025.score": "Int64",
    "tags.28_03_2025.updates": "Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))",
    "tags.tag_1.comment": "String",
    "tags.tag_1.name": "String",
    "tags.tag_1.score": "String",
    "title": "String",
    "update_date": "Date",
    "versions": "Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))"
  }
}

1 row in set. Elapsed: 0.003 sec.

或者，我们也可以使用之前的模式定义和一个 JSON tags 列来建模。一般更推荐这种方式，可以尽量减少 ClickHouse 所需的推断：

CREATE TABLE arxiv
(
    `id` String,
    `submitter` String,
    `authors` String,
    `title` String,
    `comments` String,
    `journal-ref` String,
    `doi` String,
    `report-no` String,
    `categories` String,
    `license` String,
    `abstract` String,
    `versions` Array(Tuple(created String, version String)),
    `update_date` Date,
    `authors_parsed` Array(Array(String)),
    `tags` JSON()
)
ENGINE = MergeTree
ORDER BY update_date

INSERT INTO arxiv FORMAT JSONEachRow 
{"id":"2101.11408","submitter":"Daniel Lemire","authors":"Daniel Lemire","title":"Number Parsing at a Gigabyte per Second","comments":"Software at https://github.com/fastfloat/fast_float and\n  https://github.com/lemire/simple_fastfloat_benchmark/","journal-ref":"Software: Practice and Experience 51 (8), 2021","doi":"10.1002/spe.2984","report-no":null,"categories":"cs.DS cs.MS","license":"http://creativecommons.org/licenses/by/4.0/","abstract":"With disks and networks providing gigabytes per second ....\n","versions":[{"created":"Mon, 11 Jan 2021 20:31:27 GMT","version":"v1"},{"created":"Sat, 30 Jan 2021 23:57:29 GMT","version":"v2"}],"update_date":"2022-11-07","authors_parsed":[["Lemire","Daniel",""]],"tags":{"tag_1":{"name":"ClickHouse user","score":"A+","comment":"A good read, applicable to ClickHouse"},"28_03_2025":{"name":"professor X","score":10,"comment":"Didn't learn much","updates":[{"name":"professor X","comment":"Wolverine found more interesting"}]}}}

此时我们就可以推断出子列 tags 的类型。

SELECT JSONAllPathsWithTypes(tags)
FROM arxiv
FORMAT PrettyJSONEachRow

{
  "JSONAllPathsWithTypes(tags)": {
    "28_03_2025.comment": "String",
    "28_03_2025.name": "String",
    "28_03_2025.score": "Int64",
    "28_03_2025.updates": "Array(JSON(max_dynamic_types=16, max_dynamic_paths=256))",
    "tag_1.comment": "String",
    "tag_1.name": "String",
    "tag_1.score": "String"
  }
}

1 row in set. Elapsed: 0.002 sec.

何时使用 JSON 类型​

在以下情况使用 JSON 类型：​

在以下情况下使用 String 列（或结构化类型）：​

使用 JSON 的注意事项和提示​

高级特性​

示例​

何时使用 `JSON` 类型

在以下情况使用 `JSON` 类型：

在以下情况下使用 `String` 列（或结构化类型）：

使用 JSON 的注意事项和提示

高级特性

示例