- 高级指南
- Merge table function
Merge table function
The merge table function 让我们能够并行查询多个表。这通过创建一个临时的 Merge 表来实现,并通过对它们的列进行并集并推导出共同类型来得出该表的结构。
设置表
我们将借助 Jeff Sackmann's tennis dataset 学习如何使用此函数。我们将处理包含自1960年代以来的比赛的CSV文件,但我们将为每十年创建一个稍微不同的模式。我们还会为1990年代添加几个额外的列。
导入语句如下所示:
CREATE OR REPLACE TABLE atp_matches_1960s ORDER BY tourney_id AS
SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, score
FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1968..1969}.csv')
SETTINGS schema_inference_make_columns_nullable=0,
schema_inference_hints='winner_seed Nullable(String), loser_seed Nullable(UInt8)';
CREATE OR REPLACE TABLE atp_matches_1970s ORDER BY tourney_id AS
SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score
FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1970..1979}.csv')
SETTINGS schema_inference_make_columns_nullable=0,
schema_inference_hints='winner_seed Nullable(UInt8), loser_seed Nullable(UInt8)';
CREATE OR REPLACE TABLE atp_matches_1980s ORDER BY tourney_id AS
SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score
FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1980..1989}.csv')
SETTINGS schema_inference_make_columns_nullable=0,
schema_inference_hints='winner_seed Nullable(UInt16), loser_seed Nullable(UInt16)';
CREATE OR REPLACE TABLE atp_matches_1990s ORDER BY tourney_id AS
SELECT tourney_id, surface, winner_name, loser_name, winner_seed, loser_seed, splitByWhitespace(score) AS score,
toBool(arrayExists(x -> position(x, 'W/O') > 0, score))::Nullable(bool) AS walkover,
toBool(arrayExists(x -> position(x, 'RET') > 0, score))::Nullable(bool) AS retirement
FROM url('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/refs/heads/master/atp_matches_{1990..1999}.csv')
SETTINGS schema_inference_make_columns_nullable=0,
schema_inference_hints='winner_seed Nullable(UInt16), loser_seed Nullable(UInt16), surface Enum(\'Hard\', \'Grass\', \'Clay\', \'Carpet\')';
多个表的模式
我们可以运行以下查询,将每个表中的列及其类型并排列出,以便更容易看出差异。
SELECT * EXCEPT(position) FROM (
SELECT position, name,
any(if(table = 'atp_matches_1960s', type, null)) AS 1960s,
any(if(table = 'atp_matches_1970s', type, null)) AS 1970s,
any(if(table = 'atp_matches_1980s', type, null)) AS 1980s,
any(if(table = 'atp_matches_1990s', type, null)) AS 1990s
FROM system.columns
WHERE database = currentDatabase() AND table LIKE 'atp_matches%'
GROUP BY ALL
ORDER BY position ASC
)
SETTINGS output_format_pretty_max_value_width=25;
┌─name────────┬─1960s────────────┬─1970s───────────┬─1980s────────────┬─1990s─────────────────────┐
│ tourney_id │ String │ String │ String │ String │
│ surface │ String │ String │ String │ Enum8('Hard' = 1, 'Grass'⋯│
│ winner_name │ String │ String │ String │ String │
│ loser_name │ String │ String │ String │ String │
│ winner_seed │ Nullable(String) │ Nullable(UInt8) │ Nullable(UInt16) │ Nullable(UInt16) │
│ loser_seed │ Nullable(UInt8) │ Nullable(UInt8) │ Nullable(UInt16) │ Nullable(UInt16) │
│ score │ String │ Array(String) │ Array(String) │ Array(String) │
│ walkover │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ Nullable(Bool) │
│ retirement │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ Nullable(Bool) │
└─────────────┴──────────────────┴─────────────────┴──────────────────┴───────────────────────────┘
让我们来看一下这些差异:
- 1970年代将
winner_seed
的类型从Nullable(String)
更改为Nullable(UInt8)
,将score
从String
更改为Array(String)
。 - 1980年代将
winner_seed
和loser_seed
的类型从Nullable(UInt8)
更改为Nullable(UInt16)
。 - 1990年代将
surface
从String
更改为Enum('Hard', 'Grass', 'Clay', 'Carpet')
,并添加了walkover
和retirement
列。
使用merge查询多个表
让我们写一个查询,找出约翰·麦肯罗赢得的对手种子为#1的比赛:
SELECT loser_name, score
FROM merge('atp_matches*')
WHERE winner_name = 'John McEnroe'
AND loser_seed = 1;
┌─loser_name────┬─score───────────────────────────┐
│ Bjorn Borg │ ['6-3','6-4'] │
│ Bjorn Borg │ ['7-6','6-1','6-7','5-7','6-4'] │
│ Bjorn Borg │ ['7-6','6-4'] │
│ Bjorn Borg │ ['4-6','7-6','7-6','6-4'] │
│ Jimmy Connors │ ['6-1','6-3'] │
│ Ivan Lendl │ ['6-2','4-6','6-3','6-7','7-6'] │
│ Ivan Lendl │ ['6-3','3-6','6-3','7-6'] │
│ Ivan Lendl │ ['6-1','6-3'] │
│ Stefan Edberg │ ['6-2','6-3'] │
│ Stefan Edberg │ ['7-6','6-2'] │
│ Stefan Edberg │ ['6-2','6-2'] │
│ Jakob Hlasek │ ['6-3','7-6'] │
└───────────────┴─────────────────────────────────┘
接下来,假设我们想过滤这些比赛,找出麦肯罗种子为#3或更低的比赛。这有点棘手,因为 winner_seed
在各个表中使用不同的类型:
SELECT loser_name, score, winner_seed
FROM merge('atp_matches*')
WHERE winner_name = 'John McEnroe'
AND loser_seed = 1
AND multiIf(
variantType(winner_seed) = 'UInt8', variantElement(winner_seed, 'UInt8') >= 3,
variantType(winner_seed) = 'UInt16', variantElement(winner_seed, 'UInt16') >= 3,
variantElement(winner_seed, 'String')::UInt16 >= 3
);
我们使用 variantType
函数检查每行的 winner_seed
类型,然后使用 variantElement
提取底层值。当类型为 String
时,我们将其转换为数字,然后进行比较。运行查询的结果如下所示:
┌─loser_name────┬─score─────────┬─winner_seed─┐
│ Bjorn Borg │ ['6-3','6-4'] │ 3 │
│ Stefan Edberg │ ['6-2','6-3'] │ 6 │
│ Stefan Edberg │ ['7-6','6-2'] │ 4 │
│ Stefan Edberg │ ['6-2','6-2'] │ 7 │
└───────────────┴───────────────┴─────────────┘
使用merge时行来自哪个表?
如果我们想知道行来自哪个表呢?我们可以使用 _table
虚拟列来实现,如以下查询所示:
SELECT _table, loser_name, score, winner_seed
FROM merge('atp_matches*')
WHERE winner_name = 'John McEnroe'
AND loser_seed = 1
AND multiIf(
variantType(winner_seed) = 'UInt8', variantElement(winner_seed, 'UInt8') >= 3,
variantType(winner_seed) = 'UInt16', variantElement(winner_seed, 'UInt16') >= 3,
variantElement(winner_seed, 'String')::UInt16 >= 3
);
┌─_table────────────┬─loser_name────┬─score─────────┬─winner_seed─┐
│ atp_matches_1970s │ Bjorn Borg │ ['6-3','6-4'] │ 3 │
│ atp_matches_1980s │ Stefan Edberg │ ['6-2','6-3'] │ 6 │
│ atp_matches_1980s │ Stefan Edberg │ ['7-6','6-2'] │ 4 │
│ atp_matches_1980s │ Stefan Edberg │ ['6-2','6-2'] │ 7 │
└───────────────────┴───────────────┴───────────────┴─────────────┘
我们也可以在查询中使用这个虚拟列来计算 walkover
列的值:
SELECT _table, walkover, count()
FROM merge('atp_matches*')
GROUP BY ALL
ORDER BY _table;
┌─_table────────────┬─walkover─┬─count()─┐
│ atp_matches_1960s │ ᴺᵁᴸᴸ │ 7542 │
│ atp_matches_1970s │ ᴺᵁᴸᴸ │ 39165 │
│ atp_matches_1980s │ ᴺᵁᴸᴸ │ 36233 │
│ atp_matches_1990s │ true │ 128 │
│ atp_matches_1990s │ false │ 37022 │
└───────────────────┴──────────┴─────────┘
我们可以看到,除 atp_matches_1990s
之外的所有行 walkover
列都是 NULL
。我们需要更新我们的查询,以检查如果 walkover
列为 NULL
,score
列是否包含字符串 W/O
:
SELECT _table,
multiIf(
walkover IS NOT NULL,
walkover,
variantType(score) = 'Array(String)',
toBool(arrayExists(
x -> position(x, 'W/O') > 0,
variantElement(score, 'Array(String)')
)),
variantElement(score, 'String') LIKE '%W/O%'
),
count()
FROM merge('atp_matches*')
GROUP BY ALL
ORDER BY _table;
如果 score
的基础类型是 Array(String)
,我们必须遍历数组以查找 W/O
,而如果它的类型是 String
,我们可以直接在字符串中搜索 W/O
。
┌─_table────────────┬─multiIf(isNo⋯, '%W/O%'))─┬─count()─┐
│ atp_matches_1960s │ true │ 242 │
│ atp_matches_1960s │ false │ 7300 │
│ atp_matches_1970s │ true │ 422 │
│ atp_matches_1970s │ false │ 38743 │
│ atp_matches_1980s │ true │ 92 │
│ atp_matches_1980s │ false │ 36141 │
│ atp_matches_1990s │ true │ 128 │
│ atp_matches_1990s │ false │ 37022 │
└───────────────────┴──────────────────────────┴─────────┘