Merge table function
The merge table function lets us query multiple tables in parallel. It does this by creating a temporary Merge table and derives this table's structure by taking a union of their columns and by deriving common types.
Setup tables
We're going to learn how to use this function with help from Jeff Sackmann's tennis dataset. We're going to process CSV files that contain matches going back to the 1960s, but we'll create a slightly different schema for each decade. We'll also add a couple of extra columns for the 1990s decade.
The import statements are shown below:
Schema of multiple tables
We can run the following query to list the columns in each table along with their types side by side, so that it's easier to see the differences.
Let's go through the differences:
- 1970s changes the type of
winner_seed
fromNullable(String)
toNullable(UInt8)
andscore
fromString
toArray(String)
. - 1980s changes
winner_seed
andloser_seed
fromNullable(UInt8)
toNullable(UInt16)
. - 1990s changes
surface
fromString
toEnum('Hard', 'Grass', 'Clay', 'Carpet')
and adds thewalkover
andretirement
columns.
Querying multiple tables with merge
Let's write a query to find the matches that John McEnroe won against someone who was seeded #1:
Next, let's say we want to filter those matches to find the ones where McEnroe was seeded #3 or lower.
This is a bit trickier because winner_seed
uses different types across the various tables:
We use the variantType
function to check the type of winner_seed
for each row and then variantElement
to extract the underlying value.
When the type is String
, we cast to a number and then do the comparison.
The result of running the query is shown below:
Which table do rows come from when using merge?
What if we want to know which table rows come from?
We can use the _table
virtual column to do this, as shown in the following query:
We could also use this virtual column as part of a query to count the values for the walkover
column:
We can see that the walkover
column is NULL
for everything except atp_matches_1990s
.
We'll need to update our query to check whether the score
column contains the string W/O
if the walkover
column is NULL
:
If the underlying type of score
is Array(String)
we have to go over the array and look for W/O
, whereas if it has a type of String
we can just search for W/O
in the string.