Skip to main content
Skip to main content
Edit this page

Merge table function

The merge table function lets us query multiple tables in parallel. It does this by creating a temporary Merge table and derives this table's structure by taking a union of their columns and by deriving common types.

Setup tables

We're going to learn how to use this function with help from Jeff Sackmann's tennis dataset. We're going to process CSV files that contain matches going back to the 1960s, but we'll create a slightly different schema for each decade. We'll also add a couple of extra columns for the 1990s decade.

The import statements are shown below:

Schema of multiple tables

We can run the following query to list the columns in each table along with their types side by side, so that it's easier to see the differences.

Let's go through the differences:

  • 1970s changes the type of winner_seed from Nullable(String) to Nullable(UInt8) and score from String to Array(String).
  • 1980s changes winner_seed and loser_seed from Nullable(UInt8) to Nullable(UInt16).
  • 1990s changes surface from String to Enum('Hard', 'Grass', 'Clay', 'Carpet') and adds the walkover and retirement columns.

Querying multiple tables with merge

Let's write a query to find the matches that John McEnroe won against someone who was seeded #1:

Next, let's say we want to filter those matches to find the ones where McEnroe was seeded #3 or lower. This is a bit trickier because winner_seed uses different types across the various tables:

We use the variantType function to check the type of winner_seed for each row and then variantElement to extract the underlying value. When the type is String, we cast to a number and then do the comparison. The result of running the query is shown below:

Which table do rows come from when using merge?

What if we want to know which table rows come from? We can use the _table virtual column to do this, as shown in the following query:

We could also use this virtual column as part of a query to count the values for the walkover column:

We can see that the walkover column is NULL for everything except atp_matches_1990s. We'll need to update our query to check whether the score column contains the string W/O if the walkover column is NULL:

If the underlying type of score is Array(String) we have to go over the array and look for W/O, whereas if it has a type of String we can just search for W/O in the string.