Working with JOINs in ClickHouse
ClickHouse fully supports standard SQL joins, enabling efficient data analysis. In this guide, you'll explore some of the available commonly used join types and how to use them with the help of Venn diagrams and example queries on a normalized IMDB dataset originating from the relational dataset repository.
Test Data and Resources
Instructions for creating and loading the tables can be found here. The dataset is also available in the playground for users who don't want to create and load the tables locally.
You'll use the following four tables from the example dataset:
The data in these four tables represent movies which can have one or many genres. The roles in a movie are played by actors.
The arrows in the diagram above represent foreign-to-primary-key-relationships. e.g. the movie_id column of a row in the genres table contains the id value from a row in the movies table.
There is a many-to-many relationship between movies and actors.
This many-to-many relationship is normalized into two one-to-many relationships by using the roles table.
Each row in the roles table contains the values of the id columns of the movies table and the actors table.
Join types supported in ClickHouse
ClickHouse supports the following join types:
You'll write example queries for each of the JOIN types above in the following sections.
INNER JOIN
The INNER JOIN returns, for each pair of rows matching on join keys, the column values of the row from the left table, combined with the column values of the row from the right table.
If a row has more than one match, then all matches are returned (meaning that the cartesian product is produced for rows with matching join keys).
This query finds the genre(s) for each movie by joining the movies table with the genres table:
The INNER keyword can be omitted.
The behavior of the INNER JOIN can be extended or changed, by using one of the following other join types.
(LEFT / RIGHT / FULL) OUTER JOIN
The LEFT OUTER JOIN behaves like INNER JOIN; plus, for non-matching left table rows, ClickHouse returns default values for the right table’s columns.
A RIGHT OUTER JOIN query is similar and also returns values from non-matching rows from the right table together with default values for the columns of the left table.
A FULL OUTER JOIN query combines the LEFT and RIGHT OUTER JOIN and returns values from non-matching rows from the left and the right table, together with default values for the columns of the right and left table, respectively.
ClickHouse can be configured to return NULLs instead of default values (however, for performance reasons, that is less recommended).
This query finds all movies that have no genre by querying for all rows from the movies table that don’t have matches in the genres table, and therefore get (at query time) the default value 0 for the movie_id column:
The OUTER keyword can be omitted.
CROSS JOIN
The CROSS JOIN produces the full cartesian product of the two tables without considering join keys.
Each row from the left table is combined with each row from the right table.
The following query, therefore, is combing each row from the movies table with each row from the genres table:
While the previous example query alone didn’t make much sense, it can be extended with a WHERE clause for associating matching rows to replicate INNER JOIN behavior for finding the genre(s) for each movie:
An alternative syntax for CROSS JOIN specifies multiple tables in the FROM clause separated by commas.
ClickHouse is rewriting a CROSS JOIN to an INNER JOIN if there are joining expressions in the WHERE section of the query.
You can check that for the example query via EXPLAIN SYNTAX (that returns the syntactically optimized version into which a query gets rewritten before being executed):
The INNER JOIN clause in the syntactically optimized CROSS JOIN query version contains the ALL keyword, that got explicitly added in order to keep the cartesian product semantics of the CROSS JOIN even when being rewritten into an INNER JOIN, for which the cartesian product can be disabled.
And because, as mentioned above, the OUTER keyword can be omitted for a RIGHT OUTER JOIN, and the optional ALL keyword can be added, you can write ALL RIGHT JOIN and it will work all right.
(LEFT / RIGHT) SEMI JOIN
A LEFT SEMI JOIN query returns column values for each row from the left table that has at least one join key match in the right table.
Only the first found match is returned (the cartesian product is disabled).
A RIGHT SEMI JOIN query is similar and returns values for all rows from the right table with at least one match in the left table, but only the first found match is returned.
This query finds all actors/actresses that performed in a movie in 2023.
Note that with a normal (INNER) join, the same actor/actress would show up more than one time if they had more than one role in 2023:
(LEFT / RIGHT) ANTI JOIN
A LEFT ANTI JOIN returns column values for all non-matching rows from the left table.
Similarly, the RIGHT ANTI JOIN returns column values for all non-matching right table rows.
An alternative formulation of the previous outer join example query is using an anti join for finding movies that have no genre in the dataset:
(LEFT / RIGHT / INNER) ANY JOIN
A LEFT ANY JOIN is the combination of the LEFT OUTER JOIN + the LEFT SEMI JOIN, meaning that ClickHouse returns column values for each row from the left table, either combined with the column values of a matching row from the right table or combined with default column values for the right table, in case no match exists.
If a row from the left table has more than one match in the right table, ClickHouse only returns the combined column values from the first found match (the cartesian product is disabled).
Similarly, the RIGHT ANY JOIN is the combination of the RIGHT OUTER JOIN + the RIGHT SEMI JOIN.
And the INNER ANY JOIN is the INNER JOIN with a disabled cartesian product.
The following example demonstrates the LEFT ANY JOIN with an abstract example using two temporary tables (left_table and right_table) constructed with the values table function:
This is the same query using a RIGHT ANY JOIN:
This is the query with an INNER ANY JOIN:
ASOF JOIN
The ASOF JOIN, provides non-exact matching capabilities.
If a row from the left table doesn’t have an exact match in the right table, then the closest matching row from the right table is used as a match instead.
This is particularly useful for time-series analytics and can drastically reduce query complexity.
The following example performs time-series analytics of stock market data.
A quotes table contains stock symbol quotes based on specific times of the day.
The price is updated every 10 seconds in the example data.
A trades table lists symbol trades - a specific volume of a symbol got bought at a specific time:
In order to calculate the concrete cost of each trade, we need to match the trades with their closest quote time.
This is easy and compact with the ASOF JOIN, where you use the ON clause for specifying an exact match condition and the AND clause for specifying the closest match condition - for a specific symbol (exact match) you are looking for the row with the ‘closest’ time from the quotes table at exactly or before the time (non-exact match) of a trade of that symbol:
The ON clause of the ASOF JOIN is required and specifies an exact match condition next to the non-exact match condition of the AND clause.
Summary
This guide shows how ClickHouse supports all standard SQL JOIN types, plus specialized joins to power analytical queries. See the documentation for the JOIN statement for more details on JOINs.