ClickHouse Cloud: Fast, Updatable Lookups with the Join Table Engine

Dictionaries in ClickHouse

When you move data from your transactional or event-based data sources to an analytical database like ClickHouse, you will likely consider modeling a dimensional schema according to the Kimball methodology.

Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts.

It follows that fact tables are typically immutable, and data are appended to them; whereas dimension tables are smaller, and subject to (infrequent) updates (slowly changing dimensions). When you run an analytical query, you have to join those dimensions against the fact table.

One common approach to do this in ClickHouse is to have a copy of the dimension data in memory in a Dictionary. This approach enables Direct Joins and is recommended for optimizing join performance.

A Dictionary is set up by specifying, among others, the SOURCE and LIFETIME attributes. ClickHouse pulls fresh data from the source and uses LIFETIME to determine how often it should refresh the Dictionary. But some customers asked me: Isn't there a way to update a Dictionary like a regular table? And indeed, there is a way to achieve this, using another special table engine.

The `Join` table engine

The Join table engine is just what we need here. It is an in-memory structure, laying out data for a specific type of join that has to be stated in the table definition, and it is backed by a persistence layer. Setting up the Join table, you need to configure

join strictness
join type
the key column(s) you want to use in the join.

Join strictness

This can be ANY or ALL. With ALL, all matching rows are taken from the Join table; ANY takes only the latest one.

This means with ANY type an INSERT becomes an UPSERT by key: you update a dimension row by inserting another row with the same key.

Join Type

One of ClickHouse's join types, like INNER or LEFT or RIGHT. For dimensional modeling, you will mostly use LEFT.

Querying a `Join` table

While a Join table can be queried like any regular table using SELECT, there are two more ways of using it:

If you place the Join table in a JOIN query where the join parameters match the ones in the definition of the table, ClickHouse will automatically know to use the Direct Join algorithm.
You can look up values for a given key using joinGet. This works just like dictGet for Dictionaries.

Drawbacks in the open source implementation

So a Join table with ANY LEFT join condition would be just the thing to implement a Type 1 slowly changing dimension, right? You can update values for a given key and have a high performing join. Why don't we use it all the time, then?

It turns out that the implementation of the Join table engine in open source ClickHouse has a couple of drawbacks that make it less suitable for this use case:

Join tables are not distributed; each cluster node would have to maintain its own copy/version of the table.
The persistence layer is not built for frequent inserts/updates. The Join table engine persists data as compressed Native-format .bin files in the table's data directory on disk (one file per INSERT batch). On server startup, these files are read back sequentially and the in-memory HashJoin hash table is reconstructed from them. This means each update will create a new numbered .bin file. There is no background compaction process — files are never merged automatically. Over time, this leads to performance degradation.

Implementation in ClickHouse Cloud

These issues have been solved very elegantly in ClickHouse Cloud. In ClickHouse Cloud, a Join table is actually transparently implemented as a SharedJoin table with a MergeTree family backing table:

for ALL join, this is a MergeTree table
for ANY join, a ReplacingMergeTree table.

You can find these tables in system.tables. The naming convention for the underlying table is .inner_id.SharedJoin.<uuid of Join table>.

Note: there is a setting join_any_take_last_row that is not honored

The in-memory table is populated from the persistent (underlying) table using a query (which includes FINAL in the case of ANY join) on insert to Join table (with filter to select only newest data) and on table load on startup.

Example: Data Enrichment

Probably the most meaningful use case is enrichment / dimensional modeling with an ANY LEFT join. To illustrate this specific case, let's take the example from the ClickHouse docs and modify it a bit:

1-- Create the fact table and insert some data
2CREATE OR REPLACE TABLE id_val (
3    `id` UInt32,
4    `val` UInt32
5) ENGINE = MergeTree
6ORDER BY (id);
7
8INSERT INTO id_val VALUES
9    (1, 11), (2, 12), (3, 13);
10
11-- Creating the right-side Join table:
12CREATE OR REPLACE TABLE id_val_join (
13    `id` UInt32,
14    `val` UInt8
15) ENGINE = Join(ANY, LEFT, id);
16
17-- Insert some values
18INSERT INTO id_val_join VALUES
19    (1, 21), (1, 22), (3, 23);
20
21-- Enrichment query
22SELECT *
23FROM id_val
24ANY LEFT JOIN id_val_join USING (id);

1┌─id─┬─val─┬─id_val_join.val─┐
21. │  1 │  11 │              22 │
32. │  2 │  12 │               0 │
43. │  3 │  13 │              23 │
5   └────┴─────┴─────────────────┘

Now, let's find out what happens in the Join table and in the underlying table when we upsert an entry for key 1.

1-- And another insert
2INSERT INTO id_val_join VALUES (1,42);

Look up the underlying table:

1SELECT database, name, uuid, engine
2FROM system.tables
3WHERE name = 'id_val_join'
4FORMAT Vertical;

1database:                         default
2name:                             id_val_join
3uuid:                             64f169ee-977d-46c2-b067-580fdf8c1d4b
4engine:                           SharedJoin

The Join table deduplicates and keeps the latest entry only:

1SELECT * FROM id_val_join;

1┌─id─┬─val─┐
21. │  3 │  23 │
32. │  1 │  42 │
4   └────┴─────┘

Constructing the underlying ReplacingMergeTree table from the UUID, we see this one retains the duplicates until they are merged out:

1SELECT * FROM default.`.inner_id.SharedJoin.64f169ee-977d-46c2-b067-580fdf8c1d4b`;

1┌─id─┬─val─┐
21. │  1 │  22 │
32. │  3 │  23 │
43. │  1 │  42 │
5   └────┴─────┘

Finally, running the enrichment query again, we see how the updated dimension entry reflects in the result:

1SELECT *
2FROM id_val
3ANY LEFT JOIN id_val_join USING (id);

1┌─id─┬─val─┬─id_val_join.val─┐
21. │  1 │  11 │              42 │
32. │  2 │  12 │               0 │
43. │  3 │  13 │              23 │
5   └────┴─────┴─────────────────┘

Every time you insert a new row or a set of rows:

The data is inserted into the underlying ReplacingMergeTree table.
The in-memory representation is updated on insert to the Join table (with a filter on block ID to select only newest data) and on table load on startup.
This query also applies FINAL, so the in-memory Join table will never have duplicates.
The join_any_take_last_row is ignored. You always get the latest entry.

Conclusion

The Join table engine in ClickHouse provides a precomputed hash map that can be used to speed up JOINs.
Like a dictionary, a Join table is kept in memory. But it is backed by a persistence layer (saved in files).
In ClickHouse Cloud, Join tables are automatically clustered and backed by full MergeTree tables, making them suitable for frequent updates.
In particular, for dimensional modeling in ClickHouse Cloud, use Join(ANY, LEFT, id) — upserting, deduplication, and data compaction is automatically handled by the underlying ReplacingMergeTree!

Get started today

Interested in seeing how ClickHouse works on your data? Get started with ClickHouse Cloud in minutes and receive $300 in free credits.

ClickHouse Cloud: Fast, Updatable Lookups with the Join Table Engine

Dictionaries in ClickHouse

The `Join` table engine

Join strictness

Join Type

Querying a `Join` table

Drawbacks in the open source implementation

Implementation in ClickHouse Cloud

Example: Data Enrichment

Conclusion

Get started today

Subscribe to our newsletter

Recent posts

What's new in clickhousectl v0.4.0

ClickHouse joins the Open Secure AI Alliance

How Jump Trading uses ClickHouse with Iceberg for analytics

Choosing Between ClickStack and Grafana for ClickHouse Observability

ClickHouse Cloud: Fast, Updatable Lookups with the Join Table Engine

Dictionaries in ClickHouse #

The Join table engine #

Join strictness #

Join Type #

Querying a Join table #

Drawbacks in the open source implementation #

Implementation in ClickHouse Cloud #

Example: Data Enrichment #

Conclusion #

Get started today

Subscribe to our newsletter

Recent posts

What's new in clickhousectl v0.4.0

ClickHouse joins the Open Secure AI Alliance

How Jump Trading uses ClickHouse with Iceberg for analytics

Choosing Between ClickStack and Grafana for ClickHouse Observability

Dictionaries in ClickHouse

The `Join` table engine

Join strictness

Join Type

Querying a `Join` table

Drawbacks in the open source implementation

Implementation in ClickHouse Cloud

Example: Data Enrichment

Conclusion