Refreshable materialized view

Refreshable materialized views are conceptually similar to materialized views in traditional OLTP databases, storing the result of a specified query for quick retrieval and reducing the need to repeatedly execute resource-intensive queries. Unlike ClickHouse's incremental materialized views, this requires the periodic execution of the query over the full dataset - the results of which are stored in a target table for querying. This result set should, in theory, be smaller than the original dataset, allowing the subsequent query to execute faster.

The diagram explains how Refreshable Materialized Views work:

You can also see the following video:

When should refreshable materialized views be used?

ClickHouse incremental materialized views are enormously powerful and typically scale much better than the approach used by refreshable materialized views, especially in cases where an aggregate over a single table needs to be performed. By only computing the aggregation over each block of data as it is inserted and merging the incremental states in the final table, the query only ever executes on a subset of the data. This method scales to potentially petabytes of data and is usually the preferred method.

However, there are use cases where this incremental process is not required or is not applicable. Some problems are either incompatible with an incremental approach or don't require real-time updates, with a periodic rebuild being more appropriate. For example, you may want to regularly perform a complete re-computation of a view over the full dataset because it uses a complex join, which is incompatible with an incremental approach.

Refreshable materialized views can run batch processes performing tasks such as denormalization. Dependencies can be created between refreshable materialized views such that one view depends on the results of another and only executes once it is complete. This can replace scheduled workflows or simple DAGs such as a dbt job. To find out more about how to set dependencies between refreshable materialized views go to CREATE VIEW, Dependencies section.

How do you refresh a refreshable materialized view?

Refreshable materialized views are refreshed automatically on an interval that's defined during creation. For example, the following materialized view is refreshed every minute:

If you want to force refresh a materialized view, you can use the SYSTEM REFRESH VIEW clause:

You can also cancel, stop, or start a view. For more details, see the managing refreshable materialized views documentation.

When was a refreshable materialized view last refreshed?

To find out when a refreshable materialized view was last refreshed, you can query the system.view_refreshes system table, as shown below:

How can I change the refresh rate?

To change the refresh rate of a refreshable materialized view, use the ALTER TABLE...MODIFY REFRESH syntax.

Once you've done that, you can use When was a refreshable materialized view last refreshed? query to check that the rate has been updated:

Using `APPEND` to add new rows

The APPEND functionality allows you to add new rows to the end of the table instead of replacing the whole view.

One use of this feature is to capture snapshots of values at a point in time. For example, let's imagine that we have an events table populated by a stream of messages from Kafka, Redpanda, or another streaming data platform.

This dataset has 4096 values in the uuid column. We can write the following query to find the ones with the highest total count:

Let's say we want to capture the count for each uuid every 10 seconds and store it in a new table called events_snapshot. The schema of events_snapshot would look like this:

We could then create a refreshable materialized view to populate this table:

We can then query events_snapshot to get the count over time for a specific uuid:

Examples

Lets now have a look at how to use refreshable materialized views with some example datasets.

Stack Overflow

The denormalizing data guide shows various techniques for denormalizing data using a Stack Overflow dataset. We populate data into the following tables: votes, users, badges, posts, and postlinks.

In that guide, we showed how to denormalize the postlinks dataset onto the posts table with the following query:

We then showed how to do a one-time insert of this data into the posts_with_links table, but in a production system, we'd want to run this operation periodically.

Both the posts and postlinks table could potentially be updated. Therefore, rather than attempt to implement this join using incremental materialized views, it may be sufficient to simply schedule this query to run at a set interval, e.g., once every hour, storing the results in a post_with_links table.

This is where a refreshable materialized view helps, and we can create one with the following query:

The view will execute immediately and every hour thereafter as configured to ensure updates to the source table are reflected. Importantly, when the query re-runs, the result set is atomically and transparently updated.

Note

The syntax here is identical to an incremental materialized view, except we include a REFRESH clause:

IMDb

In the dbt and ClickHouse integration guide we populated an IMDb dataset with the following tables: actors, directors, genres, movie_directors, movies, and roles.

We can then write the following query can be used to compute a summary of each actor, ordered by the most movie appearances.

It doesn't take too long to return a result, but let's say we want it to be even quicker and less computationally expensive. Suppose that this dataset is also subject to constant updates - movies are constantly released with new actors and directors also emerging.

It's time for a refreshable materialized view, so let's first create a target table for the results:

And now we can define the view:

The view will execute immediately and every minute thereafter as configured to ensure updates to the source table are reflected. Our previous query to obtain a summary of actors becomes syntactically simpler and significantly faster!

Suppose we add a new actor, "Clicky McClickHouse" to our source data who happens to have appeared in a lot of films!

Less than 60 seconds later, our target table is updated to reflect the prolific nature of Clicky's acting:

When should refreshable materialized views be used?​

How do you refresh a refreshable materialized view?​

When was a refreshable materialized view last refreshed?​

How can I change the refresh rate?​

Using APPEND to add new rows​

Examples​

Stack Overflow​

IMDb​