Denormalizing Data

Data denormalization is a technique in ClickHouse to use flattened tables to help minimize query latency by avoiding joins.

Comparing Normalized vs. Denormalized Schemas

Denormalizing data involves intentionally reversing the normalization process to optimize database performance for specific query patterns. In normalized databases, data is split into multiple related tables to minimize redundancy and ensure data integrity. Denormalization reintroduces redundancy by combining tables, duplicating data, and incorporating calculated fields into either a single table or fewer tables - effectively moving any joins from query to insert time.

This process reduces the need for complex joins at query time and can significantly speed up read operations, making it ideal for applications with heavy read requirements and complex queries. However, it can increase the complexity of write operations and maintenance, as any changes to the duplicated data must be propagated across all instances to maintain consistency.

A common technique popularized by NoSQL solutions is to denormalize data in the absence of JOIN support, effectively storing all statistics or related rows on a parent row as columns and nested objects. For example, in an example schema for a blog, we can store all Comments as an Array of objects on their respective posts.

When to use denormalization

In general, we would recommend denormalizing in the following cases:

Denormalize tables which change infrequently or for which a delay before data is available for analytical queries can be tolerated i.e. the data can be completely reloaded in a batch.
Avoid denormalizing many-to-many relationships. This can result in the need to update many rows if a single source row changes.
Avoid denormalizing high cardinality relationships. If each row in a table has thousands of related entries in another table, these will need to be represented as an Array - either of a primitive type or tuples. Generally, arrays with more than 1000 tuples would not be recommended.
Rather than denormalizing all columns as nested objects, consider denormalizing just a statistic using materialized views (see below).

All information doesn't need to be denormalized - just the key information that needs to be frequently accessed.

The denormalization work can be handled in either ClickHouse or upstream e.g. using Apache Flink.

Avoid denormalization on frequently updated data

For ClickHouse, denormalization is one of several options users can use in order to optimize query performance but should be used carefully. If data is updated frequently and needs to be updated in near-real time, this approach should be avoided. Use this if the main table is largely append only or can be reloaded periodically as a batch e.g. daily.

As an approach it suffers from one principal challenge - write performance and updating data. More specifically, denormalization effectively shifts the responsibility of the data join from query time to ingestion time. While this can significantly improve query performance, it complicates ingestion and means that data pipelines need to re-insert a row into ClickHouse should any of the rows which were used to compose it change. This can mean that a change in one source row potentially means many rows in ClickHouse need to be updated. In complicated schemas, where rows have been composed from complex joins, a single row change in a nested component of a join can potentially mean millions of rows need to be updated.

Achieving this in real-time is often unrealistic and requires significant engineering, due to two challenges:

Triggering the correct join statements when a table row changes. This should ideally not cause all objects for the join to be updated - rather just those that have been impacted. Modifying the joins to filter to the correct rows efficiently, and achieving this under high throughput, requires external tooling or engineering.
Row updates in ClickHouse need to be carefully managed, introducing additional complexity.

A batch update process is thus more common, where all of the denormalized objects are periodically reloaded.

Practical cases for denormalization

Let's consider a few practical examples where denormalizing might make sense, and others where alternative approaches are more desirable.

Consider a Posts table that has already been denormalized with statistics such as AnswerCount and CommentCount - the source data is provided in this form. In reality, we may want to actually normalize this information as it's likely to be subject to be frequently changed. Many of these columns are also available through other tables e.g. comments for a post are available via the PostId column and Comments table. For the purposes of example, we assume posts are reloaded in a batch process.

We also only consider denormalizing other tables onto Posts, as we consider this our main table for analytics. Denormalizing in the other direction would also be appropriate for some queries, with the same above considerations applying.

For each of the following examples, assume a query exists which requires both tables to be used in a join.

Posts and Votes

Votes for posts are represented as separate tables. The optimized schema for this is shown below as well as the insert command to load the data:

At a first glance, these might be candidates for denormalizing on the posts table. There are a few challenges to this approach.

Votes are added frequently to posts. While this might diminish per post over time, the following query shows that we have around 40k votes per hour over 30k posts.

This could be addressed by batching if a delay can be tolerated, but this still requires us to handle updates unless we periodically reload all posts (unlikely to be desirable).

More troublesome is some posts have an extremely high number of votes:

The main observation here is that aggregated vote statistics for each post would be sufficient for most analysis - we do not need to denormalize all of the vote information. For example, the current Score column represents such a statistic i.e. total up votes minus down votes. Ideally, we would just be able to retrieve these statistics at query time with a simple lookup (see dictionaries).

Users and Badges

Now let's consider our Users and Badges:

We first insert the data with the following command:

While users may acquire badges frequently, this is unlikely to be a dataset we need to update more than daily. The relationship between badges and users are one-to-many. Maybe we can simply denormalize badges onto users as a list of tuples? While possible, a quick check to confirm the highest number of badges per user suggests this isn't ideal:

It's probably not realistic to denormalize 19k objects onto a single row. This relationship may be best left as separate tables or with statistics added.

We may wish to denormalize statistics from badges on to users e.g. the number of badges. We consider such an example when using dictionaries for this dataset at insert time.

Posts and PostLinks

PostLinks connect Posts which users consider to be related or duplicated. The following query shows the schema and load command:

We can confirm that no posts have an excessive number of links preventing denormalization:

Likewise, these links are not events which occur overly frequently:

We use this as our denormalization example below.

Simple statistic example

In most cases, denormalization requires the adding of a single column or statistic onto a parent row. For example, we may just wish to enrich our posts with the number of duplicate posts and we simply need to add a column.

To populate this table, we utilize an INSERT INTO SELECT joining our duplicate statistic with our posts.

Exploiting complex types for one-to-many relationships

In order to perform denormalization, we often need to exploit complex types. If a one-to-one relationship is being denormalized, with a low number of columns, users can simply add these as rows with their original types as shown above. However, this is often undesirable for larger objects and not possible for one-to-many relationships.

In cases of complex objects or one-to-many relationships, users can use:

Named Tuples - These allow a related structure to be represented as a set of columns.
Array(Tuple) or Nested - An array of named tuples, also known as Nested, with each entry representing an object. Applicable to one-to-many relationships.

As an example, we demonstrate denormalizing PostLinks on to Posts below.

Each post can contain a number of links to other posts as shown in the PostLinks schema earlier. As a Nested type, we might represent these linked and duplicates posts as follows:

Note the use of the setting flatten_nested=0. We recommend disabling the flattening of nested data.

We can perform this denormalization using an INSERT INTO SELECT with an OUTER JOIN query:

Note the timing here. We've managed to denormalize 66m rows in around 2mins. As we'll see later, this is an operation we can schedule.

Note the use of the groupArray functions to collapse the PostLinks down into an array for each PostId, prior to joining. This array is then filtered into two sublists: LinkedPosts and DuplicatePosts, which also exclude any empty results from the outer join.

We can select some rows to see our new denormalized structure:

Orchestrating and scheduling denormalization

Batch

Exploiting denormalization requires a transformation process in which it can be performed and orchestrated.

We have shown above how ClickHouse can be used to perform this transformation once data has been loaded through an INSERT INTO SELECT. This is appropriate for periodic batch transformations.

Users have several options for orchestrating this in ClickHouse, assuming a periodic batch load process is acceptable:

Refreshable Materialized Views - Refreshable materialized views can be used to periodically schedule a query with the results sent to a target table. On query execution, the view ensures the target table is atomically updated. This provides a ClickHouse native means of scheduling this work.
External tooling - Utilizing tools such as dbt and Airflow to periodically schedule the transformation. The ClickHouse integration for dbt ensures this is performed atomically with a new version of the target table created and then atomically swapped with the version receiving queries (via the EXCHANGE command).

Streaming

Users may alternatively wish to perform this outside of ClickHouse, prior to insertion, using streaming technologies such as Apache Flink. Alternatively, incremental materialized views can be used to perform this process as data is inserted.

Comparing Normalized vs. Denormalized Schemas​

When to use denormalization​

Avoid denormalization on frequently updated data​

Practical cases for denormalization​

Posts and Votes​

Users and Badges​

Posts and PostLinks​

Simple statistic example​

Exploiting complex types for one-to-many relationships​

Orchestrating and scheduling denormalization​

Batch​

Streaming​