Deduplication strategies (using CDC)

Updates and deletes replicated from Postgres to ClickHouse result in duplicated rows in ClickHouse due to its data storage structure and the replication process. This page covers why this happens and the strategies to use in ClickHouse to handle duplicates.

How does data get replicated?

PostgreSQL logical decoding

ClickPipes uses Postgres Logical Decoding to consume changes as they happen in Postgres. The Logical Decoding process in Postgres enables clients like ClickPipes to receive changes in a human-readable format, i.e., a series of INSERTs, UPDATEs, and DELETEs.

ReplacingMergeTree

ClickPipes maps Postgres tables to ClickHouse using the ReplacingMergeTree engine. ClickHouse performs best with append-only workloads and does not recommend frequent UPDATEs. This is where ReplacingMergeTree is particularly powerful.

With ReplacingMergeTree, updates are modeled as inserts with a newer version (_peerdb_version) of the row, while deletes are inserts with a newer version and _peerdb_is_deleted marked as true. The ReplacingMergeTree engine deduplicates/merges data in the background, and retains the latest version of the row for a given primary key (id), enabling efficient handling of UPDATEs and DELETEs as versioned inserts.

Below is an example of a CREATE Table statement executed by ClickPipes to create the table in ClickHouse.

Illustrative example

The illustration below walks through a basic example of synchronization of a table users between PostgreSQL and ClickHouse using ClickPipes.

Step 1 shows the initial snapshot of the 2 rows in PostgreSQL and ClickPipes performing the initial load of those 2 rows to ClickHouse. As you can observe, both rows are copied as-is to ClickHouse.

Step 2 shows three operations on the users table: inserting a new row, updating an existing row, and deleting another row.

Step 3 shows how ClickPipes replicates the INSERT, UPDATE, and DELETE operations to ClickHouse as versioned inserts. The UPDATE appears as a new version of the row with ID 2, while the DELETE appears as a new version of ID 1 which is marked as true using _is_deleted. Because of this, ClickHouse has three additional rows compared to PostgreSQL.

As a result, running a simple query like SELECT count(*) FROM users; may produce different results in ClickHouse and PostgreSQL. According to the ClickHouse merge documentation, outdated row versions are eventually discarded during the merge process. However, the timing of this merge is unpredictable, meaning queries in ClickHouse may return inconsistent results until it occurs.

How can we ensure identical query results in both ClickHouse and PostgreSQL?

Deduplicate using FINAL Keyword

The recommended way to deduplicate data in ClickHouse queries is to use the FINAL modifier. This ensures only the deduplicated rows are returned.

Let's look at how to apply it to three different queries.

Take note of the WHERE clause in the following queries, used to filter out deleted rows.

Simple count query: Count the number of posts.

This is the simplest query you can run to check if the synchronization went fine. The two queries should return the same count.

Simple aggregation with JOIN: Top 10 users who have accumulated the most views.

An example of an aggregation on a single table. Having duplicates here would greatly affect the result of the sum function.

FINAL setting

Rather than adding the FINAL modifier to each table name in the query, you can use the FINAL setting to apply it automatically to all tables in the query.

This setting can be applied either per query or for an entire session.

ROW policy

An easy way to hide the redundant _peerdb_is_deleted = 0 filter is to use ROW policy. Below is an example that creates a row policy to exclude the deleted rows from all queries on the table votes.

Row policies are applied to a list of users and roles. In this example, it is applied to all users and roles. This can be adjusted to only specific users or roles.

Query like with Postgres

Migrating an analytical dataset from PostgreSQL to ClickHouse often requires modifying application queries to account for differences in data handling and query execution.

This section will explore techniques for deduplicating data while keeping the original queries unchanged.

Views

Views are a great way to hide the FINAL keyword from the query, as they do not store any data and simply perform a read from another table on each access.

Below is an example of creating views for each table of our database in ClickHouse with the FINAL keyword and filter for the deleted rows.

Then, we can query the views using the same query we would use in PostgreSQL.

Refreshable materialized view

Another approach is to use a refreshable materialized view, which enables you to schedule query execution for deduplicating rows and storing the results in a destination table. With each scheduled refresh, the destination table is replaced with the latest query results.

The key advantage of this method is that the query using the FINAL keyword runs only once during the refresh, eliminating the need for subsequent queries on the destination table to use FINAL.

However, a drawback is that the data in the destination table is only as up-to-date as the most recent refresh. That said, for many use cases, refresh intervals ranging from several minutes to a few hours may be sufficient.

Then, you can query the table deduplicated_posts normally.

How does data get replicated?​

PostgreSQL logical decoding​

ReplacingMergeTree​

Illustrative example​

Deduplicate using FINAL Keyword​

FINAL setting​

ROW policy​

Query like with Postgres​

Views​

Refreshable materialized view​