Integrating dbt and ClickHouse

ClickHouse Supported

dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles materializing these select statements into objects in the database in the form of tables and views - performing the T of Extract Load and Transform (ELT). Users can create a model defined by a SELECT statement.

Within dbt, these models can be cross-referenced and layered to allow the construction of higher-level concepts. The boilerplate SQL required to connect models is automatically generated. Furthermore, dbt identifies dependencies between models and ensures they are created in the appropriate order using a directed acyclic graph (DAG).

Dbt is compatible with ClickHouse through a ClickHouse-supported plugin. We describe the process for connecting ClickHouse with a simple example based on a publicly available IMDB dataset. We additionally highlight some of the limitations of the current connector.

Concepts
Setup of dbt and the ClickHouse plugin
Connecting to ClickHouse
Creating a Simple View Materialization
Creating a Table Materialization
Creating an Incremental Materialization
Creating a Snapshot
Using Seeds
Limitations
Fivetran

Concepts

dbt introduces the concept of a model. This is defined as a SQL statement, potentially joining many tables. A model can be "materialized" in a number of ways. A materialization represents a build strategy for the model's select query. The code behind a materialization is boilerplate SQL that wraps your SELECT query in a statement in order to create a new or update an existing relation.

dbt provides 4 types of materialization:

view (default): The model is built as a view in the database.
table: The model is built as a table in the database.
ephemeral: The model is not directly built in the database but is instead pulled into dependent models as common table expressions.
incremental: The model is initially materialized as a table, and in subsequent runs, dbt inserts new rows and updates changed rows in the table.

Additional syntax and clauses define how these models should be updated if their underlying data changes. dbt generally recommends starting with the view materialization until performance becomes a concern. The table materialization provides a query time performance improvement by capturing the results of the model's query as a table at the expense of increased storage. The incremental approach builds on this further to allow subsequent updates to the underlying data to be captured in the target table.

The current plugin for ClickHouse supports the view, table,, ephemeral and incremental materializations. The plugin also supports dbt snapshots and seeds which we explore in this guide.

For the following guides, we assume you have a ClickHouse instance available.

Setup of dbt and the ClickHouse plugin

dbt

We assume the use of the dbt CLI for the following examples. Users may also wish to consider dbt Cloud, which offers a web-based Integrated Development Environment (IDE) allowing users to edit and run projects.

dbt offers a number of options for CLI installation. Follow the instructions described here. At this stage install dbt-core only. We recommend the use of pip.

Important: The following is tested under python 3.9.

ClickHouse plugin

Install the dbt ClickHouse plugin:

Prepare ClickHouse

dbt excels when modeling highly relational data. For the purposes of example, we provide a small IMDB dataset with the following relational schema. This dataset originates from the relational dataset repository. This is trivial relative to common schemas used with dbt but represents a manageable sample:

We use a subset of these tables as shown.

Create the following tables:

note

The column created_at for the table roles, which defaults to a value of now(). We use this later to identify incremental updates to our models - see Incremental Models.

We use the s3 function to read the source data from public endpoints to insert data. Run the following commands to populate the tables:

The execution of these may vary depending on your bandwidth, but each should only take a few seconds to complete. Execute the following query to compute a summary of each actor, ordered by the most movie appearances, and to confirm the data was loaded successfully:

The response should look like:

In the later guides, we will convert this query into a model - materializing it in ClickHouse as a dbt view and table.

Connecting to ClickHouse

Create a dbt project. In this case we name this after our imdb source. When prompted, select clickhouse as the database source.
cd into your project folder:
At this point, you will need the text editor of your choice. In the examples below, we use the popular VS Code. Opening the IMDB directory, you should see a collection of yml and sql files:
Update your dbt_project.yml file to specify our first model - actor_summary and set profile to clickhouse_imdb.
We next need to provide dbt with the connection details for our ClickHouse instance. Add the following to your ~/.dbt/profiles.yml.

Note the need to modify the user and password. There are additional available settings documented here.
From the IMDB directory, execute the dbt debug command to confirm whether dbt is able to connect to ClickHouse.

Confirm the response includes Connection test: [OK connection ok] indicating a successful connection.

Creating a Simple View Materialization

When using the view materialization, a model is rebuilt as a view on each run, via a CREATE VIEW AS statement in ClickHouse. This doesn't require any additional storage of data but will be slower to query than table materializations.

From the imdb folder, delete the directory models/example:
Create a new file in the actors within the models folder. Here we create files that each represent an actor model:
Create the files schema.yml and actor_summary.sql in the models/actors folder.

The file schema.yml defines our tables. These will subsequently be available for use in macros. Edit models/actors/schema.yml to contain this content:

The actors_summary.sql defines our actual model. Note in the config function we also request the model be materialized as a view in ClickHouse. Our tables are referenced from the schema.yml file via the function source e.g. source('imdb', 'movies') refers to the movies table in the imdb database. Edit models/actors/actors_summary.sql to contain this content:

Note how we include the column updated_at in our final actor_summary. We use this later for incremental materializations.
From the imdb directory execute the command dbt run.
dbt will represent the model as a view in ClickHouse as requested. We can now query this view directly. This view will have been created in the imdb_dbt database - this is determined by the schema parameter in the file ~/.dbt/profiles.yml under the clickhouse_imdb profile.

Querying this view, we can replicate the results of our earlier query with a simpler syntax:

Creating a Table Materialization

In the previous example, our model was materialized as a view. While this might offer sufficient performance for some queries, more complex SELECTs or frequently executed queries may be better materialized as a table. This materialization is useful for models that will be queried by BI tools to ensure users have a faster experience. This effectively causes the query results to be stored as a new table, with the associated storage overheads - effectively, an INSERT TO SELECT is executed. Note that this table will be reconstructed each time i.e., it is not incremental. Large result sets may therefore result in long execution times - see dbt Limitations.

Modify the file actors_summary.sql such that the materialized parameter is set to table. Notice how ORDER BY is defined, and notice we use the MergeTree table engine:
From the imdb directory execute the command dbt run. This execution may take a little longer to execute - around 10s on most machines.
Confirm the creation of the table imdb_dbt.actor_summary:

You should the table with the appropriate data types:
Confirm the results from this table are consistent with previous responses. Notice an appreciable improvement in the response time now that the model is a table:

Feel free to issue other queries against this model. For example, which actors have the highest ranking movies with more than 5 appearances?

Creating an Incremental Materialization

The previous example created a table to materialize the model. This table will be reconstructed for each dbt execution. This may be infeasible and extremely costly for larger result sets or complex transformations. To address this challenge and reduce the build time, dbt offers Incremental materializations. This allows dbt to insert or update records into a table since the last execution, making it appropriate for event-style data. Under the hood a temporary table is created with all the updated records and then all the untouched records as well as the updated records are inserted into a new target table. This results in similar limitations for large result sets as for the table model.

To overcome these limitations for large sets, the plugin supports 'inserts_only' mode, where all the updates are inserted into the target table without creating a temporary table (more about it below).

To illustrate this example, we will add the actor "Clicky McClickHouse", who will appear in an incredible 910 movies - ensuring he has appeared in more films than even Mel Blanc.

First, we modify our model to be of type incremental. This addition requires:
1. unique_key - To ensure the plugin can uniquely identify rows, we must provide a unique_key - in this case, the id field from our query will suffice. This ensures we will have no row duplicates in our materialized table. For more details on uniqueness constraints, see here.
2. Incremental filter - We also need to tell dbt how it should identify which rows have changed on an incremental run. This is achieved by providing a delta expression. Typically this involves a timestamp for event data; hence our updated_at timestamp field. This column, which defaults to the value of now() when rows are inserted, allows new roles to be identified. Additionally, we need to identify the alternative case where new actors are added. Using the {{this}} variable, to denote the existing materialized table, this gives us the expression where id > (select max(id) from {{ this }}) or updated_at > (select max(updated_at) from {{this}}). We embed this inside the {% if is_incremental() %} condition, ensuring it is only used on incremental runs and not when the table is first constructed. For more details on filtering rows for incremental models, see this discussion in the dbt docs.
Update the file actor_summary.sql as follows:

Note that our model will only respond to updates and additions to the roles and actors tables. To respond to all tables, users would be encouraged to split this model into multiple sub-models - each with their own incremental criteria. These models can in turn be referenced and connected. For further details on cross-referencing models see here.
Execute a dbt run and confirm the results of the resulting table:
We will now add data to our model to illustrate an incremental update. Add our actor "Clicky McClickHouse" to the actors table:
Let's have "Clicky" star in 910 random movies:
Confirm he is indeed now the actor with the most appearances by querying the underlying source table and bypassing any dbt models:
Execute a dbt run and confirm our model has been updated and matches the above results:

Internals

We can identify the statements executed to achieve the above incremental update by querying ClickHouse's query log.

Adjust the above query to the period of execution. We leave result inspection to the user but highlight the general strategy used by the plugin to perform incremental updates:

The plugin creates a temporary table actor_sumary__dbt_tmp. Rows that have changed are streamed into this table.
A new table, actor_summary_new, is created. The rows from the old table are, in turn, streamed from the old to new, with a check to make sure row ids do not exist in the temporary table. This effectively handles updates and duplicates.
The results from the temporary table are streamed into the new actor_summary table:
Finally, the new table is exchanged atomically with the old version via an EXCHANGE TABLES statement. The old and temporary tables are in turn dropped.

This is visualized below:

This strategy may encounter challenges on very large models. For further details see Limitations.

Append Strategy (inserts-only mode)

To overcome the limitations of large datasets in incremental models, the plugin uses the dbt configuration parameter incremental_strategy. This can be set to the value append. When set, updated rows are inserted directly into the target table (a.k.a imdb_dbt.actor_summary) and no temporary table is created. Note: Append only mode requires your data to be immutable or for duplicates to be acceptable. If you want an incremental table model that supports altered rows don't use this mode!

To illustrate this mode, we will add another new actor and re-execute dbt run with incremental_strategy='append'.

Configure append only mode in actor_summary.sql:
Let's add another famous actor - Danny DeBito
Let's star Danny in 920 random movies.
Execute a dbt run and confirm that Danny was added to the actor-summary table

Note how much faster that incremental was compared to the insertion of "Clicky".

Checking again the query_log table reveals the differences between the 2 incremental runs:

In this run, only the new rows are added straight to imdb_dbt.actor_summary table and there is no table creation involved.

Delete+Insert mode (Experimental)

Historically ClickHouse has had only limited support for updates and deletes, in the form of asynchronous Mutations. These can be extremely IO-intensive and should generally be avoided.

ClickHouse 22.8 introduced lightweight deletes and ClickHouse 25.7 introduced lightweight updates. With the introduction of these features, modifications from single update queries, even when being materialized asynchronously, will occur instantly from the user's perspective.

This mode can be configured for a model via the incremental_strategy parameter i.e.

This strategy operates directly on the target model's table, so if there is an issue during the operation, the data in the incremental model is likely to be in an invalid state - there is no atomic update.

In summary, this approach:

The plugin creates a temporary table actor_sumary__dbt_tmp. Rows that have changed are streamed into this table.
A DELETE is issued against the current actor_summary table. Rows are deleted by id from actor_sumary__dbt_tmp
The rows from actor_sumary__dbt_tmp are inserted into actor_summary using an INSERT INTO actor_summary SELECT * FROM actor_sumary__dbt_tmp.

This process is shown below:

insert_overwrite mode (Experimental)

Performs the following steps:

Create a staging (temporary) table with the same structure as the incremental model relation: CREATE TABLE {staging} AS {target}.
Insert only new records (produced by SELECT) into the staging table.
Replace only new partitions (present in the staging table) into the target table.

This approach has the following advantages:

It is faster than the default strategy because it doesn't copy the entire table.
It is safer than other strategies because it doesn't modify the original table until the INSERT operation completes successfully: in case of intermediate failure, the original table is not modified.
It implements "partitions immutability" data engineering best practice. Which simplifies incremental and parallel data processing, rollbacks, etc.

Creating a Snapshot

dbt snapshots allow a record to be made of changes to a mutable model over time. This in turn allows point-in-time queries on models, where analysts can "look back in time" at the previous state of a model. This is achieved using type-2 Slowly Changing Dimensions where from and to date columns record when a row was valid. This functionality is supported by the ClickHouse plugin and is demonstrated below.

This example assumes you have completed Creating an Incremental Table Model. Make sure your actor_summary.sql doesn't set inserts_only=True. Your models/actor_summary.sql should look like this:

Create a file actor_summary in the snapshots directory.
Update the contents of the actor_summary.sql file with the following content:

A few observations regarding this content:

The select query defines the results you wish to snapshot over time. The function ref is used to reference our previously created actor_summary model.
We require a timestamp column to indicate record changes. Our updated_at column (see Creating an Incremental Table Model) can be used here. The parameter strategy indicates our use of a timestamp to denote updates, with the parameter updated_at specifying the column to use. If this is not present in your model you can alternatively use the check strategy. This is significantly more inefficient and requires the user to specify a list of columns to compare. dbt compares the current and historical values of these columns, recording any changes (or doing nothing if identical).

Run the command dbt snapshot.

Note how a table actor_summary_snapshot has been created in the snapshots db (determined by the target_schema parameter).

Sampling this data you will see how dbt has included the columns dbt_valid_from and dbt_valid_to. The latter has values set to null. Subsequent runs will update this.
Make our favorite actor Clicky McClickHouse appear in another 10 films.
Re-run the dbt run command from the imdb directory. This will update the incremental model. Once this is complete, run the dbt snapshot to capture the changes.
If we now query our snapshot, notice we have 2 rows for Clicky McClickHouse. Our previous entry now has a dbt_valid_to value. Our new value is recorded with the same value in the dbt_valid_from column, and a dbt_valid_to value of null. If we did have new rows, these would also be appended to the snapshot.

For further details on dbt snapshots see here.

Using Seeds

dbt provides the ability to load data from CSV files. This capability is not suited to loading large exports of a database and is more designed for small files typically used for code tables and dictionaries, e.g. mapping country codes to country names. For a simple example, we generate and then upload a list of genre codes using the seed functionality.

We generate a list of genre codes from our existing dataset. From the dbt directory, use the clickhouse-client to create a file seeds/genre_codes.csv:
Execute the dbt seed command. This will create a new table genre_codes in our database imdb_dbt (as defined by our schema configuration) with the rows from our csv file.
Confirm these have been loaded:

Limitations

The current ClickHouse plugin for dbt has several limitations users should be aware of:

The plugin currently materializes models as tables using an INSERT TO SELECT. This effectively means data duplication. Very large datasets (PB) can result in extremely long run times, making some models unviable. Aim to minimize the number of rows returned by any query, utilizing GROUP BY where possible. Prefer models which summarize data over those which simply perform a transform whilst maintaining row counts of the source.
To use Distributed tables to represent a model, users must create the underlying replicated tables on each node manually. The Distributed table can, in turn, be created on top of these. The plugin does not manage cluster creation.
When dbt creates a relation (table/view) in a database, it usually creates it as: {{ database }}.{{ schema }}.{{ table/view id }}. ClickHouse has no notion of schemas. The plugin therefore uses {{schema}}.{{ table/view id }}, where schema is the ClickHouse database.

Further Information

The previous guides only touch the surface of dbt functionality. Users are recommended to read the excellent dbt documentation.

Additional configuration for the plugin is described here.

Fivetran

The dbt-clickhouse connector is also available for use in Fivetran transformations, allowing seamless integration and transformation capabilities directly within the Fivetran platform using dbt.

Concepts​

Setup of dbt and the ClickHouse plugin​

dbt​

ClickHouse plugin​

Prepare ClickHouse​

Connecting to ClickHouse​

Creating a Simple View Materialization​

Creating a Table Materialization​

Creating an Incremental Materialization​

Internals​

Append Strategy (inserts-only mode)​

Delete+Insert mode (Experimental)​

insert_overwrite mode (Experimental)​

Creating a Snapshot​

Using Seeds​

Limitations​

Fivetran​