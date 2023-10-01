Features and Configurations
In this section, we provide documentation about some of the features available for dbt with ClickHouse.
- Profile.yml configurations
- General information about features
- Features
- Materialization: view
- Materialization: table
- Materialization: incremental
- Materialization: materialized_view (Experimental)
- Materialization: dictionary (experimental)
- Materialization: distributed_table (experimental)
- materialization: distributed_incremental (experimental)
- Snapshot
- Contracts and Constraints
- Additional ClickHouse Macros
Profile.yml configurations
To connect to ClickHouse from dbt, you'll need to add a profile to your
profiles.yml file. A ClickHouse profile conforms to the following syntax:
Schema vs Database
The dbt model relation identifier
database.schema.table is not compatible with Clickhouse because Clickhouse does not
support a
schema.
So we use a simplified approach
schema.table, where
schema is the Clickhouse database. Using the
default database
is not recommended.
SET Statement Warning
In many environments, using the SET statement to persist a ClickHouse setting across all DBT queries is not reliable and can cause unexpected failures. This is particularly true when using HTTP connections through a load balancer that distributes queries across multiple nodes (such as ClickHouse cloud), although in some circumstances this can also happen with native ClickHouse connections. Accordingly, we recommend configuring any required ClickHouse settings in the "custom_settings" property of the DBT profile as a best practice, instead of relying on a pre-hook "SET" statement as has been occasionally suggested.
Setting
quote_columns
To prevent a warning, make sure to explicitly set a value for
quote_columns in your
dbt_project.yml. See the doc on quote_columns for more information.
About the ClickHouse Cluster
The
cluster setting in profile enables dbt-clickhouse to run against a ClickHouse cluster. If
cluster is set in the profile, all models will be created with the
ON CLUSTER clause by default—except for those using a Replicated engine. This includes:
- Database creation
- View materializations
- Table and incremental materializations
- Distributed materializations
Replicated engines will not include the
ON CLUSTER clause, as they are designed to manage replication internally.
To opt out of cluster-based creation for a specific model, add the
disable_on_cluster config:
table and incremental materializations with non-replicated engine will not be affected by
cluster setting (model would
be created on the connected node only).
Compatibility
If a model has been created without a
cluster setting, dbt-clickhouse will detect the situation and run all DDL/DML
without
on cluster clause for this model.
General information about features
General table configurations
|Option
|Description
|Default if any
|engine
|The table engine (type of table) to use when creating tables
MergeTree()
|order_by
|A tuple of column names or arbitrary expressions. This allows you to create a small sparse index that helps find data faster.
tuple()
|partition_by
|A partition is a logical combination of records in a table by a specified criterion. The partition key can be any expression from the table columns.
|sharding_key
|Sharding key determines the destination server when inserting into distributed engine table. The sharding key can be random or as an output of a hash function
rand())
|primary_key
|Like order_by, a ClickHouse primary key expression. If not specified, ClickHouse will use the order by expression as the primary key
|unique_key
|A tuple of column names that uniquely identify rows. Used with incremental models for updates.
|settings
|A map/dictionary of "TABLE" settings to be used to DDL statements like 'CREATE TABLE' with this model
|query_settings
|A map/dictionary of ClickHouse user level settings to be used with
INSERT or
DELETE statements in conjunction with this model
|ttl
|A TTL expression to be used with the table. The TTL expression is a string that can be used to specify the TTL for the table.
|indexes
|A list of indexes to create, available only for
table materialization. For examples look at (#397)
|sql_security
|Allow you to specify which ClickHouse user to use when executing the view's underlying query.
SQL SECURITY has two legal values:
definer
invoker.
|definer
|If
sql_security was set to
definer, you have to specify any existing user or
CURRENT_USER in the
definer clause.
Supported table engines
Experimental supported table engines
|Type
|Details
|Distributed Table
|https://clickhouse.com/docs/en/engines/table-engines/special/distributed.
|Dictionary
|https://clickhouse.com/docs/en/engines/table-engines/special/dictionary
If you encounter issues connecting to ClickHouse from dbt with one of the above engines, please report an issue here.
A note on model settings
ClickHouse has several types/levels of "settings". In the model configuration above, two types of these are
configurable.
settings means the
SETTINGS
clause used in
CREATE TABLE/VIEW types of DDL statements, so this is generally settings that are specific to the
specific ClickHouse table engine. The new
query_settings is use to add a
SETTINGS clause to the
INSERT and
DELETE queries used for model materialization (
including incremental materializations).
There are hundreds of ClickHouse settings, and it's not always clear which is a "table" setting and which is a "user"
setting (although the latter are generally
available in the
system.settings table.) In general the defaults are recommended, and any use of these properties
should be carefully researched and tested.
Column Configuration
NOTE: The column configuration options below require model contracts to be enforced.
|Option
|Description
|Default if any
|codec
|A string consisting of arguments passed to
CODEC() in the column's DDL. For example:
codec: "Delta, ZSTD" will be compiled as
CODEC(Delta, ZSTD).
|ttl
|A string consisting of a TTL (time-to-live) expression that defines a TTL rule in the column's DDL. For example:
ttl: ts + INTERVAL 1 DAY will be compiled as
TTL ts + INTERVAL 1 DAY.
Example
Features
Materialization: view
A dbt model can be created as a ClickHouse view and configured using the following syntax:
Project File (
dbt_project.yml):
Or config block (
models/<model_name>.sql):
Materialization: table
A dbt model can be created as a ClickHouse table and configured using the following syntax:
Project File (
dbt_project.yml):
Or config block (
models/<model_name>.sql):
Materialization: incremental
Table model will be reconstructed for each dbt execution. This may be infeasible and extremely costly for larger result sets or complex transformations. To address this challenge and reduce the build time, a dbt model can be created as an incremental ClickHouse table and is configured using the following syntax:
Model definition in
dbt_project.yml:
Or config block in
models/<model_name>.sql:
Configurations
Configurations that are specific for this materialization type are listed below:
|Option
|Description
|Required?
unique_key
|A tuple of column names that uniquely identify rows. For more details on uniqueness constraints, see here.
|Required. If not provided altered rows will be added twice to the incremental table.
inserts_only
|It has been deprecated in favor of the
append incremental
strategy, which operates in the same way. If set to True for an incremental model, incremental updates will be inserted directly to the target table without creating intermediate table. . If
inserts_only is set,
incremental_strategy is ignored.
|Optional (default:
False)
incremental_strategy
|The strategy to use for incremental materialization.
delete+insert,
append,
insert_overwrite, or
microbatch are supported. For additional details on strategies, see here
|Optional (default: 'default')
incremental_predicates
|Additional conditions to be applied to the incremental materialization (only applied to
delete+insert strategy
|Optional
Incremental Model Strategies
dbt-clickhouse supports three incremental model strategies.
The Default (Legacy) Strategy
Historically ClickHouse has had only limited support for updates and deletes, in the form of asynchronous "mutations." To emulate expected dbt behavior, dbt-clickhouse by default creates a new temporary table containing all unaffected (not deleted, not changed) "old" records, plus any new or updated records, and then swaps or exchanges this temporary table with the existing incremental model relation. This is the only strategy that preserves the original relation if something goes wrong before the operation completes; however, since it involves a full copy of the original table, it can be quite expensive and slow to execute.
The Delete+Insert Strategy
ClickHouse added "lightweight deletes" as an experimental feature in version 22.8. Lightweight deletes are significantly
faster than ALTER TABLE ... DELETE
operations, because they don't require rewriting ClickHouse data parts. The incremental strategy
delete+insert
utilizes lightweight deletes to implement
incremental materializations that perform significantly better than the "legacy" strategy. However, there are important
caveats to using this strategy:
- Lightweight deletes must be enabled on your ClickHouse server using the setting
allow_experimental_lightweight_delete=1or you must set
use_lw_deletes=truein your profile (which will enable that setting for your dbt sessions)
- Lightweight deletes are now production ready, but there may be performance and other problems on ClickHouse versions earlier than 23.3.
- This strategy operates directly on the affected table/relation (with creating any intermediate or temporary tables), so if there is an issue during the operation, the data in the incremental model is likely to be in an invalid state
- When using lightweight deletes, dbt-clickhouse enabled the setting
allow_nondeterministic_mutations. In some very rare cases using non-deterministic incremental_predicates this could result in a race condition for the updated/deleted items (and related log messages in the ClickHouse logs). To ensure consistent results the incremental predicates should only include sub-queries on data that will not be modified during the incremental materialization.
The Microbatch Strategy (Requires dbt-core >= 1.9)
The incremental strategy
microbatch has been a dbt-core feature since version 1.9, designed to handle large
time-series data transformations efficiently. In dbt-clickhouse, it builds on top of the existing
delete_insert
incremental strategy by splitting the increment into predefined time-series batches based on the
event_time and
batch_size model configurations.
Beyond handling large transformations, microbatch provides the ability to:
- Reprocess failed batches.
- Auto-detect parallel batch execution.
- Eliminate the need for complex conditional logic in backfilling.
For detailed microbatch usage, refer to the official documentation.
Available Microbatch Configurations
|Option
|Description
|Default if any
|event_time
|The column indicating "at what time did the row occur." Required for your microbatch model and any direct parents that should be filtered.
|begin
|The "beginning of time" for the microbatch model. This is the starting point for any initial or full-refresh builds. For example, a daily-grain microbatch model run on 2024-10-01 with begin = '2023-10-01 will process 366 batches (it's a leap year!) plus the batch for "today."
|batch_size
|The granularity of your batches. Supported values are
hour,
day,
month, and
year
|lookback
|Process X batches prior to the latest bookmark to capture late-arriving records.
|1
|concurrent_batches
|Overrides dbt's auto detect for running batches concurrently (at the same time). Read more about configuring concurrent batches. Setting to true runs batches concurrently (in parallel). false runs batches sequentially (one after the other).
The Append Strategy
This strategy replaces the
inserts_only setting in previous versions of dbt-clickhouse. This approach simply appends
new rows to the existing relation.
As a result duplicate rows are not eliminated, and there is no temporary or intermediate table. It is the fastest
approach if duplicates are either permitted
in the data or excluded by the incremental query WHERE clause/filter.
The insert_overwrite Strategy (Experimental)
[IMPORTANT]
Currently, the insert_overwrite strategy is not fully functional with distributed materializations.
Performs the following steps:
- Create a staging (temporary) table with the same structure as the incremental model relation:
CREATE TABLE <staging> AS <target>.
- Insert only new records (produced by
SELECT) into the staging table.
- Replace only new partitions (present in the staging table) into the target table.
This approach has the following advantages:
- It is faster than the default strategy because it doesn't copy the entire table.
- It is safer than other strategies because it doesn't modify the original table until the INSERT operation completes successfully: in case of intermediate failure, the original table is not modified.
- It implements "partitions immutability" data engineering best practice. Which simplifies incremental and parallel data processing, rollbacks, etc.
The strategy requires
partition_by to be set in the model configuration. Ignores all other strategies-specific
parameters of the model config.
Materialization: materialized_view (Experimental)
A
materialized_view materialization should be a
SELECT from an existing (source) table. The adapter will create a
target table with the model name
and a ClickHouse MATERIALIZED VIEW with the name
<model_name>_mv. Unlike PostgreSQL, a ClickHouse materialized view is
not "static" (and has
no corresponding REFRESH operation). Instead, it acts as an "insert trigger", and will insert new rows into the target
table using the defined
SELECT
"transformation" in the view definition on rows inserted into the source table. See the test file
for an introductory example
of how to use this functionality.
Clickhouse provides the ability for more than one materialized view to write records to the same target table. To
support this in dbt-clickhouse, you can construct a
UNION in your model file, such that the SQL for each of your
materialized views is wrapped with comments of the form
--my_mv_name:begin and
--my_mv_name:end.
For example the following will build two materialized views both writing data to the same destination table of the
model. The names of the materialized views will take the form
<model_name>_mv1 and
<model_name>_mv2 :
IMPORTANT!
When updating a model with multiple materialized views (MVs), especially when renaming one of the MV names, dbt-clickhouse does not automatically drop the old MV. Instead, you will encounter the following warning:
Warning - Table <previous table name> was detected with the same pattern as model name <your model name> but was not found in this run. In case it is a renamed mv that was previously part of this model, drop it manually (!!!)
Data catch-up
Currently, when creating a materialized view (MV), the target table is first populated with historical data before the MV itself is created.
In other words, dbt-clickhouse initially creates the target table and preloads it with historical data based on the query defined for the MV. Only after this step is the MV created.
If you prefer not to preload historical data during MV creation, you can disable this behavior by setting the catch-up config to False:
Refreshable Materialized Views
To use Refreshable Materialized View, please adjust the following configs as needed in your MV model (all these configs are supposed to be set inside a refreshable config object):
|Option
|Description
|Required
|Default Value
|refresh_interval
|The interval clause (required)
|Yes
|randomize
|The randomization clause, will appear after
RANDOMIZE FOR
|append
|If set to
True, each refresh inserts rows into the table without deleting existing rows. The insert is not atomic, just like a regular INSERT SELECT.
|False
|depends_on
|A dependencies list for the refreshable mv. Please provide the dependencies in the following format
{schema}.{view_name}
|depends_on_validation
|Whether to validate the existence of the dependencies provided in
depends_on. In case a dependency doesn't contain a schema, the validation occurs on schema
default
|False
A config example for refreshable materialized view:
Limitations
- When creating a refreshable materialized view (MV) in ClickHouse that has a dependency, ClickHouse does not throw an error if the specified dependency does not exist at the time of creation. Instead, the refreshable MV remains in an inactive state, waiting for the dependency to be satisfied before it starts processing updates or refreshing. This behavior is by design, but it may lead to delays in data availability if the required dependency is not addressed promptly. Users are advised to ensure all dependencies are correctly defined and exist before creating a refreshable materialized view.
- As of today, there is no actual "dbt linkage" between the mv and its dependencies, therefore the creation order is not guaranteed.
- The refreshable feature was not tested with multiple mvs directing to the same target model.
Materialization: dictionary (experimental)
See the tests in https://github.com/ClickHouse/dbt-clickhouse/blob/main/tests/integration/adapter/dictionary/test_dictionary.py for examples of how to implement materializations for ClickHouse dictionaries
Materialization: distributed_table (experimental)
Distributed table created with following steps:
- Creates temp view with sql query to get right structure
- Create empty local tables based on view
- Create distributed table based on local tables.
- Data inserts into distributed table, so it is distributed across shards without duplicating.
Notes:
- dbt-clickhouse queries now automatically include the setting
insert_distributed_sync = 1in order to ensure that downstream incremental materialization operations execute correctly. This could cause some distributed table inserts to run more slowly than expected.
Distributed table model example
Generated migrations
materialization: distributed_incremental (experimental)
Incremental model based on the same idea as distributed table, the main difficulty is to process all incremental strategies correctly.
- The Append Strategy just insert data into distributed table.
- The Delete+Insert Strategy creates distributed temp table to work with all data on every shard.
- The Default (Legacy) Strategy creates distributed temp and intermediate tables for the same reason.
Only shard tables are replacing, because distributed table does not keep data. The distributed table reloads only when the full_refresh mode is enabled or the table structure may have changed.
Distributed incremental model example
Generated migrations
Snapshot
dbt snapshots allow a record to be made of changes to a mutable model over time. This in turn allows point-in-time queries on models, where analysts can “look back in time” at the previous state of a model. This functionality is supported by the ClickHouse connector and is configured using the following syntax:
Config block in
snapshots/<model_name>.sql:
For more information on configuration, check out the snapshot configs reference page.
Contracts and Constraints
Only exact column type contracts are supported. For example, a contract with a UInt32 column type will fail if the model
returns a UInt64 or other integer type.
ClickHouse also support only
CHECK constraints on the entire table/model. Primary key, foreign key, unique, and
column level CHECK constraints are not supported.
(See ClickHouse documentation on primary/order by keys.)
Additional ClickHouse Macros
Model Materialization Utility Macros
The following macros are included to facilitate creating ClickHouse specific tables and views:
engine_clause-- Uses the
enginemodel configuration property to assign a ClickHouse table engine. dbt-clickhouse uses the
MergeTreeengine by default.
partition_cols-- Uses the
partition_bymodel configuration property to assign a ClickHouse partition key. No partition key is assigned by default.
order_cols-- Uses the
order_bymodel configuration to assign a ClickHouse order by/sorting key. If not specified ClickHouse will use an empty tuple() and the table will be unsorted
primary_key_clause-- Uses the
primary_keymodel configuration property to assign a ClickHouse primary key. By default, primary key is set and ClickHouse will use the order by clause as the primary key.
on_cluster_clause-- Uses the
clusterprofile property to add an
ON CLUSTERclause to certain dbt-operations: distributed materializations, views creation, database creation.
ttl_config-- Uses the
ttlmodel configuration property to assign a ClickHouse table TTL expression. No TTL is assigned by default.
s3Source Helper Macro
The
s3source macro simplifies the process of selecting ClickHouse data directly from S3 using the ClickHouse S3 table
function. It works by
populating the S3 table function parameters from a named configuration dictionary (the name of the dictionary must end
in
s3). The macro
first looks for the dictionary in the profile
vars, and then in the model configuration. The dictionary can contain
any of the following
keys used to populate the parameters of the S3 table function:
|Argument Name
|Description
|bucket
|The bucket base url, such as
https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi.
https:// is assumed if no protocol is provided.
|path
|The S3 path to use for the table query, such as
/trips_4.gz. S3 wildcards are supported.
|fmt
|The expected ClickHouse input format (such as
TSV or
CSVWithNames) of the referenced S3 objects.
|structure
|The column structure of the data in bucket, as a list of name/datatype pairs, such as
['id UInt32', 'date DateTime', 'value String'] If not provided ClickHouse will infer the structure.
|aws_access_key_id
|The S3 access key id.
|aws_secret_access_key
|The S3 secret key.
|role_arn
|The ARN of a ClickhouseAccess IAM role to use to securely access the S3 objects. See this documentation for more information.
|compression
|The compression method used with the S3 objects. If not provided ClickHouse will attempt to determine compression based on the file name.
See the S3 test file for examples of how to use this macro.
Cross database macro support
dbt-clickhouse supports most of the cross database macros now included in
dbt Core with the following exceptions:
- The
split_partSQL function is implemented in ClickHouse using the splitByChar function. This function requires using a constant string for the "split" delimiter, so the
delimeterparameter used for this macro will be interpreted as a string, not a column name
- Similarly, the
replaceSQL function in ClickHouse requires constant strings for the
old_charsand
new_charsparameters, so those parameters will be interpreted as strings rather than column names when invoking this macro.