Connect Apache Airflow to ClickHouse
Apache Airflow is an open-source platform for authoring, scheduling, and monitoring workflows as code. Workflows are defined as directed acyclic graphs (DAGs) of tasks written in Python.
The apache-airflow-providers-clickhousedb provider connects Airflow to ClickHouse, letting you run queries, create tables, and load data as part of a DAG. It connects over the HTTP interface using the clickhouse-connect client, and exposes ClickHouse through Airflow's common SQL framework, so the standard SQLExecuteQueryOperator handles DDL, DML, and analytical queries with no ClickHouse-specific operator required.
Install the provider
Install the provider into the environment where your Airflow scheduler and workers run:
The provider depends on apache-airflow-providers-common-sql and clickhouse-connect, which are installed alongside it. To pass query results to pandas or polars DataFrames, install the optional extras:
Create a ClickHouse connection
The provider registers a connection type of clickhouse. Create a connection from the Airflow UI under Admin > Connections, or define one through the CLI or an environment variable.
In the UI, select ClickHouse as the connection type and fill in the fields:
| Field | Description | Default |
|---|---|---|
| Host | ClickHouse server hostname, for example abc123.clickhouse.cloud | localhost |
| Port | HTTP(S) port | 8123 (plain), 8443 (TLS) |
| Login | ClickHouse username | default |
| Password | ClickHouse user password | (empty) |
| Database | Default database for the connection. The UI labels this Database; it is the schema field when you define the connection by URI or JSON. | default |
For ClickHouse Cloud or any self-hosted cluster with TLS enabled, set secure to true in the Extra field and use the TLS port (8443).
Extra connection options
The provider exposes additional options as dedicated fields in the connection form. When you define the connection by URI, JSON, or environment variable instead, supply them as keys in the extra JSON object. All are optional:
extra key | UI field | Default | Description |
|---|---|---|---|
secure | Use TLS (HTTPS) | false | Enable HTTPS/TLS. |
verify | Verify SSL Certificate | true | Verify the server TLS certificate when secure is true. Set to false for self-signed certificates. |
connect_timeout | Connection Timeout (seconds) | 10 | HTTP connection timeout in seconds. |
send_receive_timeout | Query Timeout (seconds) | 300 | Query read/write timeout in seconds. Increase this for long-running analytical queries. |
compress | Enable LZ4 Compression | true | Enable LZ4 result compression. |
client_name | Client Name | (empty) | A label appended to the Airflow version identifier in the ClickHouse User-Agent and the client_name column of system.query_log. |
session_settings | Session Settings (JSON) | (empty) | ClickHouse session settings applied to every query on the connection, for example {"max_execution_time": 300, "max_threads": 8}. |
client_kwargs | Client kwargs (JSON) | (empty) | Additional keyword arguments forwarded to clickhouse_connect.get_client(), for example an http_proxy. |
Define a connection without the UI
Set the connection through an environment variable. The URI form covers host, credentials, and database:
All components of the URI must be URL-encoded. For TLS, timeouts, or session settings, use the JSON form, which exposes the Extra fields:
All hooks and operators use the connection ID clickhouse_default unless you specify another.
Run queries with SQLExecuteQueryOperator
Set the operator's conn_id to your ClickHouse connection. The following DAG creates a table, inserts rows, reads them back, and drops the table:
Query results are fetched with the default handler (fetch_all_handler). To return something other than the full result set, pass a different handler, such as fetch_one_handler for the first row only.
Target a different database per task
When one connection points at a cluster and individual tasks query different databases, override the database through hook_params instead of creating a separate connection:
Use the hook directly
For work that doesn't fit a SQL operator — bulk inserts, streaming, or ClickHouse-specific client calls — use ClickHouseHook inside a Python task.
The hook's bulk_insert_rows method uses the native columnar insert path in clickhouse-connect, which is much faster than row-by-row inserts for large datasets. Set batch_size to bound peak memory on very large inputs:
Call get_client() to reach the underlying clickhouse-connect client for anything the hook doesn't expose directly:
Apply session settings
Pass session settings when constructing the hook, either directly or through an operator's hook_params. Settings passed to the constructor are merged on top of any session_settings defined in the connection's Extra field, and the constructor values win on conflicting keys: