Connect Streamkap to ClickHouse
Streamkap is a real-time data integration platform that specializes in streaming Change Data Capture (CDC) and stream processing. It is built on a high-throughput, scalable stack using Apache Kafka, Apache Flink, and Debezium, offered as a fully managed service in SaaS or BYOC (Bring your own Cloud) deployments.
Streamkap allows you to stream every insert, update, and delete from source databases like PostgreSQL, MySQL, SQL Server, MongoDB, and more directly into ClickHouse with millisecond latency.
This makes it ideal for powering real-time analytical dashboards, operational analytics, and feeding live data to machine learning models.
Key Features
-
Real-time Streaming CDC: Streamkap captures changes directly from your database's logs, ensuring data in ClickHouse is a real-time replica of the source. Simplified Stream Processing: Transform, enrich, route, format, create embeddings from data in real-time before landing in ClickHouse. Powered by Flink with none of the complexity
-
Fully Managed and Scalable: It provides a production-ready, zero-maintenance pipeline, eliminating the need to manage your own Kafka, Flink, Debezium, or schema registry infrastructure. The platform is designed for high throughput and can scale linearly to handle billions of events.
-
Automated Schema Evolution: Streamkap automatically detects schema changes in the source database and propagates them to ClickHouse. It can handle adding new columns or changing column types without manual intervention.
-
Optimized for ClickHouse: The integration is built to work efficiently with ClickHouse's features. By default, it uses the ReplacingMergeTree engine to seamlessly handle updates and deletes from the source system.
-
Resilient Delivery: The platform offers an at-least-once delivery guarantee, ensuring data consistency between your source and ClickHouse. For upsert operations, it performs deduplication based on the primary key.
Getting Started
This guide provides a high-level overview of setting up a Streamkap pipeline to load data into ClickHouse.
Prerequisites
- A Streamkap account.
- Your ClickHouse cluster connection details: Hostname, Port, Username, and Password.
- A source database (e.g., PostgreSQL, SQL Server) configured to allow CDC. You can find detailed setup guides in the Streamkap documentation.
Step 1: Configure the Source in Streamkap
- Log into your Streamkap account.
- In the sidebar, navigate to Connectors and select the Sources tab.
- Click + Add and select your source database type (e.g., SQL Server RDS).
- Fill in the connection details, including the endpoint, port, database name, and user credentials.
- Save the connector.
Step 2: Configure the ClickHouse Destination
- In the Connectors section, select the Destinations tab.
- Click + Add and choose ClickHouse from the list.
- Enter the connection details for your ClickHouse service:
- Hostname: The host of your ClickHouse instance (e.g.,
abc123.us-west-2.aws.clickhouse.cloud) - Port: The secure HTTPS port, typically
8443 - Username and Password: The credentials for your ClickHouse user
- Database: The target database name in ClickHouse
- Hostname: The host of your ClickHouse instance (e.g.,
- Save the destination.
Step 3: Create and Run the Pipeline
- Navigate to Pipelines in the sidebar and click + Create.
- Select the Source and Destination you just configured.
- Choose the schemas and tables you wish to stream.
- Give your pipeline a name and click Save.
Once created, the pipeline will become active. Streamkap will first take a snapshot of the existing data and then begin streaming any new changes as they occur.
Step 4: Verify the Data in ClickHouse
Connect to your ClickHouse cluster and run a query to see the data arriving in the target table.
How it Works with ClickHouse
Streamkap's integration is designed to efficiently manage CDC data within ClickHouse.
Table Engine and Data Handling
By default, Streamkap uses an upsert ingestion mode. When it creates a table in ClickHouse, it uses the ReplacingMergeTree engine. This engine is ideal for handling CDC events:
-
The source table's primary key is used as the ORDER BY key in the ReplacingMergeTree table definition.
-
Updates in the source are written as new rows in ClickHouse. During its background merge process, ReplacingMergeTree collapses these rows, keeping only the latest version based on the ordering key.
-
Deletes are handled by a metadata flag feeding the ReplacingMergeTree
is_deletedparameter. Rows deleted at the source are not removed immediately but are marked as deleted.- Optionally deleted records can be kept in ClickHouse for analytics purposes
Metadata Columns
Streamkap adds several metadata columns to each table to manage the state of the data:
| Column Name | Description |
|---|---|
_STREAMKAP_SOURCE_TS_MS | Timestamp (in milliseconds) of the event in the source database. |
_STREAMKAP_TS_MS | Timestamp (in milliseconds) when Streamkap processed the event. |
__DELETED | A boolean flag (true/false) indicating if the row was deleted at the source. |
_STREAMKAP_OFFSET | Offset value from Streamkap's internal logs, useful for ordering and debugging. |
Querying the Latest Data
Because ReplacingMergeTree processes updates and deletes in the background, a simple SELECT * query might show historical or deleted rows before a merge is complete. To get the most current state of your data, you must filter out the deleted records and select only the latest version of each row.
You can do this using the FINAL modifier, which is convenient but can impact query performance:
For better performance on large tables, especially if you don’t need to read all the columns and for one-off analytical queries, you can use the argMax function to manually select the latest record for each primary key:
For production use cases and concurrent recurrent end user queries, Materialized Views can be used to model the data to better fit the downstream access patterns.