Skip to main content

Integrating Amazon MSK with ClickHouse

Prerequisites

We assume:

The official Kafka connector from ClickHouse with Amazon MSK

Gather your connection details

To connect to ClickHouse with HTTP(S) you need this information:

  • The HOST and PORT: typically, the port is 8443 when using TLS or 8123 when not using TLS.

  • The DATABASE NAME: out of the box, there is a database named default, use the name of the database that you want to connect to.

  • The USERNAME and PASSWORD: out of the box, the username is default. Use the username appropriate for your use case.

The details for your ClickHouse Cloud service are available in the ClickHouse Cloud console. Select the service that you will connect to and click Connect:

ClickHouse Cloud service connect button

Choose HTTPS, and the details are available in an example curl command.

ClickHouse Cloud HTTPS connection details

If you are using self-managed ClickHouse, the connection details are set by your ClickHouse administrator.

Steps

  1. Make sure you're familiar with the ClickHouse Connector Sink
  2. Create an MSK instance.
  3. Create and assign IAM role.
  4. Download a jar file from ClickHouse Connect Sink Release page.
  5. Install the downloaded jar file on Custom plugin page of Amazon MSK console.
  6. If Connector communicates with a public ClickHouse instance, enable internet access.
  7. Provide a topic name, ClickHouse instance hostname, and password in config.
connector.class=com.clickhouse.kafka.connect.ClickHouseSinkConnector
tasks.max=1
topics=<topic_name>
ssl=true
security.protocol=SSL
hostname=<hostname>
database=<database_name>
password=<password>
ssl.truststore.location=/tmp/kafka.client.truststore.jks
port=8443
value.converter.schemas.enable=false
value.converter=org.apache.kafka.connect.json.JsonConverter
exactlyOnce=true
username=default
schemas.enable=false

Performance tuning

One way of increasing performance is to adjust the batch size and the number of records that are fetched from Kafka by adding the following to the worker configuration:

consumer.max.poll.records=[NUMBER OF RECORDS]
consumer.max.partition.fetch.bytes=[NUMBER OF RECORDS * RECORD SIZE IN BYTES]

The specific values you use are going to vary, based on desired number of records and record size. For example, the default values are:

consumer.max.poll.records=500
consumer.max.partition.fetch.bytes=1048576

You can find more details (both implementation and other considerations) in the official Kafka and Amazon MSK documentation.

Notes on Networking for MSK Connect

In order for MSK Connect to connect to ClickHouse, we recommend your MSK cluster to be in a private subnet with a Private NAT connected for internet access. Instructions on how to set this up are provided below. Note that public subnets are supported but not recommended due to the need to constantly assign an Elastic IP address to your ENI, AWS provides more details here

  1. Create a Private Subnet: Create a new subnet within your VPC, designating it as a private subnet. This subnet should not have direct access to the internet.
  2. Create a NAT Gateway: Create a NAT gateway in a public subnet of your VPC. The NAT gateway enables instances in your private subnet to connect to the internet or other AWS services, but prevents the internet from initiating a connection with those instances.
  3. Update the Route Table: Add a route that directs internet-bound traffic to the NAT gateway
  4. Ensure Security Group(s) and Network ACLs Configuration: Configure your security groups and network ACLs (Access Control Lists) to allow relevant traffic to and from your ClickHouse instance.
    1. For ClickHouse Cloud, configure your security group to allow inbound traffic on ports 9440 and 8443.
    2. For self-hosted ClickHouse, configure your security group to allow inbound traffic on the port in your config file (default is 8123).
  5. Attach Security Group(s) to MSK: Ensure that these new security groups routed to the NAT gateways are attached to your MSK cluster