Separation of storage and compute

Overview

This guide explores how you can use ClickHouse and S3 to implement an architecture with separated storage and compute.

Separation of storage and compute means that computing resources and storage resources are managed independently. In ClickHouse, this allows for better scalability, cost-efficiency, and flexibility. You can scale storage and compute resources separately as needed, optimizing performance and costs.

Using ClickHouse backed by S3 is especially useful for use cases where query performance on "cold" data is less critical. ClickHouse provides support for using S3 as the storage for the MergeTree engine using S3BackedMergeTree. This table engine enables users to exploit the scalability and cost benefits of S3 while maintaining the insert and query performance of the MergeTree engine.

Please note that implementing and managing a separation of storage and compute architecture is more complicated compared to standard ClickHouse deployments. While self-managed ClickHouse allows for separation of storage and compute as discussed in this guide, we recommend using ClickHouse Cloud, which allows you to use ClickHouse in this architecture without configuration using the SharedMergeTree table engine.

This guide assumes you are using ClickHouse version 22.8 or higher.

Caution

Do not configure any AWS/GCS life cycle policy. This is not supported and could lead to broken tables.

1. Use S3 as a ClickHouse disk

Creating a disk

Create a new file in the ClickHouse config.d directory to store the storage configuration:

Copy the following XML in to the newly created file, replacing BUCKET, ACCESS_KEY_ID, SECRET_ACCESS_KEY with the AWS bucket details where you'd like to store your data:

If you need to further specify settings for the S3 disk, for example to specify a region or send a custom HTTP header, you can find the list of relevant settings here.

You can also replace access_key_id and secret_access_key with the following, which will attempt to obtain credentials from environment variables and Amazon EC2 metadata:

After you've created your configuration file, you need to update the owner of the file to the clickhouse user and group:

You can now restart the ClickHouse server to have the changes take effect:

2. Create a table backed by S3

To test that we've configured the S3 disk properly, we can attempt to create and query a table.

Create a table specifying the new S3 storage policy:

Note that we did not have to specify the engine as S3BackedMergeTree. ClickHouse automatically converts the engine type internally if it detects the table is using S3 for storage.

Show that the table was created with the correct policy:

You should see the following result:

Let's now insert some rows into our new table:

Let's verify that our rows were inserted:

In the AWS console, if your data was successfully inserted to S3, you should see that ClickHouse has created new files in your specified bucket.

If everything worked successfully, you are now using ClickHouse with separated storage and compute!

3. Implementing replication for fault tolerance (optional)

Caution

Do not configure any AWS/GCS life cycle policy. This is not supported and could lead to broken tables.

For fault tolerance, you can use multiple ClickHouse server nodes distributed across multiple AWS regions, with an S3 bucket for each node.

Replication with S3 disks can be accomplished by using the ReplicatedMergeTree table engine. See the following guide for details:

Replicating a single shard across two AWS regions using S3 Object Storage.

Overview​

1. Use S3 as a ClickHouse disk​

Creating a disk​

2. Create a table backed by S3​

3. Implementing replication for fault tolerance (optional)​

Further reading​

Overview

1. Use S3 as a ClickHouse disk

Creating a disk

2. Create a table backed by S3

3. Implementing replication for fault tolerance (optional)

Further reading