Data Replication
In ClickHouse Cloud replication is managed for you. Please create your tables without adding arguments. For example, in the text below you would replace:
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/table_name',
'{replica}',
ver
)
with:
ENGINE = ReplicatedMergeTree
Replication is only supported for tables in the MergeTree family:
- ReplicatedMergeTree
- ReplicatedSummingMergeTree
- ReplicatedReplacingMergeTree
- ReplicatedAggregatingMergeTree
- ReplicatedCollapsingMergeTree
- ReplicatedVersionedCollapsingMergeTree
- ReplicatedGraphiteMergeTree
Replication works at the level of an individual table, not the entire server. A server can store both replicated and non-replicated tables at the same time.
Replication does not depend on sharding. Each shard has its own independent replication.
Compressed data for INSERT
and ALTER
queries is replicated (for more information, see the documentation for ALTER).
CREATE
, DROP
, ATTACH
, DETACH
and RENAME
queries are executed on a single server and are not replicated:
- The
CREATE TABLE
query creates a new replicatable table on the server where the query is run. If this table already exists on other servers, it adds a new replica. - The
DROP TABLE
query deletes the replica located on the server where the query is run. - The
RENAME
query renames the table on one of the replicas. In other words, replicated tables can have different names on different replicas.
ClickHouse uses ClickHouse Keeper for storing replicas meta information. It is possible to use ZooKeeper version 3.4.5 or newer, but ClickHouse Keeper is recommended.
To use replication, set parameters in the zookeeper server configuration section.
Don’t neglect the security setting. ClickHouse supports the digest
ACL scheme of the ZooKeeper security subsystem.
Example of setting the addresses of the ClickHouse Keeper cluster:
<zookeeper>
<node>
<host>example1</host>
<port>2181</port>
</node>
<node>
<host>example2</host>
<port>2181</port>
</node>
<node>
<host>example3</host>
<port>2181</port>
</node>
</zookeeper>
ClickHouse also supports storing replicas meta information in an auxiliary ZooKeeper cluster. Do this by providing the ZooKeeper cluster name and path as engine arguments. In other words, it supports storing the metadata of different tables in different ZooKeeper clusters.
Example of setting the addresses of the auxiliary ZooKeeper cluster:
<auxiliary_zookeepers>
<zookeeper2>
<node>
<host>example_2_1</host>
<port>2181</port>
</node>
<node>
<host>example_2_2</host>
<port>2181</port>
</node>
<node>
<host>example_2_3</host>
<port>2181</port>
</node>
</zookeeper2>
<zookeeper3>
<node>
<host>example_3_1</host>
<port>2181</port>
</node>
</zookeeper3>
</auxiliary_zookeepers>
To store table metadata in an auxiliary ZooKeeper cluster instead of the default ZooKeeper cluster, we can use SQL to create the table with ReplicatedMergeTree engine as follows:
CREATE TABLE table_name ( ... ) ENGINE = ReplicatedMergeTree('zookeeper_name_configured_in_auxiliary_zookeepers:path', 'replica_name') ...
You can specify any existing ZooKeeper cluster and the system will use a directory on it for its own data (the directory is specified when creating a replicatable table).
If ZooKeeper is not set in the config file, you can’t create replicated tables, and any existing replicated tables will be read-only.
ZooKeeper is not used in SELECT
queries because replication does not affect the performance of SELECT
and queries run just as fast as they do for non-replicated tables. When querying distributed replicated tables, ClickHouse behavior is controlled by the settings max_replica_delay_for_distributed_queries and fallback_to_stale_replicas_for_distributed_queries.
For each INSERT
query, approximately ten entries are added to ZooKeeper through several transactions. (To be more precise, this is for each inserted block of data; an INSERT query contains one block or one block per max_insert_block_size = 1048576
rows.) This leads to slightly longer latencies for INSERT
compared to non-replicated tables. But if you follow the recommendations to insert data in batches of no more than one INSERT
per second, it does not create any problems. The entire ClickHouse cluster used for coordinating one ZooKeeper cluster has a total of several hundred INSERTs
per second. The throughput on data inserts (the number of rows per second) is just as high as for non-replicated data.
For very large clusters, you can use different ZooKeeper clusters for different shards. However, from our experience this has not proven necessary based on production clusters with approximately 300 servers.
Replication is asynchronous and multi-master. INSERT
queries (as well as ALTER
) can be sent to any available server. Data is inserted on the server where the query is run, and then it is copied to the other servers. Because it is asynchronous, recently inserted data appears on the other replicas with some latency. If part of the replicas are not available, the data is written when they become available. If a replica is available, the latency is the amount of time it takes to transfer the block of compressed data over the network. The number of threads performing background tasks for replicated tables can be set by background_schedule_pool_size setting.
ReplicatedMergeTree
engine uses a separate thread pool for replicated fetches. Size of the pool is limited by the background_fetches_pool_size setting which can be tuned with a server restart.
By default, an INSERT query waits for confirmation of writing the data from only one replica. If the data was successfully written to only one replica and the server with this replica ceases to exist, the stored data will be lost. To enable getting confirmation of data writes from multiple replicas, use the insert_quorum
option.
Each block of data is written atomically. The INSERT query is divided into blocks up to max_insert_block_size = 1048576
rows. In other words, if the INSERT
query has less than 1048576 rows, it is made atomically.
Data blocks are deduplicated. For multiple writes of the same data block (data blocks of the same size containing the same rows in the same order), the block is only written once. The reason for this is in case of network failures when the client application does not know if the data was written to the DB, so the INSERT
query can simply be repeated. It does not matter which replica INSERTs were sent to with identical data. INSERTs
are idempotent. Deduplication parameters are controlled by merge_tree server settings.
During replication, only the source data to insert is transferred over the network. Further data transformation (merging) is coordinated and performed on all the replicas in the same way. This minimizes network usage, which means that replication works well when replicas reside in different datacenters. (Note that duplicating data in different datacenters is the main goal of replication.)
You can have any number of replicas of the same data. Based on our experiences, a relatively reliable and convenient solution could use double replication in production, with each server using RAID-5 or RAID-6 (and RAID-10 in some cases).
The system monitors data synchronicity on replicas and is able to recover after a failure. Failover is automatic (for small differences in data) or semi-automatic (when data differs too much, which may indicate a configuration error).
Creating Replicated Tables
In ClickHouse Cloud replication is managed for you. Please create your tables without adding arguments. For example, in the text below you would replace:
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/table_name', '{replica}', ver)
with:
ENGINE = ReplicatedMergeTree
The Replicated
prefix is added to the table engine name. For example:ReplicatedMergeTree
.
Adding Replicated
is optional in ClickHouse Cloud, as all of the tables are replicated.
Replicated*MergeTree parameters
zoo_path
zoo_path
— The path to the table in ClickHouse Keeper.
replica_name
replica_name
— The replica name in ClickHouse Keeper.